Pandas
The high level data manipulation tool used by data scientists. It can be imported with the syntax import pandas as pd
.
Creating a Pandas Dataframe
From Dictionaries
There are multiple ways to create a Pandas dataframe, the commonly used one is to create it from a dictionary by setting a key for every column label and a the value to be a list of observations of that label for each column. Then simply calling pd.DataFrame(dict)
would create the data frame.
Assigning row labels
This can be done by using the syntax df.index = ['label1', 'label2', ... 'labeln']
for n observations that exist in the dataframe.
From Lists
If we are to create a DataFrame from two conforming lists that are defined as follows, we would need to use the list
and zip
functions.
labels = ['n1', 'n2']
x = [1, 2, 3]
y = [4, 5, 6]
- We would first need to create a list of lists like
list_columns = [x, y]
- Then we would need create a list after element-wise zipping of the columns with their labels, as shown in
z = list(zip(labels, list_columns))
- This will now need to be converted into a dictionary which would have column names (
labels
) and columns (z
) usingdata = dict(z)
- After this, we use the method defined above for dictionaries, i.e.
pd.DataFrame(data)
to create a DataFrame.
Broadcasting
It is the concept of recycling from R, that is called broadcasting in Python. The idea is, that a particular value can be recycled and used to fill all the other observations, if unsuitable number of details have been provided.
x = [1, 2, 3]
y = {'n': x, 'is_int': 'Yes'}
z = pd.DataFrame(y)
Yes
thrice.
Reading and Importing Data
From CSVs
It is relatively straightforward to be reading data from CSVs. One can use pd.read_csv('path_to_csv.csv')
in order to read from a file.
- No column labels: If the data does not have column labels,
pd.read_csv('path_to_csv.csv', header = None)
will allow it to read data without it. - External column names: External column names can be added to the data frame using the names argument,
pd.read_csv('path_to_csv.csv', header = None, names = list_of_names)
-
Null value declaration: If our data uses any other convention than
NaN
for declaring null values in it, we can explicitly define it, by setting thena_values
attritbute to that characterpd.read_csv('path_to_csv.csv', na_values = ['-1'])
This can also be done if there are more than one kinds of
NaN
values present in the dataset using a list of values as shown inpd.read_csv('path.csv', na_values = ['-1', '999'])
or using a dictionary as shownpd.read_csv('path.csv', na_values = {col1: '-1', col2: '999'})
if there are separateNaN
characters in separate columns. 4. Assigning row labels: In case the first column of the csv contains row labels for the data, then usepd.read_csv('path_to_csv.csv', index_col=0)
for using the (row) labels for your dataframe. 5. Date values: If year, month and date are in separate columns and need to be converged into one column, it can be done usingparse_dates
argument of theread_csv()
function, e.g.pd.read_csv('path_to_csv.csv', parse_dates = [[1, 2, 3]])
where columns with indices would contain year, month and day data.Alternatively, we can parse the date-time values using
parse_dates = True
which would then convert all the dates that are ISO 8601 compatible (yyyy-mm-dd hh:mm:ss), into appropriate date structure. -
Handling comments: If the file contains comments within the data, they can be distinguished using the delimiter passed to the
comment
argument as shown inpd.read_csv('path_to_csv.csv', comment='#')
- Delimiter: The delimiter in while reading a csv to a Pandas DataFrame object can be set using
sep
argument - Skipping rows: Rows can be skipped while reading a csv file by using
skiprows
argument in combination withheader
argument. - Skipping footer: Rows at the end of the file can be skipped using the
skipfooter = n
argument. This would skip the lastn
rows of the file. > NOTE: Theskipfooter
argument doesn't work with the default C Engine so we need to specify theengine = python
when setting this parameter.
Chunkwise loading
In case of large datasets, data can be loaded and processed in chunks. It can be done with the help of for
loop as in for chunk in pd.read_csv('path_to_csv.csv', chunksize = 1000)
.
Globbing
The process of looking for file names with specific patterns, and loading them is called globbing.
import glob
pattern = '*.csv'
csv_files = glob.glob(pattern)
The code above, would return a list of files names, called csv_files
. Then we can loop over this list to load all data frames. Concatenation can be used for merging all the datasets into one single dataset if required.
From URLs
Importing a csv from a web URL can be done with the UrlLib package as follows
from urllib.request import urlretrieve
urlretrieve('http://onlinefilepath', 'local_file_path.csv')
And then proceed with read_csv()
function as usual.
From Excel
- Reading a file: A simple read operation over an Excel Spreadsheet can be executed by using
x = pd.ExcelFile('filepath.xlsx')
. - Listing sheets: There can be multiple sheets involved in any particular file and they can be listed by using the
sheet_names
attribute asprint(x.sheet_names)
- Reading a particular sheet: It is done by passing the sheet name to the
parse()
method as shown indf_sheet = x.parse('sheet1')
- Custom Headers: One can define custom headers while parsing from an excel sheet by using the
names()
argument.
From HDF5 (Hierarchical Data Format version5)
This is data format commonly used for storing large quantities of numerical data in Python. This is done using the following code segment
import h5py
data = h5py.File("path_to_file.hdf5", "r")
You can explore the data
object so obtained by using code similar to that required to explore a dictionary.
From Pickles
with open('file_path.pkl', 'rb') as file:
data = pickle.load(file)
From SQL
-
Creating a Database engine
2. List Tables: This can be done usingfrom sqlalchemy import create_engine engine = create_engine("path to sql db connector")
engine.table_names()
method of engine object. 3. Connecting to the engine: This is done using theconnect()
method available with every sqlalchemyengine
object. 4. Querying: There are two ways to query an SQL database and they are as follows: 1. The first one takes all of the above methods and works as follows:```python con = engine.connect() results = con.execute("SELECT * FROM table_name") df = pd.DataFrame(results.fetchall()) df.columns = results.keys() con.close() ``` This syntax is very similar to the **PHP** syntax for this operation. We create a connection, then execute the query and get returned a results binary object. We then use `fetchall()` method to convert it to a flat structure and store it in a pandas dataframe. Finally, we add the column names to the dataframe that we created and close the connection.
-
The second method is much more concise and works just fine. It harnesses the power of Pandas library and works as follows:
df = pd.read_sql_query("SELECT * FROM table_name", engine)
This single line of code then executes the command and returns the results in form of a DataFrame.
-
-
Fetch fewer rows: Sometimes the SQL query that we execute might return humongous results, then we can use the
fetchmany()
function with thesize
argument over theresults
object in order to fetch a certain number of rows instead of all.
Exporting Data
- csv: The method
to_csv()
for every DataFrame object allows us to export it to any file that we desire. It works aspd_df.to_csv('filename.csv')
. - Excel: The method
to_excel()
for every DataFrame object allows us to export it to an excel spreadsheet file that we desire. It works aspd_df.to_csv('filename.xlsx')
. - Numpy array: Any Pandas DataFrame can be converted into a Numpy array object using
values
attribute of every pandas dataframe object.
Selecting and Index Data
Column Selection
A column in a dataframe df
may easily be selected using the syntax df["columnName"]
.
NOTE: The returned object from the code above is NOT a dataframe but an object of the type series. This may lead to unexpected results and therefore this method is not recommended.
There are two fixes for the problem mentioned above:
- The fix for the problem mentioned above is to use double square brackets like
df[["columnName"]]
for selecting the column. This would return a dataframe instead of a series as was the case in the method above. - Use the
values
attribute for any Series object in order to retrieve the numerical values involved in form of a Numpy array.
Slicing for Row Selection
It is uncommon to use regular square brackets for row selection but it can be done using df[1:5]
which would return the rows with indices 1 through 4 (because as always, the last index would not be included).
If alternative slicing methods are required then it can be achieved as pd_df[::3,:]
would select every third row of the DataFrame.
Loc and iLoc
These are the two most commonly used methods (of Pandas Data Frame objects) for selecting or subsetting data. The loc technique operates on labels and the iloc technique relies on integer positions.
-
loc: This method allows us to select certain rows based on labels as follows
df.loc[["row_label1", "row_label2"]]
would select the rows with these two labels.One trick for range of slicing is to use
df.loc[["row_label2", "row_label1":-1]]
for reverse slicing. It would select rows fromrow_label1
torow_label2
but in reverse order.Note: The use of
[[ ]]
is still necessary for making sure that the returned object is indeed a Pandas DataFrame in order to avoid any inconsistencies.WARNING: Unlike conventional slicing (with numbers) slicing with
loc
using'column_name_1':'column_name_2'
would includecolumn_name_2
in the resulting object. This is different from the index based slicing as that ignores the last index.It can be further extended to include only specific columns using a comma, as in
df.loc[["row_label1", "row_label2"], ["column_label1", "column_label2"]]
. This query would only return the columns with labels column_label1 and column_label2. -
iloc: Everything remains the same except that indices are used instead of labels.
Filtering
- any and all:
any()
orall()
methods are helpful in filtering the columns that have certain properties. They're usually used in combination withisnull()
ornotnull()
methods. - Drop na: The
dropna()
method can be used on data frames to filter out rows with any or all na values based on the argumenthow='any'
orhow='all'
.
Iterations
Columns
A basic for
loop would result in iteration over column names. For instance,
for i in pd_df:
print(i)
would simply print the columns names that exist in the pandas dataframe pd_df
.
Rows
The rows need to be iterated over using the method iterrows
of the pandas dataframe object that we are trying to access.
for lab, row in pd_df.iterrows():
print(lab)
print(row)
would then print, first the label, and then the contents of each row as a Series object.
Manipulating Dataframes
NOT COMPLETE: would need to visit more websites and research materials to complete manipulatio
-
Adding a new column
-
Single Value(loc): The operator
loc
can be used to add a new column to an existing dataframe.should create a new column namespd_df.loc["new_column"] = 2
new_column
in thepd_df
dataframe with the value2
on all rows. 2. Mutation(apply):pd_df["new_column"] = pd_df["old_column"].apply(len)
would add a new column with the length of values present in currently existingold_column
.
-
-
Tidying Data
-
Melting Dataframes: If columns contain values instead of variables, then we would need to use the
melt()
function. It can be used aspd.melt( frame = df, id_vars = 'identifier_column', value_vars = ['column1', 'column2'], var_name = 'new_key_column_name', value_name = 'new_value_column_name')
where
id_vars
is the column/set of columns that contain the ID, andvalue_vars
are the columns that need to be merged. -
Pivoting Data: It's the opposite process of melting data. It can be used to change a column into multiple columns as follows:
pd.pivot( frame = df, index = 'index_column', columns = 'key_column', values = 'value_column')
This would work just fine if we're dealing with perfect data, i.e. there are no duplicates. If there are duplicates though, then we would need to use the
pivot_table()
method in order to deal with them. It is done with one additional parameter, and with , as shown belowdf.pivot_table( index = 'index_column', columns = 'key_column', values = 'value_column', aggfunction = np.mean)
where we are telling Pandas to mean the duplicate numeric values using the
aggfunction
attribute.reset_index()
method is used on the data frames that have been pivoted in order to flatten them2 -
Concatenating Data: A list of dataframes can be passed to the
pd.concat()
function for concatenating data.The
axis = 1
argument can be used for column wise concatenation.Note: We need to reset the index labels by passing the
ignore_index=True
argument to thepd.concat()
function so that there are no duplicate indices in order to avoid usingreset_index()
method later.
-
-
Merging Data
We can use the Pandas equivalent of join to merge two dataframes as follows
pd.merge( left = left_df, right = right_df, on = None, left_on = 'left_df_column_name', right_on = 'right_df_column_name')
If the column name is same on both left and right dataframes, then only the
on
parameter can be specified in the function above and the other factors will be redundant.There are multiple kinds of merges that can happen in Pandas:
- One to One: Both the keys take a value only 1 time on both sides
- Many to One/ One to Many: This merge happens when there are more than one duplicates on either of the tables. In this case, the value from the other key will be duplicated to fill in the missing repition.
-
Data Type Cleaning
We can observe the datatypes of various columns by viewing the
dtypes
attribute of the dataframe that we want to check these details for.- Converting Data Types: The data types can be converted using the
astype()
method of any column. - Convert to categorical column: We would often want to convert the column type to a categorical variable, we can pass the
'category'
argument to theastype()
method of any column to convert it into a categorical column. - Convert to Numeric: If there is a column that should be of numeric type but is not, because of mistreated data, or erroneous characters in the data, we can use the
pd.to_numeric(df['column_name'])
function and pass it the additional argumenterrors = 'coerce'
in order to convert all that erroneous data toNaN
with ease. - Drop NA: If there are really few data points that have missing values in them, we can lose them with the
dropna()
method. - Recode Missing Values: We can customize the missing values using the
fillna('missing_value_placeholder')
method of every data frame object and the columns. - String Cleaning: The
re
library for regular expressions gives us a neat way to do string manipulations. We can formulate regular expression formalue likex = re.compile('\d{3}-\d{3}-\d{4}')
. This would create a new regex object calledx
which has a methodmatch()
. We can pass any string to thismatch()
method to match it with our regular expression and it returns a booleanTrue
if the string matches. - Duplicate Data: There may be mulitple rows where redundant partial or complete row information is stored and these may be sorted out by using the
drop_duplicates()
methods of the data frame object.
- Converting Data Types: The data types can be converted using the
-
Vectorized Operations: Whenever possible, it is recommended to use vectorized computations rather than going for custom solutions.
- Operating on Strings: There are vectorized methods in the
str
attribute of every dataframe column that contains strings. These functions enable us to do quick vectorized transformations on the df. - Map Method: There are often times when the
str
or other vectorized attributes will not be present. In such cases themap()
method can be used for mapping operation succinctly.
- Operating on Strings: There are vectorized methods in the
-
Assigning Index: We can designate a column, or any other Numpy array of the same length to be the index by assigning it to the
df.index
attribute.- Index Name: Index by default, won't have name associated with it, but one can assign a name to the index by assigning it to the attribute
df.index.name
. The similar operation can be carried for assigning an index name to the column names using thedf.columns.name
attribute. - Using Tuples as Index: Often we would need to set two or more columns as index (much like composite keys in SQL). This can be done using Tuples. They list of columns that we need to be set as the composite index of a dataframe can be passed to the
set_index(["composite_key_column_1", "composite_key_column_2"])
to achieve this. It is called the MultiIndex. -
Sorting Index: If we are using a Multiindex as shown above, we can also use the
sort_index()
method to sort the index and display it in a more organized manner.This allows for fancy indexing, i.e. calling
df.loc[(["index_1_low" : "index_1_high"], "index_2"), :]
would select all the columns for rows that belong in the range provided forindex_1
and all sub rows belonging toindex_2
.The
slice()
function must be used for slicing both indexes. 4. Stacking and Unstacking Index: We might want to remove some of the indexes from the multi level indexes to be columns. To do this, we use the methodunstack()
with thelevel="index_name_to_remove"
. This will give us a hierarchical data frame and this effect can be reversed using thestack()
method in the same format. 5. Swapping Index Levels: The index levels can be swapped using the methodswaplevel(0, 1)
on any dataframe. This would essentially exchange the hierarchical arrangement of indices and runningsort_index()
right after it would do the rest.
- Index Name: Index by default, won't have name associated with it, but one can assign a name to the index by assigning it to the attribute
-
Aggregation/Reduction: The
groupby()
method is the Python equivalent of R'saggregate()
method. It allows us to create virtual groups within the dataframe. It is usually chained together with other aggregation functions likesum()
,count()
,unique()
etc. to produce meaningful results. We can use a typical grouping operation as follows:titanic.groupby(['pclass', 'sex'])['survived'].count()
There is also the option of finding out multiple aggregation details on the grouped dataframe: 1. Multiple Aggregations: We can use
titanic.groupby('pclass').agg(['mean', 'sum'])
to compute multiple aggregation values at once. 2. Custom Aggregations: We can pass custom functions as arguments toagg()
method that would takeSeries
objects as inputs and produce results from them. When used, they would receive as inputs multipleSeries
objects (one for each group) and would produce grouped results like other functions. 3. Differnet Agg on Different Columns: We can pass adictionary
object toagg()
method, as an argument, which would contain column names as keys and corresponding aggregation functions to apply as values. This allows us to compute different statistics for the same grouping of objects, upon different columns. -
Transformation: Transformation functions are used to transform one or more columns after they have been grouped and is usually chained after the
groupby()
method astransform(transformation_function)
. This transformation method passes the Series totransform_function()
which could be a user defined function or a builtin one, which then returns a transformed series of a conforming size. - Grouping and Filtering: We can use the dictionary object created by
groupby()
method to loop over and therefore filter only the rows of interest. - Sorting: We can sort the values in any column by using the
sort_values(ascending = False)
method available for columns of all dataframe objects. - Matrix Operations: Direct matrix operations will not work on Dataframes but
pandas.core.frame.DataFrame
object comes with anas_matrix()
method available to each object for converting it readily into a Numpy 2D Array. This will only work for DataFrames with only numerical values though. There is a crucial difference between Numpy 2D Arrays and Pandas' DataFrames and that is,X[0]
for a Numpy 2D Array returns the 0th row while accessing a DataFrame X[0] would return the 0th column of the DataFrame. - Mathematical Operations: There are various mathematical operations available for our use.
- pct_change(): This method can used to detect percentage change over a particular column or aggregation values.
- add(): This method can be used to add two Series with corresponding row indices as
a.add(b)
. This would add the seriesa
andb
. However, if there are non matching indices, i.e. an index ina
does not have any corresponding index inb
, then this could return anNaN
value. We can change this by changing the default non existent value by passing the argumentfill_value
into theadd()
method. This method is chainable so more than one Series can be added in a single line.
Exploring Data
- Dimensions: The
shape
attribute of any DataFrame can be used to check the dimensions. - Column Names: The
columns
attribute of a DataFrame returns the names of all the columns. - Indices: The atrributes
columns
andindex
can be used to retrieve the columns' and rows' index of a DataFrame. - Column Details: Much like the
str
function in R,info()
method can be used over any pandas DataFrame object in order to retrieve meaningful insight on columns. It returns the name of the column and the number of Non Null values preent in the data column. - Statistical Summary: Statistical summaries for pandas DataFrames can be quickly generated using the
describe()
method on any pandas DataFrame. - Interquantile range (IQR): Quantile ranges are useful when exploring a dataset and it can easily be determined by using
quantile()
method of Pandas dataframes. For instance,pd_df.quantile([0.25, 0.75])
would return two values for each column in the dataset and half the data for those columns would lie between those two values. - Range: The range can be calculated using the
min()
andmax()
methods on any DataFrame. - Median: The
median()
method can be used for finding out the median of any given dataset. - Standard deviation: The method
std()
can be used for finding out the standard deviation for any given column. - Unique objects: Unique categories in any categorical column can be found using the
unique()
method. - Frequency Count: The frequency of factors in a column containing factors by using the
value_counts()
method on that column. Optionally, we could specify thedropna
argument to this method with a Boolean Value specifying whether or not to involve null values. - Data Type: We can explore the data type for any column that we want to, by having a look at the values of the attribute
dtypes
for each column in data frame. - Index of Max: The
idxmax()
andidxmin()
methods allow us to find the row or column labels where the maximum or minimum values are located with the help ofaxis = 'columns'
for the column labels, and these methods default tomin()
so we won't have to specify anything there. - Indexes of Non NULL Values: One can get the indices of non null values by using the
notnull()
method available for Series objects in Pandas.
Time Series with Pandas
-
Slicing on the basis of time: Interesting selections can be done on date time indices, using selection operators like
pd_df.loc['2017-01-22':'2018-01-02']
to select the values for that 1 month. -
Reindexing: Any time series can be reindexed using the index of another time series by using the syntax
time_s2.reindex(time_s1.index)
- Forward Filling of Null Values: Null values in a time series can be forward filled using
time_series.reindex(new_index, method = "ffill")
-
Resampling: There are two kinds of sampling that can be done with time series in Pandas dataframes:
- Downlsampling: Reduce datetimes to lower frequency, i.e. from daily to weekly etc.
- Upsampling: Increasing the times to higher frequency, i.e. from daily to hourly etc.
There are multiple parameters that we can pass to
resample()
method for allowing us to derive quick statistics from a time series. For instance,pd_ts.resample('D').mean()
would allow us to have daily averages of all numeric columns of the time series dataframe.Code Meaning min
,T
Minute H
Hour D
Day B
Business Day W
Week M
Month Q
Quarter Y
Year Numeric values can be used as prefixes to the parameters above in order to increase or decrease the sampling size. For instance,
2W
can be used for downsampling for 2 weeks inpd_ts.resample('2W').mean()
The
ffill()
andbfill()
methods can be used for filling in, rolling values as inpd_ts.resample('2H').ffill()
.Interpolation: It can be carried out as follows:
pd_ts.resample('A').first().interpolate('linear')
This would resample and fill in the gaps between any two data points with NaN values. Then the
interpolate()
method would interpolate the values. -
Datetime Methods: These methods allow us to slice and select specific date time attributes from the data.
- Select hours: For instance
pd_ts['date'].dt.hour
would extract and return the hour where 0 is 12AM and 23 is 12PM. - Timezone: We can define the timezone for a particular column by using
pd_ts['date'].dt.tz_localize('US/Central')
. On the other hand, easy conversions can be made, using the methodtz_convert()
that is specifically used for converting dates and times in one timezone to another.
- Select hours: For instance
Moving Averages
We can calculate moving averages, that allows us to focus on long term trends instead of being stuck in short term fluctuations. We do so by using the rolling()
method as shown in time_series.rolling(window=24).mean()
which would compute new values for each point but it will still be hourly data. This would instead set the value of that point as an average of trailing 24 datapoints, hence making it smoother.
"""
for key in keys['order']:
key_weight = keys['weights'][key] # Weight of the key
key_data = keys['data'][key]
if key_data['type'] == "singular":
print("Jacuzzi")
else:
for section in key_data['order']:
section_weight = key_data['weights'][section]
section_data = key_data['data'][section]
for unit in section_data:
user_df[key+'_'+section+'_'+unit] = user_data_raw[section_data[unit]['user']].notnull().astype(int)
print(user_df.columns)
"""