The purpose of this workbook is to act as a quick reference guide / list of recipes to perform common data science tasks in an efficient way.
- Best practices
- Compute correlation matrix
- Deal with dates
- Deal with NAs
- Deal with constant / quasi-constant variables
- Load data from multiple files and concatenate
- Force Python to reload a module
- Save a dataframe to .csv
There are often several ways of performing a given task in Python, such as iterating on a list for instance. Below I tried to list some common tasks with the corresponding good / best practice that I found after experiencing and crawling Stackoverflow and co.
The idiomatic way is to use items()
to iterate accros the dictionary.
d = {'First Name': 'John', 'Last Name': 'Doe'}
for key, val in d.items():
print('Key ' + key + ' has value ' + val)
Key First Name has value John
Key Last Name has value Doe
Here the trick is to use enumerate
rather than creating an index value that we would manually increment.
enumerate
makes things smooth:
items = ['a', 'b', 'c']
for index, item in enumerate(items, start=0): # default is zero
print(index, item)
0 a
1 b
2 c
It can be cumbersome to get the list of the most correlated pairs of variables in a data set. Here is an example of how to do so, quite smoothly.
- We first create a toy dataset with 20 features and 5 correlated pairs of features to play with
- Then, the correlation matrix is computed using the
pandas.DataFrame.corr()
command - To extract the relevant part of the matrix, a boolean mask is created with the
numpy.triu()
command - Finally, the matrix is converted to a Pandas Series with a multi-index using the
pandas.DataFrame.stack()
command
from sklearn.datasets import make_classification
import pandas as pd
import numpy as np
X, y = make_classification(n_features=10, n_informative=3, n_redundant=5, n_classes=2,
n_clusters_per_class=2)
col_names = ['feature_' + str(i) for i in range(X.shape[1])]
X = pd.DataFrame(X, columns=col_names)
# compute correlation matrix
cor_matrix = X.corr()
cor_matrix.head()
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
feature_0 | feature_1 | feature_2 | feature_3 | feature_4 | feature_5 | feature_6 | feature_7 | feature_8 | feature_9 | |
---|---|---|---|---|---|---|---|---|---|---|
feature_0 | 1.000000 | -0.823849 | 0.973042 | -0.949994 | -0.896065 | 0.764993 | -0.160190 | 0.170617 | -0.032467 | 0.198784 |
feature_1 | -0.823849 | 1.000000 | -0.689932 | 0.812806 | 0.625704 | -0.975139 | 0.105197 | -0.631158 | 0.000368 | -0.579045 |
feature_2 | 0.973042 | -0.689932 | 1.000000 | -0.877041 | -0.958594 | 0.649724 | -0.177422 | 0.051843 | -0.046402 | -0.028937 |
feature_3 | -0.949994 | 0.812806 | -0.877041 | 1.000000 | 0.718516 | -0.694186 | 0.118221 | -0.063240 | 0.013887 | -0.428114 |
feature_4 | -0.896065 | 0.625704 | -0.958594 | 0.718516 | 1.000000 | -0.648473 | 0.193494 | -0.168040 | 0.057019 | 0.225775 |
Then we want to extract the upper-part of the matrix (becauses the correlation matrix is symetrical), to do so we will generate a boolean mask array from an upper triangular matrix.
mask = np.triu(np.ones(cor_matrix.shape), k=1).astype(np.bool)
print('The following mask has been generated: \n')
print(mask)
The following mask has been generated:
[[False True True True True True True True True True]
[False False True True True True True True True True]
[False False False True True True True True True True]
[False False False False True True True True True True]
[False False False False False True True True True True]
[False False False False False False True True True True]
[False False False False False False False True True True]
[False False False False False False False False True True]
[False False False False False False False False False True]
[False False False False False False False False False False]]
When applied to our correlation matrix, it will only keep the upper part, excluding the diagonal. We use .abs()
at the end because we are interested in variables positively and negatively correlated.
upper_cor_matrix = cor_matrix.where(mask).abs()
upper_cor_matrix
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
feature_0 | feature_1 | feature_2 | feature_3 | feature_4 | feature_5 | feature_6 | feature_7 | feature_8 | feature_9 | |
---|---|---|---|---|---|---|---|---|---|---|
feature_0 | NaN | 0.823849 | 0.973042 | 0.949994 | 0.896065 | 0.764993 | 0.160190 | 0.170617 | 0.032467 | 0.198784 |
feature_1 | NaN | NaN | 0.689932 | 0.812806 | 0.625704 | 0.975139 | 0.105197 | 0.631158 | 0.000368 | 0.579045 |
feature_2 | NaN | NaN | NaN | 0.877041 | 0.958594 | 0.649724 | 0.177422 | 0.051843 | 0.046402 | 0.028937 |
feature_3 | NaN | NaN | NaN | NaN | 0.718516 | 0.694186 | 0.118221 | 0.063240 | 0.013887 | 0.428114 |
feature_4 | NaN | NaN | NaN | NaN | NaN | 0.648473 | 0.193494 | 0.168040 | 0.057019 | 0.225775 |
feature_5 | NaN | NaN | NaN | NaN | NaN | NaN | 0.115363 | 0.756552 | 0.006447 | 0.460539 |
feature_6 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.034566 | 0.064570 | 0.069284 |
feature_7 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.012584 | 0.361545 |
feature_8 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.058279 |
feature_9 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
To make it easier to use, the columns are stacked into rows, resulting in a multi-index Pandas Series:
cor_series = upper_cor_matrix.stack().sort_values(ascending=False)
print('Display the most-correlated pairs:')
cor_series[cor_series > 0.6]
Display the most-correlated pairs:
feature_1 feature_5 0.975139
feature_0 feature_2 0.973042
feature_2 feature_4 0.958594
feature_0 feature_3 0.949994
feature_4 0.896065
feature_2 feature_3 0.877041
feature_0 feature_1 0.823849
feature_1 feature_3 0.812806
feature_0 feature_5 0.764993
feature_5 feature_7 0.756552
feature_3 feature_4 0.718516
feature_5 0.694186
feature_1 feature_2 0.689932
feature_2 feature_5 0.649724
feature_4 feature_5 0.648473
feature_1 feature_7 0.631158
feature_4 0.625704
dtype: float64
The common operations involve:
- converting a
string
to adatetime
object or the reverse operation - changing the format of the displayed date, like removing day, month, year information and keep only time information
- computing date / time differences
import datetime
my_date = datetime.datetime.strptime('2012-07-22 16:19:00.539570', '%Y-%m-%d %H:%M:%S.%f')
print('The created object has the following type: %s' % type(my_date))
The created object has the following type: <class 'datetime.datetime'>
When loading data from a .csv file, it may be handy to take care of the date column at the same time, rather than modifying thedate column in another command. To do so, one may define a date parser function that will be applied on the date column when reading the csv.
To demonstrate this, we generate a fake .csv file with a date column. The date will be written as '15', '16', '17' for 2015, 2016 and 2017 respectively.
import pandas as pd
from pandas import datetime
df = pd.DataFrame(['15', '16', '17'], columns = ['date'])
print(df.head())
# save dataframe
df.to_csv('./data/tmp.csv', index=False)
date
0 15
1 16
2 17
Let's now demonstrate how to define a date parser and load the csv properly.
def date_parser(x):
return datetime.strptime('20' + x, '%Y')
pd.read_csv('./data/tmp.csv', parse_dates = [0], date_parser=date_parser)
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
date | |
---|---|
0 | 2015-01-01 |
1 | 2016-01-01 |
2 | 2017-01-01 |
To format a datetime
object to a string, the function to use is strftime()
. The list of date formatters can be found here.
import datetime
print('Output current time as a formatted string:')
datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f')
Output current time as a formatted string:
'2018-01-05 11:27:51.252837'
Dealing with missing values may involve:
- assessing the dataset to find the rows / columns containing missing values
- removing the records containing missing values
- filling the missing values with an alternate value
Pandas can look for NA column-wise or row-wise, drop a label if any or all values are NA and use a threshold for deletion.
A lambda function combined with Pandas.DataFrame.apply()
may be used to spot NAs and then take some action. Things to remember :
- if
x
is a DataFrame, thenx.isnull()
returns a boolean same-sized object indicating if the values are NA. - applying either
all()
orany()
onx.isnull()
checks if any or all of the value in the DataFrame are NA
Let's take ta toy example:
import pandas as pd
import numpy as np
df = pd.DataFrame([[0, np.nan, 1], [np.nan, np.nan, 2]], columns = ['x1', 'x2', 'x3'])
print(df)
# create a boolean mask with columns containing only NAs
cols = df.apply(lambda x: all(x.isnull()), axis=0)
print('\nThe following column(s) contain only NAs:')
print(df.columns[cols].values)
print('\n')
cols = df.apply(lambda x: any(x.isnull()), axis=0)
print('The following column(s) contain at least one NA')
print(df.columns[cols].values)
x1 x2 x3
0 0.0 NaN 1
1 NaN NaN 2
The following column(s) contain only NAs:
['x2']
The following column(s) contain at least one NA
['x1' 'x2']
Once a module has been loaded using import module_name
, running this same command again will not reload the module.
Say you are making changes on a module and testing the result interactively in a python shell. If you have loaded the module once and want to see the new changes you have to use:
import importlib
importlib.reload(module_name)
Python glob.glob()
command is used to find all the files in a directory matching a path pattern (be careful, the files are return in an unsorted order).
map()
command is applied to the file list to read the csv files via a lambda function.
The result is eventually concatenated using Pandas.concat()
function.
import glob
import pandas as pd
files = glob.glob('./data/file*.csv')
print('List of files found:')
print(files)
df1 = pd.concat(map(lambda file: pd.read_csv(file, sep=','), files))
df1
List of files found:
['./data/file1.csv', './data/file2.csv']
.dataframe thead th {
text-align: left;
}
.dataframe tbody tr th {
vertical-align: top;
}
id | name | |
---|---|---|
0 | 1 | john doe |
1 | 2 | donald trump |
0 | 3 | bill clinton |
1 | 4 | hillary clinton |
Full documentation on the Pandas.DataFrame.to_csv()
command may be found here: official documentation
Ok this one may seem obvious but if you want your Pandas.DataFrame
to be easily readable afterwards, you have to take care:
- if your dataframe doest not have an index, I suggest that you pass
index=False
toPandas.DataFrame.to_csv()
so that row numbers are not saved in the .csv file. This will prevent any trouble from occuring when loading the file again - if your dataframe already has an index that you want to keep, the default value
index=True
inPandas.DataFrame.to_csv()
will work fine - if you don't have an index yet but want to save the dataframe and use an existing column as the index for future reading, then you can specify
index=True
ANDindex_label=my_future_index_column