Saving time in Python for common Data Science tasks

The purpose of this workbook is to act as a quick reference guide / list of recipes to perform common data science tasks in an efficient way.

Best practices

There are often several ways of performing a given task in Python, such as iterating on a list for instance. Below I tried to list some common tasks with the corresponding good / best practice that I found after experiencing and crawling Stackoverflow and co.

> Iterate on a dictionary

The idiomatic way is to use items() to iterate accros the dictionary.

d = {'First Name': 'John', 'Last Name': 'Doe'}
for key, val in d.items():
    print('Key ' + key + ' has value ' + val)

Key First Name has value John
Key Last Name has value Doe

> Iterate on a list and access the index

Here the trick is to use enumerate rather than creating an index value that we would manually increment. enumerate makes things smooth:

items = ['a', 'b', 'c']
for index, item in enumerate(items, start=0):   # default is zero
    print(index, item)

0 a
1 b
2 c

Compute correlation matrix

It can be cumbersome to get the list of the most correlated pairs of variables in a data set. Here is an example of how to do so, quite smoothly.

We first create a toy dataset with 20 features and 5 correlated pairs of features to play with
Then, the correlation matrix is computed using the pandas.DataFrame.corr() command
To extract the relevant part of the matrix, a boolean mask is created with the numpy.triu() command
Finally, the matrix is converted to a Pandas Series with a multi-index using the pandas.DataFrame.stack() command

from sklearn.datasets import make_classification
import pandas as pd
import numpy as np

X, y = make_classification(n_features=10, n_informative=3, n_redundant=5, n_classes=2,
    n_clusters_per_class=2)

col_names = ['feature_' + str(i) for i in range(X.shape[1])]
X = pd.DataFrame(X, columns=col_names)

# compute correlation matrix
cor_matrix = X.corr()
cor_matrix.head()

.dataframe thead th {
    text-align: left;
}

.dataframe tbody tr th {
    vertical-align: top;
}

</style>

	feature_0	feature_1	feature_2	feature_3	feature_4	feature_5	feature_6	feature_7	feature_8	feature_9
feature_0	1.000000	-0.823849	0.973042	-0.949994	-0.896065	0.764993	-0.160190	0.170617	-0.032467	0.198784
feature_1	-0.823849	1.000000	-0.689932	0.812806	0.625704	-0.975139	0.105197	-0.631158	0.000368	-0.579045
feature_2	0.973042	-0.689932	1.000000	-0.877041	-0.958594	0.649724	-0.177422	0.051843	-0.046402	-0.028937
feature_3	-0.949994	0.812806	-0.877041	1.000000	0.718516	-0.694186	0.118221	-0.063240	0.013887	-0.428114
feature_4	-0.896065	0.625704	-0.958594	0.718516	1.000000	-0.648473	0.193494	-0.168040	0.057019	0.225775

Then we want to extract the upper-part of the matrix (becauses the correlation matrix is symetrical), to do so we will generate a boolean mask array from an upper triangular matrix.

mask = np.triu(np.ones(cor_matrix.shape), k=1).astype(np.bool)
print('The following mask has been generated: \n')
print(mask)

The following mask has been generated: 

[[False  True  True  True  True  True  True  True  True  True]
 [False False  True  True  True  True  True  True  True  True]
 [False False False  True  True  True  True  True  True  True]
 [False False False False  True  True  True  True  True  True]
 [False False False False False  True  True  True  True  True]
 [False False False False False False  True  True  True  True]
 [False False False False False False False  True  True  True]
 [False False False False False False False False  True  True]
 [False False False False False False False False False  True]
 [False False False False False False False False False False]]

When applied to our correlation matrix, it will only keep the upper part, excluding the diagonal. We use .abs()at the end because we are interested in variables positively and negatively correlated.

upper_cor_matrix = cor_matrix.where(mask).abs()
upper_cor_matrix

.dataframe thead th {
    text-align: left;
}

.dataframe tbody tr th {
    vertical-align: top;
}

</style>

	feature_0	feature_1	feature_2	feature_3	feature_4	feature_5	feature_6	feature_7	feature_8	feature_9
feature_0	NaN	0.823849	0.973042	0.949994	0.896065	0.764993	0.160190	0.170617	0.032467	0.198784
feature_1	NaN	NaN	0.689932	0.812806	0.625704	0.975139	0.105197	0.631158	0.000368	0.579045
feature_2	NaN	NaN	NaN	0.877041	0.958594	0.649724	0.177422	0.051843	0.046402	0.028937
feature_3	NaN	NaN	NaN	NaN	0.718516	0.694186	0.118221	0.063240	0.013887	0.428114
feature_4	NaN	NaN	NaN	NaN	NaN	0.648473	0.193494	0.168040	0.057019	0.225775
feature_5	NaN	NaN	NaN	NaN	NaN	NaN	0.115363	0.756552	0.006447	0.460539
feature_6	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.034566	0.064570	0.069284
feature_7	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.012584	0.361545
feature_8	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.058279
feature_9	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

To make it easier to use, the columns are stacked into rows, resulting in a multi-index Pandas Series:

cor_series = upper_cor_matrix.stack().sort_values(ascending=False)
print('Display the most-correlated pairs:')
cor_series[cor_series > 0.6]

Display the most-correlated pairs:





feature_1  feature_5    0.975139
feature_0  feature_2    0.973042
feature_2  feature_4    0.958594
feature_0  feature_3    0.949994
           feature_4    0.896065
feature_2  feature_3    0.877041
feature_0  feature_1    0.823849
feature_1  feature_3    0.812806
feature_0  feature_5    0.764993
feature_5  feature_7    0.756552
feature_3  feature_4    0.718516
           feature_5    0.694186
feature_1  feature_2    0.689932
feature_2  feature_5    0.649724
feature_4  feature_5    0.648473
feature_1  feature_7    0.631158
           feature_4    0.625704
dtype: float64

Deal with constant and quasi-variables

Deal with dates

The common operations involve:

converting a string to a datetime object or the reverse operation
changing the format of the displayed date, like removing day, month, year information and keep only time information
computing date / time differences

> Parse a `string` into a `datetime` object

import datetime
my_date = datetime.datetime.strptime('2012-07-22 16:19:00.539570', '%Y-%m-%d %H:%M:%S.%f')
print('The created object has the following type: %s' % type(my_date))

The created object has the following type: <class 'datetime.datetime'>

> Parse a `string` into `datetime` object when loading a .csv file

When loading data from a .csv file, it may be handy to take care of the date column at the same time, rather than modifying thedate column in another command. To do so, one may define a date parser function that will be applied on the date column when reading the csv.

To demonstrate this, we generate a fake .csv file with a date column. The date will be written as '15', '16', '17' for 2015, 2016 and 2017 respectively.

import pandas as pd
from pandas import datetime

df = pd.DataFrame(['15', '16', '17'], columns = ['date'])
print(df.head())

# save dataframe
df.to_csv('./data/tmp.csv', index=False)

Let's now demonstrate how to define a date parser and load the csv properly.

def date_parser(x):
    return datetime.strptime('20' + x, '%Y')

pd.read_csv('./data/tmp.csv', parse_dates = [0], date_parser=date_parser)

.dataframe thead th {
    text-align: left;
}

.dataframe tbody tr th {
    vertical-align: top;
}

</style>

	date
0	2015-01-01
1	2016-01-01
2	2017-01-01

> Format a `datetime` object to a `string`

To format a datetime object to a string, the function to use is strftime(). The list of date formatters can be found here.

import datetime
print('Output current time as a formatted string:')
datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f')

Output current time as a formatted string:





'2018-01-05 11:27:51.252837'

Deal with NAs

Dealing with missing values may involve:

assessing the dataset to find the rows / columns containing missing values
removing the records containing missing values
filling the missing values with an alternate value

> Drop NAs using `Pandas.DataFrame.dropna(axis, how, thresh)`

Pandas can look for NA column-wise or row-wise, drop a label if any or all values are NA and use a threshold for deletion.

> Spot NAs using `apply()` and a lambda function

A lambda function combined with Pandas.DataFrame.apply() may be used to spot NAs and then take some action. Things to remember :

if xis a DataFrame, then x.isnull() returns a boolean same-sized object indicating if the values are NA.
applying either all() or any() on x.isnull() checks if any or all of the value in the DataFrame are NA

Let's take ta toy example:

import pandas as pd
import numpy as np
df = pd.DataFrame([[0, np.nan, 1], [np.nan, np.nan, 2]], columns = ['x1', 'x2', 'x3'])
print(df)

# create a boolean mask with columns containing only NAs
cols = df.apply(lambda x: all(x.isnull()), axis=0)
print('\nThe following column(s) contain only NAs:')
print(df.columns[cols].values)
print('\n')

cols = df.apply(lambda x: any(x.isnull()), axis=0)
print('The following column(s) contain at least one NA')
print(df.columns[cols].values)

    x1  x2  x3
0  0.0 NaN   1
1  NaN NaN   2

The following column(s) contain only NAs:
['x2']


The following column(s) contain at least one NA
['x1' 'x2']

Force Python to reload a module

Once a module has been loaded using import module_name, running this same command again will not reload the module.

Say you are making changes on a module and testing the result interactively in a python shell. If you have loaded the module once and want to see the new changes you have to use:

import importlib
importlib.reload(module_name)

Load data from multiple files and concatenate

Python glob.glob() command is used to find all the files in a directory matching a path pattern (be careful, the files are return in an unsorted order).

map() command is applied to the file list to read the csv files via a lambda function.

The result is eventually concatenated using Pandas.concat() function.

import glob
import pandas as pd
files = glob.glob('./data/file*.csv')
print('List of files found:')
print(files)
df1 = pd.concat(map(lambda file: pd.read_csv(file, sep=','), files))
df1

List of files found:
['./data/file1.csv', './data/file2.csv']

.dataframe thead th {
    text-align: left;
}

.dataframe tbody tr th {
    vertical-align: top;
}

</style>

	id	name
0	1	john doe
1	2	donald trump
0	3	bill clinton
1	4	hillary clinton

Save a dataframe to .csv file

Full documentation on the Pandas.DataFrame.to_csv() command may be found here: official documentation

Ok this one may seem obvious but if you want your Pandas.DataFrame to be easily readable afterwards, you have to take care:

if your dataframe doest not have an index, I suggest that you pass index=False to Pandas.DataFrame.to_csv() so that row numbers are not saved in the .csv file. This will prevent any trouble from occuring when loading the file again
if your dataframe already has an index that you want to keep, the default value index=True in Pandas.DataFrame.to_csv() will work fine
if you don't have an index yet but want to save the dataframe and use an existing column as the index for future reading, then you can specify index=True AND index_label=my_future_index_column

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python Data Science Recipes.md

Python Data Science Recipes.md

Saving time in Python for common Data Science tasks

Table of contents

Best practices

> Iterate on a dictionary

> Iterate on a list and access the index

Compute correlation matrix

Deal with constant and quasi-variables

Deal with dates

> Parse a `string` into a `datetime` object

> Parse a `string` into `datetime` object when loading a .csv file

> Format a `datetime` object to a `string`

Deal with NAs

> Drop NAs using `Pandas.DataFrame.dropna(axis, how, thresh)`

> Spot NAs using `apply()` and a lambda function

Force Python to reload a module

Load data from multiple files and concatenate

Save a dataframe to .csv file

Files

Python Data Science Recipes.md

Latest commit

History

Python Data Science Recipes.md

File metadata and controls

Saving time in Python for common Data Science tasks

Table of contents

Best practices

> Iterate on a dictionary

> Iterate on a list and access the index

Compute correlation matrix

Deal with constant and quasi-variables

Deal with dates

> Parse a string into a datetime object

> Parse a string into datetime object when loading a .csv file

> Format a datetime object to a string

Deal with NAs

> Drop NAs using Pandas.DataFrame.dropna(axis, how, thresh)

> Spot NAs using apply() and a lambda function

Force Python to reload a module

Load data from multiple files and concatenate

Save a dataframe to .csv file

> Parse a `string` into a `datetime` object

> Parse a `string` into `datetime` object when loading a .csv file

> Format a `datetime` object to a `string`

> Drop NAs using `Pandas.DataFrame.dropna(axis, how, thresh)`

> Spot NAs using `apply()` and a lambda function