Skip to content

Latest commit

 

History

History
692 lines (548 loc) · 16.8 KB

Python Data Science Recipes.md

File metadata and controls

692 lines (548 loc) · 16.8 KB

Saving time in Python for common Data Science tasks

The purpose of this workbook is to act as a quick reference guide / list of recipes to perform common data science tasks in an efficient way.

Table of contents


Best practices

There are often several ways of performing a given task in Python, such as iterating on a list for instance. Below I tried to list some common tasks with the corresponding good / best practice that I found after experiencing and crawling Stackoverflow and co.

> Iterate on a dictionary

The idiomatic way is to use items() to iterate accros the dictionary.

d = {'First Name': 'John', 'Last Name': 'Doe'}
for key, val in d.items():
    print('Key ' + key + ' has value ' + val)
Key First Name has value John
Key Last Name has value Doe

> Iterate on a list and access the index

Here the trick is to use enumerate rather than creating an index value that we would manually increment. enumerate makes things smooth:

items = ['a', 'b', 'c']
for index, item in enumerate(items, start=0):   # default is zero
    print(index, item)
0 a
1 b
2 c

Compute correlation matrix

It can be cumbersome to get the list of the most correlated pairs of variables in a data set. Here is an example of how to do so, quite smoothly.

  • We first create a toy dataset with 20 features and 5 correlated pairs of features to play with
  • Then, the correlation matrix is computed using the pandas.DataFrame.corr() command
  • To extract the relevant part of the matrix, a boolean mask is created with the numpy.triu() command
  • Finally, the matrix is converted to a Pandas Series with a multi-index using the pandas.DataFrame.stack() command
from sklearn.datasets import make_classification
import pandas as pd
import numpy as np

X, y = make_classification(n_features=10, n_informative=3, n_redundant=5, n_classes=2,
    n_clusters_per_class=2)

col_names = ['feature_' + str(i) for i in range(X.shape[1])]
X = pd.DataFrame(X, columns=col_names)
# compute correlation matrix
cor_matrix = X.corr()
cor_matrix.head()
<style> .dataframe thead tr:only-child th { text-align: right; }
.dataframe thead th {
    text-align: left;
}

.dataframe tbody tr th {
    vertical-align: top;
}
</style>
feature_0 feature_1 feature_2 feature_3 feature_4 feature_5 feature_6 feature_7 feature_8 feature_9
feature_0 1.000000 -0.823849 0.973042 -0.949994 -0.896065 0.764993 -0.160190 0.170617 -0.032467 0.198784
feature_1 -0.823849 1.000000 -0.689932 0.812806 0.625704 -0.975139 0.105197 -0.631158 0.000368 -0.579045
feature_2 0.973042 -0.689932 1.000000 -0.877041 -0.958594 0.649724 -0.177422 0.051843 -0.046402 -0.028937
feature_3 -0.949994 0.812806 -0.877041 1.000000 0.718516 -0.694186 0.118221 -0.063240 0.013887 -0.428114
feature_4 -0.896065 0.625704 -0.958594 0.718516 1.000000 -0.648473 0.193494 -0.168040 0.057019 0.225775

Then we want to extract the upper-part of the matrix (becauses the correlation matrix is symetrical), to do so we will generate a boolean mask array from an upper triangular matrix.

mask = np.triu(np.ones(cor_matrix.shape), k=1).astype(np.bool)
print('The following mask has been generated: \n')
print(mask)
The following mask has been generated: 

[[False  True  True  True  True  True  True  True  True  True]
 [False False  True  True  True  True  True  True  True  True]
 [False False False  True  True  True  True  True  True  True]
 [False False False False  True  True  True  True  True  True]
 [False False False False False  True  True  True  True  True]
 [False False False False False False  True  True  True  True]
 [False False False False False False False  True  True  True]
 [False False False False False False False False  True  True]
 [False False False False False False False False False  True]
 [False False False False False False False False False False]]

When applied to our correlation matrix, it will only keep the upper part, excluding the diagonal. We use .abs()at the end because we are interested in variables positively and negatively correlated.

upper_cor_matrix = cor_matrix.where(mask).abs()
upper_cor_matrix
<style> .dataframe thead tr:only-child th { text-align: right; }
.dataframe thead th {
    text-align: left;
}

.dataframe tbody tr th {
    vertical-align: top;
}
</style>
feature_0 feature_1 feature_2 feature_3 feature_4 feature_5 feature_6 feature_7 feature_8 feature_9
feature_0 NaN 0.823849 0.973042 0.949994 0.896065 0.764993 0.160190 0.170617 0.032467 0.198784
feature_1 NaN NaN 0.689932 0.812806 0.625704 0.975139 0.105197 0.631158 0.000368 0.579045
feature_2 NaN NaN NaN 0.877041 0.958594 0.649724 0.177422 0.051843 0.046402 0.028937
feature_3 NaN NaN NaN NaN 0.718516 0.694186 0.118221 0.063240 0.013887 0.428114
feature_4 NaN NaN NaN NaN NaN 0.648473 0.193494 0.168040 0.057019 0.225775
feature_5 NaN NaN NaN NaN NaN NaN 0.115363 0.756552 0.006447 0.460539
feature_6 NaN NaN NaN NaN NaN NaN NaN 0.034566 0.064570 0.069284
feature_7 NaN NaN NaN NaN NaN NaN NaN NaN 0.012584 0.361545
feature_8 NaN NaN NaN NaN NaN NaN NaN NaN NaN 0.058279
feature_9 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

To make it easier to use, the columns are stacked into rows, resulting in a multi-index Pandas Series:

cor_series = upper_cor_matrix.stack().sort_values(ascending=False)
print('Display the most-correlated pairs:')
cor_series[cor_series > 0.6]
Display the most-correlated pairs:





feature_1  feature_5    0.975139
feature_0  feature_2    0.973042
feature_2  feature_4    0.958594
feature_0  feature_3    0.949994
           feature_4    0.896065
feature_2  feature_3    0.877041
feature_0  feature_1    0.823849
feature_1  feature_3    0.812806
feature_0  feature_5    0.764993
feature_5  feature_7    0.756552
feature_3  feature_4    0.718516
           feature_5    0.694186
feature_1  feature_2    0.689932
feature_2  feature_5    0.649724
feature_4  feature_5    0.648473
feature_1  feature_7    0.631158
           feature_4    0.625704
dtype: float64

Deal with constant and quasi-variables

Deal with dates

The common operations involve:

  • converting a string to a datetime object or the reverse operation
  • changing the format of the displayed date, like removing day, month, year information and keep only time information
  • computing date / time differences

> Parse a string into a datetime object

import datetime
my_date = datetime.datetime.strptime('2012-07-22 16:19:00.539570', '%Y-%m-%d %H:%M:%S.%f')
print('The created object has the following type: %s' % type(my_date))
The created object has the following type: <class 'datetime.datetime'>

> Parse a string into datetime object when loading a .csv file

When loading data from a .csv file, it may be handy to take care of the date column at the same time, rather than modifying thedate column in another command. To do so, one may define a date parser function that will be applied on the date column when reading the csv.

To demonstrate this, we generate a fake .csv file with a date column. The date will be written as '15', '16', '17' for 2015, 2016 and 2017 respectively.

import pandas as pd
from pandas import datetime

df = pd.DataFrame(['15', '16', '17'], columns = ['date'])
print(df.head())

# save dataframe
df.to_csv('./data/tmp.csv', index=False)
  date
0   15
1   16
2   17

Let's now demonstrate how to define a date parser and load the csv properly.

def date_parser(x):
    return datetime.strptime('20' + x, '%Y')

pd.read_csv('./data/tmp.csv', parse_dates = [0], date_parser=date_parser)
<style> .dataframe thead tr:only-child th { text-align: right; }
.dataframe thead th {
    text-align: left;
}

.dataframe tbody tr th {
    vertical-align: top;
}
</style>
date
0 2015-01-01
1 2016-01-01
2 2017-01-01

> Format a datetime object to a string

To format a datetime object to a string, the function to use is strftime(). The list of date formatters can be found here.

import datetime
print('Output current time as a formatted string:')
datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f')
Output current time as a formatted string:





'2018-01-05 11:27:51.252837'

Deal with NAs

Dealing with missing values may involve:

  • assessing the dataset to find the rows / columns containing missing values
  • removing the records containing missing values
  • filling the missing values with an alternate value

> Drop NAs using Pandas.DataFrame.dropna(axis, how, thresh)

Pandas can look for NA column-wise or row-wise, drop a label if any or all values are NA and use a threshold for deletion.

> Spot NAs using apply() and a lambda function

A lambda function combined with Pandas.DataFrame.apply() may be used to spot NAs and then take some action. Things to remember :

  • if xis a DataFrame, then x.isnull() returns a boolean same-sized object indicating if the values are NA.
  • applying either all() or any() on x.isnull() checks if any or all of the value in the DataFrame are NA

Let's take ta toy example:

import pandas as pd
import numpy as np
df = pd.DataFrame([[0, np.nan, 1], [np.nan, np.nan, 2]], columns = ['x1', 'x2', 'x3'])
print(df)

# create a boolean mask with columns containing only NAs
cols = df.apply(lambda x: all(x.isnull()), axis=0)
print('\nThe following column(s) contain only NAs:')
print(df.columns[cols].values)
print('\n')

cols = df.apply(lambda x: any(x.isnull()), axis=0)
print('The following column(s) contain at least one NA')
print(df.columns[cols].values)
    x1  x2  x3
0  0.0 NaN   1
1  NaN NaN   2

The following column(s) contain only NAs:
['x2']


The following column(s) contain at least one NA
['x1' 'x2']

Force Python to reload a module

Once a module has been loaded using import module_name, running this same command again will not reload the module.

Say you are making changes on a module and testing the result interactively in a python shell. If you have loaded the module once and want to see the new changes you have to use:

import importlib
importlib.reload(module_name)

Load data from multiple files and concatenate

Python glob.glob() command is used to find all the files in a directory matching a path pattern (be careful, the files are return in an unsorted order).

map() command is applied to the file list to read the csv files via a lambda function.

The result is eventually concatenated using Pandas.concat() function.

import glob
import pandas as pd
files = glob.glob('./data/file*.csv')
print('List of files found:')
print(files)
df1 = pd.concat(map(lambda file: pd.read_csv(file, sep=','), files))
df1
List of files found:
['./data/file1.csv', './data/file2.csv']
<style> .dataframe thead tr:only-child th { text-align: right; }
.dataframe thead th {
    text-align: left;
}

.dataframe tbody tr th {
    vertical-align: top;
}
</style>
id name
0 1 john doe
1 2 donald trump
0 3 bill clinton
1 4 hillary clinton

Save a dataframe to .csv file

Full documentation on the Pandas.DataFrame.to_csv() command may be found here: official documentation

Ok this one may seem obvious but if you want your Pandas.DataFrame to be easily readable afterwards, you have to take care:

  • if your dataframe doest not have an index, I suggest that you pass index=False to Pandas.DataFrame.to_csv() so that row numbers are not saved in the .csv file. This will prevent any trouble from occuring when loading the file again
  • if your dataframe already has an index that you want to keep, the default value index=True in Pandas.DataFrame.to_csv() will work fine
  • if you don't have an index yet but want to save the dataframe and use an existing column as the index for future reading, then you can specify index=True AND index_label=my_future_index_column