Skip to content

sierra073/esh-machine-cleaning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Machine Cleaning at EducationSuperHighway (ESH)

The Machine Cleaning project develops classification models to predict corrected USAC Form 471 line item field values for the most commonly edited line item fields by ESH Business Analysts. Past analyses were done to determine which fields were the the best "low-hanging fruit" for machine learning; these fields are Purpose and Connect Category.

Workflow/Code Architecture Illustration:

These modules were built to facilitate the model building workflow for this specific ESH application, and to make it easier to use for analysts with a limited machine learning background. The preprocessing module is specific to ESH, and the modeling modules are built as wrappers around sklearn modules. These can and should be extended, with the first priority being to rewrite them to work in Python 3+!

Files and Descriptions:

1. Environment setup

Directory File Description
/ setup.sh Before running this script, go in and edit it for your system. The GITHUB and _FORKED variables need to be edited for your individual system. Then run . ./setup.sh && source mc_venv/bin/activate && . /.bash_profile_mc in your command line to set up and activate the environment.

2. Loading the datasets

Directory File Description
src/modeling/sql get_data_2019_train.sql This data is used to train the model. Gets the raw USAC FRN & line item data for 2019.
src/modeling/sql get_yvar_dar_prod.sql Labels to train supervised machine learning model to predict purpose or connect category.
src/modeling/sql get_data_future_predict.sql Blank file, but this will look similar to get_data_2019_train.sql but will pull in the future year's data to predict on.

3. Preprocessing the dataset

Directory File Description
src/modeling preprocess_raw.py Includes all data preprocessing functions such as removing nulls, duplicates and data conversions to numeric or dummy variables. Also includes function to remove correlated columns. More detail on ReadtheDocs

4. Training models

Directory File Description
src/modeling training_demo.ipynb Notebook for training and iterating on models. There is a basic demo of the end-to-end process with a Random Forest model.
src/modeling model_setup_fit.py On ReadtheDocs
src/modeling model_optimization.py On ReadtheDocs

5. Making Predictions on Purpose and Connect Category using the trained models

Directory File Description
src/examples apply_models.ipynb Run this notebook to call the load_and_predict() function. This function loads in a model and features and applies it to new data to make predictions on purpose and connect category.

Output:
/data/ml_mass_update.csv

Note: Must input the data frame to predict on (after the minimal preprocessing) and a model id (string)

About

Machine Learning at EducationSuperHighway 2018-2019

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published