Machine Cleaning at EducationSuperHighway (ESH)

The Machine Cleaning project develops classification models to predict corrected USAC Form 471 line item field values for the most commonly edited line item fields by ESH Business Analysts. Past analyses were done to determine which fields were the the best "low-hanging fruit" for machine learning; these fields are Purpose and Connect Category.

Workflow/Code Architecture Illustration:

These modules were built to facilitate the model building workflow for this specific ESH application, and to make it easier to use for analysts with a limited machine learning background. The preprocessing module is specific to ESH, and the modeling modules are built as wrappers around sklearn modules. These can and should be extended, with the first priority being to rewrite them to work in Python 3+!

Files and Descriptions:

1. Environment setup

Directory	File	Description
/	setup.sh	Before running this script, go in and edit it for your system. The GITHUB and _FORKED variables need to be edited for your individual system. Then run `. ./setup.sh && source mc_venv/bin/activate && . /.bash_profile_mc` in your command line to set up and activate the environment.

2. Loading the datasets

Directory	File	Description
src/modeling/sql	get_data_2019_train.sql	This data is used to train the model. Gets the raw USAC FRN & line item data for 2019.
src/modeling/sql	get_yvar_dar_prod.sql	Labels to train supervised machine learning model to predict purpose or connect category.
src/modeling/sql	get_data_future_predict.sql	Blank file, but this will look similar to `get_data_2019_train.sql` but will pull in the future year's data to predict on.

3. Preprocessing the dataset

Directory	File	Description
src/modeling	preprocess_raw.py	Includes all data preprocessing functions such as removing nulls, duplicates and data conversions to numeric or dummy variables. Also includes function to remove correlated columns. More detail on ReadtheDocs

4. Training models

Directory	File	Description
src/modeling	training_demo.ipynb	Notebook for training and iterating on models. There is a basic demo of the end-to-end process with a Random Forest model.
src/modeling	model_setup_fit.py	On ReadtheDocs
src/modeling	model_optimization.py	On ReadtheDocs

5. Making Predictions on Purpose and Connect Category using the trained models

Directory	File	Description
src/examples	apply_models.ipynb	Run this notebook to call the `load_and_predict()` function. This function loads in a model and features and applies it to new data to make predictions on purpose and connect category. Output: `/data/ml_mass_update.csv` Note: Must input the data frame to predict on (after the minimal preprocessing) and a model id (string)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Machine Cleaning at EducationSuperHighway (ESH)

Workflow/Code Architecture Illustration:

Files and Descriptions:

1. Environment setup

2. Loading the datasets

3. Preprocessing the dataset

4. Training models

5. Making Predictions on Purpose and Connect Category using the trained models

Files

README.md

Latest commit

History

README.md

File metadata and controls

Machine Cleaning at EducationSuperHighway (ESH)

Workflow/Code Architecture Illustration:

Files and Descriptions:

1. Environment setup

2. Loading the datasets

3. Preprocessing the dataset

4. Training models

5. Making Predictions on Purpose and Connect Category using the trained models