The Machine Cleaning project develops classification models to predict corrected USAC Form 471 line item field values for the most commonly edited line item fields by ESH Business Analysts. Past analyses were done to determine which fields were the the best "low-hanging fruit" for machine learning; these fields are Purpose
and Connect Category
.
These modules were built to facilitate the model building workflow for this specific ESH application, and to make it easier to use for analysts with a limited machine learning background. The preprocessing module is specific to ESH, and the modeling modules are built as wrappers around sklearn
modules. These can and should be extended, with the first priority being to rewrite them to work in Python 3+!
Directory | File | Description |
---|---|---|
/ | setup.sh | Before running this script, go in and edit it for your system. The GITHUB and _FORKED variables need to be edited for your individual system. Then run . ./setup.sh && source mc_venv/bin/activate && . /.bash_profile_mc in your command line to set up and activate the environment. |
Directory | File | Description |
---|---|---|
src/modeling/sql | get_data_2019_train.sql | This data is used to train the model. Gets the raw USAC FRN & line item data for 2019. |
src/modeling/sql | get_yvar_dar_prod.sql | Labels to train supervised machine learning model to predict purpose or connect category. |
src/modeling/sql | get_data_future_predict.sql | Blank file, but this will look similar to get_data_2019_train.sql but will pull in the future year's data to predict on. |
Directory | File | Description |
---|---|---|
src/modeling | preprocess_raw.py | Includes all data preprocessing functions such as removing nulls, duplicates and data conversions to numeric or dummy variables. Also includes function to remove correlated columns. More detail on ReadtheDocs |
Directory | File | Description |
---|---|---|
src/modeling | training_demo.ipynb | Notebook for training and iterating on models. There is a basic demo of the end-to-end process with a Random Forest model. |
src/modeling | model_setup_fit.py | On ReadtheDocs |
src/modeling | model_optimization.py | On ReadtheDocs |
Directory | File | Description |
---|---|---|
src/examples | apply_models.ipynb | Run this notebook to call the load_and_predict() function. This function loads in a model and features and applies it to new data to make predictions on purpose and connect category. Output: /data/ml_mass_update.csv Note: Must input the data frame to predict on (after the minimal preprocessing) and a model id (string) |