GitHub - MitchMedeiros/MLCompare: Quickly compare machine learning models across libraries and datasets

GitHub Actions status (MacOS Unit Tests)

MLCompare is a Python package for running model comparison pipelines, with the aim of being both simple and flexible. It supports multiple popular ML libraries, retrieval from multiple online dataset repositories, common data processing steps, and results visualization. Additionally, it allows for using your own models and datasets within the pipelines.

Libraries	Datasets	Data Processing
Scikit-learn XGBoost	Kaggle OpenML Hugging Face locally saved	train-test split drop columns handle NaNs: drop \| forward-fill \| backward-fill encoders: OneHot \| Ordinal \| Target \| Label scalers: Standard \| MinMax \| MaxAbs \| Robust transformers: Quantile \| Power \| Normalizer

Installing

It is recommended to create a new virtual environment. Example with Conda:

conda create -n compare_env python==3.11.9
conda activate compare_env

Install this library with pip:

pip install mlcompare

Note that for MacOS, both XGBoost and LightGBM require libomp. It can be installed with Homebrew:

brew install libomp

A Simple Example

Running a pipeline with multiple datasets and models is done by creating a list of dictionaries for each and providing them to a pipeline function.

The below example downloads a dataset from OpenML and Kaggle, one-hot encodes some of the columns in the Kaggle dataset, and trains and evaluates a Random Forest and XGBoost model on them.

import mlcompare

datasets = [
    {
        "type": "openml",
        "id": 8,
        "target": "drinks",
    },
    {
        "type": "kaggle",
        "user": "gorororororo23",
        "dataset": "plant-growth-data-classification",
        "file": "plant_growth_data.csv",
        "target": "Growth_Milestone",
        "oneHotEncode": ["Soil_Type", "Water_Frequency", "Fertilizer_Type"],
    }
]

models = [
    {
        "library": "sklearn",
        "name": "RandomForestRegressor",
    },
    {
        "library": "xgboost",
        "name": "XGBRegressor",
        "params": {"num_leaves": 40, "n_estimators": 200}
    }
]

mlcompare.full_pipeline(datasets, models, "regression")

In the case of the XGBoost model some non-default parameter values were used.

Planned Additions

Version 1.3

LightGBM support
CatBoost support
Model results graphing and visualization
Improved documentation
Support for presplit data

Version 1.4

PyTorch support
TensorFlow support
Additional dataset sources
Built-in model and dataset collections for quick testing of similar model types/datasets
Optional pipeline caching
Optional trained model saving

Version 1.5

S3 Support

Name		Name	Last commit message	Last commit date
Latest commit History 134 Commits
.github/workflows		.github/workflows
docs		docs
mlcompare		mlcompare
tests		tests
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
HISTORY.md		HISTORY.md
LICENSE.txt		LICENSE.txt
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Installing

A Simple Example

Planned Additions

Version 1.3

Version 1.4

Version 1.5

About

Releases

Packages

Languages

License

MitchMedeiros/MLCompare

Folders and files

Latest commit

History

Repository files navigation

Installing

A Simple Example

Planned Additions

Version 1.3

Version 1.4

Version 1.5

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages