Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generalize predict processes for ML models #396

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- `fit_regr_random_forest`
- `flatten_dimensions`
- `load_ml_model`
- `predict_random_forest`
- `predict_ml_model`
- `predict_ml_model_probabilities`
- `save_ml_model`
- `unflatten_dimension`
- `vector_buffer`
Expand Down
2 changes: 1 addition & 1 deletion proposals/load_ml_model.json
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@
}
],
"returns": {
"description": "A machine learning model to be used with machine learning processes such as ``predict_random_forest()``.",
"description": "A machine learning model to be used with machine learning processes such as ``predict_ml_model()`` or ``predict_ml_model_probabilities()``.",
"schema": {
"type": "object",
"subtype": "ml-model"
Expand Down
4 changes: 2 additions & 2 deletions proposals/predict_curve.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"id": "predict_curve",
"summary": "Predict values",
"summary": "Predict values using a model function",
"description": "Predict values using a model function and pre-computed parameters. The process is primarily intended to compute values for new labels, but it can also fill gaps where existing labels contain no-data (`null`) values.",
"categories": [
"cubes",
Expand Down Expand Up @@ -109,4 +109,4 @@
"message": "A dimension with the specified name does not exist."
}
}
}
}
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"id": "predict_random_forest",
"summary": "Predict values based on a Random Forest model",
"description": "Applies a Random Forest machine learning model to an array and predict a value for it.",
"id": "predict_ml_model",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

predict_ml_model looks a bit weird to me, it reads like you will be predicting the model (which is a confusing statement), instead of letting the model predict classes.

Something like predict_class, ml_model_predict or even ml_predict would feel better.

Note that we for the array, text, date related processes, we also use a prefix based naming (array_append, array_apply, date_shift) instead of a postfix based naming.

Copy link
Member Author

@m-mohr m-mohr Nov 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My aim was to align with predict_curve so that in docs they would be listed side by side. Unfortunately, we are not very consistent with prefixes/suffixes (array_apply / date_shift / load_collection / save_result / reduce_dimension)...

Is there any other case where predict_class / predict_probabilities could become useful without ML? That was the main reason I added ml_model in the first place...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing with the name predict_class is that that process would not only work for ML classification, but also ML regression, so ideally, the term class should be avoided. Unless there is a good reason to have separate inference/predict processes for ML classification and ML regression, but as far as I know that's not the case.

Is there any other case where predict_class / predict_probabilities could become useful without ML? That was the main reason I added ml_model in the first place...

To me, "predict" implies "machine learning", so having both "ml" and "predict" in the name is slightly redundant.
However, it might be better for discoverability and self-documenting reasons to keep that bit of redundancy.

"summary": "Predict values values using a ML model",
m-mohr marked this conversation as resolved.
Show resolved Hide resolved
"description": "Applies a machine learning model to an array and predicts a value/class for it.",
m-mohr marked this conversation as resolved.
Show resolved Hide resolved
"categories": [
"machine learning",
"reducer"
Expand All @@ -23,7 +23,7 @@
},
{
"name": "model",
"description": "A model object that can be trained with the processes ``fit_regr_random_forest()`` (regression) and ``fit_class_random_forest()`` (classification).",
"description": "A ML model that can be trained with one of the ML processes such as ``fit_class_random_forest()``.",
m-mohr marked this conversation as resolved.
Show resolved Hide resolved
"schema": {
"type": "object",
"subtype": "ml-model"
Expand Down
45 changes: 45 additions & 0 deletions proposals/predict_ml_model_probabilities.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
{
"id": "predict_ml_model_probabilities",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like above, this process name looks a bit weird to me.
I'd prefer something like predict_probabilities, ml_model_predict_probabilities or ml_predict_probabilities

Copy link
Member Author

@m-mohr m-mohr Mar 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about the recent discussion (prefix is based on the primary input), it should probably be ml_predict_probabilities (or ml_model_predict_probabilities which is rather long)

"summary": "Predict class probabilities using a ML model",
"description": "Applies a machine learning model to an array and predicts (class) probabilities for them.",
m-mohr marked this conversation as resolved.
Show resolved Hide resolved
"categories": [
"machine learning",
"reducer"
],
"experimental": true,
"parameters": [
{
"name": "data",
"description": "An array of numbers.",
m-mohr marked this conversation as resolved.
Show resolved Hide resolved
"schema": {
"type": "array",
"items": {
"type": [
"number",
"null"
]
}
}
},
{
"name": "model",
"description": "A ML model that can be trained with one of the ML processes such as ``fit_regr_random_forest()``.",
m-mohr marked this conversation as resolved.
Show resolved Hide resolved
"schema": {
"type": "object",
"subtype": "ml-model"
}
}
],
"returns": {
"description": "The predicted (class) probabilities. Returns `null` if any of the given values in the array is a no-data value.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"description": "The predicted (class) probabilities. Returns `null` if any of the given values in the array is a no-data value.",
"description": "The predicted class probabilities. Returns `null` if any of the given values in the array is a no-data value.",

I'm not sure about the Returns null if any ...: that depends on the capability of the ML model to handle null/nodata I think

Copy link
Contributor

@LukeWeidenwalker LukeWeidenwalker Dec 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually tricky - some thoughts:

  • E.g. in deep learning if I've anticipated missing data, I could have used torch.nan_to_num to replace nan with a carefully chosen value in both training and inference. If I then export that model with e.g. ONNX, that transformation will be baked in and my model will know how to deal with missing data. I'm not sure how every other frameworks handles this (e.g. sklearn or xgboost) - but it's super important that this value is the same during inference as it was during training!
  • If the model handles nans, and the predict_ml_model process just ignores nan-values and lets it through, then I get the correct behaviour.
  • However, if it fails on nan-values, then the only way the user can fix this is to replace nan-values with the correct value in an extra openeo process before. To do that I need to know what that value was during training - this might not be easily available!
  • If the model doesn't handle nans (either they didn't add nan-handling to inference code, or just didn't train on samples with nans at all, etc.), then what would happen really depends on the framework. It might just crash, or it might run through, with the nan values subtly impacting the predictions.
  • some options I can think of:
    • Make this a parameter of the process (fail_on_nan or similar), fail by default and allow overriding if the user knows that their model can handle nans out of the box or is willing to throw the dice.
    • Add information about what to replace nan values with to the ml-model STAC extension and use that if available

I think as a user my preference would be for this process to assume that the model has already been constructed to handle nans correctly in both training and inference and therefore doesn't try to interfere.

Hope this is useful!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, you agree that it would be better to drop

Returns null if any of the given values in the array is a no-data value.

from the general description of predict_..._probabilities ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, you agree that it would be better to drop

Returns null if any of the given values in the array is a no-data value.

from the general description of predict_..._probabilities ?

Yeah, exactly!

"schema": {
"type": "array",
"items": {
"type": [
"number",
"null"
]
}
}
}
}