Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Machine Learning for openEO #441

Open
wants to merge 9 commits into
base: draft
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 8 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,12 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- `filter_vector`
- `flatten_dimensions`
- `load_geojson`
- `load_ml_model`
- `load_url`
- `ml_fit_class_random_forest`
- `ml_fit_regr_random_forest`
- `ml_predict`
- `save_ml_model`
- `unflatten_dimension`
- `vector_buffer`
- `vector_reproject`
Expand Down
6 changes: 6 additions & 0 deletions meta/subtype-schemas.json
Original file line number Diff line number Diff line change
Expand Up @@ -232,6 +232,12 @@
}
}
},
"ml-model": {
"type": "object",
"subtype": "ml-model",
"title": "Machine Learning Model",
"description": "A machine learning model, accompanied with STAC metadata that implements the the STAC ml-model extension."
},
"output-format": {
"type": "string",
"subtype": "output-format",
Expand Down
46 changes: 46 additions & 0 deletions proposals/load_ml_model.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
{
"id": "load_ml_model",
"summary": "Load a ML model",
"description": "Loads a machine learning model from a STAC Item.\n\nSuch a model could be trained and saved as part of a previous batch job with processes such as ``ml_fit_regr_random_forest()`` and ``save_ml_model()``.",
"categories": [
"machine learning",
"import"
],
"experimental": true,
"parameters": [
{
"name": "uri",
"description": "The STAC Item to load the machine learning model from. The STAC Item must implement the `ml-model` extension.",
"schema": [
{
"title": "URL",
"type": "string",
"format": "uri",
"subtype": "uri",
"pattern": "^https?://"
},
{
"title": "User-uploaded File",
"type": "string",
"subtype": "file-path",
"pattern": "^[^\r\n\\:'\"]+$"
}
]
}
],
"returns": {
"description": "A machine learning model to be used with machine learning processes such as ``ml_predict()``.",
"schema": {
"type": "object",
"subtype": "ml-model"
}
},
"links": [
{
"href": "https://github.com/stac-extensions/ml-model",
"title": "STAC ml-model extension",
"type": "text/html",
"rel": "about"
}
]
}
110 changes: 110 additions & 0 deletions proposals/ml_fit_class_random_forest.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
{
"id": "ml_fit_class_random_forest",
"summary": "Train a random forest classification model",
"description": "Executes the fit of a random forest classification based on training data. The process does not include a separate split of the data in test, validation and training data. The Random Forest classification model is based on the approach by Breiman (2001).",
"categories": [
"machine learning"
],
"experimental": true,
"parameters": [
{
"name": "predictors",
"description": "The predictors for the classification model as a vector data cube. Aggregated to the features (vectors) of the target input variable.",
"schema": [
{
"type": "object",
"subtype": "datacube",
"dimensions": [
{
"type": "geometry"
},
{
"type": "bands"
}
]
},
{
"type": "object",
"subtype": "datacube",
"dimensions": [
{
"type": "geometry"
},
{
"type": "other"
}
]
}
]
},
{
"name": "target",
"description": "The training sites for the classification model as a vector data cube. This is associated with the target variable for the Random Forest model. The geometry has to associated with a value to predict (e.g. fractional forest canopy cover).",
"schema": {
"type": "object",
"subtype": "datacube",
"dimensions": [
{
"type": "geometry"
}
]
}
},
{
"name": "max_variables",
"description": "Specifies how many split variables will be used at a node.\n\nThe following options are available:\n\n- *integer*: The given number of variables are considered for each split.\n- `all`: All variables are considered for each split.\n- `log2`: The logarithm with base 2 of the number of variables are considered for each split.\n- `onethird`: A third of the number of variables are considered for each split.\n- `sqrt`: The square root of the number of variables are considered for each split. This is often the default for classification.",
"schema": [
{
"type": "integer",
"minimum": 1
},
{
"type": "string",
"enum": [
"all",
"log2",
"onethird",
"sqrt"
]
}
]
},
{
"name": "num_trees",
"description": "The number of trees build within the Random Forest classification.",
"optional": true,
"default": 100,
"schema": {
"type": "integer",
"minimum": 1
}
},
{
"name": "seed",
"description": "A randomization seed to use for the random sampling in training. If not given or `null`, no seed is used and results may differ on subsequent use.",
"optional": true,
"default": null,
"schema": {
"type": [
"integer",
"null"
]
}
}
],
"returns": {
"description": "A model object that can be saved with ``save_ml_model()`` and restored with ``load_ml_model()``.",
"schema": {
"type": "object",
"subtype": "ml-model"
}
},
"links": [
{
"href": "https://doi.org/10.1023/A:1010933404324",
"title": "Breiman (2001): Random Forests",
"type": "text/html",
"rel": "about"
}
]
}
110 changes: 110 additions & 0 deletions proposals/ml_fit_regr_random_forest.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
{
"id": "ml_fit_regr_random_forest",
"summary": "Train a random forest regression model",
"description": "Executes the fit of a random forest regression based on training data. The process does not include a separate split of the data in test, validation and training data. The Random Forest regression model is based on the approach by Breiman (2001).",
"categories": [
"machine learning"
],
"experimental": true,
"parameters": [
{
"name": "predictors",
"description": "The predictors for the regression model as a vector data cube. Aggregated to the features (vectors) of the target input variable.",
"schema": [
{
"type": "object",
"subtype": "datacube",
"dimensions": [
{
"type": "geometry"
},
{
"type": "bands"
}
]
},
{
"type": "object",
"subtype": "datacube",
"dimensions": [
{
"type": "geometry"
},
{
"type": "other"
}
]
}
]
},
{
"name": "target",
"description": "The training sites for the regression model as a vector data cube. This is associated with the target variable for the Random Forest model. The geometry has to associated with a value to predict (e.g. fractional forest canopy cover).",
"schema": {
"type": "object",
"subtype": "datacube",
"dimensions": [
{
"type": "geometry"
}
]
}
},
{
"name": "max_variables",
"description": "Specifies how many split variables will be used at a node.\n\nThe following options are available:\n\n- *integer*: The given number of variables are considered for each split.\n- `all`: All variables are considered for each split.\n- `log2`: The logarithm with base 2 of the number of variables are considered for each split.\n- `onethird`: A third of the number of variables are considered for each split. This is often the default for regression.\n- `sqrt`: The square root of the number of variables are considered for each split.",
"schema": [
{
"type": "integer",
"minimum": 1
},
{
"type": "string",
"enum": [
"all",
"log2",
"onethird",
"sqrt"
]
}
]
},
{
"name": "num_trees",
"description": "The number of trees build within the Random Forest regression.",
"optional": true,
"default": 100,
"schema": {
"type": "integer",
"minimum": 1
}
},
{
"name": "seed",
"description": "A randomization seed to use for the random sampling in training. If not given or `null`, no seed is used and results may differ on subsequent use.",
"optional": true,
"default": null,
"schema": {
"type": [
"integer",
"null"
]
}
}
],
"returns": {
"description": "A model object that can be saved with ``save_ml_model()`` and restored with ``load_ml_model()``.",
"schema": {
"type": "object",
"subtype": "ml-model"
}
},
"links": [
{
"href": "https://doi.org/10.1023/A:1010933404324",
"title": "Breiman (2001): Random Forests",
"type": "text/html",
"rel": "about"
}
]
}
49 changes: 49 additions & 0 deletions proposals/ml_predict.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
{
"id": "ml_predict",
"summary": "Predict using ML",
"description": "Applies a machine learning model to a data cube of input features and returns the predicted values.",
"categories": [
"machine learning"
],
"experimental": true,
"parameters": [
{
"name": "data",
"description": "The data cube containing the input features.",
"schema": {
"type": "object",
"subtype": "datacube"
}
},
{
"name": "model",
"description": "A ML model that was trained with one of the ML training processes such as ``ml_fit_regr_random_forest()``.",
"schema": {
"type": "object",
"subtype": "ml-model"
}
},
{
"name": "dimensions",
"description": "Zero or more dimensions that will be reduced by the model. Fails with a `DimensionNotAvailable` exception if one of the specified dimensions does not exist.",
"schema": {
"type": "array",
"items": {
"type": "string"
}
}
}
],
"returns": {
"description": "A data cube with the predicted values. It removes the specified dimensions and adds new dimension for the predicted values. It has the name `predictions` and is of type `other`. If a single value is returned, the dimension has a single label with name `0`.",
"schema": {
"type": "object",
"subtype": "datacube",
"dimensions": [
{
"type": "other"
}
]
}
}
}
2 changes: 1 addition & 1 deletion proposals/predict_curve.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"id": "predict_curve",
"summary": "Predict values",
"summary": "Predict values using a model function",
"description": "Predict values using a model function and pre-computed parameters. The process is primarily intended to compute values for new labels, but it can also fill gaps where existing labels contain no-data (`null`) values.",
"categories": [
"cubes",
Expand Down
44 changes: 44 additions & 0 deletions proposals/save_ml_model.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
{
"id": "save_ml_model",
"summary": "Save a ML model",
"description": "Saves a machine learning model as part of a batch job.\n\nThe model will be accompanied by a separate STAC Item that implements the [ml-model extension](https://github.com/stac-extensions/ml-model).",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The model will be accompanied by a separate STAC Item that

What does "accompanied" practically mean? Should there be an additional job result asset? Or should this be an job result link item?

The reason I'm asking is that we want to streamline the detection of the model's URL at the client side.

e.g. see Open-EO/openeo-python-client#576 we we currently have a highly implementation-specific hack

ml_model_metadata_url = [
    link 
    for link in links if 'ml_model_metadata.json' in link['href']
][0]['href']

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I guess we should clarify that.

On the other hand, please note that this PR is implicitly outdated as the ML Model extension in STAC is likely going to be replaced by another extension. So this generally needs more work (which I have no plans to do anytime soon).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ML Model extension in STAC is likely going to be replaced by another extension.

can you point to the new one @m-mohr ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"categories": [
"machine learning",
"import"
],
"experimental": true,
"parameters": [
{
"name": "data",
"description": "The data to store as a machine learning model.",
"schema": {
"type": "object",
"subtype": "ml-model"
}
},
{
"name": "options",
"description": "Additional parameters to create the file(s).",
"schema": {
"type": "object",
"additionalParameters": false
},
"default": {},
"optional": true
}
],
"returns": {
"description": "Returns `false` if the process failed to store the model, `true` otherwise.",
"schema": {
"type": "boolean"
}
},
"links": [
{
"href": "https://github.com/stac-extensions/ml-model",
"title": "STAC ml-model extension",
"type": "text/html",
"rel": "about"
}
]
}