ML Data Cube Regularization #444

PondiB · 2023-06-19T14:48:20Z

Regularized datacubes are a necessity for machine learning and deep learning in EO time series data. This process aims to eliminate the need for a user chaining processes to have a consistent data cube

PondiB · 2023-06-20T15:22:43Z

@m-mohr , I am seeking your eyes whenever you get to have a moment as I have fixed most failures but I am taking way longer to trace this.

m-mohr · 2023-06-21T11:52:45Z

fyi: I won't get to it anytime soon, sorry.

PondiB · 2023-06-21T12:02:09Z

fyi: I won't get to it anytime soon, sorry.

Thanks for getting back. It's fine. I'll figure it out soon.

soxofaan · 2023-09-25T08:18:20Z

I'm not sure I understand why this process is necessary. The description talks about "irregular" but if your data is in a openEO data cube, then it's pretty regular already. Your time instants could be spaced unevenly, but that doesn't mean that an ML model could not handle that.

This process looks like a combination between aggregate_temporal_period and resample_spatial, but:

aggregate_temporal_period uses a different period specification format
aggregate_temporal_period has a reducer argument which ml_regularize_data_cube is missing I guess
resample_spatial has projection and method arguments (and some more) which are also missing here

In this state, I think ml_regularize_data_cube is missing quite some parameters.

more generally: is there a compelling reason to define ml_regularize_data_cube, if we already have aggregate_temporal_period and resample_spatial?

jdries · 2023-09-25T08:26:07Z

The use case has even been explored quite extensively in openEO platform, and made it into public examples:

https://github.com/Open-EO/openeo-community-examples/blob/main/python/BasicSentinelMerge/sentinel_merge.ipynb
https://github.com/openEOPlatform/openeo-classification/blob/main/src/openeo_classification/features.py#L117

PondiB · 2023-09-25T13:00:56Z

@soxofaan thanks for the feedback, on the OEMC project we are planning to come up with a new openEO backend with a more focus on ML and DL capabilities for Satellite Image Time Series.

Regular data cube in our case encompasses: (a) there is a unique field function; (b) the spatial support is georeferenced; (c) temporal continuity is assured; and (d) all spatiotemporal locations share the same set of attributes, and (e) there are no gaps or missing values in the spatiotemporal extent.

In our discussion, there were philosophies as shown in the image below and we would like to support both i.e. (1) allowing users to define their processes before ML/DL operations and (2)not bothering the users with underlying processes.

@jdries cool, I will check out the examples.

jdries · 2023-09-27T07:35:14Z

Nice, this is exactly what I happen to be working on for the moment, in support of a couple of projects using ML.

Maybe you already know, but openEO has a mechanism to build this kind of convenience function that is a combination of existing processes, the openEO 'user defined processes' (UDP). Using this has a couple of advantages:

The process definition is very formal, and falls back to the definition of the individual processes, so less specification work to be done.
Backends that support the individual processes can easily support the convenience process, even without requiring explicit implementation. This is extremely important if we want to reach the goal of cross-backend compatibility.
Backends that do not support the individual processes, can still support the convenience process.
If you want a special (e.g. faster) implementation of the convenience process, that's also possible.

I see this case arising more often, so maybe we can create an open source github repo, with the definitions of these UDP's. That would allow users to reference the central repo, or allow backends to import those definitions.

Now about the actual process:

spatial regularization is something that openEO already allows to do by default, without requiring any process. If a user loads a mix Sentinel-2 bands at different resolution, we for instance return a datacube with the right UTM zone as projection system, and the highest resolution. So not sure if we need this.
cloud masking is tricky, and unfortunately still needs sensor specific implementations to do it right. Not sure how that would work with a convenience process? The most generic approach I can think of is some kind of binarized cloudmask, and then using a 'distance to cloud' metric in the compositing. The sits regularize (1) method mentions sorting images by cloud percentage, but I'm not sure how this translates to openEO datacubes.
there's different methods possible to select the best available pixel from a given compositing interval. The most optimal choice somewhat depends on the length of the interval, and number of observations per interval. A method that's relatively generic is using distance to the middle of the interval, combined with distance to cloud. It has the advantage over (1) that you try to ensure that the actually selected observations are spaced evenly in time as much as possible.

(1) https://rdrr.io/cran/sits/man/sits_regularize.html

m-mohr

Maybe you already know, but openEO has a mechanism to build this kind of convenience function that is a combination of existing processes, the openEO 'user defined processes' (UDP).

Yeah, maybe some of these processes should go into openeo-community-examples if they can be built on top of other processes? This could also apply to the ard_* processes. All these are very heavyweight processes, that may not 100% fit into the current process landscape. I'll take this into the PSC for discussion.

I think we should at least consider trying to solve this use case with existing processes, i.e. add a "process_graph" member to the process description.

m-mohr · 2023-12-08T10:37:33Z

@PondiB I think it would make sense to make PRs against the ml branch because otherwise all changes from the ML branch will also appear in this PR. This leads to confusion. Please rebase your changes against the ML branch if necessary and set the base branch of the PR to ml.

PondiB · 2023-12-08T13:35:30Z

@PondiB I think it would make sense to make PRs against the ml branch because otherwise all changes from the ML branch will also appear in this PR. This leads to confusion. Please rebase your changes against the ML branch if necessary and set the base branch of the PR to ml.

Sure.

PondiB · 2024-08-22T06:53:01Z

Closing this.

PondiB added 17 commits June 19, 2023 16:44

data cube regularization

2a645c8

fix description

67ba32b

fix summary

9d37a84

change extent option"

287ecee

temporal-extent filtering

3becaf2

resample and resolution set

9216b9b

period description update

8957ba0

params refactor

f5d2d9a

refactor params

ce261f2

resolution not optional

3336bae

update main description

5215bb9

raster cube

22a605a

update categories

bc2e199

shorten summary

0d3558f

fix summary

bea0beb

fix summary

9fc1d86

update chnage log"

0ff4987

PondiB added 2 commits June 21, 2023 10:32

summary and description

7191379

description update

5f957a0

PondiB added 2 commits September 13, 2023 22:02

refactor ml regularize data cube

d7a8724

update description

dc82cd0

PondiB changed the title ~~data cube regularization~~ ml data cube regularization Sep 14, 2023

PondiB changed the title ~~ml data cube regularization~~ data cube regularization for machine learning processes Sep 14, 2023

m-mohr requested changes Sep 30, 2023

View reviewed changes

PondiB changed the base branch from draft to ml December 8, 2023 13:35

PondiB added new process ML labels Dec 12, 2023

PondiB changed the title ~~data cube regularization for machine learning processes~~ ML Data Cube Regularization Dec 12, 2023

PondiB closed this Aug 22, 2024

m-mohr deleted the datacube-regularization branch August 22, 2024 08:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ML Data Cube Regularization #444

ML Data Cube Regularization #444

PondiB commented Jun 19, 2023

PondiB commented Jun 20, 2023

m-mohr commented Jun 21, 2023

PondiB commented Jun 21, 2023

soxofaan commented Sep 25, 2023

jdries commented Sep 25, 2023

PondiB commented Sep 25, 2023

jdries commented Sep 27, 2023

m-mohr left a comment

m-mohr commented Dec 8, 2023 •

edited

Loading

PondiB commented Dec 8, 2023

PondiB commented Aug 22, 2024

ML Data Cube Regularization #444

ML Data Cube Regularization #444

Conversation

PondiB commented Jun 19, 2023

PondiB commented Jun 20, 2023

m-mohr commented Jun 21, 2023

PondiB commented Jun 21, 2023

soxofaan commented Sep 25, 2023

jdries commented Sep 25, 2023

PondiB commented Sep 25, 2023

jdries commented Sep 27, 2023

m-mohr left a comment

Choose a reason for hiding this comment

m-mohr commented Dec 8, 2023 • edited Loading

PondiB commented Dec 8, 2023

PondiB commented Aug 22, 2024

m-mohr commented Dec 8, 2023 •

edited

Loading