-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
aggregate_polygon: Output format #2
Comments
Currently we have the possibility to specify the raster or vector data format by using GDAL and OGR identifiers. What you suggest is that we need to incorporate additional standards to model the especially vector data, when we use formats like plain JSON or XML. Considering this I am not sure if we really should specify this for the output of this particular "zonal_statistics" process. I mean, we haven't specified an output format for any of the processes that deal with images as output. But I see that it would be benefical to model vector data after a specific standard and I could imagine that this would be best modeled as a process, which can later become part of an openEO core profile. |
Basically, as a user of the openeo api, if I compute zonal stats, I need to know what I'll get as output, and the API should describe this as clearly as possible, so the definition of this process should somehow reference the output format. |
Every process needs to define what it returns, but that is not necessarily the output for a job. Most processes will return an image collection, but this is probably not the case for zonal statistics. Yet, I am not sure how we smoothly convert from a process result to the actual output as defined in the job. If this definition is urgent for any back-end then I am happy to pick up concrete proposals. Feel free to come up with a proposal/format you'd think is a good solution and we can discuss it and add it to the current "collection" of processes. A general remark: Process definitions will not be part of the API specification itself, but part of a separate process / profile specification. AFAIK, they are due with D3.3 openEO data set and process descriptions in March 2019 (API version 0.4.0). That's why I changed the milestone for now, but that's more a categorization than setting a priority. |
Disclaimer: I have very little experience with zonal statistics. This ideas evolved from implementing zonal-statistics for the GEE back-end. As far as I know, we are expecting the user to specify a GeoJSON object as input for the region(s). Therefore, I'd suggest to simply use GeoJSON as output format, too. Example 1Multiple polygons stored in a file with a single aggregation function. Process: zonal_statistics
polygon.json:
Result:
Remarks:
Example 2A single polygon embedded into the process graph with multiple aggregation functions. Process: zonal_statistics
Result:
We could also wrap the Feature into a FeatureCollection if that would be simpler to read and/or write. What do you think? |
Just a small remark: IMHO "totalCount" and "validCount" are essential because the user polygon used for computing the zonal statistics might be located outside or only partially inside available data. And/or, pixels could be invalid (cloud contamination, etc). From a GIS point of view these attributes should be mandatory to enable the user to evaluate the delivered results. |
Given the discussion at Vito this week, about vector collections as arrays, I suggest we do the following:
In JSON, this could look like the following (modeled after NetCDF; the metadata part at the beginning is just a quick attempt at encoding relevant info into JSON)—note that there is no duplication of property names in the //openEOCubeJSON
{
"axes": {
"time": {
"type" : "regular",
"data" : [{
"variable" : "timestamp",
"dataType" : "datetime",
"range" : {
"length" : 2,
"offset" : "2015-07-06",
"samplingDistance" : "P10D"
}
}]
},
"zone": { "type" : "index" },
"band": {
"type" : "categorical",
"data" : [{
"variable" : "name",
"dataType" : "string",
"values" : ["NDVI"]
}]
},
"stat": {
"type" : "categorical",
"data" : [{
"variable" : "name",
"dataType" : "string",
"values" : ["totalCount", "validCount", "min", "max", "mean"]
}]
}
},
"data": [
{
"variable" : "value",
"axes" : ["zone", "time", "band", "stat"],
"raster" : [
//zone 0
[
//time 2015-07-06
[
//band NDVI
[9,8,0.58,0.99,0.86]
],
//time 2015-07-16
[
//band NDVI
[9,8,0.40,0.57,0.49]
]
],
//zone 1
[
[
//band NDVI
[9,9,0.78,0.98,0.975]
],
//time 2015-07-16
[
//band NDVI
[9,9,0.37,0.51,0.425]
]
]
]
}
]
} Now if polygons or their IDs are needed, they could be returned in the |
I like that. There is also CoverageJSON, see for instance the MULTIPOLYGON or POINT example in https://covjson.org/playground/ . What they call |
Nice link - i completely forgot about covjson. My source was mostly the NetCDF Data Model Some observation/comparison of the two:
|
Very interesting proposal and discussion, thanks @mkadunc . I don't think we'll come up with a final approach until mid-February so I'll stick with the simple GeoJSON based approach for now and add the experimental flag for aggregate_zonal. See the current specification: https://open-eo.github.io/openeo-api/v/0.4.0/processreference/#aggregate_zonal |
There is also CF-JSON, which seems somewhat similar: http://cf-json.org/specification Edit: I was told there's also NCO-JSON, which departed from CF-JSON and may be merged at some point again. See pangeo-data/pangeo-datastore#3 (comment) |
Didn't know this, but NetCDF also supports it: However, browsers will not like such a binary format, so I still see a clear use case for something based on json. |
3rd year planning: @jdries will look into CovJSON and CFJSON (fka NCO-JSON, see cf-json/cf-json.github.io#10). Related discussions: covjson/specification#86 and pangeo-data/pangeo-datastore#3 (comment) |
CovJSON output format has been implemented for aggregate_polygon in VITO backend (still to be deployed at the moment) through PRs Open-EO/openeo-python-driver#23 and Open-EO/openeo-geopyspark-driver#25 Usage example against my localhost setup of the VITO backend import json
import shapely
import openeo
connection = openeo.connect("http://localhost:8080/openeo/0.4.2/")
polygon1 = shapely.geometry.Polygon([(3.71, 51.01), (3.72, 51.02), (3.73, 51.01)])
polygon2 = shapely.geometry.Polygon([(3.71, 51.015), (3.725, 51.025), (3.725, 51.03), (3.71, 51.03)])
geometry_collection = shapely.geometry.GeometryCollection([polygon1, polygon2])
cube = (
connection
.load_collection('CGS_SENTINEL2_RADIOMETRY_V102_001')
.filter_temporal('2019-07-15', '2019-07-30')
.polygonal_mean_timeseries(polygon=geometry_collection)
.save_result(format="CovJSON")
)
result = cube.execute()
print(json.dumps(result)) output:
|
What would be the actual return value of aggregate_polygon? A general "vector-cube", which can be passed to the save_result process or would it return directly covjson or so? How should the schema for the return value of the function look like? |
vector-cube JSON should be considered just as an output encoding, similar to "GTiff" in save_result |
So coming from the point (see #68) that we generally don't support vector-cubes at the moment, would it be enough to support vector cubes in aggregate_polygon, save_result, load_result and maybe load_collection, add_dimension, drop_dimension and rename_dimension? That's probably what we need to support actually using that return type from aggregate_polygon. I guess I'd not enable vector cubes in filter_* processes yet? Or do I miss something? |
IMO the only thing we need to do in order to have vector-cubes support is allow objects as dimension labels (currently we only allow number, string, date, date-time and time). Then vector-cube is just a cube with simple-feature-geometry as dimension labels on the single spatial dimension. If we go for this approach, we already support vector cubes in all processes (but we treat the spatial dimension as nothing special). We could also ignore vector-cubes altogether (for now), returning a raster cube with an ordinal dimension to encode the index of the corresponding polygon. This should be quite intuitive for the user... |
Moved discussions about vector cubes to #68 and thus this issue can be closed, I think. |
FYI I implemented netcdf as output format for timeseries. I used the simple orthogonal multidimensional array representation, and converted polygons to their points to make life easier. (Support for polygons in netcdf is worked on, but seems less standard.) |
@jdries Sounds good. Do you plan to support nco-json anytime soon? If so, I think I'd also try to implement it for GEE. Also, I'd be interested in an example process graph + result... |
I'll give nco-json a try, but will have to install the nco dependencies for that, as I haven't found a python library yet that does the conversion. It's a bit of a side thing, so not sure when I'll be able to finish. |
Currently the zonal statistics process does not have a clear, standardized output format.
Having a json encoding would make most sense. You could also look into, or borrow concepts from, OGC TimeseriesML.
Here's an example of a simple json encoding that we currently use:
The text was updated successfully, but these errors were encountered: