Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Part 3: Types of Collections? #394

Open
m-mohr opened this issue Feb 8, 2024 · 15 comments
Open

Part 3: Types of Collections? #394

m-mohr opened this issue Feb 8, 2024 · 15 comments

Comments

@m-mohr
Copy link

m-mohr commented Feb 8, 2024

Part 3 offers a generic way to load from OGC API Collections.

A couple of uncertainties:

  • As far as I can see it doesn't limit the scope. What Collections can it really load from? Is loading from Tiles, Routes, or Styles really meaningful here?
  • Is there any way a Processes API can describe which types of collections it can load from? If the requirement is to support all by default, that's a pretty steep requirement.
  • How to load from other data source? Would a STAC API fall under loading from OGC API Collections? As STAC API is based on Features, would this only load the "Features" (i.e. the STAC metadata)?
  • Is there a way to load from static STAC or static Records catalogs?
  • Is there a way to query for more specific data? Like e.g. a CQL query for a specific property (both GET/Text encoding and POST/JSON encoding)?

Disclaimer: I'm struggline with all these tiny files in the repo so I may just not have found the answers yet.

@jerstlouis
Copy link
Member

jerstlouis commented Feb 9, 2024

@m-mohr See also my reply about Collection Input / Output in the other issue.

What Collections can it really load from? Is loading from Tiles, Routes, or Styles really meaningful here?

It can load from an OGC API collection as defined by OGC API - Common - Part 2: Geospatial data, which is still in draft, but the concept of a collection is already quite clear and supported by several OGC API data access standards including:

  • OGC API - Features
  • OGC API - EDR
  • OGC API - Tiles
  • OGC API - Maps
  • OGC API - Coverages
  • OGC API - DGGS
  • OGC API - 3D GeoVolumes
  • OGC API - Moving Features

I am less familiar with OGC API - Connected Systems, but I think it also fits into that category (as an extension of Features?).

A couple clarifications:

  • technically the data could be temporal only, but in general these APIs focus on data with two or three spatial dimensions.
  • a particular collection of data can be available in more than one of these APIs -- I call these access mechanisms, but views on the same data has also been suggested.

In particular, an implementation conceptually does not serve a "collection of tiles", or a collection of DGGS zones, rather it serves a collection of data (features, gridded coverage cell values, point cloud, 3D meshes) which has been tiled, has been organized as a DGGS or as a bounding volume hierarchy.

OGC API - Routes is excluded because it does not depend on Common - Part 2 (no /collections/{collectionId}), and so far has been focused on an API to compute routes (though there is on-going discussion to split "sharing" routes into a separate part -- see opengeospatial/ogcapi-routes#44 and opengeospatial/ogcapi-routes#70 ).
However, the routes computation request is fully aligned with the Processes - Part 1 execution request and the response of a routes (Route Exchange Model) is a feature collection, therefore this allows to use it together with Processes - Part 3 "Collection Output" so that a route can in fact be used as a collection input (though that becomes fully transparent -- at that point the client is actually doing an OGC API - Tiles (Vector Tiles) or OGC API - Features request. We implemented exactly that in our OpenStreetMap Routing Engine process.

OGC API - Styles is excluded because it shares portrayal information, not spatiotemporal data (no dependency on Common - Part 2 or /collections/{collectionId} either, though styles can be offered for a particular collection at /collections/{collectionId}/styles). However, passing a style URL to a map rendering process (like ours) would make sense as an input to that process for a client to specify that a map should be rendered in a particular style, but that could be simply through the href mechanism. See also opengeospatial/ogcapi-maps#42 .

OGC API - Records in its "Features" style incarnation is on the fence in that in a sense it inherits from Features, but normally it is about metadata rather than actual spatiotemporal data.

Is there any way a Processes API can describe which types of collections it can load from?

Currently, the assumption is based on the process description inputs, whereas based on the media types of the input (e.g., GeoTIFF or GeoJSON) and the additional format tag which might clarify that an input is a feature collection (potentially of a particular geometry type -- see #322). From this, the client can figure out whether the input is expected to be a coverage or a feature collection.

In terms of which OGC APIs the server actually support accessing as a client, there is not yet a clear mechanism how to specify that. It would certainly make sense to clarify that. Potentially as parameters to the collection-input requirement class at /conformance at a global level, or extending the process description with that information. In the meantime, the server can return a useful feedback error so that the user at least knows why an execution request failed, and which APIs are supported or not supported .

If the requirement is to support all by default, that's a pretty steep requirement.

The requirement is for the API to support at least one of the OGC API data access mechanism.
See Requirement 6:

A The Implementation SHALL support accessing local collections (as defined by OGC API — Common — Part 2: Geospatial data) accessible using at least one OGC API data access mechanisms (e.g., OGC API — Tiles, Coverages, DGGS, Features, EDR, Maps) as an input to a process.

About:

How to load from other data source? Would a STAC API fall under loading from OGC API Collections? As STAC API is based on Features, would this only load the "Features" (i.e. the STAC metadata)?

Like I mentioned for Records above, STAC is about metadata. The idea is likely to load the relevant scenes described by the catalog as a single coverage, in effect wanting to load the "data" referenced by STAC rather than just the image footprints as vector features. This is one of the thing that the Coverage Scenes requirement class tries to clarify -- supporting the same data model and type of queries as STAC, but having this STAC API at /collection/{collectionId}/scenes instead of at /items which avoids the confusion with vector features. Perhaps we can clarify Collection Input to say that STAC collections could be identified as such and the behavior in that case for implementations that can work as a STAC client should be to access the scenes as coverage data rather than as a collection of vector features?

Is there a way to load from static STAC or static Records catalogs?

If the goal is to load spatiotemporal data (e.g., coverage scenes), it should be clear that this is what is intended. Perhaps my suggestion above to clarify that a collection of STAC scene metadata should be loaded as coverage data input would address this?

I'm not sure whether there would be a use case to extend this collection input to non-spatiotemporal data. The Collection Input/Output mechanisms are really about this idea of spatiotemporal datacubes (a collection is a coverage / is a data cube).

Is there a way to query for more specific data? Like e.g. a CQL query for a specific property (both GET/Text encoding and POST/JSON encoding)?

This is the Input Field Modifiers requirement class, where filter and properties are allow to filter, select fields or derive new fields using CQL2 queries. Input fields modifiers can be set on any input types coming into a process (including a "collection"), output field modifiers can be set on a "process" object (including the top-level process to which the execution request is being submitted, where the process property is optional). The Processes execution requests are always submitted via POST, but CQL2-Text is a lot more expressive / human readable than the CQL2-JSON equivalent, so the expectation is that CQL2-Text would still be used for those field modifiers within execution requests, though servers could support either CQL2-JSON or CQL2-Text based on the conformance classes they declare. Note that separately from the process/workflow execution definition, it is also possible via Collection Output to specify filter= or properties= again for the final output, as part of the following GET Features or Coverage requests to get the actual data / actually trigger the processing for a given area/time/resolution of interest, if the API declares conformance for those capabilities (e.g., Features - Part 3: Filtering / Part n: property selection, Coverages - Part 2: Filtering, deriving and aggregating fields...).

@fmigneault
Copy link
Contributor

I share @m-mohr's concerns, especially regarding points like "Is there any way a Processes API can describe which types of collections it can load from? If the requirement is to support all by default, that's a pretty steep requirement.".

I think the specification still lacks a lot of details regarding "what to do" with whichever collection type that gets loaded. I understand that the idea is to offer flexibility, such that the process expecting some input/output format can negotiate with the referenced collections to pick depending on available service types that suits best, but my impression is that there are still a lot of assumptions that schemas and content-types are enough for data structures to align with the processing intensions. Calling equivalent data collections would yield different results depending on which services the server being called implements. Because the procedure is very abstract, I cannot foresee any way to make this interoperable reliably between implementations.

If there was a single reference implementation to interact with all collection types (and correct me if there is one that I don't know of), that would at least help align interoperability since implementation could understand expectations for each case. For example with STAC, there is pystac and pystac-client that helps standardize operations and returned formats, and even then, there is a lot of interpretation depending on which STAC extensions they contain, even though that is only limited to OGC API - Features queries/formats not including all other collections types. It gets worst in the case of Collections as the scope is even larger.

@jerstlouis
Copy link
Member

jerstlouis commented Feb 9, 2024

Thanks for the feedback @fmigneault .

Is there any way a Processes API can describe which types of collections it can load from?

I fully agree it would be good to improve on that, with some suggestions how we could go about it in my comment above.

If the requirement is to support all by default, that's a pretty steep requirement.".

That is not the case, so no steep requirement :)

If there was a single reference implementation to interact with all collection types (and correct me if there is one that I don't know of),

So far we implemented the closest thing to a reference implementation at https://maps.gnosis.earth/ogcapi .

The RenderMap process supports both feature collections and coverages. Our server suports accessing remote collections as a client through several OGC APIs: Tiles (Vector Tiles, Coverage Tiles, Map Tiles), Maps, Features, Coverages, EDR (Cube queries only) and Processes (using Collection Output).

The PassThrough process also expects either a Coverage or a Feature Collection, and provides an opportunity to apply filter or properties to filter, select or derive new fields/properties using a CQL2 query. For example you can use it to compute an NDVI. It can also be use to easily cascade from any of the supported OGC API, making it available through all of the APIs / Tile Matrix Sets / DGGS / formats supported by the GNOSIS Map Server.

The current example is broken due to the cascaded server not being available, so here is an NDVI example to POST at https://maps.gnosis.earth/ogcapi/processes/PassThrough/execution?response=collection:

{
   "process" : "https://maps.gnosis.earth/ogcapi/processes/PassThrough",
   "inputs" : {
      "data" : [
         {
            "collection" : "https://maps.gnosis.earth/ogcapi/collections/sentinel2-l2a",
            "properties" : [ "(B08 - B04) / (B08 + B04)" ]
         }
      ]
   }
}

Feature Attributes Combiner supports feature collections.

Elevation Contour Tracer expects a DEM coverage.

To use remote collections, the external APIs need to be explicitly authorized, so let me know if you want to try doing some TIEs with a particular server. We also have some point cloud and 3D mesh client / processing capabilities, so we could also work towards that.

It would be great to experiment on some of this at the Code Sprint in Portugal.

In Testbed 19, WHU did successfully implement Collection Output at http://oge.whu.edu.cn/geocube/gdc_api_t19/ , and Compusult also implemented it in their GDC API (have to re-test it to confirm whether they addressed the remaining issues). WHU also mostly have a working "Collection Input" for their local collections with a small tweak needed, as mentioned at #325 (comment) . See https://gitlab.ogc.org/ogc/T19-GDC/-/wikis/Ecere#technology-integration-experiments . See also demo day presentation for GDC if you have not yet watched it. Multiple clients were able to access the processing as Collection Output. There were also successful Collection Input / Output experiments in Testbed 17 ( https://docs.ogc.org/per/21-027.html#toc64 ).

Some key things to clarify: Collection Input is for collections locally available on the same API.
Remote Collection is what implies support for "collection" from an external server through OGC API data client capabilities.

Whether a collection is suitable for a particular process (a topic looked at in Testbed 16 - DAPA) is a problem that is not limited to Collection Input / Remote Collections. It's also if you pass the data as GeoJSON or GeoTIFF embedded in the request or as href. The schema (as in Features - Part 5) needs to be suitable, and currently the Process Description goes only so far in describing what to expect. We use schemas in both Features and Coveages now, so you can describe the properties or the features or the fields of the coverage the same way. The GeoDataClass concept is also nice solution for this which would help enhance the description of the process inputs. It avoids having to parse schemas to determine compatibility of Process Input / Collection (or Style / Collection), instead just comparing a URI (where presumably the schema(s) for this class of geospatial data can also be retrieved) without having to do a full schema comparison.

I really believe Collection Output is a key capability that should be considered for the OGC GeoDataCube API (potentially specifically in terms of OGC API - Coverages client / output support), because it provides access to an Area/Time/Resolution of interest from a data cubes exactly the same way regardless of whether dealing with preprocessed data, or data generated on the fly (through whichever process or worflow definition language). And Remote Collection / Collection Input allows to easily integrate that same capability in a workflow.

Because the procedure is very abstract, I cannot foresee any way to make this interoperable reliably between implementations. ... It gets worst in the case of Collections as the scope is even larger.

This all needs (a lot) more experimentation / implementations. Hopefully at upcoming Code Sprints and Testbed 20! :)

@fmigneault
Copy link
Contributor

My concern is mostly around the expectations for collections which I find hard to track.
The one presented is a good example.

         {
            "collection" : "https://maps.gnosis.earth/ogcapi/collections/sentinel2-l2a",
            "properties" : [ "(B08 - B04) / (B08 + B04)" ]
         }

Looking at https://maps.gnosis.earth/ogcapi/collections/sentinel2-l2a, I can see that Map, Tile, Coverage, etc. are available for that collection. From the parent process https://maps.gnosis.earth/ogcapi/processes/PassThrough, we can see that FeatureCollection are expected as input and output. Considering "properties" : [ "(B08 - B04) / (B08 + B04)" ], how is the server (and the user invoking it) supposed to follow what resolution will happen? How does result get translated into the feature collection. What will that collection actually contain? What stops the server from either retrieving coverage for the bands, for a combination of tiles, or the entire map instead, and return the results in completely different manners?

My impression is that this abstraction, although it can simplify some workflow-building aspects, also makes it much harder to follow. For reproducible science, data lineage and interpretability of what happens behind the scene, it looks like a black box that "somehow just works". There are also more chances that something breaks in a long chain of processes, which becomes harder to debug because which data is transferred at each step is more complicated to predict.

@jerstlouis
Copy link
Member

jerstlouis commented Feb 9, 2024

From the parent process https://maps.gnosis.earth/ogcapi/processes/PassThrough, we can see that FeatureCollection are expected as input and output.

I need to fix this to a oneOf that also accepts a coverage (e.g., GeoTIFF supported media type) as an input / output. Sorry for the confusing example :) The previous example was a Feature Collection, and the support for Coverages by this process is newer functionality.

how is the server (and the user invoking it) supposed to follow what resolution will happen?

Collection Input really works best together with Collection Output. With Collection Output, it is the data requests (e.g., Coverage / Map Tiles or subsets with scaling) that will determine which subset / resolution is being processed.

For Sync / Async execution, the process would either:

  • infer this from other parameters controlling the resolution / subset of the output (similar to how it works with Collection Output) -- there is an example of this in Testbed 17 GDC API by 52 North,
  • have parameters specific to the input for specifying the resolution / subset (or we could discuss adding other members inside the "collection" object for those sync/async execution (not Collection Output) cases based on common OGC API parameter building blocks like subset, bbox, scale-factor, etc.),
  • process the whole thing, or use the collection as a whole (without necessarily having to access the entire collection, e.g. computing a route or line of sight between two points), or
  • not use a "collection" input, but make it an "href" input instead and add e.g., /coverage?subset=Lat(10:20),Lon(20:30)&scale-factor=256&f=geotiff to the collection end-point. -- This is currently how to use the PassThrough process in our implementation using the Synchronous execution mode instead of Collection Output.

How does result get translated into the feature collection.

Sorry again for the bad process description confusion -- PassThrough takes a Coverage or a Feature Collection.

What will that collection actually contain?

The source sentinel2-l2a collection contains all the fields as described in its schema (a.k.a. range type).

The output of PassThrough will contain a single field that is the computed NDVI value specified by the properties CQL2 string.
The syntax "properties" : { "ndvi" : "(B08 - B04) / (B08 + B04)" } should also (or will) be supported to allow naming individual fields of the output.

What stops the server from either retrieving coverage for the bands, for a combination of tiles, or the entire map instead, and return the results in completely different manners?

How the server acting as a client accesses the coverage using /coverage with subset or /tiles or /dggs/, which TileMatrixSets / DGGRS it picks, should not matter, as long as it uses the relevant Area/Time/Resolution of interest. In terms of reproducibility, the results should be the same regardless of the partitioning mechanism, with minimal variation (the result might not be identical byte-for-byte, but visually and numerically within an acceptable margin of error they should be). I agree practically this is not guaranteed and requires experimentation/validation, but in theory, I do believe this is possible, and that this flexibility is a good thing (though I understand I still have a long way to go to convince both of you ;)).

Data processing intended for a coverage should never use Maps or Map tiles unless the process is specifically intended for that (e.g., a Map Tiling process), or if Maps are the only thing available (in which case it can still treat the red,green,blue channels as separate numeric values -- still not super useful).

My impression is that this abstraction, although it can simplify some workflow-building aspects,

I think it can really simplify things a lot, but also it makes workflows much easier to reuse with different data sources, area/time/resolution of interest, and different implementations and deployments. The same workflow should just work everywhere. So the Collection Input / Output approach is really focused on Reusability, though it might slightly impact Reproducibility (but again we can keep that in check with experimentation/validation).

@m-mohr
Copy link
Author

m-mohr commented Feb 9, 2024

Ugh, that's a lot of text. Thanks though, although it's hard to fully follow the discussion with limited time available.

Two points though:

  • I would appreciate if process descriptions could include an indicator which collections it can load. Especially relevant also to generate user-friendly UIs, see Type indications for user interfaces in parameter schemas #395
  • About Scenes API: I'm obviously no fan of it and defining a new one that competes with STAC API will complicate things. The other way around would make more sense: Make OGC API - Processes digest STAC API, not define a new API. STAC is also about metadata, but primarily about data (= assets).

https://docs.ogc.org/DRAFTS/21-009.html

Thanks for the link, I really struggled to navigate the documents in this repo.

@pvretano
Copy link
Contributor

pvretano commented Feb 9, 2024

@m-mohr I have no idea what "OGC API - Processes digest STAC API" means. Can you explain further?

OGC API Processes exposes "processes". Processes have inputs and outputs. Inputs and outputs have schemas which are defined using JSON-Schema. How exactly does STAC API play in this world?

The STAC API, like the Records API, is the Features API with a predefined schema for the "feature" ... a "record" for Records and a STAC Item for STAC. I can see that STAC, like Records, can be used (in a deployment that uses a catalog and a Processes server) to correlate expected process inputs/outputs with available data products (i.e. matching input formats, bands, etc.) and visa-versa but that is a catalog function and is orthogonal but tightly related to Processes ... especially if you are trying to deploy something like a thematic exploitation platform. Is that what you mean?

@jerstlouis
Copy link
Member

jerstlouis commented Feb 9, 2024

@m-mohr

I would appreciate if process descriptions could include an indicator which collections it can load.

I think there are two cases for this:

  • One where a particular class of data is expected. This is really where I would like GeoDataClasses to come into play. A process description's inputs could have a GeoDataClass URI, and a collection would have a GeoDataClass URI. Then clients can easily know what fits where (regardless of which server the processes or collections come from).
  • Processes that accept any Coverage and/or any Feature Collection (possibly of a particular geometry type). I believe the current process description functionality and the availability of links of specific relation types from the collection (e.g. [ogc-rel:coverage], [ogc-rel:items], [ogc-rel:tilesets-vector], [ogc-rel:tilesets-coverage] should be enough for the client to easily figure this out.

Listing local collections only from processes, or listing processes that work with a specific collection, doesn't help with integrating processes / collections across different servers.

About Scenes API: I'm obviously no fan of it and defining a new one that competes with STAC API will complicate things.

Again it's not about competing with STAC but complementing it. The idea is to align it as much as possible with STAC (at least as an option), where the differences are:

  • /collections/{collectionId}/scenes instead of /collections/{collectionId}/items (possibly the only difference in terms of what the API provides -- you could also offer this at /items if you want to offer a fully compliant STAC API, but you may not want to because you have a different Features data representation for the same collection, or because you want to avoid the confusion that this is collection is a coverage, not a feature collection),
  • the implication that a coverage made up of all scenes is available at/collections/{collectionId}/coverage,
  • the availability of /collections/{collectionId}/scenes/{sceneId}/coverage for coverage of individual scenes (which may be in a different CRS than the whole collection coverage),
  • the implications that the scene queryables are also accessible as part of the /collections/{collectionId}/coverage (for a Coverages - Part 2 filter= CQL2 query)
  • the ability to add / update / remove scene images with CRUD operations (an optional capability)

As we discussed previously, I would also like to look into a relational model / binary format (e.g., SQLIte -- see 6.3.6.1 STAC relational database) allowing to more efficiently retrieve all scenes and synchronize across multiple servers the metadata for available scenes based on new scenes available / updates by retrieving a delta from a previous synchronization.

@m-mohr
Copy link
Author

m-mohr commented Feb 9, 2024

I have no idea what "OGC API - Processes digest STAC API" means. Can you explain further?

STAC API as an input for data, similar to or exactly how the Collection Input works in Part 3. @pvretano

@m-mohr
Copy link
Author

m-mohr commented Feb 10, 2024

The idea is to align it as much as possible with STAC (at least as an option)

Why just as an option?

the implication that a coverage made up of all scenes is available at/collections/{collectionId}/coverage,

That's just an extension and doesn't need a Scenes API? You could easily add a coverage endpoint to a STAC API.

the availability of /collections/{collectionId}/scenes/{sceneId}/coverage for coverage of individual scenes (which may be in a different CRS than the whole collection coverage),

Same, why does this need a Scenes API? You could easily add a coverage endpoint to the item endpoint of a STAC API.

the implications that the scene queryables are also accessible as part of the /collections/{collectionId}/coverage (for a Coverages - Part 2 filter= CQL2 query)

I don't understand this.

the ability to add / update / remove scene images with CRUD operations (an optional capability)

That's possible with the STAC Transaction extension for Items and Collections (both aligned with the OGC API Transaction extensions whenever possible).

As we discussed previously, I would also like to look into a relational model / binary format (e.g., SQLIte -- see 6.3.6.1 STAC relational database)

So something like stac-geoparquet (binary format)? Or more like pgstac (database model for PostgreSQL)?

@jerstlouis

@jerstlouis
Copy link
Member

jerstlouis commented Feb 10, 2024

STAC API as an input for data, similar to or exactly how the Collection Input works in Part 3

@m-mohr Having a STAC collection option (which implies a two steps access -- metadata -> data) makes sense since this is already an established pattern, so we should add clarification text to that effect in Processes-3 Collection Input that STAC is one possible access mechanism. However, there would be an expectation that all referenced STAC assets in the collection popuplate a collection with a consistent schema (fields). I don't think STAC necessarily implies this for any STAC catalog? There may also be confusion about whether the schema of the collection describes the data (the coverage made up of all scenes / assets) or the metadata (the STAC records). We can potentially deconflict this with the schema profile mechanism of Features - Part 5.

Our own experience trying to use a STAC API instance this way was that this did not work for the use case of sentinel-2 global overviews that we were trying to use it for. I had trouble making use of the filter capabilities (CQL2 not yet supported, and could not make sense of the STAC-specific option in that implementation), and the server would reject responses returning more than a few granules.

If /coverage is available for that collection, then the client can ask for a subset, specific fields, downsampled resolution, instead of having to figure out which scenes are needed and then retrieve the data for every scene. This is more practical for global overviews / collections with millions of scenes (if the server supports an overview cache like we implemented).

The idea is to align it as much as possible with STAC (at least as an option)
Why just as an option?

Because we may want to support alternative representations of the /scenes/{sceneId} resource, for example using a lighter data model that does not need to repeat identical information (identical schema bands, identical base paths) that is shared across all assets of the coverage.

Another use case is for example the AWS sentinel-2 that we are proxying has assets organized by granules, but we regroup multiple granules as a single scene. So we may want to expose a STAC API that lists the actual assets available on AWS for that collection at /items, but provides a list of scenes at /scenes that are regrouped as fewer scenes.

There is also a distinction between a "Scene" and a STAC asset: a single Scene implies all assets for all fields / bands of the collection schema, whereas a single STAC asset may be for a single band or multiple bands.

That's just an extension and doesn't need a Scenes API? You could easily add a coverage endpoint to a STAC API.

It's a "Scenes requirement class" for the "OGC API - Coverages", not a separate "Scenes API".
The requirement class specifies requirements that clients can expect the implementation to conform to.
So you can think of it as specifying that particular extension to the STAC API, in the context of an OGC API - Coverages implementation.

Same, why does this need a Scenes API? You could easily add a coverage endpoint to the item endpoint of a STAC API.

Same answer -- that's mostly what the Coverage Scenes requirement class does.

scene queryables are also accessible -- I don't understand this.

The scene-level metadata from the individual scenes at /collections/{collectionId}/scenes / /collections/{collectionId}/scenes/{sceneId} can be listed in queryables at /collections/{collectionId}/queryables and available for an OGC API - Coverages data request e.g., /collections/{collectionId}/coverage?filter=scene.cloudCover<70 and SCL <> 9 (using both scene-level and cell-level queryables).

That's possible with the STAC Transaction extension for Items and Collections (both aligned with the OGC API Transaction extensions whenever possible).

We could potentially reference that in the requirement class. However, there are two different and equally valid use cases here being considered:

  • Doing a PUT / POST of Scene / STAC metadata that references assets elsewhere (e.g., on AWS) that are only cataloged by the deployed API instance
  • Doing a PUT / POST of an actual GeoTIFF at /scenes / /scenes/{sceneId} that will cause the scene metadata as well as the coverage for the collection's overall coverage to be updated automatically (the Images API from Testbed 15)

Or more like pgstac (database model for PostgreSQL)?

Yes what we implemented is probably similar to pgstac.

@pvretano
Copy link
Contributor

pvretano commented Feb 10, 2024

@m-mohr whether or not a process accepts a STAC item as an input is, as @jerstlouis mention, an option. That is really a property of the deployed process and depends on the definition of the process -- over which the provider of the service may or may not have control (i.e. Part 2) . Of course, we can certainly define an optional conformance class for that.

In addition, the OGC Best Practice for Earth Observation Application Package has a pretty in depth discussion of the interaction between STAC and OGC API Processes. Perhaps we can steal some material from that document.

@fmigneault
Copy link
Contributor

@jerstlouis
For the "two different and equally valid use cases here being considered" of PUT/POST, more specification for multipart content-type of the request might be needed. This is what https://docs.ogc.org/per/19-070.html#_payload seems to describe, though it uses custom OGC definitions instead of well established Content-Type: multipart/mixed and Content-ID (https://www.w3.org/Protocols/rfc1341/7_2_Multipart.html) to distinguish data from metadata parts (not sure why?). This way, the same endpoint can handle either adding/updating only the metadata, only the data, or both simultaneously.

@pvretano @m-mohr
One thing I think we must be mindful of with STAC (and this problem is not limited only to STAC, same goes for GeoJSON, ZIP, OAP Part 3 Collections, etc.) is that these "generic container" media-types can virtually contain anything. While they allow to tag along metadata with the data, Process descriptions become too abstract when everything is application/geo+json (or whatever alternative). All Processes become artificially chain-able by sharing I/O media-types, although the data probably makes no sense to be chained. Overall, it lowers the quality, understanding and reproducibility of workflows, because we are not quite sure what happens being those abstractions and in-between processes. I brought this up in the OGC OSPD Slack, and would like to have more discussions about how to handle this problem properly.

@m-mohr
Copy link
Author

m-mohr commented Feb 15, 2024

@fmigneault Regarding the "generic container types": This is certainly true and the reason I opened this issue is to solve that. I'd love to see a solution for this. In openEO we have some kind of a general description of which file formats are supported by a back-end for input and output operations (GET /file_formats), but that's relatively high-level and would probably not work well for OGC API - Processes (and CWL-based processes). Maybe we need a way to describe container formats and what they can contain?

We have the same issue in openEO, where out principle is pretty much that everything that goes in is STAC and everything goes out is STAC. While with a good format abstraction library such as GDAL you can cater for a lot of things, we pretty much just have to throw errors during the workflow if something doesn't meet the expectations. On the other hand, openEO doesn't have as many steps in-between where you actually need to pass around STAC. It really just comes into play in openEO, if you switch between platforms, but not in-between processes that run on the same platform. That is a significant difference to OGC API - Processes, where this is fundamental even on the same platform. (Not judging whether something is better or worse, just trying to highlight the differences.)

@bpross-52n
Copy link
Contributor

SWG meeting from 2024-03-04: Related to #395 A PR will be created that will likely close this issue as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

5 participants