Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alternative to dropping attributes that vary between datasets #743

Open
jbusecke opened this issue May 6, 2024 · 2 comments
Open

Alternative to dropping attributes that vary between datasets #743

jbusecke opened this issue May 6, 2024 · 2 comments

Comments

@jbusecke
Copy link
Contributor

jbusecke commented May 6, 2024

When we merge dataset schemas here we currently drop everything in the attributes that is not identical between them.

Example:

from pangeo_forge_recipes.aggregation import _combine_xarray_schemas, dataset_to_schema, schema_to_template_ds
import xarray as xr

ds_a = xr.Dataset(attrs={'something_same':'a', 'something_different':'a'})
ds_b = xr.Dataset(attrs={'something_same':'a', 'something_different':'b'})

schemas = [dataset_to_schema(ds) for ds in [ds_a, ds_b]]
combined_schema = _combine_xarray_schemas(*schemas)
ds_new = schema_to_template_ds(combined_schema)
ds_new

gives

<xarray.Dataset> Size: 0B
Dimensions:  ()
Data variables:
    *empty*
Attributes:
    something_same:  a

I would like a way to preserve the values of something_different on each dataset. Perhaps we could add an option to just make a list of the differing items?

<xarray.Dataset> Size: 0B
Dimensions:  ()
Data variables:
    *empty*
Attributes:
    something_same:  a
    something_different: [a, b]

This is motivated by a real world use case. For CMIP6 each file has a unique tracking_id that can be used to find issues with a specific file (which would then affect all the resulting concatenated dataset). Currently my pangeo-forge-recipes based workflow is completely dropping this important information.

Happy to help with a PR but I am not quite sure what the best way to expose such a behavior to the user is?

Would this be a keyword argument to StoreToZarr?

@TomNicholas
Copy link

FYI this is a hard problem in general, and we normally recommend promoting unique_tracking_id to be an actual coordinate variable so that it has specific rules for propagation.

pydata/xarray#1614

@jbusecke
Copy link
Contributor Author

jbusecke commented May 9, 2024

Interesting. It would be great to have this implemented on the xarray level, but AFAICT that would still not solve the issue here, since we are not using xarray to generate much of the schema?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants