Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialization of just coordinates #2347

Closed
hmaarrfk opened this issue Aug 6, 2018 · 6 comments
Closed

Serialization of just coordinates #2347

hmaarrfk opened this issue Aug 6, 2018 · 6 comments

Comments

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Aug 6, 2018

In the search for the perfect data storage mechanism, I find myself needing to store some of the images I am generating the metadata seperately. It is really useful for me to serialize just the coordinates of my DataArray.

My serialization method of choice is json since it allows me to read the metadata with just a text editor. For that, having the coordinates as a self contained dictionary is really important.

Currently, I convert just the coordinates to a dataset, and serialize that. The code looks something like this:

import xarray as xr
import numpy as np

# Setup an array with coordinates
n = np.zeros(3)
coords={'x': np.arange(3)}
m = xr.DataArray(n, dims=['x'], coords=coords)

coords_dataset_dict = m.coords.to_dataset().to_dict()
coords_dict = coords_dataset_dict['coords']

# Read/Write dictionary to JSON file

# This works, but I'm essentially creating an emtpy dataset for it
coords_set = xr.Dataset.from_dict(coords_dataset_dict)
coords2 = coords_set.coords  # so many `coords` :D
m2 = xr.DataArray(np.zeros(shape=m.shape), dims=m.dims, coords=coords2)

Would encapsulating this functionality in the Coordinates class be accepted as a PR?

It would add 2 functions that would look like:

def to_dict(self):
    # offload the heavy lifting to the Dataset class
    return self.to_dataset().to_dict()['coords']

def from_dict(self, d):
    # Offload the heavy lifting again to the Dataset class
    d_dataset = {'dims': [], 'attrs': [], 'coords': d}
    return Dataset.from_dict(d_dataset).coords
@rabernat
Copy link
Contributor

rabernat commented Jan 6, 2019

@hmaarrfk - sorry no one replied to your issue. I would personally be fine with adding this to the Coordinates API.

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Jan 6, 2019

no need to be sorry. These two functions were easy enough for me to do myself in my own codebase.

There are few issues that I've found doing this though.
Mainly, I can't find a good way to serialize numpy arrays in a round-trippable fashion.
It is difficult to get back lists of arrays, or arrays of unit8. I don't know if you have a good way to solvle this problem.

@rabernat
Copy link
Contributor

rabernat commented Jan 6, 2019

I don't know if you have a good way to solvle this problem.

This came up in zarr-developers/zarr-python#156 (comment), where @jewfro-cuban suggested using json-tricks to encode numpy arrays.

My preferred solution would be to use a different serialization format, like netcdf or zarr.

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Jan 6, 2019

mind blown!!!! thanks for that pointer I haven't touched my serialization code in a while, kinda scared to go back to it now, but I will keep that library in mind.

I saw Zarr a while back, looks cool. I hope to see it grow.

@dcherian dcherian mentioned this issue Jan 8, 2019
3 tasks
@andersy005
Copy link
Member

If I'm not mistaken, this appears to have been addressed by #2659. Should we close this issue?

@hmaarrfk
Copy link
Contributor Author

hmaarrfk commented Jan 9, 2022

This is likely true. Thanks for looking back into this.

@hmaarrfk hmaarrfk closed this as completed Jan 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants