Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature(store): V3 ZipStore #2078

Merged
merged 19 commits into from
Sep 13, 2024
Merged

Conversation

jhamman
Copy link
Member

@jhamman jhamman commented Aug 10, 2024

I threw together a super basic implementation of a v3 zip store. I am still testing this but it seems to be working for basic things.

closes #2010

try it!

In [1]: import zarr
   ...: 
   ...: store = zarr.store.ZipStore('data/example.zip', mode='w')
   ...: root = zarr.group(store=store)
   ...: z = root.create_array(shape=(100, 100), chunks=(10, 10), name="foo")

In [2]: z[:] = 42

In [3]: store.close()

# unzip -l data/example.zip

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

Comment on lines +151 to +152
async def delete(self, key: str) -> None:
raise NotImplementedError
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today I was reminded that you can't delete anything from inside a ZipFile 😢. This behavior also existed in 2.18.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be nice to have an In-memory version of a Zip store where all the zip data is read in memory. This way it can support deleting and updating entries. Thereafter, a user can persist the data using a method like write_to_file. I think it would be very efficient for data sets whose compressed size is small enough to fit entirely in memory.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed that would be nice @zoj613. I would like to save that until after the 3.0 release though as the minimal zip store is a release blocker at this point.

Comment on lines 18 to 20
supports_writes: bool = True
supports_partial_writes: bool = False
supports_listing: bool = True
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do we think about adding supports_deletes: bool = False as a class attribute?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would be a solid extrapolation of the design we are currently using. one downside of that design is that, even if a class has supports_x set to False, the class will still need an implementation of x. Another solution would be to express supports_x by having the class inherit from a DoesX mixin. But that's maybe out of scope for this PR.

Copy link
Member Author

@jhamman jhamman Aug 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone the supports_deletes route for now.

@jhamman jhamman marked this pull request as ready for review August 12, 2024 22:17
@jhamman jhamman requested a review from d-v-b August 12, 2024 22:24
@jhamman jhamman changed the title feature(store): add basic implementation of a zip store feature(store): V3 ZipStore Aug 12, 2024
@jhamman jhamman added the V3 Related to compatibility with V3 spec label Aug 12, 2024
@jhamman jhamman added this to the 3.0.0 milestone Aug 12, 2024
@d-v-b
Copy link
Contributor

d-v-b commented Aug 13, 2024

I played around with this a bit locally.

  • the zip archive gets completely broken if you forget to call store.close after writing (or, presumably, if the program exits before reaching the line where store.close() is invoked). I think this is severe enough that this PR should have the ZipStore implement ContextManager to defend against data corruption issues.
  • attempting to write twice to the array fails, because the zip store doesn't allow deletion. In such a case, the user will see a NotImplementedError. We should probably make something more informative for this case. But I don't think this is a blocker.
  • We will at some point need to address the possibility of zip file stored on s3. Our current store hierarchy can't express this easily I think. but that's a future problem.

@jhamman
Copy link
Member Author

jhamman commented Aug 15, 2024

I played around with this a bit locally.

  • the zip archive gets completely broken if you forget to call store.close after writing (or, presumably, if the program exits before reaching the line where store.close() is invoked). I think this is severe enough that this PR should have the ZipStore implement ContextManager to defend against data corruption issues.

There is now a context manager interface here. I'll note that I'm mildly uncomfortable with this because its not clear if this should be a async context manager or not. While I think this is what you were asking for, its not really the way classes that do async things should work. I've got some ideas for how we can improve this but I think we should hold them for another PR.

Two asides:

  1. not closing the v2 ZipStore also led to bad things. So this is "expected behavior" I guess.
  2. we could consider an atexit cleanup of the file handle but I think we should first decide on the context api.
  • attempting to write twice to the array fails, because the zip store doesn't allow deletion. In such a case, the user will see a NotImplementedError. We should probably make something more informative for this case. But I don't think this is a blocker.

Noting that this behavior is also present in v2:

zarr-python/zarr/storage.py

Lines 1884 to 1885 in 11fd8db

def __delitem__(self, key):
raise NotImplementedError

  • We will at some point need to address the possibility of zip file stored on s3. Our current store hierarchy can't express this easily I think. but that's a future problem.

I think fsspec can handle most of this for us. Some work is likely needed there but it should be possible.

@joshmoore
Copy link
Member

A sidenote that similar to the situation with consolidated metadata, the zip store will bring v3 to parity with v2 (good thing ™️ 👍🏽) but will leave other implementations in the same situation of not having a spec. If there's a chance that the discussion around that spec will lead to changes in the on-disk format, it would be in our best interest to make sure we have a specification to go along with these change so that there's less likelihood of multiple versions we need to support.

@d-v-b
Copy link
Contributor

d-v-b commented Aug 31, 2024

A sidenote that similar to the situation with consolidated metadata, the zip store will bring v3 to parity with v2 (good thing ™️ 👍🏽) but will leave other implementations in the same situation of not having a spec. If there's a chance that the discussion around that spec will lead to changes in the on-disk format, it would be in our best interest to make sure we have a specification to go along with these change so that there's less likelihood of multiple versions we need to support.

+1, it would be great to define a spec for this so other implementations can safely support it

@jhamman
Copy link
Member Author

jhamman commented Sep 11, 2024

Draft spec to go along with this feature is now available for review: zarr-developers/zarr-specs#311

@LDeakin
Copy link

LDeakin commented Sep 12, 2024

I tried this, and it produces zip files that are compatible with the zip store implementation in zarrs. However, I noticed that this fails because it tries to call delete:

z = root.create_array(shape=(100, 100), chunks=(10, 10), name="foo", dtype=np.uint8, fill_value=42)
z[:] = 42

I would expect it to succeed and just not write empty chunks.

We will at some point need to address the possibility of zip file stored on s3. Our current store hierarchy can't express this easily I think. but that's a future problem.

For reference, in zarrs I opted to implement zip support as a "storage adapter" that can be layered on other stores. See example. But zip support is read-only, regardless of the backing store.

Copy link
Contributor

@d-v-b d-v-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Let's get it in

(@jhamman do you think the issue @LDeakin raised is a blocker, or can we sort that out in a future pr?)

@jhamman
Copy link
Member Author

jhamman commented Sep 13, 2024

Let me see about addressing @LDeakin's finding here before merging.

@jhamman
Copy link
Member Author

jhamman commented Sep 13, 2024

@LDeakin - your example exposed something beyond the reach of this PR. Setting a chunk equal to the fill value triggers a delete - even if the chunk does not exist. This is handled nicely for all other stores but is hard to manage here. Something we could do in the future is allow delete to pass if the key does not exist. That may look something like this:

    async def delete(self, key: str) -> None:
        if await self.exists(key):
            raise NotImplementedError

This is a bit too clever IMO so I'm going to leave this for later.

@jhamman jhamman merged commit f1bd703 into zarr-developers:v3 Sep 13, 2024
26 checks passed
@jhamman jhamman deleted the feature/zip-store branch September 13, 2024 16:46
dcherian added a commit to dcherian/zarr-python that referenced this pull request Sep 16, 2024
* v3:
  fix: opening a group with unspecified format finds either v2 or v3 (zarr-developers#2183)
  test: check that store, array, and group classes are serializable  (zarr-developers#2006)
  feature(store): V3 ZipStore (zarr-developers#2078)
  More typing fixes for tests (zarr-developers#2173)
  refactor: split metadata into v2 and v3 modules (zarr-developers#2163)
  Accept dictionaries for `store` argument (zarr-developers#2164)
  Simplify mypy config for tests (zarr-developers#2156)
  Fixed path segment duplication in open_array (zarr-developers#2167)
  Fixed test warnings (zarr-developers#2168)
  chore: update pre-commit hooks (zarr-developers#2165)
  Ensure that store_dict used for empty dicts (zarr-developers#2162)
  Bump pypa/gh-action-pypi-publish from 1.10.0 to 1.10.1 in the actions group (zarr-developers#2160)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
V3 Related to compatibility with V3 spec
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[v3] Develop ZipStore
5 participants