Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement legacy vlen-utf8 codec #2036

Draft
wants to merge 11 commits into
base: v3
Choose a base branch
from

Conversation

rabernat
Copy link
Contributor

@rabernat rabernat commented Jul 14, 2024

This is an alternative approach to #2031 for implementing variable length string encoding, using the legacy numcodecs vlen-utf8 codec.

The codec is very simple. It encodes variable length data like this:

| header | item 0 len | item 0 data | item 1 len | item 1 data | ... 
| uint4  | unit4      | variable    | unit4      | variable    | ...

where header is the number of items and each item is preceded by an int which tells how many bytes long it is.

We could add this as an "official" Zarr V3 codec if we want. The point of this PR is to demonstrate that it is simple to implement. If desirable, we could extend this to the other vlen types (bytes and array).

I am in favor of supporting this "legacy" encoding for backwards compatibility--I'd like existing Zarr V2 data to be convertible to V3 without rewriting the chunks.

How, I would favor using Arrow encoding (#2031) as the preferred default for new data.

xref:

@rabernat
Copy link
Contributor Author

rabernat commented Jul 14, 2024

I compared this encoding + Zstd compression with the arrow-based encoding in #2031 and got some surprising results. The dataset consists of 47868 city names.

encoding compression stored bytes
vlen-utf8 None 642616
vlen-utf8 Zstd(2) 366504
vlen-utf8 Zstd(2) 358428
arrow None 657185
arrow Zstd(2) 462208
arrow Zstd(5) 455234

The surprise is that the legacy encoding compresses better than the arrow-based one.

import zarr
from zarr.codecs import ZstdCodec
from zarr.buffer import default_buffer_prototype
from zarr.api.synchronous import array
import asyncio

# swap to compare across PRs
from zarr.codecs import VLenUTF8Codec as BytesCodec
#from zarr.codecs import ArrowRecordBatchCodec as BytesCodec

a = array(df.city.values, chunks=(1000,), fill_value='', codecs=[BytesCodec(), ZstdCodec(level=5)])

# ridiculous workarounds the lack of getsize on V3 store API
store_path = a._async_array.store_path
all_items = [item async for item in store_path.store.list()]
async def get_item_size(store, item):
    return len(await store.get(item, prototype=default_buffer_prototype))
all_sizes = await asyncio.gather(*[get_item_size(store_path.store, item) for item in all_items])
total = sum(all_sizes)

worldcities.csv

@tomwhite
Copy link
Contributor

Thanks for opening this PR @rabernat!

I just tried your implementation of VLenUTF8Codec with the VCF to Zarr conversion in sgkit-dev/bio2zarr#254, and the test worked, which shows string encoding and decoding is working correctly in this case.

@tomwhite
Copy link
Contributor

I am in favor of supporting this "legacy" encoding for backwards compatibility--I'd like existing Zarr V2 data to be convertible to V3 without rewriting the chunks.

+1

@jhamman jhamman added the V3 Related to compatibility with V3 spec label Aug 9, 2024
@TomAugspurger
Copy link
Contributor

Thanks. Would it be possible to support object-dtype arrays here too (I think zarr-v2 or xarray assumes object-dtype arrays hold strings)?

Right now this raises an exception.

import zarr
import zarr.store
import numpy as np
import json
# from zarr.codecs import VLenUTF8Codec

store = zarr.store.MemoryStore(mode="a")
data = np.array(['a', 'bc', 'def', 'asdf', 'asdfg'], dtype=np.dtype(object))

arr = zarr.create(shape=data.shape, store=store, zarr_format=2, filters=[{"id": "vlen-utf8"}], fill_value="", dtype=object)
arr[:] = data
arr2 = zarr.open_array(store=store, zarr_format=2)
print(arr2[:])

Something like this seems to do the trick for zarr_format=2:

diff --git a/src/zarr/codecs/_v2.py b/src/zarr/codecs/_v2.py
index cc6129e6..c1f4944e 100644
--- a/src/zarr/codecs/_v2.py
+++ b/src/zarr/codecs/_v2.py
@@ -37,7 +37,13 @@ class V2Compressor(ArrayBytesCodec):
 
         # ensure correct dtype
         if str(chunk_numpy_array.dtype) != chunk_spec.dtype:
-            chunk_numpy_array = chunk_numpy_array.view(chunk_spec.dtype)
+            if chunk_spec.dtype.kind == "O":
+                # I think we need to assert something about the codec pipeline here
+                # to ensure that we're followed by something that'll do the cast properly.
+                # chunk_numpy_array = chunk_numpy_array.astype(chunk_spec.dtype)
+                pass
+            else:
+                chunk_numpy_array = chunk_numpy_array.view(chunk_spec.dtype)
 
         return get_ndbuffer_class().from_numpy_array(chunk_numpy_array)
 
diff --git a/src/zarr/core/buffer/core.py b/src/zarr/core/buffer/core.py
index 9a808e08..034711d9 100644
--- a/src/zarr/core/buffer/core.py
+++ b/src/zarr/core/buffer/core.py
@@ -472,7 +472,7 @@ class NDBuffer:
         # use array_equal to obtain equal_nan=True functionality
         _data, other = np.broadcast_arrays(self._data, other)
         return np.array_equal(
-            self._data, other, equal_nan=equal_nan if self._data.dtype.kind not in "US" else False
+            self._data, other, equal_nan=equal_nan if self._data.dtype.kind not in "OUS" else False
         )
 
     def fill(self, value: Any) -> None:

@rabernat
Copy link
Contributor Author

We could do that if we assume that the object dtype only ever hold strings. But that seems like a pretty extreme assumption.

Zarr V2 allows arbitrary python objects, but V3 does not. Seems reasonable to force users to use an explicit string dtype for string arrays.

What's the use case you have in mind for object string arrays?

@TomAugspurger
Copy link
Contributor

I'll look a bit closer tomorrow, but this was for reading zarr v2 data (an xarray dataset with an object-dtype coordinate array full of strings).

I agree that being explicit about the dtype here is better. Hopefully we can find a way that works well for older v2 datasets without too much hassle.

@rabernat
Copy link
Contributor Author

rabernat commented Oct 1, 2024

Here's a little exploration of different types of numpy string arrays

import numpy as np
strings = ["hello", "world", "this", "is", "a", "test"]
# default
a1 = np.array(strings)
assert a1.dtype == '<U5'
# numpy 2.0 vlen string
a2 = np.array(strings, dtype='T')
# object
a3 = np.array(strings, dtype='O')

# the first two are detectable as string arrays
assert np.issubdtype(np.str_, a1.dtype)
assert np.issubdtype(np.str_, a2.dtype)

# the object is not, even though it contains all strings
assert not np.issubdtype(np.str_, a3.dtype)

# we can cast the object dtype to a vlen string or regular string
assert np.issubdtype(np.str_, a3.astype('T').dtype)

# however, we want to guard against casting if the array contains non-string elements
a4 = np.array(strings + [1], dtype='O') 
a4.astype('T')  # works
a4.astype('T', casting="same_kind")  # -> TypeError

So my proposal for object arrays would be to try to cast them using .astype('T', casking="same_kind") and, if successful, use string encoding.

edit: actually that is useless, because this also fails:

a3.astype('T', casting="same_kind")

So it looks like there is actually no reliable way to know if an object array is a string array.

@TomAugspurger - do you have a test case you are using for object arrays? Is my example enough to cover what Xarray needs? Under what circumstances does Xarray produce object arrays?


Also, my code is relying on numpy 2.0 vlen-string arrays. I guess that's not a hard dependency for Zarr. 🤔

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Oct 1, 2024

do you have a test case you are using for object arrays?

In xarray, xarray/tests/test_backends.py::TestZarrDictStore::test_roundtrip_object_dtype is the failing test: https://github.com/pydata/xarray/blob/095d47fcb036441532bf6f5aed907a6c4cfdfe0d/xarray/tests/test_backends.py#L506

 In [10]: import xarray as xr, numpy as np, zarr

In [11]: ds = xr.Dataset({"a": xr.DataArray(np.array(['a', 'b', 'c'], dtype=object))})

In [12]: store = zarr.store.MemoryStore(mode="a")

In [13]: ds.to_zarr(store)

Those are manually set to object dtype. But that same test also creates float arrays, so there's something more sophisticated than "object dtype == stings". I'll do some more looking as I work through those tests, but I think this PR can proceed independently.

@TomAugspurger
Copy link
Contributor

OK, I think I understand how v2 did things a bit better now.

  1. It's xarray that treats object-dtype arrays as strings, after validating that's actually true. If so, then xarray sets the dtype passed to Zarr to str: https://github.com/pydata/xarray/blob/5c6418219e551cd59f81e3d1c6a828c1c23cd763/xarray/backends/zarr.py#L899-L900
  2. Then zarr turns dtype=str back into dtype=object with

    zarr-python/zarr/util.py

    Lines 187 to 192 in 6b9ab9e

    if isinstance(dtype, str):
    # allow ':' to delimit class from codec arguments
    tokens = dtype.split(":")
    key = tokens[0]
    if key in object_codecs:
    dtype = np.dtype(object)
    (assuming an object codec is provided, which xarray does)

The net result is that in zarr-python 2.x writing zarr v2 data, here's the compressor and filters for various dtypes:

dtype compressor filters
strings (U) Blosc
bytes (S) Blosc
object[str] (O) Blosc VLenUTF8

@rabernat
Copy link
Contributor Author

rabernat commented Oct 3, 2024

Thanks Tom. Planning to work on this today.

@mkitti
Copy link

mkitti commented Oct 3, 2024

Could we use the existing sharding codec for new variable length data? Essentially each variable length element is a just an "inner chunk" with some offset and length (nbytes)?

@rabernat
Copy link
Contributor Author

rabernat commented Oct 3, 2024

I'm not planning on doing anything along those lines here. The "variable length" we're talking about here is on the order of bytes. Doesn't make sense to compress each string individually.

Just trying to get to basic feature parity with Zarr V2, not reinvent anything right now. Plenty of room to experiment with new ideas in the future.

@rabernat
Copy link
Contributor Author

rabernat commented Oct 3, 2024

I do not understand why the tests are failing in CI on this

tests/v3/test_codecs/test_vlen.py:10: in <module>
    from zarr.store.common import StorePath
E   ModuleNotFoundError: No module named 'zarr.store'

This is just boilerplate I copied from the other codec tests. Works fine locally. Help please @d-v-b or @TomAugspurger! 🙏

@d-v-b
Copy link
Contributor

d-v-b commented Oct 3, 2024

I do not understand why the tests are failing in CI on this

tests/v3/test_codecs/test_vlen.py:10: in <module>
    from zarr.store.common import StorePath
E   ModuleNotFoundError: No module named 'zarr.store'

This is just boilerplate I copied from the other codec tests. Works fine locally. Help please @d-v-b or @TomAugspurger! 🙏

it's zarr.storage now!

@d-v-b
Copy link
Contributor

d-v-b commented Oct 3, 2024

I would really love to have zarr-developers/zarr-specs#312 resolved at the same time as this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
V3 Related to compatibility with V3 spec
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants