Implement legacy vlen-utf8 codec #2036

rabernat · 2024-07-14T16:56:14Z

This is an alternative approach to #2031 for implementing variable length string encoding, using the legacy numcodecs vlen-utf8 codec.

The codec is very simple. It encodes variable length data like this:

| header | item 0 len | item 0 data | item 1 len | item 1 data | ... 
| uint4  | unit4      | variable    | unit4      | variable    | ...

where header is the number of items and each item is preceded by an int which tells how many bytes long it is.

We could add this as an "official" Zarr V3 codec if we want. The point of this PR is to demonstrate that it is simple to implement. If desirable, we could extend this to the other vlen types (bytes and array).

I am in favor of supporting this "legacy" encoding for backwards compatibility--I'd like existing Zarr V2 data to be convertible to V3 without rewriting the chunks.

How, I would favor using Arrow encoding (#2031) as the preferred default for new data.

xref:

rabernat · 2024-07-14T18:20:29Z

I compared this encoding + Zstd compression with the arrow-based encoding in #2031 and got some surprising results. The dataset consists of 47868 city names.

encoding	compression	stored bytes
vlen-utf8	None	642616
vlen-utf8	Zstd(2)	366504
vlen-utf8	Zstd(2)	358428
arrow	None	657185
arrow	Zstd(2)	462208
arrow	Zstd(5)	455234

The surprise is that the legacy encoding compresses better than the arrow-based one.

import zarr
from zarr.codecs import ZstdCodec
from zarr.buffer import default_buffer_prototype
from zarr.api.synchronous import array
import asyncio

# swap to compare across PRs
from zarr.codecs import VLenUTF8Codec as BytesCodec
#from zarr.codecs import ArrowRecordBatchCodec as BytesCodec

a = array(df.city.values, chunks=(1000,), fill_value='', codecs=[BytesCodec(), ZstdCodec(level=5)])

# ridiculous workarounds the lack of getsize on V3 store API
store_path = a._async_array.store_path
all_items = [item async for item in store_path.store.list()]
async def get_item_size(store, item):
    return len(await store.get(item, prototype=default_buffer_prototype))
all_sizes = await asyncio.gather(*[get_item_size(store_path.store, item) for item in all_items])
total = sum(all_sizes)

worldcities.csv

tomwhite · 2024-07-18T13:00:37Z

Thanks for opening this PR @rabernat!

I just tried your implementation of VLenUTF8Codec with the VCF to Zarr conversion in sgkit-dev/bio2zarr#254, and the test worked, which shows string encoding and decoding is working correctly in this case.

tomwhite · 2024-07-18T13:02:40Z

I am in favor of supporting this "legacy" encoding for backwards compatibility--I'd like existing Zarr V2 data to be convertible to V3 without rewriting the chunks.

+1

TomAugspurger · 2024-09-30T20:00:49Z

Thanks. Would it be possible to support object-dtype arrays here too (I think zarr-v2 or xarray assumes object-dtype arrays hold strings)?

Right now this raises an exception.

import zarr
import zarr.store
import numpy as np
import json
# from zarr.codecs import VLenUTF8Codec

store = zarr.store.MemoryStore(mode="a")
data = np.array(['a', 'bc', 'def', 'asdf', 'asdfg'], dtype=np.dtype(object))

arr = zarr.create(shape=data.shape, store=store, zarr_format=2, filters=[{"id": "vlen-utf8"}], fill_value="", dtype=object)
arr[:] = data
arr2 = zarr.open_array(store=store, zarr_format=2)
print(arr2[:])

Something like this seems to do the trick for zarr_format=2:

diff --git a/src/zarr/codecs/_v2.py b/src/zarr/codecs/_v2.py
index cc6129e6..c1f4944e 100644
--- a/src/zarr/codecs/_v2.py
+++ b/src/zarr/codecs/_v2.py
@@ -37,7 +37,13 @@ class V2Compressor(ArrayBytesCodec):
 
         # ensure correct dtype
         if str(chunk_numpy_array.dtype) != chunk_spec.dtype:
-            chunk_numpy_array = chunk_numpy_array.view(chunk_spec.dtype)
+            if chunk_spec.dtype.kind == "O":
+                # I think we need to assert something about the codec pipeline here
+                # to ensure that we're followed by something that'll do the cast properly.
+                # chunk_numpy_array = chunk_numpy_array.astype(chunk_spec.dtype)
+                pass
+            else:
+                chunk_numpy_array = chunk_numpy_array.view(chunk_spec.dtype)
 
         return get_ndbuffer_class().from_numpy_array(chunk_numpy_array)
 
diff --git a/src/zarr/core/buffer/core.py b/src/zarr/core/buffer/core.py
index 9a808e08..034711d9 100644
--- a/src/zarr/core/buffer/core.py
+++ b/src/zarr/core/buffer/core.py
@@ -472,7 +472,7 @@ class NDBuffer:
         # use array_equal to obtain equal_nan=True functionality
         _data, other = np.broadcast_arrays(self._data, other)
         return np.array_equal(
-            self._data, other, equal_nan=equal_nan if self._data.dtype.kind not in "US" else False
+            self._data, other, equal_nan=equal_nan if self._data.dtype.kind not in "OUS" else False
         )
 
     def fill(self, value: Any) -> None:

rabernat · 2024-09-30T22:56:02Z

We could do that if we assume that the object dtype only ever hold strings. But that seems like a pretty extreme assumption.

Zarr V2 allows arbitrary python objects, but V3 does not. Seems reasonable to force users to use an explicit string dtype for string arrays.

What's the use case you have in mind for object string arrays?

TomAugspurger · 2024-10-01T02:57:15Z

I'll look a bit closer tomorrow, but this was for reading zarr v2 data (an xarray dataset with an object-dtype coordinate array full of strings).

I agree that being explicit about the dtype here is better. Hopefully we can find a way that works well for older v2 datasets without too much hassle.

rabernat · 2024-10-01T11:41:43Z

Here's a little exploration of different types of numpy string arrays

import numpy as np
strings = ["hello", "world", "this", "is", "a", "test"]
# default
a1 = np.array(strings)
assert a1.dtype == '<U5'
# numpy 2.0 vlen string
a2 = np.array(strings, dtype='T')
# object
a3 = np.array(strings, dtype='O')

# the first two are detectable as string arrays
assert np.issubdtype(np.str_, a1.dtype)
assert np.issubdtype(np.str_, a2.dtype)

# the object is not, even though it contains all strings
assert not np.issubdtype(np.str_, a3.dtype)

# we can cast the object dtype to a vlen string or regular string
assert np.issubdtype(np.str_, a3.astype('T').dtype)

# however, we want to guard against casting if the array contains non-string elements
a4 = np.array(strings + [1], dtype='O') 
a4.astype('T')  # works
a4.astype('T', casting="same_kind")  # -> TypeError

So my proposal for object arrays would be to try to cast them using .astype('T', casking="same_kind") and, if successful, use string encoding.

edit: actually that is useless, because this also fails:

a3.astype('T', casting="same_kind")

So it looks like there is actually no reliable way to know if an object array is a string array.

@TomAugspurger - do you have a test case you are using for object arrays? Is my example enough to cover what Xarray needs? Under what circumstances does Xarray produce object arrays?

Also, my code is relying on numpy 2.0 vlen-string arrays. I guess that's not a hard dependency for Zarr. 🤔

TomAugspurger · 2024-10-01T14:26:18Z

do you have a test case you are using for object arrays?

In xarray, xarray/tests/test_backends.py::TestZarrDictStore::test_roundtrip_object_dtype is the failing test: https://github.com/pydata/xarray/blob/095d47fcb036441532bf6f5aed907a6c4cfdfe0d/xarray/tests/test_backends.py#L506

 In [10]: import xarray as xr, numpy as np, zarr

In [11]: ds = xr.Dataset({"a": xr.DataArray(np.array(['a', 'b', 'c'], dtype=object))})

In [12]: store = zarr.store.MemoryStore(mode="a")

In [13]: ds.to_zarr(store)

Those are manually set to object dtype. But that same test also creates float arrays, so there's something more sophisticated than "object dtype == stings". I'll do some more looking as I work through those tests, but I think this PR can proceed independently.

TomAugspurger · 2024-10-03T16:30:16Z

OK, I think I understand how v2 did things a bit better now.

It's xarray that treats object-dtype arrays as strings, after validating that's actually true. If so, then xarray sets the dtype passed to Zarr to str: https://github.com/pydata/xarray/blob/5c6418219e551cd59f81e3d1c6a828c1c23cd763/xarray/backends/zarr.py#L899-L900

Then zarr turns dtype=str back into dtype=object with

zarr-python/zarr/util.py

Lines 187 to 192 in 6b9ab9e

    
           if isinstance(dtype, str): 
        
               # allow ':' to delimit class from codec arguments 
        
               tokens = dtype.split(":") 
        
               key = tokens[0] 
        
               if key in object_codecs: 
        
                   dtype = np.dtype(object)

(assuming an object codec is provided, which xarray does)

The net result is that in zarr-python 2.x writing zarr v2 data, here's the compressor and filters for various dtypes:

dtype	compressor	filters
strings (U)	Blosc
bytes (S)	Blosc
object[str] (O)	Blosc	VLenUTF8

rabernat · 2024-10-03T16:37:57Z

Thanks Tom. Planning to work on this today.

mkitti · 2024-10-03T20:19:52Z

Could we use the existing sharding codec for new variable length data? Essentially each variable length element is a just an "inner chunk" with some offset and length (nbytes)?

rabernat · 2024-10-03T21:03:51Z

I'm not planning on doing anything along those lines here. The "variable length" we're talking about here is on the order of bytes. Doesn't make sense to compress each string individually.

Just trying to get to basic feature parity with Zarr V2, not reinvent anything right now. Plenty of room to experiment with new ideas in the future.

rabernat · 2024-10-03T21:07:43Z

I do not understand why the tests are failing in CI on this

tests/v3/test_codecs/test_vlen.py:10: in <module>
    from zarr.store.common import StorePath
E   ModuleNotFoundError: No module named 'zarr.store'

This is just boilerplate I copied from the other codec tests. Works fine locally. Help please @d-v-b or @TomAugspurger! 🙏

d-v-b · 2024-10-03T21:09:38Z

I do not understand why the tests are failing in CI on this
tests/v3/test_codecs/test_vlen.py:10: in <module>
    from zarr.store.common import StorePath
E   ModuleNotFoundError: No module named 'zarr.store'
This is just boilerplate I copied from the other codec tests. Works fine locally. Help please @d-v-b or @TomAugspurger! 🙏

it's zarr.storage now!

d-v-b · 2024-10-03T21:20:10Z

I would really love to have zarr-developers/zarr-specs#312 resolved at the same time as this PR

add legacy vlen-utf8 codec

c05b9d1

rabernat force-pushed the ryan/legacy-vlen branch from cb617f0 to c05b9d1 Compare July 14, 2024 16:58

LDeakin mentioned this pull request Jul 17, 2024

Variable length data types LDeakin/zarrs#40

Merged

tomwhite added a commit to tomwhite/bio2zarr that referenced this pull request Jul 18, 2024

Use VLenUTF8Codec from zarr-developers/zarr-python#2036

849b141

tomwhite mentioned this pull request Jul 18, 2024

Run tests against Zarr 3 sgkit-dev/bio2zarr#254

Open

jhamman added the V3 Related to compatibility with V3 spec label Aug 9, 2024

jhamman mentioned this pull request Sep 29, 2024

[v3] String support for v3 array #2268

Open

rabernat added 2 commits September 29, 2024 17:59

Merge branch 'v3' into ryan/legacy-vlen

c86ddc6

got it working again

a322124

TomAugspurger mentioned this pull request Sep 30, 2024

[WIP]: compatibility for zarr-python 3.x pydata/xarray#9552

Draft

4 tasks

rabernat added 2 commits September 30, 2024 20:02

got strings working; broke everything else

2a1e2e3

change v3.metadata.data_type type

1d3d7a5

rabernat mentioned this pull request Oct 1, 2024

Change ArrayV3Metadata.data_type to DataType #2278

Open

rabernat added 3 commits September 30, 2024 22:08

merged

cd40b08

fixed tests

988f9df

satisfy mypy for tests

507161a

TomAugspurger mentioned this pull request Oct 3, 2024

[v3] default compressor / codec pipeline #2267

Open

rabernat added 2 commits October 3, 2024 16:26

make strings work

1ae5e63

add missing module

94ecdb5

Merge branch 'v3' into ryan/legacy-vlen

2c7d638

d-v-b mentioned this pull request Oct 3, 2024

Define the list of codecs in the v3 spec zarr-developers/zarr-specs#312

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement legacy vlen-utf8 codec #2036

Implement legacy vlen-utf8 codec #2036

rabernat commented Jul 14, 2024 •

edited

Loading

rabernat commented Jul 14, 2024 •

edited

Loading

tomwhite commented Jul 18, 2024

tomwhite commented Jul 18, 2024

TomAugspurger commented Sep 30, 2024

rabernat commented Sep 30, 2024

TomAugspurger commented Oct 1, 2024

rabernat commented Oct 1, 2024 •

edited

Loading

TomAugspurger commented Oct 1, 2024 •

edited

Loading

TomAugspurger commented Oct 3, 2024

rabernat commented Oct 3, 2024

mkitti commented Oct 3, 2024

rabernat commented Oct 3, 2024

rabernat commented Oct 3, 2024

d-v-b commented Oct 3, 2024

d-v-b commented Oct 3, 2024

Implement legacy vlen-utf8 codec #2036

Are you sure you want to change the base?

Implement legacy vlen-utf8 codec #2036

Conversation

rabernat commented Jul 14, 2024 • edited Loading

rabernat commented Jul 14, 2024 • edited Loading

tomwhite commented Jul 18, 2024

tomwhite commented Jul 18, 2024

TomAugspurger commented Sep 30, 2024

rabernat commented Sep 30, 2024

TomAugspurger commented Oct 1, 2024

rabernat commented Oct 1, 2024 • edited Loading

TomAugspurger commented Oct 1, 2024 • edited Loading

TomAugspurger commented Oct 3, 2024

rabernat commented Oct 3, 2024

mkitti commented Oct 3, 2024

rabernat commented Oct 3, 2024

rabernat commented Oct 3, 2024

d-v-b commented Oct 3, 2024

d-v-b commented Oct 3, 2024

rabernat commented Jul 14, 2024 •

edited

Loading

rabernat commented Jul 14, 2024 •

edited

Loading

rabernat commented Oct 1, 2024 •

edited

Loading

TomAugspurger commented Oct 1, 2024 •

edited

Loading