Zarr

Zarr 是一种用于分块、压缩的 N 维数组的二进制文件格式。它被广泛用于 PyData 生态系统,尤其是在气候和生物科学应用中。

Zarr-Python 是用于读写 Zarr 数组的官方 Python 包。其主要特点是一个类似于 NumPy 的数组,能将数组操作无缝转换为文件 IO。KvikIO 为 Zarr-Python 提供了一个 GPU 后端,可以无缝启用 GPUDirect Storage (GDS)

KvikIO 支持 zarr-python 2.x 或 zarr-python 3.x。然而,根据您使用的 zarr 版本不同,kvikio.zarr 中提供的 API 也会有所不同,这与 zarr-python 2.x 和 zarr-python 3.x 之间的差异一致。

Zarr Python 3.x

如果您配置 Zarr 使用 GPU,Zarr-python 会包含将 Zarr 分块读取到设备内存的本地支持。您可以使用任何存储,但 KvikIO 提供了 kvikio.zarr.GDSStore,可以高效地将数据直接加载到 GPU 内存中。

>>> import zarr
>>> from kvikio.zarr import GDSStore
>>> zarr.config.enable_gpu()
>>> store = GDSStore(root="data.zarr")
>>> z = zarr.create_array(
...     store=store, shape=(100, 100), chunks=(10, 10), dtype="float32", overwrite=True
... )
>>> type(z[:10, :10])
cupy.ndarray

Zarr Python 2.x

以下使用了 zarr-python 2.x,并举例说明了如何使用便捷函数 kvikio.zarr.open_cupy_array() 来创建新的 Zarr 数组以及如何打开现有的 Zarr 数组。

# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
# See file LICENSE for terms.

import cupy
import numpy
import zarr

import kvikio
import kvikio.zarr


def main(path):
    a = cupy.arange(20)

    # Let's use KvikIO's convenience function `open_cupy_array()` to create
    # a new Zarr file on disk. Its semantic is the same as `zarr.open_array()`
    # but uses a GDS file store, nvCOMP compression, and CuPy arrays.
    z = kvikio.zarr.open_cupy_array(store=path, mode="w", shape=(20,), chunks=(5,))

    # `z` is a regular Zarr Array that we can write to as usual
    z[0:10] = numpy.arange(0, 10)
    # but it also support direct reads and writes of CuPy arrays
    z[10:20] = cupy.arange(10, 20)

    # Reading `z` returns a CuPy array
    assert isinstance(z[:], cupy.ndarray)
    assert (a == z[:]).all()

    # Normally, we cannot assume that GPU and CPU compressors are compatible.
    # E.g., `open_cupy_array()` uses nvCOMP's Snappy GPU compression by default,
    # which, as far as we know, isn’t compatible with any CPU compressor. Thus,
    # let's re-write our Zarr array using a CPU and GPU compatible compressor.
    #
    # Warning: it isn't possible to use `CompatCompressor` as a compressor argument
    #          in Zarr directly. It is only meant for `open_cupy_array()`. However,
    #          in an example further down, we show how to write using regular Zarr.
    z = kvikio.zarr.open_cupy_array(
        store=path,
        mode="w",
        shape=(20,),
        chunks=(5,),
        compressor=kvikio.zarr.CompatCompressor.lz4(),
    )
    z[:] = a

    # Because we are using a CompatCompressor, it is now possible to open the file
    # using Zarr's built-in LZ4 decompressor that uses the CPU.
    z = zarr.open_array(path)
    # `z` is now read as a regular NumPy array
    assert isinstance(z[:], numpy.ndarray)
    assert (a.get() == z[:]).all()
    # and we can write to is as usual
    z[:] = numpy.arange(20, 40)

    # And we can read the Zarr file back into a CuPy array.
    z = kvikio.zarr.open_cupy_array(store=path, mode="r")
    assert isinstance(z[:], cupy.ndarray)
    assert (cupy.arange(20, 40) == z[:]).all()

    # Similarly, we can also open a file written by regular Zarr.
    # Let's write the file without any compressor.
    ary = numpy.arange(10)
    z = zarr.open(store=path, mode="w", shape=ary.shape, compressor=None)
    z[:] = ary
    # This works as before where the file is read as a CuPy array
    z = kvikio.zarr.open_cupy_array(store=path)
    assert isinstance(z[:], cupy.ndarray)
    assert (z[:] == cupy.asarray(ary)).all()

    # Using a compressor is a bit more tricky since not all CPU compressors
    # are GPU compatible. To make sure we use a compable compressor, we use
    # the CPU-part of `CompatCompressor.lz4()`.
    ary = numpy.arange(10)
    z = zarr.open(
        store=path,
        mode="w",
        shape=ary.shape,
        compressor=kvikio.zarr.CompatCompressor.lz4().cpu,
    )
    z[:] = ary
    # This works as before where the file is read as a CuPy array
    z = kvikio.zarr.open_cupy_array(store=path)
    assert isinstance(z[:], cupy.ndarray)
    assert (z[:] == cupy.asarray(ary)).all()


if __name__ == "__main__":
    main("/tmp/zarr-cupy-nvcomp")