Zarr
Zarr 是一种用于分块、压缩的 N 维数组的二进制文件格式。它被广泛用于 PyData 生态系统,尤其是在气候和生物科学应用中。
Zarr-Python 是用于读写 Zarr 数组的官方 Python 包。其主要特点是一个类似于 NumPy 的数组,能将数组操作无缝转换为文件 IO。KvikIO 为 Zarr-Python 提供了一个 GPU 后端,可以无缝启用 GPUDirect Storage (GDS)。
KvikIO 支持 zarr-python 2.x 或 zarr-python 3.x。然而,根据您使用的 zarr 版本不同,kvikio.zarr
中提供的 API 也会有所不同,这与 zarr-python 2.x 和 zarr-python 3.x 之间的差异一致。
Zarr Python 3.x
如果您配置 Zarr 使用 GPU,Zarr-python 会包含将 Zarr 分块读取到设备内存的本地支持。您可以使用任何存储,但 KvikIO 提供了 kvikio.zarr.GDSStore
,可以高效地将数据直接加载到 GPU 内存中。
>>> import zarr
>>> from kvikio.zarr import GDSStore
>>> zarr.config.enable_gpu()
>>> store = GDSStore(root="data.zarr")
>>> z = zarr.create_array(
... store=store, shape=(100, 100), chunks=(10, 10), dtype="float32", overwrite=True
... )
>>> type(z[:10, :10])
cupy.ndarray
Zarr Python 2.x
以下使用了 zarr-python 2.x,并举例说明了如何使用便捷函数 kvikio.zarr.open_cupy_array()
来创建新的 Zarr 数组以及如何打开现有的 Zarr 数组。
# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
# See file LICENSE for terms.
import cupy
import numpy
import zarr
import kvikio
import kvikio.zarr
def main(path):
a = cupy.arange(20)
# Let's use KvikIO's convenience function `open_cupy_array()` to create
# a new Zarr file on disk. Its semantic is the same as `zarr.open_array()`
# but uses a GDS file store, nvCOMP compression, and CuPy arrays.
z = kvikio.zarr.open_cupy_array(store=path, mode="w", shape=(20,), chunks=(5,))
# `z` is a regular Zarr Array that we can write to as usual
z[0:10] = numpy.arange(0, 10)
# but it also support direct reads and writes of CuPy arrays
z[10:20] = cupy.arange(10, 20)
# Reading `z` returns a CuPy array
assert isinstance(z[:], cupy.ndarray)
assert (a == z[:]).all()
# Normally, we cannot assume that GPU and CPU compressors are compatible.
# E.g., `open_cupy_array()` uses nvCOMP's Snappy GPU compression by default,
# which, as far as we know, isn’t compatible with any CPU compressor. Thus,
# let's re-write our Zarr array using a CPU and GPU compatible compressor.
#
# Warning: it isn't possible to use `CompatCompressor` as a compressor argument
# in Zarr directly. It is only meant for `open_cupy_array()`. However,
# in an example further down, we show how to write using regular Zarr.
z = kvikio.zarr.open_cupy_array(
store=path,
mode="w",
shape=(20,),
chunks=(5,),
compressor=kvikio.zarr.CompatCompressor.lz4(),
)
z[:] = a
# Because we are using a CompatCompressor, it is now possible to open the file
# using Zarr's built-in LZ4 decompressor that uses the CPU.
z = zarr.open_array(path)
# `z` is now read as a regular NumPy array
assert isinstance(z[:], numpy.ndarray)
assert (a.get() == z[:]).all()
# and we can write to is as usual
z[:] = numpy.arange(20, 40)
# And we can read the Zarr file back into a CuPy array.
z = kvikio.zarr.open_cupy_array(store=path, mode="r")
assert isinstance(z[:], cupy.ndarray)
assert (cupy.arange(20, 40) == z[:]).all()
# Similarly, we can also open a file written by regular Zarr.
# Let's write the file without any compressor.
ary = numpy.arange(10)
z = zarr.open(store=path, mode="w", shape=ary.shape, compressor=None)
z[:] = ary
# This works as before where the file is read as a CuPy array
z = kvikio.zarr.open_cupy_array(store=path)
assert isinstance(z[:], cupy.ndarray)
assert (z[:] == cupy.asarray(ary)).all()
# Using a compressor is a bit more tricky since not all CPU compressors
# are GPU compatible. To make sure we use a compable compressor, we use
# the CPU-part of `CompatCompressor.lz4()`.
ary = numpy.arange(10)
z = zarr.open(
store=path,
mode="w",
shape=ary.shape,
compressor=kvikio.zarr.CompatCompressor.lz4().cpu,
)
z[:] = ary
# This works as before where the file is read as a CuPy array
z = kvikio.zarr.open_cupy_array(store=path)
assert isinstance(z[:], cupy.ndarray)
assert (z[:] == cupy.asarray(ary)).all()
if __name__ == "__main__":
main("/tmp/zarr-cupy-nvcomp")