HNSW#

这是 hnswlib 的一个包装器，用于将 CAGRA 索引加载为不可变的 HNSW 索引。加载的 HNSW 索引仅在 cuVS 中兼容，可以使用包装函数进行搜索。

索引搜索参数#

class cuvs.neighbors.hnsw.SearchParams(ef=200, *, num_threads=0)#

HNSW搜索参数

参数：

ef: int，默认值 = 200: 搜索期间使用的候选列表的最大大小。
num_threads: int，默认值 = 0: 用于增加搜索并行性的CPU线程数。当设置为0时，线程数将使用OpenMP的omp_get_max_threads()自动确定。

属性：

ef
num_threads

索引#

class cuvs.neighbors.hnsw.Index#

HNSW索引对象。此对象存储训练好的HNSW索引状态，可用于执行最近邻搜索。

属性：

已训练

索引转换#

cuvs.neighbors.hnsw.from_cagra( IndexParams index_params, Index cagra_index, temporary_index_path=None, resources=None, )[source]#

从CAGRA索引返回一个HNSW索引。

注意：当index_params.hierarchy为

NONE：此方法使用文件系统将CAGRA索引写入

到/tmp/<random_number>.bin，然后将其作为hnswlib索引读取，之后删除临时文件。返回的索引是不可变的，只能通过cuVS中的hnswlib包装器进行搜索，因为其格式与原始hnswlib不兼容。

与原始 hnswlib 兼容。
CPU：返回的索引是可变的，可以扩展以包含
额外的向量。序列化后的索引也与原始hnswlib库兼容。

保存/加载索引是实验性功能。序列化格式可能会更改。

参数：

index_paramsIndexParams: 将CAGRA索引转换为HNSW索引的参数。
cagra_indexcagra.Index: 已训练的CAGRA索引。
temporary_index_pathstring，默认值 = None: 保存临时索引文件的路径。如果为None，则临时文件将保存在/tmp/<random_number>.bin中。
resources可选的cuVS资源句柄，用于重用CUDA资源。: 如果未提供Resources，CUDA资源将在函数内部分配并在函数退出前同步。如果提供了resources，您需要在访问输出之前通过调用resources.sync()显式同步。

示例

>>> import cupy as cp
>>> from cuvs.neighbors import cagra
>>> from cuvs.neighbors import hnsw
>>> n_samples = 50000
>>> n_features = 50
>>> dataset = cp.random.random_sample((n_samples, n_features),
...                                   dtype=cp.float32)
>>> # Build index
>>> index = cagra.build(cagra.IndexParams(), dataset)
>>> # Serialize the CAGRA index to hnswlib base layer only index format
>>> hnsw_index = hnsw.from_cagra(hnsw.IndexParams(), index)

索引搜索#

cuvs.neighbors.hnsw.search( SearchParams search_params, Index index, queries, k, neighbors=None, distances=None, resources=None, )[source]#

为每个查询找到k个最近邻。

参数：

search_paramsSearchParams
indexIndex: 已训练的HNSW索引。
queries符合CPU数组接口的矩阵，形状为(n_samples, dim): 支持的数据类型dtype：[float, int]
kint: 最近邻的数量。
neighbors可选的符合CPU数组接口的矩阵，形状为: (n_queries, k)，数据类型为uint64_t。如果提供，最近邻索引将原地写入此处。（默认值None）
distances可选的符合CPU数组接口的矩阵，形状为: (n_queries, k)。如果提供，到最近邻的距离将原地写入此处。（默认值None）
resources可选的cuVS资源句柄，用于重用CUDA资源。: 如果未提供Resources，CUDA资源将在函数内部分配并在函数退出前同步。如果提供了resources，您需要在访问输出之前通过调用resources.sync()显式同步。

示例

>>> import cupy as cp
>>> from cuvs.neighbors import cagra, hnsw
>>> n_samples = 50000
>>> n_features = 50
>>> n_queries = 1000
>>> dataset = cp.random.random_sample((n_samples, n_features),
...                                   dtype=cp.float32)
>>> # Build index
>>> index = cagra.build(cagra.IndexParams(), dataset)
>>> # Search using the built index
>>> queries = cp.random.random_sample((n_queries, n_features),
...                                   dtype=cp.float32)
>>> k = 10
>>> search_params = hnsw.SearchParams(
...     ef=200,
...     num_threads=0
... )
>>> # Convert CAGRA index to HNSW
>>> hnsw_index = hnsw.from_cagra(hnsw.IndexParams(), index)
>>> # Using a pooling allocator reduces overhead of temporary array
>>> # creation during search. This is useful if multiple searches
>>> # are performed with same query size.
>>> distances, neighbors = hnsw.search(search_params, index, queries,
...                                     k)
>>> neighbors = cp.asarray(neighbors)
>>> distances = cp.asarray(distances)

索引保存#

cuvs.neighbors.hnsw.save(filename, Index index, resources=None)[source]#

将CAGRA索引保存到文件，格式为hnswlib索引。如果索引是使用hnsw.IndexParams(hierarchy="none")构建的，则保存的索引是不可变的，只能通过cuVS中的hnswlib包装器进行搜索，因为其格式与原始hnswlib不兼容。然而，如果索引是使用hnsw.IndexParams(hierarchy="cpu")构建的，则保存的索引是可变的，并且与原始hnswlib兼容。

保存/加载索引是实验性功能。序列化格式可能会更改。

参数：

filenamestring: 文件名。
indexIndex: 已训练的HNSW索引。
resources可选的cuVS资源句柄，用于重用CUDA资源。: 如果未提供Resources，CUDA资源将在函数内部分配并在函数退出前同步。如果提供了resources，您需要在访问输出之前通过调用resources.sync()显式同步。

示例

>>> import cupy as cp
>>> from cuvs.neighbors import cagra
>>> n_samples = 50000
>>> n_features = 50
>>> dataset = cp.random.random_sample((n_samples, n_features),
...                                   dtype=cp.float32)
>>> # Build index
>>> cagra_index = cagra.build(cagra.IndexParams(), dataset)
>>> # Serialize and deserialize the cagra index built
>>> hnsw_index = hnsw.from_cagra(hnsw.IndexParams(), cagra_index)
>>> hnsw.save("my_index.bin", hnsw_index)

索引加载#

cuvs.neighbors.hnsw.load( IndexParams index_params, filename, dim, dtype, metric=u'sqeuclidean', resources=None, )[source]#

加载一个HNSW索引。如果索引是使用hnsw.IndexParams(hierarchy="none")构建的，则加载的索引是不可变的，只能通过cuVS中的hnswlib包装器进行搜索，因为其格式与原始hnswlib不兼容。然而，如果索引是使用hnsw.IndexParams(hierarchy="cpu")构建的，则加载的索引是可变的，并且与原始hnswlib兼容。

保存/加载索引是实验性功能。序列化格式可能会更改，因此无法保证加载使用之前版本的cuVS保存的索引能够正常工作。

参数：

index_paramsIndexParams

用于将CAGRA索引转换为HNSW索引的参数。

filenamestring

文件名。

dimint

训练数据集的维度

dtype保存的索引的np.dtype

dtype的有效值：[np.float32, np.byte, np.ubyte]

metric表示度量类型的字符串，默认值=”sqeuclidean”

metric的有效值：[“sqeuclidean”, “inner_product”]，其中

sqeuclidean是欧几里得距离去除平方根操作，即：distance(a,b) = sum_i (a_i - b_i)^2，
inner_product距离定义为distance(a, b) = sum_i a_i * b_i。

resources可选的cuVS资源句柄，用于重用CUDA资源。

如果未提供Resources，CUDA资源将在函数内部分配并在函数退出前同步。如果提供了resources，您需要在访问输出之前通过调用resources.sync()显式同步。

返回：

indexHnswIndex

示例

>>> import cupy as cp
>>> from cuvs.neighbors import cagra
>>> from cuvs.neighbors import hnsw
>>> n_samples = 50000
>>> n_features = 50
>>> dataset = cp.random.random_sample((n_samples, n_features),
...                                   dtype=cp.float32)
>>> # Build index
>>> index = cagra.build(cagra.IndexParams(), dataset)
>>> # Serialize the CAGRA index to hnswlib base layer only index format
>>> hnsw.save("my_index.bin", index)
>>> index = hnsw.load("my_index.bin", n_features, np.float32,
...                   "sqeuclidean")