cuML 在 GPU 和 CPU 上#

cuML 是一个类似 Scikit-learn 的套件，包含用于数据科学和分析任务的快速、GPU 加速的机器学习算法。

从 23.10 版本开始，cuML 提供了基于 GPU 和基于 CPU 的执行能力，并且在它们之间切换时无需进行零代码更改。这种统一的 CPU/GPU cuML

允许用户在没有 GPU 的系统中进行原型开发。
允许库集成，无需调度和样板代码。
允许用户在一种类型的系统上训练，在另一种类型的系统上进行推理（对于部分估计器，将来会扩展）。
提供了与更广泛的 GPU/CPU 开源 pydata 生态系统的兼容性。

cuML 的大多数估计器都可以在 CPU 和 GPU 系统中运行，其中一部分支持在 GPU 和 CPU 系统之间导出模型。下表显示了对最常见估计器的支持

类别	算法	支持在 CPU 上执行	支持在 CPU 和 GPU 之间导出
聚类	基于密度的带噪声的应用空间聚类 (DBSCAN)	是	否
	分层基于密度的带噪声的应用空间聚类 (HDBSCAN)	是	部分
	K-均值	是	否
	单连接层次聚类	否	否
降维	主成分分析 (PCA)	是	是
	增量 PCA	否	否
	截断奇异值分解 (tSVD)	是	是
	均匀流形逼近与投影 (UMAP)	是	部分
	随机投影	否	否
	t-分布式随机邻域嵌入 (TSNE)	否	否
用于回归或分类的线性模型	线性回归 (OLS)	是	是
	带有 Lasso 或 Ridge 正则化的线性回归	是	是
	弹性网络回归	是	是
	LARS 回归	否	否
	逻辑回归	是	是
	朴素贝叶斯	否	否
	求解器		是
用于回归或分类的非线性模型	随机森林 (RF) 分类	否	部分
	随机森林 (RF) 回归	否	部分
	基于决策树模型的推理	否	否
	最近邻 (NN)	是	是
	K-最近邻 (KNN) 分类	是	是
	K-最近邻 (KNN) 回归	是	是
	支持向量机分类器 (SVC)	否	否
	Epsilon-支持向量回归 (SVR)	否	否
时间序列	Holt-Winters 指数平滑	否	否
	自回归积分移动平均 (ARIMA)	否	否

这保证了相同的代码可以在 GPU 和 CPU 系统中运行。23.12 版本计划增加以下算法

随机森林
支持向量机估计器

安装#

对于 GPU 系统，cuML 仍然遵循 RAPIDS 要求。cuML 包和轮子是通用的，可以在 GPU 和 CPU 模式下运行。要在仅 CPU 系统中使用 cuML，可以使用 conda/mamba 安装，命令为

mamba install -c rapidsai -c nvidia -c conda-forge cuml-cpu=23.10
# mamba install -c rapidsai-nightly -c nvidia -c conda-forge cuml-cpu=23.12 # for nightly builds

cuML 23.10 支持在 GPU 和 CPU 系统上使用 conda 的 Linux 和 WSL2。
cuML 23.12 将带来对 pip 轮子和 MacOS CPU 执行的支持。

如何使用#

使用 cuML 的 CPU 功能主要有两种方式

1. 直接使用 CPU 包#

CPU 包 cuml-cpu 是 cuml 包的子集，因此在使用仅 CPU 系统时运行代码无需任何代码更改。例如，以下脚本可以在有 GPU 和安装了 cuml 的系统上运行，也可以在没有 GPU 和安装了 cuml-cpu 的系统上运行

[1]:

import cuml # no change is needed for even the importing!
import pandas as pd

from cuml.manifold.umap import UMAP
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.manifold import trustworthiness

# load the iris dataset from sklearn and extract the required information
iris = datasets.load_iris()
dataset = iris.data

iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)

# define the cuml UMAP model and use fit_transform function to obtain the low dimensional output of the input dataset
embedding = UMAP(
    n_neighbors=10, min_dist=0.01,  init="random"
).fit_transform(iris_df)

# calculate the trust worthiness of the results obtaind from the cuml UMAP
trust = trustworthiness(iris_df, embedding)
print(trust)

0.9818028169014085

这使得在 CPU 系统上进行原型开发并在 GPU 服务器上运行生产代码变得容易，反之亦然。如上所述并在相应部分中的示例所解释，一些估计器支持在一种类型的系统上训练，然后将模型导出到另一种类型。

2. 使用 GPU 包管理执行平台#

除了允许在 CPU 系统中进行零代码更改执行外，用户在使用完整 cuML 的系统时，还可以手动控制由哪个设备执行代码的某些部分。

例如，使用以下数据

[2]:

import cuml
from cuml.neighbors import NearestNeighbors
from cuml.datasets import make_regression, make_blobs
from cuml.model_selection import train_test_split

X_blobs, y_blobs = make_blobs(n_samples=2000,
                              n_features=20)
X_train_blobs, X_test_blobs, y_train_blobs, y_test_blobs = train_test_split(X_blobs,
                                                                            y_blobs,
                                                                            test_size=0.2, shuffle=True)

X_reg, y_reg = make_regression(n_samples=2000,
                               n_features=20)
X_train_reg, X_test_reg, y_train_reg, y_tes_reg = train_test_split(X_reg,
                                                                   y_reg,
                                                                   test_size=0.2,
                                                                   shuffle=True)

有两种方法可以控制代码的执行

a) `using_device_type` 上下文管理器#

[3]:

from cuml.neighbors import NearestNeighbors
from cuml.common.device_selection import using_device_type

nn = NearestNeighbors()
with using_device_type('cpu'):
    nn.fit(X_train_blobs)
    nearest_neighbors = nn.kneighbors(X_test_blobs)

这使得在不同设备上原型开发和运行不同估计器变得容易，例如在数据量小，移动数据不会允许 GPU 加速估计器的情况下。

它还允许使用不受支持的参数运行估计器

from cuml.manifold import UMAP

umap_model = UMAP(angular_rp_forest=True) # `angular_rp_forest` hyperparameter only available in UMAP library
with using_device_type('cpu'):
    umap_model.fit(X_train_blobs) # will run the UMAP library with the hyperparameter
with using_device_type('gpu'):
    transformed = umap_model.transform(X_test_blobs) # will run the cuML implementation of UMAP, ignoring the unsupported parameter.

即将推出的功能将允许这种调度在底层自动发生。这对于将 cuML 集成到其他库中非常有用，这样如果用户使用的参数在 GPU 上不受支持，代码将自动调度到 CPU 实现。

b) 使用 `set_global_device_type` 进行全局配置#

默认情况下，cuml 将在 GPU/设备上执行估计器。但它也允许一个全局配置选项来更改默认设备，这在共享系统中可能有用，在这些系统中 cuML 与占用大部分 GPU 的深度学习框架一起运行。这可以通过 set_global_device_type 函数实现

[4]:

from cuml.common.device_selection import set_global_device_type, get_global_device_type

initial_device_type = get_global_device_type()
print('default execution device:', initial_device_type)

default execution device: DeviceType.device

[5]:

set_global_device_type('cpu')
print('new device type:', get_global_device_type())

new device type: DeviceType.host

跨设备训练和推理序列化#

如上所述，部分估计器支持在一种设备类型（CPU 或 GPU）上训练、序列化训练好的模型，然后将其反序列化并在另一种设备类型上执行。

为此，提供了简单的 API。例如，要在 GPU 上训练模型但在 CPU 上部署，首先在设备上训练估计器并将其保存到磁盘

[6]:

import pickle
from cuml.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(X_train_reg, y_train_reg)

pickle.dump(lin_reg, open("lin_reg.pkl", "wb"))
del lin_reg

然后，在服务器/其他设备上，在安装了 cuml-cpu 的节点上恢复估计器

[7]:

recovered_lin_reg = pickle.load(open("lin_reg.pkl", "rb"))
predictions = recovered_lin_reg.predict(X_test_reg)
print(predictions[0:10])

[[  7.6141477]
 [-25.442528 ]
 [-48.71788  ]
 [-47.04067  ]
 [ 49.882076 ]
 [ 86.28621  ]
 [131.08463  ]
 [ 34.544495 ]
 [-49.43804  ]
 [ -4.429276 ]]

结论#

cuML 的 CPU 功能旨在促进不同的用例，降低使用 cuML 功能的门槛，并简化将 cuML 集成到其他工具和部署模型的过程。

即将推出的 cuML 版本将扩展支持的估计器，既支持 CPU 执行，也支持在有无 GPU 的系统之间序列化/导出模型。