训练和评估机器学习模型#

本笔记本探讨了 cuML 中的几种基本机器学习估计器,展示了如何训练它们并使用内置的度量函数进行评估。所有模型均使用合成数据进行训练,这些数据由 cuML 的数据集工具生成。

  1. 随机森林分类器

  2. UMAP

  3. DBSCAN

  4. 线性回归

共享库导入#

[1]:
import cuml
from cupy import asnumpy
from joblib import dump, load

1. 分类#

随机森林分类和准确率度量#

随机森林算法分类模型构建多个决策树,并聚合它们的每个输出来进行预测。有关 cuML 中随机森林分类模型的更多信息,请参阅: https://docs.rapids.org.cn/api/cuml/stable/api.html#cuml.ensemble.RandomForestClassifier

准确率分数是正确预测数与总预测数的比值。它用于衡量分类模型的性能。有关准确率分数的更多信息,请参阅: https://en.wikipedia.org/wiki/Accuracy_and_precision

有关 cuML 中准确率分数度量的更多信息,请参阅: https://docs.rapids.org.cn/api/cuml/stable/api.html#cuml.metrics.accuracy.accuracy_score

下面的单元格展示了随机森林分类模型的端到端管线。这里的数据集是使用 sklearn 的 make_classification 数据集生成的。生成的数据集用于训练模型并运行预测。评估随机森林的性能,然后比较从 cuML 和 sklearn 准确率度量获得的值。

[2]:
from cuml.datasets.classification import make_classification
from cuml.model_selection import train_test_split
from cuml.ensemble import RandomForestClassifier as cuRF
from sklearn.metrics import accuracy_score

# synthetic dataset dimensions
n_samples = 1000
n_features = 10
n_classes = 2

# random forest depth and size
n_estimators = 25
max_depth = 10

# generate synthetic data [ binary classification task ]
X, y = make_classification ( n_classes = n_classes,
                             n_features = n_features,
                             n_samples = n_samples,
                             random_state = 0 )

X_train, X_test, y_train, y_test = train_test_split( X, y, random_state = 0 )

model = cuRF( max_depth = max_depth,
              n_estimators = n_estimators,
              random_state  = 0 )

trained_RF = model.fit ( X_train, y_train )

predictions = model.predict ( X_test )

cu_score = cuml.metrics.accuracy_score( y_test, predictions )
sk_score = accuracy_score( asnumpy( y_test ), asnumpy( predictions ) )

print( " cuml accuracy: ", cu_score )
print( " sklearn accuracy : ", sk_score )

# save
dump( trained_RF, 'RF.model')

# to reload the model uncomment the line below
loaded_model = load('RF.model')
/opt/conda/envs/docs/lib/python3.12/site-packages/cuml/internals/api_decorators.py:317: UserWarning: For reproducible results in Random Forest Classifier or for almost reproducible results in Random Forest Regressor, n_streams=1 is recommended. If n_streams is > 1, results may vary due to stream/thread timing differences, even when random_state is set
  return init_func(self, *args, **kwargs)
 cuml accuracy:  0.996
 sklearn accuracy :  0.996

聚类#

UMAP 和 可信度度量#

UMAP 是一种降维算法,执行非线性降维。它也可用于可视化。有关 UMAP 模型的更多信息,请参阅文档:https://docs.rapids.org.cn/api/cuml/stable/api.html#cuml.UMAP

可信度衡量了模型嵌入中保留局部结构的程度。因此,如果模型预测的样本位于最近邻居的意外区域内,则这些样本将受到惩罚。有关可信度度量的更多信息,请参阅: https://scikit-learn.cn/dev/modules/generated/sklearn.manifold.t_sne.trustworthiness.html

cuML 中可信度度量实现的文档是: https://docs.rapids.org.cn/api/cuml/stable/api.html#cuml.metrics.trustworthiness.trustworthiness

下面的单元格展示了 UMAP 模型的端到端管线。这里,blob 数据集是使用 cuml 中等效于 make_blobs 函数创建的,用作输入。UMAP 的 fit_transform 的输出使用 trustworthiness 函数进行评估。下面比较了 sklearn 和 cuml 的 trustworthiness 获得的值。

[3]:
from cuml.datasets import make_blobs
from cuml.manifold.umap import UMAP as cuUMAP
from sklearn.manifold import trustworthiness
import numpy as np

n_samples = 1000
n_features = 100
cluster_std = 0.1

X_blobs, y_blobs = make_blobs( n_samples = n_samples,
                               cluster_std = cluster_std,
                               n_features = n_features,
                               random_state = 0,
                               dtype=np.float32 )

trained_UMAP = cuUMAP( n_neighbors = 10 ).fit( X_blobs )
X_embedded = trained_UMAP.transform( X_blobs )

cu_score = cuml.metrics.trustworthiness( X_blobs, X_embedded )
sk_score = trustworthiness( asnumpy( X_blobs ),  asnumpy( X_embedded ) )

print(" cuml's trustworthiness score : ", cu_score )
print(" sklearn's trustworthiness score : ", sk_score )

# save
dump( trained_UMAP, 'UMAP.model')

# to reload the model uncomment the line below
# loaded_model = load('UMAP.model')
[2025-04-14 07:40:34.732] [CUML] [info] Building knn graph using brute force
 cuml's trustworthiness score :  0.850690120967742
 sklearn's trustworthiness score :  0.850690120967742
[3]:
['UMAP.model']

DBSCAN 和 调整兰德指数#

DBSCAN 是一种流行且功能强大的聚类算法。有关 DBSCAN 模型的更多信息,请参阅文档: https://docs.rapids.org.cn/api/cuml/stable/api.html#cuml.DBSCAN

我们使用 cuml 中等效于 make_blobs 函数创建 blob 数据集。

调整兰德指数是一种度量指标,用于衡量两个数据聚类之间的相似性,并经过调整以考虑元素的偶然分组。有关调整兰德指数的更多信息,请参阅: https://en.wikipedia.org/wiki/Rand_index

下面的单元格展示了 DBSCAN 的端到端模型。DBSCAN 的 fit_predict 的输出使用调整兰德指数函数进行评估。下面比较了 sklearn 和 cuml 的调整兰德度量获得的值。

[4]:
from cuml.datasets import make_blobs
from cuml import DBSCAN as cumlDBSCAN
from sklearn.metrics import adjusted_rand_score
import numpy as np

n_samples = 1000
n_features = 100
cluster_std = 0.1

X_blobs, y_blobs = make_blobs( n_samples = n_samples,
                               n_features = n_features,
                               cluster_std = cluster_std,
                               random_state = 0,
                               dtype=np.float32 )

cuml_dbscan = cumlDBSCAN( eps = 3,
                          min_samples = 2)

trained_DBSCAN = cuml_dbscan.fit( X_blobs )

cu_y_pred = trained_DBSCAN.fit_predict ( X_blobs )

cu_adjusted_rand_index = cuml.metrics.cluster.adjusted_rand_score( y_blobs, cu_y_pred )
sk_adjusted_rand_index = adjusted_rand_score( asnumpy(y_blobs), asnumpy(cu_y_pred) )

print(" cuml's adjusted random index score : ", cu_adjusted_rand_index)
print(" sklearn's adjusted random index score : ", sk_adjusted_rand_index)

# save and optionally reload
dump( trained_DBSCAN, 'DBSCAN.model')

# to reload the model uncomment the line below
# loaded_model = load('DBSCAN.model')
 cuml's adjusted random index score :  1.0
 sklearn's adjusted random index score :  1.0
[4]:
['DBSCAN.model']

回归#

线性回归和 R^2 分数#

线性回归是一种简单的机器学习模型,其中响应 y 由 X 中预测器的线性组合建模。

R^2 分数也称为决定系数。它用作评估回归模型的指标。它根据模型总变异的比例来评估模型的输出。有关 R^2 分数度量的更多信息,请参阅: https://en.wikipedia.org/wiki/Coefficient_of_determination

有关 cuML 中 r2 分数度量的更多信息,请参阅: https://docs.rapids.org.cn/api/cuml/stable/api.html#cuml.metrics.regression.r2_score

下面的单元格使用线性回归模型比较 cuML 和 sklearn 可信度度量之间的结果。有关 cuML 中线性回归模型的更多信息,请参阅: https://docs.rapids.org.cn/api/cuml/stable/api.html#linear-regression

[5]:
from cuml.datasets import make_regression
from cuml.model_selection import train_test_split
from cuml.linear_model import LinearRegression as cuLR
from sklearn.metrics import r2_score

n_samples = 2**10
n_features = 100
n_info = 70

X_reg, y_reg = make_regression( n_samples = n_samples,
                                n_features = n_features,
                                n_informative = n_info,
                                random_state = 123 )

X_reg_train, X_reg_test, y_reg_train, y_reg_test = train_test_split( X_reg,
                                                                     y_reg,
                                                                     train_size = 0.8,
                                                                     random_state = 10 )
cuml_reg_model = cuLR( fit_intercept = True,
                       normalize = True,
                       algorithm = 'eig' )

trained_LR = cuml_reg_model.fit( X_reg_train, y_reg_train )
cu_preds = trained_LR.predict( X_reg_test )

cu_r2 = cuml.metrics.r2_score( y_reg_test, cu_preds )
sk_r2 = r2_score( asnumpy( y_reg_test ), asnumpy( cu_preds ) )

print("cuml's r2 score : ", cu_r2)
print("sklearn's r2 score : ", sk_r2)

# save and reload
dump( trained_LR, 'LR.model')

# to reload the model uncomment the line below
# loaded_model = load('LR.model')
cuml's r2 score :  1.0
sklearn's r2 score :  1.0
[5]:
['LR.model']