训练和评估机器学习模型#
本笔记本探讨了 cuML 中的几种基本机器学习估计器,展示了如何训练它们并使用内置的度量函数进行评估。所有模型均使用合成数据进行训练,这些数据由 cuML 的数据集工具生成。
随机森林分类器
UMAP
DBSCAN
线性回归
随机森林分类和准确率度量#
随机森林算法分类模型构建多个决策树,并聚合它们的每个输出来进行预测。有关 cuML 中随机森林分类模型的更多信息,请参阅: https://docs.rapids.org.cn/api/cuml/stable/api.html#cuml.ensemble.RandomForestClassifier
准确率分数是正确预测数与总预测数的比值。它用于衡量分类模型的性能。有关准确率分数的更多信息,请参阅: https://en.wikipedia.org/wiki/Accuracy_and_precision
有关 cuML 中准确率分数度量的更多信息,请参阅: https://docs.rapids.org.cn/api/cuml/stable/api.html#cuml.metrics.accuracy.accuracy_score
下面的单元格展示了随机森林分类模型的端到端管线。这里的数据集是使用 sklearn 的 make_classification 数据集生成的。生成的数据集用于训练模型并运行预测。评估随机森林的性能,然后比较从 cuML 和 sklearn 准确率度量获得的值。
[2]:
from cuml.datasets.classification import make_classification
from cuml.model_selection import train_test_split
from cuml.ensemble import RandomForestClassifier as cuRF
from sklearn.metrics import accuracy_score
# synthetic dataset dimensions
n_samples = 1000
n_features = 10
n_classes = 2
# random forest depth and size
n_estimators = 25
max_depth = 10
# generate synthetic data [ binary classification task ]
X, y = make_classification ( n_classes = n_classes,
n_features = n_features,
n_samples = n_samples,
random_state = 0 )
X_train, X_test, y_train, y_test = train_test_split( X, y, random_state = 0 )
model = cuRF( max_depth = max_depth,
n_estimators = n_estimators,
random_state = 0 )
trained_RF = model.fit ( X_train, y_train )
predictions = model.predict ( X_test )
cu_score = cuml.metrics.accuracy_score( y_test, predictions )
sk_score = accuracy_score( asnumpy( y_test ), asnumpy( predictions ) )
print( " cuml accuracy: ", cu_score )
print( " sklearn accuracy : ", sk_score )
# save
dump( trained_RF, 'RF.model')
# to reload the model uncomment the line below
loaded_model = load('RF.model')
/opt/conda/envs/docs/lib/python3.12/site-packages/cuml/internals/api_decorators.py:317: UserWarning: For reproducible results in Random Forest Classifier or for almost reproducible results in Random Forest Regressor, n_streams=1 is recommended. If n_streams is > 1, results may vary due to stream/thread timing differences, even when random_state is set
return init_func(self, *args, **kwargs)
cuml accuracy: 0.996
sklearn accuracy : 0.996
聚类#
UMAP 和 可信度度量#
UMAP 是一种降维算法,执行非线性降维。它也可用于可视化。有关 UMAP 模型的更多信息,请参阅文档:https://docs.rapids.org.cn/api/cuml/stable/api.html#cuml.UMAP
可信度衡量了模型嵌入中保留局部结构的程度。因此,如果模型预测的样本位于最近邻居的意外区域内,则这些样本将受到惩罚。有关可信度度量的更多信息,请参阅: https://scikit-learn.cn/dev/modules/generated/sklearn.manifold.t_sne.trustworthiness.html
cuML 中可信度度量实现的文档是: https://docs.rapids.org.cn/api/cuml/stable/api.html#cuml.metrics.trustworthiness.trustworthiness
下面的单元格展示了 UMAP 模型的端到端管线。这里,blob 数据集是使用 cuml 中等效于 make_blobs 函数创建的,用作输入。UMAP 的 fit_transform 的输出使用 trustworthiness 函数进行评估。下面比较了 sklearn 和 cuml 的 trustworthiness 获得的值。
[3]:
from cuml.datasets import make_blobs
from cuml.manifold.umap import UMAP as cuUMAP
from sklearn.manifold import trustworthiness
import numpy as np
n_samples = 1000
n_features = 100
cluster_std = 0.1
X_blobs, y_blobs = make_blobs( n_samples = n_samples,
cluster_std = cluster_std,
n_features = n_features,
random_state = 0,
dtype=np.float32 )
trained_UMAP = cuUMAP( n_neighbors = 10 ).fit( X_blobs )
X_embedded = trained_UMAP.transform( X_blobs )
cu_score = cuml.metrics.trustworthiness( X_blobs, X_embedded )
sk_score = trustworthiness( asnumpy( X_blobs ), asnumpy( X_embedded ) )
print(" cuml's trustworthiness score : ", cu_score )
print(" sklearn's trustworthiness score : ", sk_score )
# save
dump( trained_UMAP, 'UMAP.model')
# to reload the model uncomment the line below
# loaded_model = load('UMAP.model')
[2025-04-14 07:40:34.732] [CUML] [info] Building knn graph using brute force
cuml's trustworthiness score : 0.850690120967742
sklearn's trustworthiness score : 0.850690120967742
[3]:
['UMAP.model']
DBSCAN 和 调整兰德指数#
DBSCAN 是一种流行且功能强大的聚类算法。有关 DBSCAN 模型的更多信息,请参阅文档: https://docs.rapids.org.cn/api/cuml/stable/api.html#cuml.DBSCAN
我们使用 cuml 中等效于 make_blobs 函数创建 blob 数据集。
调整兰德指数是一种度量指标,用于衡量两个数据聚类之间的相似性,并经过调整以考虑元素的偶然分组。有关调整兰德指数的更多信息,请参阅: https://en.wikipedia.org/wiki/Rand_index
下面的单元格展示了 DBSCAN 的端到端模型。DBSCAN 的 fit_predict 的输出使用调整兰德指数函数进行评估。下面比较了 sklearn 和 cuml 的调整兰德度量获得的值。
[4]:
from cuml.datasets import make_blobs
from cuml import DBSCAN as cumlDBSCAN
from sklearn.metrics import adjusted_rand_score
import numpy as np
n_samples = 1000
n_features = 100
cluster_std = 0.1
X_blobs, y_blobs = make_blobs( n_samples = n_samples,
n_features = n_features,
cluster_std = cluster_std,
random_state = 0,
dtype=np.float32 )
cuml_dbscan = cumlDBSCAN( eps = 3,
min_samples = 2)
trained_DBSCAN = cuml_dbscan.fit( X_blobs )
cu_y_pred = trained_DBSCAN.fit_predict ( X_blobs )
cu_adjusted_rand_index = cuml.metrics.cluster.adjusted_rand_score( y_blobs, cu_y_pred )
sk_adjusted_rand_index = adjusted_rand_score( asnumpy(y_blobs), asnumpy(cu_y_pred) )
print(" cuml's adjusted random index score : ", cu_adjusted_rand_index)
print(" sklearn's adjusted random index score : ", sk_adjusted_rand_index)
# save and optionally reload
dump( trained_DBSCAN, 'DBSCAN.model')
# to reload the model uncomment the line below
# loaded_model = load('DBSCAN.model')
cuml's adjusted random index score : 1.0
sklearn's adjusted random index score : 1.0
[4]:
['DBSCAN.model']
回归#
线性回归和 R^2 分数#
线性回归是一种简单的机器学习模型,其中响应 y 由 X 中预测器的线性组合建模。
R^2 分数也称为决定系数。它用作评估回归模型的指标。它根据模型总变异的比例来评估模型的输出。有关 R^2 分数度量的更多信息,请参阅: https://en.wikipedia.org/wiki/Coefficient_of_determination
有关 cuML 中 r2 分数度量的更多信息,请参阅: https://docs.rapids.org.cn/api/cuml/stable/api.html#cuml.metrics.regression.r2_score
下面的单元格使用线性回归模型比较 cuML 和 sklearn 可信度度量之间的结果。有关 cuML 中线性回归模型的更多信息,请参阅: https://docs.rapids.org.cn/api/cuml/stable/api.html#linear-regression
[5]:
from cuml.datasets import make_regression
from cuml.model_selection import train_test_split
from cuml.linear_model import LinearRegression as cuLR
from sklearn.metrics import r2_score
n_samples = 2**10
n_features = 100
n_info = 70
X_reg, y_reg = make_regression( n_samples = n_samples,
n_features = n_features,
n_informative = n_info,
random_state = 123 )
X_reg_train, X_reg_test, y_reg_train, y_reg_test = train_test_split( X_reg,
y_reg,
train_size = 0.8,
random_state = 10 )
cuml_reg_model = cuLR( fit_intercept = True,
normalize = True,
algorithm = 'eig' )
trained_LR = cuml_reg_model.fit( X_reg_train, y_reg_train )
cu_preds = trained_LR.predict( X_reg_test )
cu_r2 = cuml.metrics.r2_score( y_reg_test, cu_preds )
sk_r2 = r2_score( asnumpy( y_reg_test ), asnumpy( cu_preds ) )
print("cuml's r2 score : ", cu_r2)
print("sklearn's r2 score : ", sk_r2)
# save and reload
dump( trained_LR, 'LR.model')
# to reload the model uncomment the line below
# loaded_model = load('LR.model')
cuml's r2 score : 1.0
sklearn's r2 score : 1.0
[5]:
['LR.model']