cuml.accel:使用 NVIDIA GPU 实现零代码改动加速。#

cuml.accel 是 cuML 中的一个新工具,它允许您在不更改脚本或笔记本的情况下,使用 NVIDIA GPU 加速 Scikit-Learn 估计器。以下笔记本直接取自 Scikit-Learn 示例库,用于演示如何使用 cuml.accel 加速未经修改的 Scikit-Learn 代码。

一如既往,请记住在任何基于 Scikit-Learn 社区出色工作的出版物中引用 Scikit-Learn。

[1]:
# The following magic is the only change required to enable GPU acceleration with cuml.accel
%load_ext cuml.accel
# If you wish to see results WITHOUT cuml.accel, be sure to comment out the above AND restart the notebook kernel
cuML: Installed accelerator for sklearn.
cuML: Successfully initialized accelerator.

手写数字数据的 K-Means 聚类演示#

在此示例中,我们将比较 K-means 的各种初始化策略在运行时和结果质量方面的表现。

由于此处已知真实标签,我们还将应用不同的聚类质量指标来判断聚类标签与真实标签的拟合优度。

评估的聚类质量指标(关于这些指标的定义和讨论,请参阅 clustering_evaluation

=========== ======================================================== 简称 全称 =========== ======================================================== homo 同质性分数 compl 完整性分数 v-meas V 度量 ARI 调整兰德指数 AMI 调整互信息 silhouette 轮廓系数 =========== ========================================================

加载数据集#

我们将首先加载 digits 数据集。此数据集包含手写数字 0 到 9。在聚类情境下,我们希望将图像分组,使图像上的手写数字相同。

[2]:
import numpy as np

from sklearn.datasets import load_digits

data, labels = load_digits(return_X_y=True)
(n_samples, n_features), n_digits = data.shape, np.unique(labels).size

print(f"# digits: {n_digits}; # samples: {n_samples}; # features {n_features}")
# digits: 10; # samples: 1797; # features 64

定义我们的评估基准#

我们将首先定义我们的评估基准。在此基准测试期间,我们打算比较 KMeans 的不同初始化方法。我们的基准测试将

  • 创建一个使用 :class:~sklearn.preprocessing.StandardScaler 缩放数据的管道;

  • 训练并计时管道拟合过程;

  • 通过不同的指标衡量所获得的聚类性能。

[3]:
from time import time

from sklearn import metrics
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler


def bench_k_means(kmeans, name, data, labels):
    """Benchmark to evaluate the KMeans initialization methods.

    Parameters
    ----------
    kmeans : KMeans instance
        A :class:`~sklearn.cluster.KMeans` instance with the initialization
        already set.
    name : str
        Name given to the strategy. It will be used to show the results in a
        table.
    data : ndarray of shape (n_samples, n_features)
        The data to cluster.
    labels : ndarray of shape (n_samples,)
        The labels used to compute the clustering metrics which requires some
        supervision.
    """
    t0 = time()
    estimator = make_pipeline(StandardScaler(), kmeans).fit(data)
    fit_time = time() - t0
    results = [name, fit_time, estimator[-1].inertia_]

    # Define the metrics which require only the true labels and estimator
    # labels
    clustering_metrics = [
        metrics.homogeneity_score,
        metrics.completeness_score,
        metrics.v_measure_score,
        metrics.adjusted_rand_score,
        metrics.adjusted_mutual_info_score,
    ]
    results += [m(labels, estimator[-1].labels_) for m in clustering_metrics]

    # The silhouette score requires the full dataset
    results += [
        metrics.silhouette_score(
            data,
            estimator[-1].labels_,
            metric="euclidean",
            sample_size=300,
        )
    ]

    # Show the results
    formatter_result = (
        "{:9s}\t{:.3f}s\t{:.0f}\t{:.3f}\t{:.3f}\t{:.3f}\t{:.3f}\t{:.3f}\t{:.3f}"
    )
    print(formatter_result.format(*results))

运行基准测试#

我们将比较三种方法

  • 使用 k-means++ 进行初始化。此方法是随机的,我们将运行初始化 4 次;

  • 随机初始化。此方法也是随机的,我们将运行初始化 4 次;

  • 基于 :class:~sklearn.decomposition.PCA 投影的初始化。实际上,我们将使用 :class:~sklearn.decomposition.PCA 的分量来初始化 KMeans。此方法是确定性的,一次初始化就足够了。

[4]:
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

print(82 * "_")
print("init\t\ttime\tinertia\thomo\tcompl\tv-meas\tARI\tAMI\tsilhouette")

kmeans = KMeans(init="k-means++", n_clusters=n_digits, n_init=4, random_state=0)
bench_k_means(kmeans=kmeans, name="k-means++", data=data, labels=labels)

kmeans = KMeans(init="random", n_clusters=n_digits, n_init=4, random_state=0)
bench_k_means(kmeans=kmeans, name="random", data=data, labels=labels)

pca = PCA(n_components=n_digits).fit(data)
kmeans = KMeans(init=pca.components_, n_clusters=n_digits, n_init=1)
bench_k_means(kmeans=kmeans, name="PCA-based", data=data, labels=labels)

print(82 * "_")
__________________________________________________________________________________
init            time    inertia homo    compl   v-meas  ARI     AMI     silhouette
k-means++       0.343s  73684   0.374   0.528   0.438   0.280   0.432   0.040
random          0.034s  70558   0.590   0.665   0.625   0.475   0.621   0.121
PCA-based       0.010s  69519   0.610   0.658   0.633   0.480   0.630   0.159
__________________________________________________________________________________

cuml.accel 结果#

如果您将上表中获得的时间与不使用 %load_ext cuml.accel 时获得的时间进行比较,您应该注意到随机和基于 PCA 的初始化有显著加速,但 k-means++ 的加速要小得多。cuml.accel 将尽可能地进行 GPU 加速,但对于任何缺失的功能,它将回退到 CPU。因此,与常规 CPU 执行相比,使用 cuml.accel 时应看到运行时间与之相当或显著更优。

您还会注意到,结果的轮廓系数应与不使用 cuml.accel 时获得的结果相当或更优。即使有无 cuml.accel 的输出在数值上可能不完全匹配,cuml.accel 应提供与不使用 cuml.accel 时获得的等效结果,即结果的质量应与之相当或更优。

在 PCA 降维数据上可视化结果#

:class:~sklearn.decomposition.PCA 允许将数据从原始的 64 维空间投影到较低维空间。随后,我们可以使用 :class:~sklearn.decomposition.PCA 投影到 2 维空间,并在新空间中绘制数据和聚类。

[5]:
import matplotlib.pyplot as plt

reduced_data = PCA(n_components=2).fit_transform(data)
kmeans = KMeans(init="k-means++", n_clusters=n_digits, n_init=4)
kmeans.fit(reduced_data)

# Step size of the mesh. Decrease to increase the quality of the VQ.
h = 0.02  # point in the mesh [x_min, x_max]x[y_min, y_max].

# Plot the decision boundary. For that, we will assign a color to each
x_min, x_max = reduced_data[:, 0].min() - 1, reduced_data[:, 0].max() + 1
y_min, y_max = reduced_data[:, 1].min() - 1, reduced_data[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Obtain labels for each point in mesh. Use last trained model.
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure(1)
plt.clf()
plt.imshow(
    Z,
    interpolation="nearest",
    extent=(xx.min(), xx.max(), yy.min(), yy.max()),
    cmap=plt.cm.Paired,
    aspect="auto",
    origin="lower",
)

plt.plot(reduced_data[:, 0], reduced_data[:, 1], "k.", markersize=2)
# Plot the centroids as a white X
centroids = kmeans.cluster_centers_
plt.scatter(
    centroids[:, 0],
    centroids[:, 1],
    marker="x",
    s=169,
    linewidths=3,
    color="w",
    zorder=10,
)
plt.title(
    "K-means clustering on the digits dataset (PCA-reduced data)\n"
    "Centroids are marked with white cross"
)
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
plt.show()
../../_images/zero_code_change_examples_plot_kmeans_digits_11_0.png