Google Kubernetes Engine#

RAPIDS 可以通过 Google Kubernetes Engine (GKE) 部署在 Google Cloud 上。

要运行 RAPIDS，您需要一个具有可用 GPU 的 Kubernetes 集群。

先决条件#

首先，您需要安装 gcloud CLI 工具，以及用于管理 Kubernetes 的 kubectl、helm 等工具。

确保您已登录到 gcloud CLI。

$ gcloud init

创建 Kubernetes 集群#

现在我们可以启动一个启用 GPU 的 GKE 集群。

gcloud container clusters create rapids-gpu-kubeflow \
  --accelerator type=nvidia-tesla-a100,count=2 --machine-type a2-highgpu-2g \
  --zone us-central1-c --release-channel stable

通过此命令，您已启动了一个名为 rapids-gpu-kubeflow 的 GKE 集群。您已指定它应使用 a2-highgpu-2g 类型的节点，每个节点配有两个 A100 GPU。

获取集群凭据#

gcloud container clusters get-credentials rapids-gpu-kubeflow \
    --region=us-central1-c

通过此命令，您的 kubeconfig 将使用 rapids-gpu-kubeflow 集群的凭据和端点信息进行更新。

安装驱动程序#

接下来，在每个节点上安装 NVIDIA 驱动程序。

$ kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded-latest.yaml
daemonset.apps/nvidia-driver-installer created

验证 NVIDIA 驱动程序是否已成功安装。

$ kubectl get po -A --watch | grep nvidia
kube-system   nvidia-driver-installer-6zwcn                                 1/1     Running   0         8m47s
kube-system   nvidia-driver-installer-8zmmn                                 1/1     Running   0         8m47s
kube-system   nvidia-driver-installer-mjkb8                                 1/1     Running   0         8m47s
kube-system   nvidia-gpu-device-plugin-5ffkm                                1/1     Running   0         13m
kube-system   nvidia-gpu-device-plugin-d599s                                1/1     Running   0         13m
kube-system   nvidia-gpu-device-plugin-jrgjh                                1/1     Running   0         13m

安装驱动程序后，您就可以测试集群了。

让我们创建一个使用 GPU 计算的示例 Pod，以确保一切按预期工作。

cat << EOF | kubectl create -f -
apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd
spec:
  restartPolicy: OnFailure
  containers:
  - name: cuda-vectoradd
    image: "nvidia/samples:vectoradd-cuda11.6.0-ubuntu18.04"
    resources:
       limits:
         nvidia.com/gpu: 1
EOF

$ kubectl logs pod/cuda-vectoradd
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

如果您在输出中看到 Test PASSED，则可以确信您的 Kubernetes 集群已正确设置 GPU 计算。

接下来，清理该 Pod。

$ kubectl delete pod cuda-vectoradd
pod "cuda-vectoradd" deleted

安装 RAPIDS#

现在您已在 GKE 上拥有一个启用 GPU 的 Kubernetes 集群，您可以使用任何支持的方法安装 RAPIDS。

清理#

您还可以使用以下命令删除 GKE 集群以停止计费。

$ gcloud container clusters delete rapids-gpu-kubeflow --zone us-central1-c
Deleting cluster rapids...⠼