在 Amazon SageMaker 上大规模运行 RAPIDS 超参数实验#

导入包并创建 Amazon SageMaker 和 Boto3 会话#

import time

import boto3
import sagemaker
execution_role = sagemaker.get_execution_role()
session = sagemaker.Session()

region = boto3.Session().region_name
account = boto3.client("sts").get_caller_identity().get("Account")
account, region
('561241433344', 'us-east-2')

将 higgs-boson 数据集上传到 S3 存储桶#

!mkdir -p ./dataset
!if [ ! -f "dataset/HIGGS.csv" ]; then wget -P dataset https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz; fi
!if [ ! -f "dataset/HIGGS.csv" ]; then gunzip dataset/HIGGS.csv.gz; fi
s3_data_dir = session.upload_data(path="dataset", key_prefix="dataset/higgs-dataset")
s3_data_dir
's3://sagemaker-us-east-2-561241433344/dataset/higgs-dataset'

从 DockerHub 下载最新的 RAPIDS 容器#

为了构建与 Amazon SageMaker 兼容的 RAPIDS Docker 容器,您将从基础 RAPIDS 容器开始,NVIDIA 的优秀工程师们已经构建并将其推送到了 DockerHub

您需要通过创建 Dockerfile、复制训练脚本以及安装 SageMaker 训练工具包 来扩展此容器,使 RAPIDS 与 SageMaker 兼容

estimator_info = {
    "rapids_container": "nvcr.io/nvidia/rapidsai/base:25.04-cuda12.8-py3.12",
    "ecr_image": "sagemaker-rapids-higgs:latest",
    "ecr_repository": "sagemaker-rapids-higgs",
}
%%time
!docker pull {estimator_info['rapids_container']}
!cat Dockerfile
ARG RAPIDS_IMAGE

FROM $RAPIDS_IMAGE as rapids

# Installs a few more dependencies
RUN conda install --yes -n base \
        cupy \
        flask \
        protobuf \
        'sagemaker-python-sdk>=2.239.0'

# Copies the training code inside the container
COPY rapids-higgs.py /opt/ml/code/rapids-higgs.py

# Defines rapids-higgs.py as script entry point
# ref: https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-training-container.html
ENV SAGEMAKER_PROGRAM rapids-higgs.py

# override entrypoint from the base image with one that accepts
# 'train' and 'serve' (as SageMaker expects to provide)
COPY entrypoint.sh /opt/entrypoint.sh
ENTRYPOINT ["/opt/entrypoint.sh"]
!docker build -t {estimator_info['ecr_image']} --build-arg RAPIDS_IMAGE={estimator_info['rapids_container']} .
Sending build context to Docker daemon   7.68kB
Step 1/7 : ARG RAPIDS_IMAGE
Step 2/7 : FROM $RAPIDS_IMAGE as rapids
 ---> a80bdce0d796
Step 3/7 : RUN conda install --yes -n base         cupy         flask         protobuf         sagemaker
 ---> Running in f6522ce9b303
Channels:
 - rapidsai-nightly
 - dask/label/dev
 - pytorch
 - conda-forge
 - nvidia
Platform: linux-64
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: /opt/conda

  added / updated specs:
    - cupy
    - flask
    - protobuf
    - sagemaker


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    blinker-1.8.2              |     pyhd8ed1ab_0          14 KB  conda-forge
    boto3-1.34.118             |     pyhd8ed1ab_0          78 KB  conda-forge
    botocore-1.34.118          |pyge310_1234567_0         6.8 MB  conda-forge
    dill-0.3.8                 |     pyhd8ed1ab_0          86 KB  conda-forge
    flask-3.0.3                |     pyhd8ed1ab_0          79 KB  conda-forge
    google-pasta-0.2.0         |     pyh8c360ce_0          42 KB  conda-forge
    itsdangerous-2.2.0         |     pyhd8ed1ab_0          19 KB  conda-forge
    jmespath-1.0.1             |     pyhd8ed1ab_0          21 KB  conda-forge
    multiprocess-0.70.16       |  py310h2372a71_0         238 KB  conda-forge
    openssl-3.3.1              |       h4ab18f5_0         2.8 MB  conda-forge
    pathos-0.3.2               |     pyhd8ed1ab_1          52 KB  conda-forge
    pox-0.3.4                  |     pyhd8ed1ab_0          26 KB  conda-forge
    ppft-1.7.6.8               |     pyhd8ed1ab_0          33 KB  conda-forge
    protobuf-4.25.3            |  py310ha8c1f0e_0         325 KB  conda-forge
    protobuf3-to-dict-0.1.5    |  py310hff52083_8          14 KB  conda-forge
    s3transfer-0.10.1          |     pyhd8ed1ab_0          61 KB  conda-forge
    sagemaker-2.75.1           |     pyhd8ed1ab_0         377 KB  conda-forge
    smdebug-rulesconfig-1.0.1  |     pyhd3deb0d_1          20 KB  conda-forge
    werkzeug-3.0.3             |     pyhd8ed1ab_0         237 KB  conda-forge
    ------------------------------------------------------------
                                           Total:        11.2 MB

The following NEW packages will be INSTALLED:

  blinker            conda-forge/noarch::blinker-1.8.2-pyhd8ed1ab_0 
  boto3              conda-forge/noarch::boto3-1.34.118-pyhd8ed1ab_0 
  botocore           conda-forge/noarch::botocore-1.34.118-pyge310_1234567_0 
  dill               conda-forge/noarch::dill-0.3.8-pyhd8ed1ab_0 
  flask              conda-forge/noarch::flask-3.0.3-pyhd8ed1ab_0 
  google-pasta       conda-forge/noarch::google-pasta-0.2.0-pyh8c360ce_0 
  itsdangerous       conda-forge/noarch::itsdangerous-2.2.0-pyhd8ed1ab_0 
  jmespath           conda-forge/noarch::jmespath-1.0.1-pyhd8ed1ab_0 
  multiprocess       conda-forge/linux-64::multiprocess-0.70.16-py310h2372a71_0 
  pathos             conda-forge/noarch::pathos-0.3.2-pyhd8ed1ab_1 
  pox                conda-forge/noarch::pox-0.3.4-pyhd8ed1ab_0 
  ppft               conda-forge/noarch::ppft-1.7.6.8-pyhd8ed1ab_0 
  protobuf           conda-forge/linux-64::protobuf-4.25.3-py310ha8c1f0e_0 
  protobuf3-to-dict  conda-forge/linux-64::protobuf3-to-dict-0.1.5-py310hff52083_8 
  s3transfer         conda-forge/noarch::s3transfer-0.10.1-pyhd8ed1ab_0 
  sagemaker          conda-forge/noarch::sagemaker-2.75.1-pyhd8ed1ab_0 
  smdebug-rulesconf~ conda-forge/noarch::smdebug-rulesconfig-1.0.1-pyhd3deb0d_1 
  werkzeug           conda-forge/noarch::werkzeug-3.0.3-pyhd8ed1ab_0 

The following packages will be UPDATED:

  openssl                                  3.3.0-h4ab18f5_3 --> 3.3.1-h4ab18f5_0 



Downloading and Extracting Packages: ...working... done
Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done
Removing intermediate container f6522ce9b303
 ---> 883c682b36bc
Step 4/7 : COPY rapids-higgs.py /opt/ml/code/rapids-higgs.py
 ---> 2f6b3e0bec44
Step 5/7 : ENV SAGEMAKER_PROGRAM rapids-higgs.py
 ---> Running in df524941c02e
Removing intermediate container df524941c02e
 ---> 4cf437176c8c
Step 6/7 : COPY entrypoint.sh /opt/entrypoint.sh
 ---> 32d95ff5bd74
Step 7/7 : ENTRYPOINT ["/opt/entrypoint.sh"]
 ---> Running in c396fa9e98ad
Removing intermediate container c396fa9e98ad
 ---> 39f900bfeba0
Successfully built 39f900bfeba0
Successfully tagged sagemaker-rapids-higgs:latest
!docker images

发布到 Elastic Container Registry#

在运行大规模训练作业时,无论是进行分布式训练还是独立实验,您都需要确保数据集和训练脚本在集群中的每个实例上都已复制。令人欣慰的是,其中更棘手的部分——移动数据集——由 Amazon SageMaker 完成。至于训练代码,您已经准备好了一个 Docker 容器,只需将其推送到容器注册表,然后 Amazon SageMaker 就会将其拉取到集群中的每个训练计算实例中。

注意:SageMaker 不支持使用来自私有 Docker 注册表(例如 DockerHub)的训练镜像,因此我们需要将 SageMaker 兼容的 RAPIDS 容器推送到 Amazon Elastic Container Registry (Amazon ECR) 中,以存储您的 Amazon SageMaker 兼容的 RAPIDS 容器并使其可供 Amazon SageMaker 使用。

ECR_container_fullname = (
    f"{account}.dkr.ecr.{region}.amazonaws.com/{estimator_info['ecr_image']}"
)
ECR_container_fullname
'561241433344.dkr.ecr.us-east-2.amazonaws.com/sagemaker-rapids-higgs:latest'
!docker tag {estimator_info['ecr_image']} {ECR_container_fullname}
print(
    f"source      : {estimator_info['ecr_image']}\n"
    f"destination : {ECR_container_fullname}"
)
source      : sagemaker-rapids-higgs:latest
destination : 561241433344.dkr.ecr.us-east-2.amazonaws.com/sagemaker-rapids-higgs:latest
!aws ecr create-repository --repository-name {estimator_info['ecr_repository']}
!$(aws ecr get-login --no-include-email --region {region})
!docker push {ECR_container_fullname}
The push refers to repository [561241433344.dkr.ecr.us-east-2.amazonaws.com/sagemaker-rapids-higgs]

3be3c6f4: Preparing 
a7112765: Preparing 
5c05c772: Preparing 
bdce5066: Preparing 
923ec1b3: Preparing 
3fcfb3d4: Preparing 
bf18a086: Preparing 
f3ff1008: Preparing 
b6fb91b8: Preparing 
7bf1eb99: Preparing 
264186e1: Preparing 
7d7711e0: Preparing 
ee96f292: Preparing 
e2a80b3f: Preparing 
0a873d7a: Preparing 
bcc60d01: Preparing 
1dcee623: Preparing 
9a46b795: Preparing 
5e83c163: Preparing 
c05c772: Pushed   643.1MB/637.1MB9Alatest: digest: sha256:c8172a0ad30cd39b091f5fc3f3cde922ceabb103d0a0ec90beb1a5c4c9c6c97c size: 4504

在本地测试与 Amazon SageMaker 兼容的 RAPIDS 容器#

在您花费时间和金钱在大型集群上运行大型实验之前,您应该先运行一个本地 Amazon SageMaker 训练作业,以确保容器按预期工作。请确保您的本地机器上安装了 SageMaker SDK

定义一些默认的超参数。大胆猜测一下,您可以在 cuML 文档 页面上找到 RandomForest 的完整超参数列表。

hyperparams = {
    "n_estimators": 15,
    "max_depth": 5,
    "n_bins": 8,
    "split_criterion": 0,  # GINI:0, ENTROPY:1
    "bootstrap": 0,  # true: sample with replacement, false: sample without replacement
    "max_leaves": -1,  # unlimited leaves
    "max_features": 0.2,
}

现在,将实例类型指定为 local_gpu。这假设您在本地拥有 GPU。如果您没有本地 GPU,您可以在 Amazon SageMaker 托管的 GPU 实例上进行测试 — 只需将 local_gpu 替换为 p3p2 GPU 实例,方法是更新 instance_type 变量。

from sagemaker.estimator import Estimator

rapids_estimator = Estimator(
    image_uri=ECR_container_fullname,
    role=execution_role,
    instance_count=1,
    instance_type="ml.p3.2xlarge",  #'local_gpu'
    max_run=60 * 60 * 24,
    max_wait=(60 * 60 * 24) + 1,
    use_spot_instances=True,
    hyperparameters=hyperparams,
    metric_definitions=[{"Name": "test_acc", "Regex": "test_acc: ([0-9\\.]+)"}],
)
%%time
rapids_estimator.fit(inputs=s3_data_dir)
INFO:sagemaker:Creating training-job with name: sagemaker-rapids-higgs-2024-06-05-02-14-30-371
2024-06-05 02:14:30 Starting - Starting the training job...
2024-06-05 02:14:54 Starting - Preparing the instances for training...
2024-06-05 02:15:26 Downloading - Downloading input data..................
2024-06-05 02:18:16 Downloading - Downloading the training image...
2024-06-05 02:18:47 Training - Training image download completed. Training in progress...@ entrypoint -> launching training script 

2024-06-05 02:19:27 Uploading - Uploading generated training modeltest_acc: 0.7133834362030029

2024-06-05 02:19:35 Completed - Training job completed
Training seconds: 249
Billable seconds: 78
Managed Spot Training savings: 68.7%
CPU times: user 793 ms, sys: 29.8 ms, total: 823 ms
Wall time: 5min 43s

恭喜您,您已成功使用与 Amazon SageMaker 兼容的 RAPIDS 容器在 HIGGS 数据集上训练了 Random Forest 模型。现在,您已准备好在集群上并行运行实验,尝试不同的超参数和选项。

定义超参数范围并运行大规模搜索实验#

从本地训练到大规模训练所需的代码更改并不多。首先,您将使用 SageMaker SDK 定义一个范围,而不是定义一组固定的超参数

from sagemaker.tuner import (
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
    IntegerParameter,
)

hyperparameter_ranges = {
    "n_estimators": IntegerParameter(10, 200),
    "max_depth": IntegerParameter(1, 22),
    "n_bins": IntegerParameter(5, 24),
    "split_criterion": CategoricalParameter([0, 1]),
    "bootstrap": CategoricalParameter([True, False]),
    "max_features": ContinuousParameter(0.01, 0.5),
}

接下来,您将把实例类型更改为要在云中进行训练的实际 GPU 实例。在这里,您将选择一个配备 4 个基于 NVIDIA Tesla V100 的 GPU 实例的 Amazon SageMaker 计算实例 — ml.p3.8xlarge。如果您有可以利用多个 GPU 的训练脚本,则每个实例最多可以选择 8 个 GPU 以加快训练速度。

from sagemaker.estimator import Estimator

rapids_estimator = Estimator(
    image_uri=ECR_container_fullname,
    role=execution_role,
    instance_count=2,
    instance_type="ml.p3.8xlarge",
    max_run=60 * 60 * 24,
    max_wait=(60 * 60 * 24) + 1,
    use_spot_instances=True,
    hyperparameters=hyperparams,
    metric_definitions=[{"Name": "test_acc", "Regex": "test_acc: ([0-9\\.]+)"}],
)

现在,您可以使用上面定义的 estimator 来定义一个 HyperparameterTuner 对象。

tuner = HyperparameterTuner(
    rapids_estimator,
    objective_metric_name="test_acc",
    hyperparameter_ranges=hyperparameter_ranges,
    strategy="Bayesian",
    max_jobs=2,
    max_parallel_jobs=2,
    objective_type="Maximize",
    metric_definitions=[{"Name": "test_acc", "Regex": "test_acc: ([0-9\\.]+)"}],
)
job_name = "rapidsHPO" + time.strftime("%Y-%m-%d-%H-%M-%S-%j", time.gmtime())
tuner.fit({"dataset": s3_data_dir}, job_name=job_name)

清理#

  • 删除不需要的 S3 存储桶和文件

  • 终止不需要运行的训练作业

  • 删除刚刚创建的容器镜像和仓库

!aws ecr delete-repository --force --repository-name {estimator_info['ecr_repository']}