在 Amazon SageMaker 上大规模运行 RAPIDS 超参数实验#

导入包并创建 Amazon SageMaker 和 Boto3 会话#

import time

import boto3
import sagemaker

execution_role = sagemaker.get_execution_role()
session = sagemaker.Session()

region = boto3.Session().region_name
account = boto3.client("sts").get_caller_identity().get("Account")

account, region

('561241433344', 'us-east-2')

将 higgs-boson 数据集上传到 S3 存储桶#

!mkdir -p ./dataset
!if [ ! -f "dataset/HIGGS.csv" ]; then wget -P dataset https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz; fi
!if [ ! -f "dataset/HIGGS.csv" ]; then gunzip dataset/HIGGS.csv.gz; fi

s3_data_dir = session.upload_data(path="dataset", key_prefix="dataset/higgs-dataset")

s3_data_dir

's3://sagemaker-us-east-2-561241433344/dataset/higgs-dataset'

从 DockerHub 下载最新的 RAPIDS 容器#

为了构建与 Amazon SageMaker 兼容的 RAPIDS Docker 容器，您将从基础 RAPIDS 容器开始，NVIDIA 的优秀工程师们已经构建并将其推送到了 DockerHub。

您需要通过创建 Dockerfile、复制训练脚本以及安装 SageMaker 训练工具包来扩展此容器，使 RAPIDS 与 SageMaker 兼容

estimator_info = {
    "rapids_container": "nvcr.io/nvidia/rapidsai/base:25.04-cuda12.8-py3.12",
    "ecr_image": "sagemaker-rapids-higgs:latest",
    "ecr_repository": "sagemaker-rapids-higgs",
}

%%time
!docker pull {estimator_info['rapids_container']}

!cat Dockerfile

ARG RAPIDS_IMAGE

FROM $RAPIDS_IMAGE as rapids

# Installs a few more dependencies
RUN conda install --yes -n base \
        cupy \
        flask \
        protobuf \
        'sagemaker-python-sdk>=2.239.0'

# Copies the training code inside the container
COPY rapids-higgs.py /opt/ml/code/rapids-higgs.py

# Defines rapids-higgs.py as script entry point
# ref: https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-training-container.html
ENV SAGEMAKER_PROGRAM rapids-higgs.py

# override entrypoint from the base image with one that accepts
# 'train' and 'serve' (as SageMaker expects to provide)
COPY entrypoint.sh /opt/entrypoint.sh
ENTRYPOINT ["/opt/entrypoint.sh"]

!docker build -t {estimator_info['ecr_image']} --build-arg RAPIDS_IMAGE={estimator_info['rapids_container']} .

Sending build context to Docker daemon   7.68kB
Step 1/7 : ARG RAPIDS_IMAGE
Step 2/7 : FROM $RAPIDS_IMAGE as rapids
 ---> a80bdce0d796
Step 3/7 : RUN conda install --yes -n base         cupy         flask         protobuf         sagemaker
 ---> Running in f6522ce9b303
Channels:
 - rapidsai-nightly
 - dask/label/dev
 - pytorch
 - conda-forge
 - nvidia
Platform: linux-64
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: /opt/conda

  added / updated specs:
    - cupy
    - flask
    - protobuf
    - sagemaker


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    blinker-1.8.2              |     pyhd8ed1ab_0          14 KB  conda-forge
    boto3-1.34.118             |     pyhd8ed1ab_0          78 KB  conda-forge
    botocore-1.34.118          |pyge310_1234567_0         6.8 MB  conda-forge
    dill-0.3.8                 |     pyhd8ed1ab_0          86 KB  conda-forge
    flask-3.0.3                |     pyhd8ed1ab_0          79 KB  conda-forge
    google-pasta-0.2.0         |     pyh8c360ce_0          42 KB  conda-forge
    itsdangerous-2.2.0         |     pyhd8ed1ab_0          19 KB  conda-forge
    jmespath-1.0.1             |     pyhd8ed1ab_0          21 KB  conda-forge
    multiprocess-0.70.16       |  py310h2372a71_0         238 KB  conda-forge
    openssl-3.3.1              |       h4ab18f5_0         2.8 MB  conda-forge
    pathos-0.3.2               |     pyhd8ed1ab_1          52 KB  conda-forge
    pox-0.3.4                  |     pyhd8ed1ab_0          26 KB  conda-forge
    ppft-1.7.6.8               |     pyhd8ed1ab_0          33 KB  conda-forge
    protobuf-4.25.3            |  py310ha8c1f0e_0         325 KB  conda-forge
    protobuf3-to-dict-0.1.5    |  py310hff52083_8          14 KB  conda-forge
    s3transfer-0.10.1          |     pyhd8ed1ab_0          61 KB  conda-forge
    sagemaker-2.75.1           |     pyhd8ed1ab_0         377 KB  conda-forge
    smdebug-rulesconfig-1.0.1  |     pyhd3deb0d_1          20 KB  conda-forge
    werkzeug-3.0.3             |     pyhd8ed1ab_0         237 KB  conda-forge
    ------------------------------------------------------------
                                           Total:        11.2 MB

The following NEW packages will be INSTALLED:

  blinker            conda-forge/noarch::blinker-1.8.2-pyhd8ed1ab_0 
  boto3              conda-forge/noarch::boto3-1.34.118-pyhd8ed1ab_0 
  botocore           conda-forge/noarch::botocore-1.34.118-pyge310_1234567_0 
  dill               conda-forge/noarch::dill-0.3.8-pyhd8ed1ab_0 
  flask              conda-forge/noarch::flask-3.0.3-pyhd8ed1ab_0 
  google-pasta       conda-forge/noarch::google-pasta-0.2.0-pyh8c360ce_0 
  itsdangerous       conda-forge/noarch::itsdangerous-2.2.0-pyhd8ed1ab_0 
  jmespath           conda-forge/noarch::jmespath-1.0.1-pyhd8ed1ab_0 
  multiprocess       conda-forge/linux-64::multiprocess-0.70.16-py310h2372a71_0 
  pathos             conda-forge/noarch::pathos-0.3.2-pyhd8ed1ab_1 
  pox                conda-forge/noarch::pox-0.3.4-pyhd8ed1ab_0 
  ppft               conda-forge/noarch::ppft-1.7.6.8-pyhd8ed1ab_0 
  protobuf           conda-forge/linux-64::protobuf-4.25.3-py310ha8c1f0e_0 
  protobuf3-to-dict  conda-forge/linux-64::protobuf3-to-dict-0.1.5-py310hff52083_8 
  s3transfer         conda-forge/noarch::s3transfer-0.10.1-pyhd8ed1ab_0 
  sagemaker          conda-forge/noarch::sagemaker-2.75.1-pyhd8ed1ab_0 
  smdebug-rulesconf~ conda-forge/noarch::smdebug-rulesconfig-1.0.1-pyhd3deb0d_1 
  werkzeug           conda-forge/noarch::werkzeug-3.0.3-pyhd8ed1ab_0 

The following packages will be UPDATED:

  openssl                                  3.3.0-h4ab18f5_3 --> 3.3.1-h4ab18f5_0 



Downloading and Extracting Packages: ...working... done
Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done
Removing intermediate container f6522ce9b303
 ---> 883c682b36bc
Step 4/7 : COPY rapids-higgs.py /opt/ml/code/rapids-higgs.py
 ---> 2f6b3e0bec44
Step 5/7 : ENV SAGEMAKER_PROGRAM rapids-higgs.py
 ---> Running in df524941c02e
Removing intermediate container df524941c02e
 ---> 4cf437176c8c
Step 6/7 : COPY entrypoint.sh /opt/entrypoint.sh
 ---> 32d95ff5bd74
Step 7/7 : ENTRYPOINT ["/opt/entrypoint.sh"]
 ---> Running in c396fa9e98ad
Removing intermediate container c396fa9e98ad
 ---> 39f900bfeba0
Successfully built 39f900bfeba0
Successfully tagged sagemaker-rapids-higgs:latest

!docker images

发布到 Elastic Container Registry#

在运行大规模训练作业时，无论是进行分布式训练还是独立实验，您都需要确保数据集和训练脚本在集群中的每个实例上都已复制。令人欣慰的是，其中更棘手的部分——移动数据集——由 Amazon SageMaker 完成。至于训练代码，您已经准备好了一个 Docker 容器，只需将其推送到容器注册表，然后 Amazon SageMaker 就会将其拉取到集群中的每个训练计算实例中。

注意：SageMaker 不支持使用来自私有 Docker 注册表（例如 DockerHub）的训练镜像，因此我们需要将 SageMaker 兼容的 RAPIDS 容器推送到 Amazon Elastic Container Registry (Amazon ECR) 中，以存储您的 Amazon SageMaker 兼容的 RAPIDS 容器并使其可供 Amazon SageMaker 使用。

ECR_container_fullname = (
    f"{account}.dkr.ecr.{region}.amazonaws.com/{estimator_info['ecr_image']}"
)

ECR_container_fullname

'561241433344.dkr.ecr.us-east-2.amazonaws.com/sagemaker-rapids-higgs:latest'

!docker tag {estimator_info['ecr_image']} {ECR_container_fullname}

print(
    f"source      : {estimator_info['ecr_image']}\n"
    f"destination : {ECR_container_fullname}"
)

source      : sagemaker-rapids-higgs:latest
destination : 561241433344.dkr.ecr.us-east-2.amazonaws.com/sagemaker-rapids-higgs:latest

!aws ecr create-repository --repository-name {estimator_info['ecr_repository']}
!$(aws ecr get-login --no-include-email --region {region})

!docker push {ECR_container_fullname}

The push refers to repository [561241433344.dkr.ecr.us-east-2.amazonaws.com/sagemaker-rapids-higgs]

3be3c6f4: Preparing 
a7112765: Preparing 
5c05c772: Preparing 
bdce5066: Preparing 
923ec1b3: Preparing 
3fcfb3d4: Preparing 
bf18a086: Preparing 
f3ff1008: Preparing 
b6fb91b8: Preparing 
7bf1eb99: Preparing 
264186e1: Preparing 
7d7711e0: Preparing 
ee96f292: Preparing 
e2a80b3f: Preparing 
0a873d7a: Preparing 
bcc60d01: Preparing 
1dcee623: Preparing 
9a46b795: Preparing 
5e83c163: Preparing 
c05c772: Pushed   643.1MB/637.1MB9Alatest: digest: sha256:c8172a0ad30cd39b091f5fc3f3cde922ceabb103d0a0ec90beb1a5c4c9c6c97c size: 4504

在本地测试与 Amazon SageMaker 兼容的 RAPIDS 容器#

在您花费时间和金钱在大型集群上运行大型实验之前，您应该先运行一个本地 Amazon SageMaker 训练作业，以确保容器按预期工作。请确保您的本地机器上安装了 SageMaker SDK。

定义一些默认的超参数。大胆猜测一下，您可以在 cuML 文档页面上找到 RandomForest 的完整超参数列表。

hyperparams = {
    "n_estimators": 15,
    "max_depth": 5,
    "n_bins": 8,
    "split_criterion": 0,  # GINI:0, ENTROPY:1
    "bootstrap": 0,  # true: sample with replacement, false: sample without replacement
    "max_leaves": -1,  # unlimited leaves
    "max_features": 0.2,
}

现在，将实例类型指定为 local_gpu。这假设您在本地拥有 GPU。如果您没有本地 GPU，您可以在 Amazon SageMaker 托管的 GPU 实例上进行测试 — 只需将 local_gpu 替换为 p3 或 p2 GPU 实例，方法是更新 instance_type 变量。

from sagemaker.estimator import Estimator

rapids_estimator = Estimator(
    image_uri=ECR_container_fullname,
    role=execution_role,
    instance_count=1,
    instance_type="ml.p3.2xlarge",  #'local_gpu'
    max_run=60 * 60 * 24,
    max_wait=(60 * 60 * 24) + 1,
    use_spot_instances=True,
    hyperparameters=hyperparams,
    metric_definitions=[{"Name": "test_acc", "Regex": "test_acc: ([0-9\\.]+)"}],
)

%%time
rapids_estimator.fit(inputs=s3_data_dir)

INFO:sagemaker:Creating training-job with name: sagemaker-rapids-higgs-2024-06-05-02-14-30-371

2024-06-05 02:14:30 Starting - Starting the training job...
2024-06-05 02:14:54 Starting - Preparing the instances for training...
2024-06-05 02:15:26 Downloading - Downloading input data..................
2024-06-05 02:18:16 Downloading - Downloading the training image...
2024-06-05 02:18:47 Training - Training image download completed. Training in progress...@ entrypoint -> launching training script 

2024-06-05 02:19:27 Uploading - Uploading generated training modeltest_acc: 0.7133834362030029

2024-06-05 02:19:35 Completed - Training job completed
Training seconds: 249
Billable seconds: 78
Managed Spot Training savings: 68.7%
CPU times: user 793 ms, sys: 29.8 ms, total: 823 ms
Wall time: 5min 43s

恭喜您，您已成功使用与 Amazon SageMaker 兼容的 RAPIDS 容器在 HIGGS 数据集上训练了 Random Forest 模型。现在，您已准备好在集群上并行运行实验，尝试不同的超参数和选项。

定义超参数范围并运行大规模搜索实验#

从本地训练到大规模训练所需的代码更改并不多。首先，您将使用 SageMaker SDK 定义一个范围，而不是定义一组固定的超参数

from sagemaker.tuner import (
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner,
    IntegerParameter,
)

hyperparameter_ranges = {
    "n_estimators": IntegerParameter(10, 200),
    "max_depth": IntegerParameter(1, 22),
    "n_bins": IntegerParameter(5, 24),
    "split_criterion": CategoricalParameter([0, 1]),
    "bootstrap": CategoricalParameter([True, False]),
    "max_features": ContinuousParameter(0.01, 0.5),
}

接下来，您将把实例类型更改为要在云中进行训练的实际 GPU 实例。在这里，您将选择一个配备 4 个基于 NVIDIA Tesla V100 的 GPU 实例的 Amazon SageMaker 计算实例 — ml.p3.8xlarge。如果您有可以利用多个 GPU 的训练脚本，则每个实例最多可以选择 8 个 GPU 以加快训练速度。

from sagemaker.estimator import Estimator

rapids_estimator = Estimator(
    image_uri=ECR_container_fullname,
    role=execution_role,
    instance_count=2,
    instance_type="ml.p3.8xlarge",
    max_run=60 * 60 * 24,
    max_wait=(60 * 60 * 24) + 1,
    use_spot_instances=True,
    hyperparameters=hyperparams,
    metric_definitions=[{"Name": "test_acc", "Regex": "test_acc: ([0-9\\.]+)"}],
)

现在，您可以使用上面定义的 estimator 来定义一个 HyperparameterTuner 对象。

tuner = HyperparameterTuner(
    rapids_estimator,
    objective_metric_name="test_acc",
    hyperparameter_ranges=hyperparameter_ranges,
    strategy="Bayesian",
    max_jobs=2,
    max_parallel_jobs=2,
    objective_type="Maximize",
    metric_definitions=[{"Name": "test_acc", "Regex": "test_acc: ([0-9\\.]+)"}],
)

job_name = "rapidsHPO" + time.strftime("%Y-%m-%d-%H-%M-%S-%j", time.gmtime())
tuner.fit({"dataset": s3_data_dir}, job_name=job_name)

清理#

删除不需要的 S3 存储桶和文件
终止不需要运行的训练作业
删除刚刚创建的容器镜像和仓库

!aws ecr delete-repository --force --repository-name {estimator_info['ecr_repository']}