在 Amazon SageMaker 上大规模运行 RAPIDS 超参数实验#
导入包并创建 Amazon SageMaker 和 Boto3 会话#
import time
import boto3
import sagemaker
execution_role = sagemaker.get_execution_role()
session = sagemaker.Session()
region = boto3.Session().region_name
account = boto3.client("sts").get_caller_identity().get("Account")
account, region
('561241433344', 'us-east-2')
将 higgs-boson 数据集上传到 S3 存储桶#
!mkdir -p ./dataset
!if [ ! -f "dataset/HIGGS.csv" ]; then wget -P dataset https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz; fi
!if [ ! -f "dataset/HIGGS.csv" ]; then gunzip dataset/HIGGS.csv.gz; fi
s3_data_dir = session.upload_data(path="dataset", key_prefix="dataset/higgs-dataset")
s3_data_dir
's3://sagemaker-us-east-2-561241433344/dataset/higgs-dataset'
从 DockerHub 下载最新的 RAPIDS 容器#
为了构建与 Amazon SageMaker 兼容的 RAPIDS Docker 容器,您将从基础 RAPIDS 容器开始,NVIDIA 的优秀工程师们已经构建并将其推送到了 DockerHub。
您需要通过创建 Dockerfile、复制训练脚本以及安装 SageMaker 训练工具包 来扩展此容器,使 RAPIDS 与 SageMaker 兼容
estimator_info = { "rapids_container": "nvcr.io/nvidia/rapidsai/base:25.04-cuda12.8-py3.12", "ecr_image": "sagemaker-rapids-higgs:latest", "ecr_repository": "sagemaker-rapids-higgs", }
%%time
!docker pull {estimator_info['rapids_container']}
!cat Dockerfile
ARG RAPIDS_IMAGE
FROM $RAPIDS_IMAGE as rapids
# Installs a few more dependencies
RUN conda install --yes -n base \
cupy \
flask \
protobuf \
'sagemaker-python-sdk>=2.239.0'
# Copies the training code inside the container
COPY rapids-higgs.py /opt/ml/code/rapids-higgs.py
# Defines rapids-higgs.py as script entry point
# ref: https://docs.aws.amazon.com/sagemaker/latest/dg/adapt-training-container.html
ENV SAGEMAKER_PROGRAM rapids-higgs.py
# override entrypoint from the base image with one that accepts
# 'train' and 'serve' (as SageMaker expects to provide)
COPY entrypoint.sh /opt/entrypoint.sh
ENTRYPOINT ["/opt/entrypoint.sh"]
!docker build -t {estimator_info['ecr_image']} --build-arg RAPIDS_IMAGE={estimator_info['rapids_container']} .
Sending build context to Docker daemon 7.68kB
Step 1/7 : ARG RAPIDS_IMAGE
Step 2/7 : FROM $RAPIDS_IMAGE as rapids
---> a80bdce0d796
Step 3/7 : RUN conda install --yes -n base cupy flask protobuf sagemaker
---> Running in f6522ce9b303
Channels:
- rapidsai-nightly
- dask/label/dev
- pytorch
- conda-forge
- nvidia
Platform: linux-64
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... done
## Package Plan ##
environment location: /opt/conda
added / updated specs:
- cupy
- flask
- protobuf
- sagemaker
The following packages will be downloaded:
package | build
---------------------------|-----------------
blinker-1.8.2 | pyhd8ed1ab_0 14 KB conda-forge
boto3-1.34.118 | pyhd8ed1ab_0 78 KB conda-forge
botocore-1.34.118 |pyge310_1234567_0 6.8 MB conda-forge
dill-0.3.8 | pyhd8ed1ab_0 86 KB conda-forge
flask-3.0.3 | pyhd8ed1ab_0 79 KB conda-forge
google-pasta-0.2.0 | pyh8c360ce_0 42 KB conda-forge
itsdangerous-2.2.0 | pyhd8ed1ab_0 19 KB conda-forge
jmespath-1.0.1 | pyhd8ed1ab_0 21 KB conda-forge
multiprocess-0.70.16 | py310h2372a71_0 238 KB conda-forge
openssl-3.3.1 | h4ab18f5_0 2.8 MB conda-forge
pathos-0.3.2 | pyhd8ed1ab_1 52 KB conda-forge
pox-0.3.4 | pyhd8ed1ab_0 26 KB conda-forge
ppft-1.7.6.8 | pyhd8ed1ab_0 33 KB conda-forge
protobuf-4.25.3 | py310ha8c1f0e_0 325 KB conda-forge
protobuf3-to-dict-0.1.5 | py310hff52083_8 14 KB conda-forge
s3transfer-0.10.1 | pyhd8ed1ab_0 61 KB conda-forge
sagemaker-2.75.1 | pyhd8ed1ab_0 377 KB conda-forge
smdebug-rulesconfig-1.0.1 | pyhd3deb0d_1 20 KB conda-forge
werkzeug-3.0.3 | pyhd8ed1ab_0 237 KB conda-forge
------------------------------------------------------------
Total: 11.2 MB
The following NEW packages will be INSTALLED:
blinker conda-forge/noarch::blinker-1.8.2-pyhd8ed1ab_0
boto3 conda-forge/noarch::boto3-1.34.118-pyhd8ed1ab_0
botocore conda-forge/noarch::botocore-1.34.118-pyge310_1234567_0
dill conda-forge/noarch::dill-0.3.8-pyhd8ed1ab_0
flask conda-forge/noarch::flask-3.0.3-pyhd8ed1ab_0
google-pasta conda-forge/noarch::google-pasta-0.2.0-pyh8c360ce_0
itsdangerous conda-forge/noarch::itsdangerous-2.2.0-pyhd8ed1ab_0
jmespath conda-forge/noarch::jmespath-1.0.1-pyhd8ed1ab_0
multiprocess conda-forge/linux-64::multiprocess-0.70.16-py310h2372a71_0
pathos conda-forge/noarch::pathos-0.3.2-pyhd8ed1ab_1
pox conda-forge/noarch::pox-0.3.4-pyhd8ed1ab_0
ppft conda-forge/noarch::ppft-1.7.6.8-pyhd8ed1ab_0
protobuf conda-forge/linux-64::protobuf-4.25.3-py310ha8c1f0e_0
protobuf3-to-dict conda-forge/linux-64::protobuf3-to-dict-0.1.5-py310hff52083_8
s3transfer conda-forge/noarch::s3transfer-0.10.1-pyhd8ed1ab_0
sagemaker conda-forge/noarch::sagemaker-2.75.1-pyhd8ed1ab_0
smdebug-rulesconf~ conda-forge/noarch::smdebug-rulesconfig-1.0.1-pyhd3deb0d_1
werkzeug conda-forge/noarch::werkzeug-3.0.3-pyhd8ed1ab_0
The following packages will be UPDATED:
openssl 3.3.0-h4ab18f5_3 --> 3.3.1-h4ab18f5_0
Downloading and Extracting Packages: ...working... done
Preparing transaction: ...working... done
Verifying transaction: ...working... done
Executing transaction: ...working... done
Removing intermediate container f6522ce9b303
---> 883c682b36bc
Step 4/7 : COPY rapids-higgs.py /opt/ml/code/rapids-higgs.py
---> 2f6b3e0bec44
Step 5/7 : ENV SAGEMAKER_PROGRAM rapids-higgs.py
---> Running in df524941c02e
Removing intermediate container df524941c02e
---> 4cf437176c8c
Step 6/7 : COPY entrypoint.sh /opt/entrypoint.sh
---> 32d95ff5bd74
Step 7/7 : ENTRYPOINT ["/opt/entrypoint.sh"]
---> Running in c396fa9e98ad
Removing intermediate container c396fa9e98ad
---> 39f900bfeba0
Successfully built 39f900bfeba0
Successfully tagged sagemaker-rapids-higgs:latest
!docker images
发布到 Elastic Container Registry#
在运行大规模训练作业时,无论是进行分布式训练还是独立实验,您都需要确保数据集和训练脚本在集群中的每个实例上都已复制。令人欣慰的是,其中更棘手的部分——移动数据集——由 Amazon SageMaker 完成。至于训练代码,您已经准备好了一个 Docker 容器,只需将其推送到容器注册表,然后 Amazon SageMaker 就会将其拉取到集群中的每个训练计算实例中。
注意:SageMaker 不支持使用来自私有 Docker 注册表(例如 DockerHub)的训练镜像,因此我们需要将 SageMaker 兼容的 RAPIDS 容器推送到 Amazon Elastic Container Registry (Amazon ECR) 中,以存储您的 Amazon SageMaker 兼容的 RAPIDS 容器并使其可供 Amazon SageMaker 使用。
ECR_container_fullname = (
f"{account}.dkr.ecr.{region}.amazonaws.com/{estimator_info['ecr_image']}"
)
ECR_container_fullname
'561241433344.dkr.ecr.us-east-2.amazonaws.com/sagemaker-rapids-higgs:latest'
!docker tag {estimator_info['ecr_image']} {ECR_container_fullname}
print(
f"source : {estimator_info['ecr_image']}\n"
f"destination : {ECR_container_fullname}"
)
source : sagemaker-rapids-higgs:latest
destination : 561241433344.dkr.ecr.us-east-2.amazonaws.com/sagemaker-rapids-higgs:latest
!aws ecr create-repository --repository-name {estimator_info['ecr_repository']}
!$(aws ecr get-login --no-include-email --region {region})
!docker push {ECR_container_fullname}
The push refers to repository [561241433344.dkr.ecr.us-east-2.amazonaws.com/sagemaker-rapids-higgs]
3be3c6f4: Preparing
a7112765: Preparing
5c05c772: Preparing
bdce5066: Preparing
923ec1b3: Preparing
3fcfb3d4: Preparing
bf18a086: Preparing
f3ff1008: Preparing
b6fb91b8: Preparing
7bf1eb99: Preparing
264186e1: Preparing
7d7711e0: Preparing
ee96f292: Preparing
e2a80b3f: Preparing
0a873d7a: Preparing
bcc60d01: Preparing
1dcee623: Preparing
9a46b795: Preparing
5e83c163: Preparing
c05c772: Pushed 643.1MB/637.1MB9Alatest: digest: sha256:c8172a0ad30cd39b091f5fc3f3cde922ceabb103d0a0ec90beb1a5c4c9c6c97c size: 4504
在本地测试与 Amazon SageMaker 兼容的 RAPIDS 容器#
在您花费时间和金钱在大型集群上运行大型实验之前,您应该先运行一个本地 Amazon SageMaker 训练作业,以确保容器按预期工作。请确保您的本地机器上安装了 SageMaker SDK。
定义一些默认的超参数。大胆猜测一下,您可以在 cuML 文档 页面上找到 RandomForest 的完整超参数列表。
hyperparams = {
"n_estimators": 15,
"max_depth": 5,
"n_bins": 8,
"split_criterion": 0, # GINI:0, ENTROPY:1
"bootstrap": 0, # true: sample with replacement, false: sample without replacement
"max_leaves": -1, # unlimited leaves
"max_features": 0.2,
}
现在,将实例类型指定为 local_gpu
。这假设您在本地拥有 GPU。如果您没有本地 GPU,您可以在 Amazon SageMaker 托管的 GPU 实例上进行测试 — 只需将 local_gpu
替换为 p3
或 p2
GPU 实例,方法是更新 instance_type
变量。
from sagemaker.estimator import Estimator
rapids_estimator = Estimator(
image_uri=ECR_container_fullname,
role=execution_role,
instance_count=1,
instance_type="ml.p3.2xlarge", #'local_gpu'
max_run=60 * 60 * 24,
max_wait=(60 * 60 * 24) + 1,
use_spot_instances=True,
hyperparameters=hyperparams,
metric_definitions=[{"Name": "test_acc", "Regex": "test_acc: ([0-9\\.]+)"}],
)
%%time
rapids_estimator.fit(inputs=s3_data_dir)
INFO:sagemaker:Creating training-job with name: sagemaker-rapids-higgs-2024-06-05-02-14-30-371
2024-06-05 02:14:30 Starting - Starting the training job...
2024-06-05 02:14:54 Starting - Preparing the instances for training...
2024-06-05 02:15:26 Downloading - Downloading input data..................
2024-06-05 02:18:16 Downloading - Downloading the training image...
2024-06-05 02:18:47 Training - Training image download completed. Training in progress...@ entrypoint -> launching training script
2024-06-05 02:19:27 Uploading - Uploading generated training modeltest_acc: 0.7133834362030029
2024-06-05 02:19:35 Completed - Training job completed
Training seconds: 249
Billable seconds: 78
Managed Spot Training savings: 68.7%
CPU times: user 793 ms, sys: 29.8 ms, total: 823 ms
Wall time: 5min 43s
恭喜您,您已成功使用与 Amazon SageMaker 兼容的 RAPIDS 容器在 HIGGS 数据集上训练了 Random Forest 模型。现在,您已准备好在集群上并行运行实验,尝试不同的超参数和选项。
定义超参数范围并运行大规模搜索实验#
从本地训练到大规模训练所需的代码更改并不多。首先,您将使用 SageMaker SDK 定义一个范围,而不是定义一组固定的超参数
from sagemaker.tuner import (
CategoricalParameter,
ContinuousParameter,
HyperparameterTuner,
IntegerParameter,
)
hyperparameter_ranges = {
"n_estimators": IntegerParameter(10, 200),
"max_depth": IntegerParameter(1, 22),
"n_bins": IntegerParameter(5, 24),
"split_criterion": CategoricalParameter([0, 1]),
"bootstrap": CategoricalParameter([True, False]),
"max_features": ContinuousParameter(0.01, 0.5),
}
接下来,您将把实例类型更改为要在云中进行训练的实际 GPU 实例。在这里,您将选择一个配备 4 个基于 NVIDIA Tesla V100 的 GPU 实例的 Amazon SageMaker 计算实例 — ml.p3.8xlarge
。如果您有可以利用多个 GPU 的训练脚本,则每个实例最多可以选择 8 个 GPU 以加快训练速度。
from sagemaker.estimator import Estimator
rapids_estimator = Estimator(
image_uri=ECR_container_fullname,
role=execution_role,
instance_count=2,
instance_type="ml.p3.8xlarge",
max_run=60 * 60 * 24,
max_wait=(60 * 60 * 24) + 1,
use_spot_instances=True,
hyperparameters=hyperparams,
metric_definitions=[{"Name": "test_acc", "Regex": "test_acc: ([0-9\\.]+)"}],
)
现在,您可以使用上面定义的 estimator 来定义一个 HyperparameterTuner 对象。
tuner = HyperparameterTuner(
rapids_estimator,
objective_metric_name="test_acc",
hyperparameter_ranges=hyperparameter_ranges,
strategy="Bayesian",
max_jobs=2,
max_parallel_jobs=2,
objective_type="Maximize",
metric_definitions=[{"Name": "test_acc", "Regex": "test_acc: ([0-9\\.]+)"}],
)
job_name = "rapidsHPO" + time.strftime("%Y-%m-%d-%H-%M-%S-%j", time.gmtime())
tuner.fit({"dataset": s3_data_dir}, job_name=job_name)
清理#
删除不需要的 S3 存储桶和文件
终止不需要运行的训练作业
删除刚刚创建的容器镜像和仓库
!aws ecr delete-repository --force --repository-name {estimator_info['ecr_repository']}