使用 Kubernetes 和 XGBoost GPU 算法扩展超参数优化#

选择一组最佳的超参数是一项艰巨的任务，特别是对于像 XGBoost 这样有许多超参数需要调优的算法。在本 notebook 中，我们将展示如何在 Kubernetes 集群上并行运行多个训练作业来加速超参数优化。

先决条件#

请按照 Dask Operator: 安装中的说明在启用 GPU 的 Kubernetes 集群之上安装 Dask Operator。（出于本示例的目的，您可以忽略链接文档的其他部分。）

可选：Kubeflow#

Kubeflow 提供了一个不错的 notebook 环境，用于在 k8s 集群内运行此 notebook。按照安装 Kubeflow 中的说明安装 Kubeflow。您可以选择任何方法；我们在通过 manifest 安装 Kubeflow 后测试了此示例。

安装系统包#

我们将需要额外的 Python 包。特别是，我们需要 Optuna 的一个未发布版本。

!pip install dask_kubernetes optuna

设置 Dask 集群#

让我们使用 KubeCluster 类设置一个 Dask 集群。根据您的 Kubernetes 集群配置填写以下变量。假设您正在使用 Kubernetes 集群中的所有节点，以下是如何获取 n_workers。令 N 为节点数。

在 AWS Elastic Kubernetes Service (EKS) 上：n_workers = N - 2
在 Google Cloud Kubernetes 上：n_workers = N - 1

# Choose the same RAPIDS image you used for launching the notebook session
rapids_image = "nvcr.io/nvidia/rapidsai/base:25.04-cuda12.8-py3.12"
# Use the number of worker nodes in your Kubernetes cluster.
n_workers = 4

from dask_kubernetes.operator import KubeCluster

cluster = KubeCluster(
    name="rapids-dask",
    image=rapids_image,
    worker_command="dask-cuda-worker",
    n_workers=n_workers,
    resources={"limits": {"nvidia.com/gpu": "1"}},
    env={"EXTRA_PIP_PACKAGES": "optuna"},
)

Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x7f0b4652b130>

cluster

from dask.distributed import Client

client = Client(cluster)

使用一个玩具示例执行超参数优化#

现在我们可以运行超参数优化了。worker 将并行运行多个训练作业。

def objective(trial):
    x = trial.suggest_uniform("x", -10, 10)
    return (x - 2) ** 2

import optuna
from dask.distributed import wait

# Number of hyperparameter combinations to try in parallel
n_trials = 100

# Optimize in parallel on your Dask cluster
backend_storage = optuna.storages.InMemoryStorage()
dask_storage = optuna.integration.DaskStorage(storage=backend_storage, client=client)
study = optuna.create_study(direction="minimize", storage=dask_storage)

futures = []
for i in range(0, n_trials, n_workers * 4):
    iter_range = (i, min([i + n_workers * 4, n_trials]))
    futures.append(
        {
            "range": iter_range,
            "futures": [
                client.submit(study.optimize, objective, n_trials=1, pure=False)
                for _ in range(*iter_range)
            ],
        }
    )
for partition in futures:
    iter_range = partition["range"]
    print(f"Testing hyperparameter combinations {iter_range[0]}..{iter_range[1]}")
    _ = wait(partition["futures"])

/tmp/ipykernel_75/1194069379.py:9: ExperimentalWarning: DaskStorage is experimental (supported from v3.1.0). The interface can change in the future.
  dask_storage = optuna.integration.DaskStorage(storage=backend_storage, client=client)

Testing hyperparameter combinations 0..16
Testing hyperparameter combinations 16..32
Testing hyperparameter combinations 32..48
Testing hyperparameter combinations 48..64
Testing hyperparameter combinations 64..80
Testing hyperparameter combinations 80..96
Testing hyperparameter combinations 96..100

study.best_params

{'x': 1.9899853370223668}

study.best_value

0.00010029347455557715

使用 XGBoost GPU 算法执行超参数优化#

现在让我们尝试优化 XGBoost 模型的超参数。

import xgboost as xgb
from optuna.samplers import RandomSampler
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import KFold, cross_val_score


def objective(trial):
    X, y = load_breast_cancer(return_X_y=True)
    params = {
        "n_estimators": 10,
        "verbosity": 0,
        "tree_method": "gpu_hist",
        # L2 regularization weight.
        "lambda": trial.suggest_float("lambda", 1e-8, 100.0, log=True),
        # L1 regularization weight.
        "alpha": trial.suggest_float("alpha", 1e-8, 100.0, log=True),
        # sampling according to each tree.
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.2, 1.0),
        "max_depth": trial.suggest_int("max_depth", 2, 10, step=1),
        # minimum child weight, larger the term more conservative the tree.
        "min_child_weight": trial.suggest_float(
            "min_child_weight", 1e-8, 100, log=True
        ),
        "learning_rate": trial.suggest_float("learning_rate", 1e-8, 1.0, log=True),
        # defines how selective algorithm is.
        "gamma": trial.suggest_float("gamma", 1e-8, 1.0, log=True),
        "grow_policy": "depthwise",
        "eval_metric": "logloss",
    }
    clf = xgb.XGBClassifier(**params)
    fold = KFold(n_splits=5, shuffle=True, random_state=0)
    score = cross_val_score(clf, X, y, cv=fold, scoring="neg_log_loss")
    return score.mean()

# Number of hyperparameter combinations to try in parallel
n_trials = 250

# Optimize in parallel on your Dask cluster
backend_storage = optuna.storages.InMemoryStorage()
dask_storage = optuna.integration.DaskStorage(storage=backend_storage, client=client)
study = optuna.create_study(
    direction="maximize", sampler=RandomSampler(seed=0), storage=dask_storage
)
futures = []
for i in range(0, n_trials, n_workers * 4):
    iter_range = (i, min([i + n_workers * 4, n_trials]))
    futures.append(
        {
            "range": iter_range,
            "futures": [
                client.submit(study.optimize, objective, n_trials=1, pure=False)
                for _ in range(*iter_range)
            ],
        }
    )
for partition in futures:
    iter_range = partition["range"]
    print(f"Testing hyperparameter combinations {iter_range[0]}..{iter_range[1]}")
    _ = wait(partition["futures"])

/tmp/ipykernel_75/1634478960.py:6: ExperimentalWarning: DaskStorage is experimental (supported from v3.1.0). The interface can change in the future.
  dask_storage = optuna.integration.DaskStorage(storage=backend_storage, client=client)

Testing hyperparameter combinations 0..16
Testing hyperparameter combinations 16..32
Testing hyperparameter combinations 32..48
Testing hyperparameter combinations 48..64
Testing hyperparameter combinations 64..80
Testing hyperparameter combinations 80..96
Testing hyperparameter combinations 96..112
Testing hyperparameter combinations 112..128
Testing hyperparameter combinations 128..144
Testing hyperparameter combinations 144..160
Testing hyperparameter combinations 160..176
Testing hyperparameter combinations 176..192
Testing hyperparameter combinations 192..208
Testing hyperparameter combinations 208..224
Testing hyperparameter combinations 224..240
Testing hyperparameter combinations 240..250

study.best_params

{'lambda': 1.9471539598103378,
 'alpha': 1.1141784696858766e-08,
 'colsample_bytree': 0.7422532294369841,
 'max_depth': 4,
 'min_child_weight': 0.2248745054413427,
 'learning_rate': 0.4983200494234886,
 'gamma': 9.77293810275356e-07}

study.best_value

-0.10351124143719839

让我们可视化超参数优化的进度。

from optuna.visualization.matplotlib import (
    plot_optimization_history,
    plot_param_importances,
)

plot_optimization_history(study)

/tmp/ipykernel_75/3324289224.py:1: ExperimentalWarning: plot_optimization_history is experimental (supported from v2.2.0). The interface can change in the future.
  plot_optimization_history(study)

<AxesSubplot: title={'center': 'Optimization History Plot'}, xlabel='Trial', ylabel='Objective Value'>

../../../_images/8739865d167204450c63638ae27790d283cfdc8b93df1cf6efc1c1ef52bd5822.png

plot_param_importances(study)

/tmp/ipykernel_75/3836449081.py:1: ExperimentalWarning: plot_param_importances is experimental (supported from v2.2.0). The interface can change in the future.
  plot_param_importances(study)

<AxesSubplot: title={'center': 'Hyperparameter Importances'}, xlabel='Importance for Objective Value', ylabel='Hyperparameter'>

../../../_images/43d5ccea17ec62779dfca7946cf70acd1ec274a00c9badea8f340d93281e45b1.png