如何在 Azure 上设置 InfiniBand#

Azure GPU 优化虚拟机提供低延迟、高带宽的 InfiniBand 网络。本指南将逐步介绍启用 InfiniBand 以优化网络性能的步骤。

构建虚拟机#

从 Azure 门户开始创建 GPU 优化虚拟机。以下是我们将用于演示的示例。

创建新的虚拟机实例。
选择 美国东部 区域。
将 可用性选项 更改为 可用性集 并创建一个集。
- 如果构建多个实例，请将其他实例放入同一集中。
使用第二代 Ubuntu 24.04 镜像。
- 搜索所有镜像中的 Ubuntu Server 24.04 并选择列表中的第二个。
将大小更改为 ND40rs_v2。
使用凭据设置密码登录。
- 用户 someuser
- 密码 somepassword
将所有其他选项保留为默认设置。

然后使用您喜欢的方法连接到虚拟机。

安装软件#

在安装驱动程序之前，请确保系统是最新的。

sudo apt-get update
sudo apt-get upgrade -y

NVIDIA 驱动程序#

以下命令应适用于 Ubuntu。有关在其他操作系统上安装的详细信息，请参阅 CUDA Toolkit 文档。

sudo apt-get install -y linux-headers-$(uname -r)
distribution=$(. /etc/os-release;echo $ID$VERSION_ID | sed -e 's/\.//g')
wget https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-drivers

重启虚拟机实例

sudo reboot

虚拟机启动后，重新连接并运行 nvidia-smi 以验证驱动程序安装。

nvidia-smi

Mon Nov 14 20:32:39 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000001:00:00.0 Off |                    0 |
| N/A   34C    P0    41W / 300W |    445MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000002:00:00.0 Off |                    0 |
| N/A   37C    P0    43W / 300W |      4MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000003:00:00.0 Off |                    0 |
| N/A   34C    P0    42W / 300W |      4MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000004:00:00.0 Off |                    0 |
| N/A   35C    P0    44W / 300W |      4MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  On   | 00000005:00:00.0 Off |                    0 |
| N/A   35C    P0    41W / 300W |      4MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00000006:00:00.0 Off |                    0 |
| N/A   36C    P0    43W / 300W |      4MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 00000007:00:00.0 Off |                    0 |
| N/A   37C    P0    44W / 300W |      4MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  On   | 00000008:00:00.0 Off |                    0 |
| N/A   38C    P0    44W / 300W |      4MiB / 32768MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1396      G   /usr/lib/xorg/Xorg                427MiB |
|    0   N/A  N/A      1762      G   /usr/bin/gnome-shell               16MiB |
|    1   N/A  N/A      1396      G   /usr/lib/xorg/Xorg                  4MiB |
|    2   N/A  N/A      1396      G   /usr/lib/xorg/Xorg                  4MiB |
|    3   N/A  N/A      1396      G   /usr/lib/xorg/Xorg                  4MiB |
|    4   N/A  N/A      1396      G   /usr/lib/xorg/Xorg                  4MiB |
|    5   N/A  N/A      1396      G   /usr/lib/xorg/Xorg                  4MiB |
|    6   N/A  N/A      1396      G   /usr/lib/xorg/Xorg                  4MiB |
|    7   N/A  N/A      1396      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+

InfiniBand 驱动程序#

在 Ubuntu 24.04 上

sudo apt-get install -y automake dh-make git libcap2 libnuma-dev libtool make pkg-config udev curl librdmacm-dev rdma-core \
    libgfortran5 bison chrpath flex graphviz gfortran tk  quilt swig tcl ibverbs-utils

检查安装

ibv_devinfo

hca_id: mlx5_0
        transport:                      InfiniBand (0)
        fw_ver:                         16.28.4000
        node_guid:                      0015:5dff:fe33:ff2c
        sys_image_guid:                 0c42:a103:00b3:2f68
        vendor_id:                      0x02c9
        vendor_part_id:                 4120
        hw_ver:                         0x0
        board_id:                       MT_0000000010
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             4096 (5)
                        sm_lid:                 7
                        port_lid:               115
                        port_lmc:               0x00
                        link_layer:             InfiniBand

hca_id: rdmaP36305p0s2
        transport:                      InfiniBand (0)
        fw_ver:                         2.43.7008
        node_guid:                      6045:bdff:feed:8445
        sys_image_guid:                 043f:7203:0003:d583
        vendor_id:                      0x02c9
        vendor_part_id:                 4100
        hw_ver:                         0x0
        board_id:                       MT_1090111019
        phys_port_cnt:                  1
                port:   1
                        state:                  PORT_ACTIVE (4)
                        max_mtu:                4096 (5)
                        active_mtu:             1024 (3)
                        sm_lid:                 0
                        port_lid:               0
                        port_lmc:               0x00
                        link_layer:             Ethernet

启用 IPoIB#

sudo sed -i -e 's/# OS.EnableRDMA=y/OS.EnableRDMA=y/g' /etc/waagent.conf

重启并重新连接。

sudo reboot

检查 IB#

ip addr show

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether 60:45:bd:a7:42:cc brd ff:ff:ff:ff:ff:ff
    inet 10.6.0.5/24 brd 10.6.0.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::6245:bdff:fea7:42cc/64 scope link
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 00:15:5d:33:ff:16 brd ff:ff:ff:ff:ff:ff
4: enP44906s1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master eth0 state UP group default qlen 1000
    link/ether 60:45:bd:a7:42:cc brd ff:ff:ff:ff:ff:ff
    altname enP44906p0s2
5: ibP59423s2: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN group default qlen 256
    link/infiniband 00:00:09:27:fe:80:00:00:00:00:00:00:00:15:5d:ff:fd:33:ff:16 brd 00:ff:ff:ff:ff:12:40:1b:80:1d:00:00:00:00:00:00:ff:ff:ff:ff
    altname ibP59423p0s2

nvidia-smi topo -m

        GPU0  GPU1  GPU2  GPU3  GPU4  GPU5  GPU6  GPU7  mlx5_0  CPU Affinity  NUMA Affinity
GPU0    X     NV2   NV1   NV2   NODE  NODE  NV1   NODE  NODE    0-19          0
GPU1    NV2   X     NV2   NV1   NODE  NODE  NODE  NV1   NODE    0-19          0
GPU2    NV1   NV2   X     NV1   NV2   NODE  NODE  NODE  NODE    0-19          0
GPU3    NV2   NV1   NV1   X     NODE  NV2   NODE  NODE  NODE    0-19          0
GPU4    NODE  NODE  NV2   NODE  X     NV1   NV1   NV2   NODE    0-19          0
GPU5    NODE  NODE  NODE  NV2   NV1   X     NV2   NV1   NODE    0-19          0
GPU6    NV1   NODE  NODE  NODE  NV1   NV2   X     NV2   NODE    0-19          0
GPU7    NODE  NV1   NODE  NODE  NV2   NV1   NV2   X     NODE    0-19          0
mlx5_0  NODE  NODE  NODE  NODE  NODE  NODE  NODE  NODE  X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

安装 UCX-Py 和工具#

wget https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh
bash Mambaforge-Linux-x86_64.sh

接受默认设置并允许 conda 初始化运行。

~/mambaforge/bin/conda init

然后启动新的 shell。

创建 conda 环境（参见 UCX-Py 文档）

mamba create -n ucxpy -c rapidsai -c conda-forge -c nvidia rapids=25.04 python=3.12 cuda-version=12.8 ipython ucx-proc=*=gpu ucx ucx-py dask distributed numpy cupy pytest pynvml -y
mamba activate ucxpy

本地克隆 UCX-Py 仓库

git clone https://github.com/rapidsai/ucx-py.git
cd ucx-py

运行测试#

首先从 ucx-py 仓库中运行 UCX-Py 测试套件

pytest -vs tests/
pytest -vs ucp/_libs/tests/

现在检查 InfiniBand 是否工作正常，为此您可以运行我们在 UCX-Py 中包含的一些基准测试，例如

# cd out of the ucx-py directory
cd ..
# Let UCX pick the best transport (expecting NVLink when available,
# otherwise InfiniBand, or TCP in worst case) on devices 0 and 1
python -m ucp.benchmarks.send_recv --server-dev 0 --client-dev 1 -o rmm --reuse-alloc -n 128MiB

# Force TCP-only on devices 0 and 1
UCX_TLS=tcp,cuda_copy python -m ucp.benchmarks.send_recv --server-dev 0 --client-dev 1 -o rmm --reuse-alloc -n 128MiB

我们期望上面的第一个案例具有比第二个更高的带宽。如果您同时具有 NVLink 和 InfiniBand 连接，则可以通过指定 UCX_TLS 来限制特定的传输方式，例如

# NVLink (if available) or TCP
UCX_TLS=tcp,cuda_copy,cuda_ipc

# InfiniBand (if available) or TCP
UCX_TLS=tcp,cuda_copy,rc

运行基准测试#

最后，让我们运行 dask-cuda 中的合并基准测试。

此基准测试使用 Dask 对分布在虚拟机上所有可用 GPU 上的两个数据帧执行合并。合并是分布式设置中具有挑战性的基准测试，因为它们需要参与数据帧进行通信密集型的 shuffle 操作（有关此类操作的更多信息，请参见 Dask 文档）。为了执行合并，每个数据帧都会被 shuffle，使得具有相同连接键的行出现在同一 GPU 上。这会产生 all-to-all 通信模式，这需要 GPU 之间进行大量通信。因此，网络性能对于基准测试的吞吐量非常重要。

下面我们针对设备 0 到 7（含）运行，您需要根据虚拟机上可用设备的数量进行调整，默认仅在 GPU 0 上运行。此外，--chunk-size 100_000_000 对于 32GB GPU 是一个安全值，您可以根据您拥有的 GPU 大小按比例调整它（它线性缩放，因此 50_000_000 适用于 16GB，或 150_000_000 适用于 48GB）。

# Default Dask TCP communication protocol
python -m dask_cuda.benchmarks.local_cudf_merge --devs 0,1,2,3,4,5,6,7 --chunk-size 100_000_000 --no-show-p2p-bandwidth

Merge benchmark
--------------------------------------------------------------------------------
Backend                   | dask
Merge type                | gpu
Rows-per-chunk            | 100000000
Base-chunks               | 8
Other-chunks              | 8
Broadcast                 | default
Protocol                  | tcp
Device(s)                 | 0,1,2,3,4,5,6,7
RMM Pool                  | True
Frac-match                | 0.3
Worker thread(s)          | 1
Data processed            | 23.84 GiB
Number of workers         | 8
================================================================================
Wall clock                | Throughput
--------------------------------------------------------------------------------
48.51 s                   | 503.25 MiB/s
47.85 s                   | 510.23 MiB/s
41.20 s                   | 592.57 MiB/s
================================================================================
Throughput                | 532.43 MiB/s +/- 22.13 MiB/s
Bandwidth                 | 44.76 MiB/s +/- 0.93 MiB/s
Wall clock                | 45.85 s +/- 3.30 s

# UCX protocol
python -m dask_cuda.benchmarks.local_cudf_merge --devs 0,1,2,3,4,5,6,7 --chunk-size 100_000_000 --protocol ucx --no-show-p2p-bandwidth

Merge benchmark
--------------------------------------------------------------------------------
Backend                   | dask
Merge type                | gpu
Rows-per-chunk            | 100000000
Base-chunks               | 8
Other-chunks              | 8
Broadcast                 | default
Protocol                  | ucx
Device(s)                 | 0,1,2,3,4,5,6,7
RMM Pool                  | True
Frac-match                | 0.3
TCP                       | None
InfiniBand                | None
NVLink                    | None
Worker thread(s)          | 1
Data processed            | 23.84 GiB
Number of workers         | 8
================================================================================
Wall clock                | Throughput
--------------------------------------------------------------------------------
9.57 s                    | 2.49 GiB/s
6.01 s                    | 3.96 GiB/s
9.80 s                    | 2.43 GiB/s
================================================================================
Throughput                | 2.82 GiB/s +/- 341.13 MiB/s
Bandwidth                 | 159.89 MiB/s +/- 8.96 MiB/s
Wall clock                | 8.46 s +/- 1.73 s