如何在 Azure 上设置 InfiniBand#
Azure GPU 优化虚拟机 提供低延迟、高带宽的 InfiniBand 网络。本指南将逐步介绍启用 InfiniBand 以优化网络性能的步骤。
构建虚拟机#
从 Azure 门户开始创建 GPU 优化虚拟机。以下是我们将用于演示的示例。
创建新的虚拟机实例。
选择
美国东部
区域。将
可用性选项
更改为可用性集
并创建一个集。如果构建多个实例,请将其他实例放入同一集中。
使用第二代 Ubuntu 24.04 镜像。
搜索所有镜像中的
Ubuntu Server 24.04
并选择列表中的第二个。
将大小更改为
ND40rs_v2
。使用凭据设置密码登录。
用户
someuser
密码
somepassword
将所有其他选项保留为默认设置。
然后使用您喜欢的方法连接到虚拟机。
安装软件#
在安装驱动程序之前,请确保系统是最新的。
sudo apt-get update
sudo apt-get upgrade -y
NVIDIA 驱动程序#
以下命令应适用于 Ubuntu。有关在其他操作系统上安装的详细信息,请参阅 CUDA Toolkit 文档。
sudo apt-get install -y linux-headers-$(uname -r)
distribution=$(. /etc/os-release;echo $ID$VERSION_ID | sed -e 's/\.//g')
wget https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda-drivers
重启虚拟机实例
sudo reboot
虚拟机启动后,重新连接并运行 nvidia-smi
以验证驱动程序安装。
nvidia-smi
Mon Nov 14 20:32:39 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000001:00:00.0 Off | 0 |
| N/A 34C P0 41W / 300W | 445MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... On | 00000002:00:00.0 Off | 0 |
| N/A 37C P0 43W / 300W | 4MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... On | 00000003:00:00.0 Off | 0 |
| N/A 34C P0 42W / 300W | 4MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... On | 00000004:00:00.0 Off | 0 |
| N/A 35C P0 44W / 300W | 4MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 Tesla V100-SXM2... On | 00000005:00:00.0 Off | 0 |
| N/A 35C P0 41W / 300W | 4MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 5 Tesla V100-SXM2... On | 00000006:00:00.0 Off | 0 |
| N/A 36C P0 43W / 300W | 4MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 Tesla V100-SXM2... On | 00000007:00:00.0 Off | 0 |
| N/A 37C P0 44W / 300W | 4MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 7 Tesla V100-SXM2... On | 00000008:00:00.0 Off | 0 |
| N/A 38C P0 44W / 300W | 4MiB / 32768MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1396 G /usr/lib/xorg/Xorg 427MiB |
| 0 N/A N/A 1762 G /usr/bin/gnome-shell 16MiB |
| 1 N/A N/A 1396 G /usr/lib/xorg/Xorg 4MiB |
| 2 N/A N/A 1396 G /usr/lib/xorg/Xorg 4MiB |
| 3 N/A N/A 1396 G /usr/lib/xorg/Xorg 4MiB |
| 4 N/A N/A 1396 G /usr/lib/xorg/Xorg 4MiB |
| 5 N/A N/A 1396 G /usr/lib/xorg/Xorg 4MiB |
| 6 N/A N/A 1396 G /usr/lib/xorg/Xorg 4MiB |
| 7 N/A N/A 1396 G /usr/lib/xorg/Xorg 4MiB |
+-----------------------------------------------------------------------------+
InfiniBand 驱动程序#
在 Ubuntu 24.04 上
sudo apt-get install -y automake dh-make git libcap2 libnuma-dev libtool make pkg-config udev curl librdmacm-dev rdma-core \
libgfortran5 bison chrpath flex graphviz gfortran tk quilt swig tcl ibverbs-utils
检查安装
ibv_devinfo
hca_id: mlx5_0
transport: InfiniBand (0)
fw_ver: 16.28.4000
node_guid: 0015:5dff:fe33:ff2c
sys_image_guid: 0c42:a103:00b3:2f68
vendor_id: 0x02c9
vendor_part_id: 4120
hw_ver: 0x0
board_id: MT_0000000010
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 7
port_lid: 115
port_lmc: 0x00
link_layer: InfiniBand
hca_id: rdmaP36305p0s2
transport: InfiniBand (0)
fw_ver: 2.43.7008
node_guid: 6045:bdff:feed:8445
sys_image_guid: 043f:7203:0003:d583
vendor_id: 0x02c9
vendor_part_id: 4100
hw_ver: 0x0
board_id: MT_1090111019
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 1024 (3)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: Ethernet
启用 IPoIB#
sudo sed -i -e 's/# OS.EnableRDMA=y/OS.EnableRDMA=y/g' /etc/waagent.conf
重启并重新连接。
sudo reboot
检查 IB#
ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 60:45:bd:a7:42:cc brd ff:ff:ff:ff:ff:ff
inet 10.6.0.5/24 brd 10.6.0.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::6245:bdff:fea7:42cc/64 scope link
valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
link/ether 00:15:5d:33:ff:16 brd ff:ff:ff:ff:ff:ff
4: enP44906s1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master eth0 state UP group default qlen 1000
link/ether 60:45:bd:a7:42:cc brd ff:ff:ff:ff:ff:ff
altname enP44906p0s2
5: ibP59423s2: <BROADCAST,MULTICAST> mtu 4092 qdisc noop state DOWN group default qlen 256
link/infiniband 00:00:09:27:fe:80:00:00:00:00:00:00:00:15:5d:ff:fd:33:ff:16 brd 00:ff:ff:ff:ff:12:40:1b:80:1d:00:00:00:00:00:00:ff:ff:ff:ff
altname ibP59423p0s2
nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 mlx5_0 CPU Affinity NUMA Affinity
GPU0 X NV2 NV1 NV2 NODE NODE NV1 NODE NODE 0-19 0
GPU1 NV2 X NV2 NV1 NODE NODE NODE NV1 NODE 0-19 0
GPU2 NV1 NV2 X NV1 NV2 NODE NODE NODE NODE 0-19 0
GPU3 NV2 NV1 NV1 X NODE NV2 NODE NODE NODE 0-19 0
GPU4 NODE NODE NV2 NODE X NV1 NV1 NV2 NODE 0-19 0
GPU5 NODE NODE NODE NV2 NV1 X NV2 NV1 NODE 0-19 0
GPU6 NV1 NODE NODE NODE NV1 NV2 X NV2 NODE 0-19 0
GPU7 NODE NV1 NODE NODE NV2 NV1 NV2 X NODE 0-19 0
mlx5_0 NODE NODE NODE NODE NODE NODE NODE NODE X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
安装 UCX-Py 和工具#
wget https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh
bash Mambaforge-Linux-x86_64.sh
接受默认设置并允许 conda 初始化运行。
~/mambaforge/bin/conda init
然后启动新的 shell。
创建 conda 环境(参见 UCX-Py 文档)
mamba create -n ucxpy -c rapidsai -c conda-forge -c nvidia rapids=25.04 python=3.12 cuda-version=12.8 ipython ucx-proc=*=gpu ucx ucx-py dask distributed numpy cupy pytest pynvml -y mamba activate ucxpy
本地克隆 UCX-Py 仓库
git clone https://github.com/rapidsai/ucx-py.git
cd ucx-py
运行测试#
首先从 ucx-py
仓库中运行 UCX-Py 测试套件
pytest -vs tests/
pytest -vs ucp/_libs/tests/
现在检查 InfiniBand 是否工作正常,为此您可以运行我们在 UCX-Py 中包含的一些基准测试,例如
# cd out of the ucx-py directory
cd ..
# Let UCX pick the best transport (expecting NVLink when available,
# otherwise InfiniBand, or TCP in worst case) on devices 0 and 1
python -m ucp.benchmarks.send_recv --server-dev 0 --client-dev 1 -o rmm --reuse-alloc -n 128MiB
# Force TCP-only on devices 0 and 1
UCX_TLS=tcp,cuda_copy python -m ucp.benchmarks.send_recv --server-dev 0 --client-dev 1 -o rmm --reuse-alloc -n 128MiB
我们期望上面的第一个案例具有比第二个更高的带宽。如果您同时具有 NVLink 和 InfiniBand 连接,则可以通过指定 UCX_TLS
来限制特定的传输方式,例如
# NVLink (if available) or TCP
UCX_TLS=tcp,cuda_copy,cuda_ipc
# InfiniBand (if available) or TCP
UCX_TLS=tcp,cuda_copy,rc
运行基准测试#
最后,让我们运行 dask-cuda
中的合并基准测试。
此基准测试使用 Dask 对分布在虚拟机上所有可用 GPU 上的两个数据帧执行合并。合并是分布式设置中具有挑战性的基准测试,因为它们需要参与数据帧进行通信密集型的 shuffle 操作(有关此类操作的更多信息,请参见 Dask 文档)。为了执行合并,每个数据帧都会被 shuffle,使得具有相同连接键的行出现在同一 GPU 上。这会产生 all-to-all 通信模式,这需要 GPU 之间进行大量通信。因此,网络性能对于基准测试的吞吐量非常重要。
下面我们针对设备 0 到 7(含)运行,您需要根据虚拟机上可用设备的数量进行调整,默认仅在 GPU 0 上运行。此外,--chunk-size 100_000_000
对于 32GB GPU 是一个安全值,您可以根据您拥有的 GPU 大小按比例调整它(它线性缩放,因此 50_000_000
适用于 16GB,或 150_000_000
适用于 48GB)。
# Default Dask TCP communication protocol
python -m dask_cuda.benchmarks.local_cudf_merge --devs 0,1,2,3,4,5,6,7 --chunk-size 100_000_000 --no-show-p2p-bandwidth
Merge benchmark
--------------------------------------------------------------------------------
Backend | dask
Merge type | gpu
Rows-per-chunk | 100000000
Base-chunks | 8
Other-chunks | 8
Broadcast | default
Protocol | tcp
Device(s) | 0,1,2,3,4,5,6,7
RMM Pool | True
Frac-match | 0.3
Worker thread(s) | 1
Data processed | 23.84 GiB
Number of workers | 8
================================================================================
Wall clock | Throughput
--------------------------------------------------------------------------------
48.51 s | 503.25 MiB/s
47.85 s | 510.23 MiB/s
41.20 s | 592.57 MiB/s
================================================================================
Throughput | 532.43 MiB/s +/- 22.13 MiB/s
Bandwidth | 44.76 MiB/s +/- 0.93 MiB/s
Wall clock | 45.85 s +/- 3.30 s
# UCX protocol
python -m dask_cuda.benchmarks.local_cudf_merge --devs 0,1,2,3,4,5,6,7 --chunk-size 100_000_000 --protocol ucx --no-show-p2p-bandwidth
Merge benchmark
--------------------------------------------------------------------------------
Backend | dask
Merge type | gpu
Rows-per-chunk | 100000000
Base-chunks | 8
Other-chunks | 8
Broadcast | default
Protocol | ucx
Device(s) | 0,1,2,3,4,5,6,7
RMM Pool | True
Frac-match | 0.3
TCP | None
InfiniBand | None
NVLink | None
Worker thread(s) | 1
Data processed | 23.84 GiB
Number of workers | 8
================================================================================
Wall clock | Throughput
--------------------------------------------------------------------------------
9.57 s | 2.49 GiB/s
6.01 s | 3.96 GiB/s
9.80 s | 2.43 GiB/s
================================================================================
Throughput | 2.82 GiB/s +/- 341.13 MiB/s
Bandwidth | 159.89 MiB/s +/- 8.96 MiB/s
Wall clock | 8.46 s +/- 1.73 s