使用 Graviton GPU DLAMI - 深度学习 AMI
Amazon Web Services 文档中描述的 Amazon Web Services 服务或功能可能因区域而异。要查看适用于中国区域的差异,请参阅 中国的 Amazon Web Services 服务入门 (PDF)

本文属于机器翻译版本。若本译文内容与英语原文存在差异,则一律以英文原文为准。

使用 Graviton GPU DLAMI

Amazon Deep Learning AMI 它已准备好与基于 Arm 处理器的 Graviton GPU 配合使用。Graviton GPU DLAMI 附带 GPU 驱动程序基础平台,以及可用来部署您自己的自定义深度学习环境的加速库。Docker 和 NVIDIA Docker 在 Graviton GPU DLAMI 上进行了预配置,允许您部署容器化应用程序。有关 Graviton GPU DLAMI 的更多详细信息,请查看发布说明

检查 GPU 状态

使用 NVIDIA 系统管理界面来检查 Graviton GPU 的状态。

nvidia-smi

nvidia-smi 命令的输出应与以下内容类似:

+-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA T4G On | 00000000:00:1F.0 Off | 0 | | N/A 32C P8 8W / 70W | 0MiB / 15109MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

检查 CUDA 版本

要检查 CUDA 版本,请运行以下命令:

/usr/local/cuda/bin/nvcc --version | grep Cuda

您的输出应类似于以下内容:

nvcc: NVIDIA (R) Cuda compiler driver Cuda compilation tools, release 11.4, V11.4.120

验证 Docker

从中运行 CUDA 容器DockerHub来验证 Graviton GPU 上的 Docker 功能:

sudo docker run --platform=linux/arm64 --rm \ --gpus all nvidia/cuda:11.4.2-base-ubuntu20.04 nvidia-smi

您的输出应类似于以下内容:

+-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA T4G On | 00000000:00:1F.0 Off | 0 | | N/A 33C P8 9W / 70W | 0MiB / 15109MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

TensorRT

使用以下命令来访问 TensorRT 命令行工具:

trtexec

您的输出应类似于以下内容:

&&&& RUNNING TensorRT.trtexec [TensorRT v8200] # trtexec ... &&&& PASSED TensorRT.trtexec [TensorRT v8200] # trtexec

TensorRT Python 滚轮可供按需安装。您可以在以下文件位置找到这些滚轮:

/usr/local/tensorrt/graphsurgeon/ └── graphsurgeon-0.4.5-py2.py3-none-any.whl /usr/local/tensorrt/onnx_graphsurgeon/ └── onnx_graphsurgeon-0.3.12-py2.py3-none-any.whl /usr/local/tensorrt/python/ ├── tensorrt-8.2.0.6-cp36-none-linux_aarch64.whl ├── tensorrt-8.2.0.6-cp37-none-linux_aarch64.whl ├── tensorrt-8.2.0.6-cp38-none-linux_aarch64.whl └── tensorrt-8.2.0.6-cp39-none-linux_aarch64.whl /usr/local/tensorrt/uff/ └── uff-0.6.9-py2.py3-none-any.whl

有关其他详细信息,请参阅 NVIDIA TensorRT 文档

运行 CUDA 示例

Graviton GPU DLAMI 提供了预编译的 CUDA 示例,可以帮助您验证不同的 CUDA 功能。

ls /usr/local/cuda/compiled_samples

例如,使用以下命令来运行 vectorAdd 示例:

/usr/local/cuda/compiled_samples/vectorAdd

您的输出应类似于以下内容:

[Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED Done

运行 transpose 示例:

/usr/local/cuda/compiled_samples/transpose

您的输出应类似于以下内容:

Transpose Starting... GPU Device 0: "Turing" with compute capability 7.5 > Device 0: "NVIDIA T4G" > SM Capability 7.5 detected: > [NVIDIA T4G] has 40 MP(s) x 64 (Cores/MP) = 2560 (Cores) > Compute performance scaling factor = 1.00 Matrix size: 1024x1024 (64x64 tiles), tile size: 16x16, block size: 16x16 transpose simple copy , Throughput = 185.1781 GB/s, Time = 0.04219 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 transpose shared memory copy, Throughput = 163.8616 GB/s, Time = 0.04768 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 transpose naive , Throughput = 98.2805 GB/s, Time = 0.07949 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 transpose coalesced , Throughput = 127.6759 GB/s, Time = 0.06119 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 transpose optimized , Throughput = 156.2960 GB/s, Time = 0.04999 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 transpose coarse-grained , Throughput = 155.9157 GB/s, Time = 0.05011 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 transpose fine-grained , Throughput = 158.4177 GB/s, Time = 0.04932 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 transpose diagonal , Throughput = 133.4277 GB/s, Time = 0.05855 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 Test passed
后续步骤

使用 Graviton GPU DLAMI TensorFlow