使用 Graviton GPU DLAMI - 深度学习 AMI
Amazon Web Services 文档中描述的 Amazon Web Services 服务或功能可能因区域而异。要查看适用于中国区域的差异,请参阅中国的 Amazon Web Services 服务入门

本文属于机器翻译版本。若本译文内容与英语原文存在差异,则一律以英文原文为准。

使用 Graviton GPU DLAMI

这些区域有:Amazon Deep Learning AMI已准备好与基于 Arm 处理器的 Graviton GPU 一起使用。Graviton GPU DLAMI 带有 GPU 驱动程序和加速库的基础平台,用于部署您自己的自定义深度学习环境。Docker 和 NVIDIA Docker 已在 Graviton GPU DLAMI 上预先配置,允许你部署容器化应用程序。检查发布说明了解有关 Graviton GPU DLAMI 的更多详细信息。

检查 GPU 状态

使用NVIDIA 系统管理接口检查你的 Graviton GPU 的状态。

nvidia-smi

的输出nvidia-smi命令应类似于以下内容:

+-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA T4G On | 00000000:00:1F.0 Off | 0 | | N/A 32C P8 8W / 70W | 0MiB / 15109MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

查看 CUDA 版本

运行以下命令检查您的 CUDA 版本:

/usr/local/cuda/bin/nvcc --version | grep Cuda

您的输出应类似于以下内容:

nvcc: NVIDIA (R) Cuda compiler driver Cuda compilation tools, release 11.4, V11.4.120

验证 Docker

从运行一个 CUDA 容器DockerHub要验证你的 Graviton GPU 上的 Docker 功能:

sudo docker run --platform=linux/arm64 --rm \ --gpus all nvidia/cuda:11.4.2-base-ubuntu20.04 nvidia-smi

您的输出应类似于以下内容:

+-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA T4G On | 00000000:00:1F.0 Off | 0 | | N/A 33C P8 9W / 70W | 0MiB / 15109MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

TensorRT

使用以下命令访问 TensorRT 命令行工具:

trtexec

您的输出应类似于以下内容:

&&&& RUNNING TensorRT.trtexec [TensorRT v8200] # trtexec ... &&&& PASSED TensorRT.trtexec [TensorRT v8200] # trtexec

有 TensorRT Python 轮子可供按需安装。可以在以下文件位置找到这些轮子:

/usr/local/tensorrt/graphsurgeon/ └── graphsurgeon-0.4.5-py2.py3-none-any.whl /usr/local/tensorrt/onnx_graphsurgeon/ └── onnx_graphsurgeon-0.3.12-py2.py3-none-any.whl /usr/local/tensorrt/python/ ├── tensorrt-8.2.0.6-cp36-none-linux_aarch64.whl ├── tensorrt-8.2.0.6-cp37-none-linux_aarch64.whl ├── tensorrt-8.2.0.6-cp38-none-linux_aarch64.whl └── tensorrt-8.2.0.6-cp39-none-linux_aarch64.whl /usr/local/tensorrt/uff/ └── uff-0.6.9-py2.py3-none-any.whl

有关其他详细信息,请参阅NVIDIA TensorRT.

运行 CUDA 示例

Graviton GPU DLAMI 提供预编译的 CUDA 样本,以帮助您验证不同的 CUDA 功能。

ls /usr/local/cuda/compiled_samples

例如,运行vectorAdd使用以下命令示例:

/usr/local/cuda/compiled_samples/vectorAdd

您的输出应类似于以下内容:

[Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED Done

运行...transpose示例:

/usr/local/cuda/compiled_samples/transpose

您的输出应类似于以下内容:

Transpose Starting... GPU Device 0: "Turing" with compute capability 7.5 > Device 0: "NVIDIA T4G" > SM Capability 7.5 detected: > [NVIDIA T4G] has 40 MP(s) x 64 (Cores/MP) = 2560 (Cores) > Compute performance scaling factor = 1.00 Matrix size: 1024x1024 (64x64 tiles), tile size: 16x16, block size: 16x16 transpose simple copy , Throughput = 185.1781 GB/s, Time = 0.04219 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 transpose shared memory copy, Throughput = 163.8616 GB/s, Time = 0.04768 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 transpose naive , Throughput = 98.2805 GB/s, Time = 0.07949 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 transpose coalesced , Throughput = 127.6759 GB/s, Time = 0.06119 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 transpose optimized , Throughput = 156.2960 GB/s, Time = 0.04999 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 transpose coarse-grained , Throughput = 155.9157 GB/s, Time = 0.05011 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 transpose fine-grained , Throughput = 158.4177 GB/s, Time = 0.04932 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 transpose diagonal , Throughput = 133.4277 GB/s, Time = 0.05855 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 Test passed

后续步骤

使用 Graviton GPU TensorFlow DLAMI