本文属于机器翻译版本。若本译文内容与英语原文存在差异,则一律以英文原文为准。
使用 Graviton GPU DLAMI
Amazon Deep Learning AMI 它已准备好与基于 Arm 处理器的 Graviton GPU 配合使用。Graviton GPU DLAMI 附带 GPU 驱动程序基础平台,以及可用来部署您自己的自定义深度学习环境的加速库。Docker 和 NVIDIA Docker 在 Graviton GPU DLAMI 上进行了预配置,允许您部署容器化应用程序。有关 Graviton GPU DLAMI 的更多详细信息,请查看发布说明
检查 GPU 状态
使用 NVIDIA 系统管理界面
nvidia-smi
nvidia-smi
命令的输出应与以下内容类似:
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA T4G On | 00000000:00:1F.0 Off | 0 | | N/A 32C P8 8W / 70W | 0MiB / 15109MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
检查 CUDA 版本
要检查 CUDA 版本,请运行以下命令:
/usr/
local/cuda/bin/nvcc --version | grep Cuda
您的输出应类似于以下内容:
nvcc: NVIDIA (R) Cuda compiler driver Cuda compilation tools, release 11.4, V11.4.120
验证 Docker
从中运行 CUDA 容器DockerHub
sudo docker run --platform=linux/arm64 --rm \ --gpus all nvidia/cuda:11.4.2-base-ubuntu20.04 nvidia-smi
您的输出应类似于以下内容:
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA T4G On | 00000000:00:1F.0 Off | 0 | | N/A 33C P8 9W / 70W | 0MiB / 15109MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
TensorRT
使用以下命令来访问 TensorRT 命令行工具:
trtexec
您的输出应类似于以下内容:
&&&& RUNNING TensorRT.trtexec [TensorRT v8200] # trtexec ... &&&& PASSED TensorRT.trtexec [TensorRT v8200] # trtexec
TensorRT Python 滚轮可供按需安装。您可以在以下文件位置找到这些滚轮:
/usr/local/tensorrt/graphsurgeon/ └── graphsurgeon-0.4.5-py2.py3-none-any.whl /usr/local/tensorrt/onnx_graphsurgeon/ └── onnx_graphsurgeon-0.3.12-py2.py3-none-any.whl /usr/local/tensorrt/python/ ├── tensorrt-8.2.0.6-cp36-none-linux_aarch64.whl ├── tensorrt-8.2.0.6-cp37-none-linux_aarch64.whl ├── tensorrt-8.2.0.6-cp38-none-linux_aarch64.whl └── tensorrt-8.2.0.6-cp39-none-linux_aarch64.whl /usr/local/tensorrt/uff/ └── uff-0.6.9-py2.py3-none-any.whl
有关其他详细信息,请参阅 NVIDIA TensorRT 文档
运行 CUDA 示例
Graviton GPU DLAMI 提供了预编译的 CUDA 示例,可以帮助您验证不同的 CUDA 功能。
ls /usr/local/cuda/compiled_samples
例如,使用以下命令来运行 vectorAdd
示例:
/usr/
local/cuda/compiled_samples/vectorAdd
您的输出应类似于以下内容:
[Vector addition of 50000 elements] Copy input data from the host memory to the CUDA device CUDA kernel launch with 196 blocks of 256 threads Copy output data from the CUDA device to the host memory Test PASSED Done
运行 transpose
示例:
/usr/
local/cuda/compiled_samples/transpose
您的输出应类似于以下内容:
Transpose Starting... GPU Device 0: "Turing" with compute capability 7.5 > Device 0: "NVIDIA T4G" > SM Capability 7.5 detected: > [NVIDIA T4G] has 40 MP(s) x 64 (Cores/MP) = 2560 (Cores) > Compute performance scaling factor = 1.00 Matrix size: 1024x1024 (64x64 tiles), tile size: 16x16, block size: 16x16 transpose simple copy , Throughput = 185.1781 GB/s, Time = 0.04219 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 transpose shared memory copy, Throughput = 163.8616 GB/s, Time = 0.04768 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 transpose naive , Throughput = 98.2805 GB/s, Time = 0.07949 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 transpose coalesced , Throughput = 127.6759 GB/s, Time = 0.06119 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 transpose optimized , Throughput = 156.2960 GB/s, Time = 0.04999 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 transpose coarse-grained , Throughput = 155.9157 GB/s, Time = 0.05011 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 transpose fine-grained , Throughput = 158.4177 GB/s, Time = 0.04932 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 transpose diagonal , Throughput = 133.4277 GB/s, Time = 0.05855 ms, Size = 1048576 fp32 elements, NumDevsUsed = 1, Workgroup = 256 Test passed
后续步骤
使用 Graviton GPU DLAMI TensorFlow