Appendix
Monitor training results via HyperPod recipes
SageMaker HyperPod recipes offer Tensorboard integration to analyze training behavior. These recipes
also incorporate VizTracer, which is a low-overhead tool for tracing and visualizing Python code
execution. For more information, see
VizTracer
The tensorboard logs are generated and stored within the log_dir. To access and analyze these logs locally, use the following procedure:
-
Download the Tensorboard experiment folder from your training environment to your local machine.
-
Open a terminal or command prompt on your local machine.
-
Navigate to the directory containing the downloaded experiment folder.
-
Launch Tensorboard by running the command:
tensorboard --port=<port> --bind_all --logdir experiment. -
Open your web browser and visit
http://localhost:8008.
You can now see the status and visualizations of your training jobs within the Tensorboard
interface. Seeing the status and visualizations helps you monitor and analyze the training process.
Monitoring and analyzing the training process helps you gain insights into the behavior and
performance of your models. For more information about how you monitor and analyze the
training with Tensorboard, see the
NVIDIA NeMo Framework User Guide
VizTracer
To enable VizTracer, you can modify your recipe by setting the environment variable ENABLE_VIZTRACER to 1.
After the training has completed, your VizTracer profile is in the experiment folder log_dir/viztracer_xxx.json. To analyze your profile, you can download it and open it using the vizviewer tool:
vizviewer --port <port> viztracer_xxx.json
This command launches the vizviewer on port 9001. You can view your VizTracer by going to
http://localhost:<port> in your browser. After you open VizTracer, you begin
analyzing the training. For more information about using VizTracer, see
VizTracer documentation