Use TensorBoard to debug and analyze training jobs in Amazon SageMaker - Amazon SageMaker
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Use TensorBoard to debug and analyze training jobs in Amazon SageMaker

Amazon SageMaker with TensorBoard is a capability of Amazon SageMaker that brings the visualization tools of TensorBoard to SageMaker, integrated with SageMaker Training and domain. It provides options to administer your Amazon account and users belonging to the account through SageMaker domain, to give the domain users access to the TensorBoard data with appropriate permissions to Amazon S3, and help the domain users perform model debugging tasks using the TensorBoard visualization plugins. SageMaker with TensorBoard is extended with the SageMaker Data Manager plugin, with which domain users can access a number of training jobs in one place within the TensorBoard application.

Note

This feature is for training and debugging deep learning models using the PyTorch or TensorFlow framework.

For data scientists

Training large models can have scientific problems that require data scientists to debug and resolve them in order to improve model convergence and stabilize gradient descent processes.

When you encounter model training issues, such as loss not converging, or vanishing or exploding weights and gradients, you need to access tensor data to dive deep and analyze the model parameters, scalars, and any custom metrics. Using SageMaker with TensorBoard, you can visualize model output tensors extracted from training jobs. As you experiment with different models, multiple training runs, and model hyperparameters, you can select multiple training jobs in TensorBoard and compare them in one place.

For administrators

Through the TensorBoard landing page in the SageMaker console or SageMaker domain, you can manage TensorBoard application users if you are an administrator of an Amazon account or SageMaker domain. Each domain user can access their own TensorBoard application given the granted permissions. As a SageMaker domain administrator and domain user, you can create and delete the TensorBoard application given the permission level you have.

Supported frameworks and Amazon Web Services Regions

This feature supports the following machine learning frameworks and Amazon Web Services Regions.

Frameworks
  • PyTorch

  • TensorFlow

  • Hugging Face Transformers

Amazon Web Services Regions
  • US East (N. Virginia) (us-east-1)

  • US East (Ohio) (us-east-2)

  • US West (Oregon) (us-west-2)

  • Europe (Frankfurt) (eu-central-1)

  • Europe (Ireland) (eu-west-1)

Note

Amazon SageMaker with TensorBoard runs the TensorBoard application on an ml.r5.large instance and incurs charges after the SageMaker free tier or the free trial period of the feature. For more information, see Amazon SageMaker Pricing.

Prerequisites

The following list shows the prerequisites to start using SageMaker with TensorBoard.

  • A SageMaker domain that's set up with Amazon VPC in your Amazon account.

    For instructions on setting up a domain, see Onboard to Amazon SageMaker domain using quick setup. You also need to add domain user profiles for individual users to access the TensorBoard on SageMaker. For more information, see Add and remove SageMaker domain user profiles.

  • The following list is the minimum set of permissions for using TensorBoard on SageMaker.

    • sagemaker:CreateApp

    • sagemaker:DeleteApp

    • sagemaker:DescribeTrainingJob

    • sagemaker:Search

    • s3:GetObject

    • s3:ListBucket

Prepare a training job with a TensorBoard output data configuration

A typical training job for deep learning in SageMaker consists of two main steps: preparing a training script and configuring a SageMaker Training job launcher. In this section, you can check the required changes to collect TensorBoard-compatible data from SageMaker Training.

Step 1: Modify your training script

Make sure you determine which output tensors and scalars to collect, and modify code lines in your training script using any of the following tools: TensorBoardX, TensorFlow Summary Writer, PyTorch Summary Writer, or SageMaker Debugger.

Also make sure that you specify the TensorBoard data output path as the log directory (log_dir) for callback in the training container.

For more information about callbacks per framework, see the following resources.

Step 2: Construct a SageMaker training launcher with TensorBoard data configuration

Use the sagemaker.debugger.TensorBoardOutputConfig while configuring a SageMaker framework estimator. This configuration API maps the S3 bucket you specify for saving TensorBoard data with the local path in the training container (/opt/ml/output/tensorboard). Pass the object of the module to the tensorboard_output_config parameter of the estimator class. The following code snippet shows an example of preparing a TensorFlow estimator with the TensorBoard output configuration parameter.

Note

This example assumes that you use the SageMaker Python SDK. If you use the low-level SageMaker API, you should include the following to the request syntax of the CreateTrainingJob API.

"TensorBoardOutputConfig": { "LocalPath": "/opt/ml/output/tensorboard", "S3OutputPath": "s3_output_bucket" }
from sagemaker.tensorflow import TensorFlow from sagemaker.debugger import TensorBoardOutputConfig # Set variables for training job information, # such as s3_out_bucket and other unique tags. ... LOG_DIR="/opt/ml/output/tensorboard" output_path = os.path.join( "s3_output_bucket", "sagemaker-output", "date_str", "your-training_job_name" ) tensorboard_output_config = TensorBoardOutputConfig( s3_output_path=os.path.join(output_path, 'tensorboard'), container_local_output_path=LOG_DIR ) estimator = TensorFlow( entry_point="train.py", source_dir="src", role=role, image_uri=image_uri, instance_count=1, instance_type="ml.c5.xlarge", base_job_name="your-training_job_name", tensorboard_output_config=tensorboard_output_config, hyperparameters=hyperparameters )

How to access TensorBoard on SageMaker

You can access TensorBoard by two methods: programmatically using the sagemaker.interactive_apps.tensorboard module that generates an unsigned or a presigned URL, or using the TensorBoard landing page in the SageMaker console. After you open TensorBoard, SageMaker runs the TensorBoard plugin and automatically finds all training job output data in TensorBoard-compatible file format.

Open TensorBoard using the sagemaker.interactive_apps.tensorboard module

The sagemaker.interactive_apps.tensorboard module provides a function called get_app_url that generates unsigned or presigned URLs to open the TensorBoard application in any environment in SageMaker or Amazon EC2. This is to provide a unified experience for both Studio Classic and non-Studio Classic users. For the Studio environment, you can open TensorBoard by running the get_app_url() function as it is, or you can also specify a job name to start tracking as the TensorBoard application opens. For non-Studio Classic environments, you can open TensorBoard by providing your domain and user profile information to the utility function. With this functionality, regardless of where or how you run training code and launch training jobs, you can directly access TensorBoard by running the get_app_url function in your Jupyter notebook or terminal.

Note

This functionality is available in the SageMaker Python SDK v2.184.0 and later. To use this functionality, make sure that you upgrade the SDK by running pip install sagemaker --upgrade.

Option 1: For SageMaker Studio Classic

If you are using SageMaker Studio Classic, you can directly open the TensorBoard application or retrieve an unsigned URL by running the get_app_url function as follows. As you are already within the Studio Classic environment and signed in as a domain user, get_app_url() generates unsigned URL because it is not necessary to authenticate again.

To open the TensorBoard application

The following code automatically opens the TensorBoard application from the unsigned URL that the get_app_url() function returns in the your environment's default web browser.

from sagemaker.interactive_apps import tensorboard region = "us-west-2" app = tensorboard.TensorBoardApp(region) app.get_app_url( training_job_name="your-training_job_name" # Optional. Specify the job name to track a specific training job )

To retrieve an unsigned URL and open the TensorBoard application manually

The following code prints an unsigned URL that you can copy to a web browser and open the TensorBoard application.

from sagemaker.interactive_apps import tensorboard region = "us-west-2" app = tensorboard.TensorBoardApp(region) print("Navigate to the following URL:") print( app.get_app_url( training_job_name="your-training_job_name", # Optional. Specify the name of the job to track. open_in_default_web_browser=False # Set to False to print the URL to terminal. ) )

Note that if you run the preceding two code samples outside the SageMaker Studio Classic environment, the function will return a URL to the TensorBoard landing page in the SageMaker console, because these do not have sign-in information to your domain and user profile. For creating a presigned URL, see Option 2 in the following section.

Option 2: For non-Studio Classic environments

If you use non-Studio Classic environments, such as SageMaker Notebook instance or Amazon EC2, and want to open TensorBoard directly from the environment you are in, you need to generate a URL presigned with your domain and user profile information. A presigned URL is a URL that's signed in to Amazon SageMaker Studio Classic while the URL is being created with your domain and user profile, and therefore granted access to all of the domain applications and files associated with your domain. To open TensorBoard through a presigned URL, use the get_app_url function with your domain and user profile name as follows.

Note that this option requires the domain user to have the sagemaker:CreatePresignedDomainUrl permission. Without the permission, the domain user will receive an exception error.

Important

Do not share any presigned URLs. The get_app_url function creates presigned URLs, which automatically authenticates with your domain and user profile and gives access to any applications and files associated with your domain.

print( app.get_app_url( training_job_name="your-training_job_name", # Optional. Specify the name of the job to track. create_presigned_domain_url=True, # Reguired to be set to True for creating a presigned URL. domain_id="your-domain-id", # Required if creating a presigned URL (create_presigned_domain_url=True). user_profile_name="your-user-profile-name", # Required if creating a presigned URL (create_presigned_domain_url=True). open_in_default_web_browser=False, # Optional. Set to False to print the URL to terminal. optional_create_presigned_url_kwargs={} # Optional. Add any additional args for Boto3 create_presigned_domain_url ) )
Tip

The get_app_url function runs the SageMaker.Client.create_presigned_domain_url API in the Amazon SDK for Python (Boto3) in the backend. As the Boto3 create_presigned_domain_url API creates presigned domain URLs that expire in 300 seconds by default, presigned TensorBoard application URLs also expire in 300 seconds. If you want to extend the expiration time, pass the ExpiresInSeconds argument to the optional_create_presigned_url_kwargs argument of the get_app_url function as follows.

optional_create_presigned_url_kwargs={"ExpiresInSeconds": 1500}
Note

If any of your input passed to the arguments of get_app_url is invalid, the function outputs a URL to the TensorBoard landing page instead of opening the TensorBoard application. The output message would be similar to the following.

Navigate to the following URL: https://us-west-2.console.aws.amazon.com/sagemaker/home?region=us-west-2#/tensor-board-landing

Open TensorBoard using the get_app_url function as an estimator class method

If you are in the process of running a training job using the estimator class of the SageMaker Python SDK and have an active object of the estimator class, you can also access the get_app_url function as a class method of the estimator class. Open the TensorBoard application or retrieve an unsigned URL by running the get_app_url method as follows. The get_app_url class method pulls the training job name from the estimator and opens the TensorBoard application with the specified job.

Note

This functionality is available in the SageMaker Python SDK v2.184.0 and later. To use this functionality, make sure that you upgrade the SDK by running pip install sagemaker --upgrade.

Option 1: For SageMaker Studio Classic

To open the TensorBoard application

The following code automatically opens the TensorBoard application from the unsigned URL that the get_app_url() method returns in the your environment's default web browser.

estimator.get_app_url( app_type=SupportedInteractiveAppTypes.TENSORBOARD # Required. )

To retrieve an unsigned URL and open the TensorBoard application manually

The following code prints an unsigned URL that you can copy to a web browser and open the TensorBoard application.

print( estimator.get_app_url( app_type=SupportedInteractiveAppTypes.TENSORBOARD, # Required. open_in_default_web_browser=False, # Optional. Set to False to print the URL to terminal. ) )

Note that if you run the preceding two code samples outside the SageMaker Studio Classic environment, the function will return a URL to the TensorBoard landing page in the SageMaker console, because these do not have sign-in information to your domain and user profile. For creating a presigned URL, see Option 2 in the following section.

Option 2: For non-Studio Classic environments

If you use non-Studio Classic environments, such as SageMaker Notebook instance and Amazon EC2, and want to generate a presigned URL to open the TensorBoard application, use the get_app_url method with your domain and user profile information as follows.

Note that this option requires the domain user to have the sagemaker:CreatePresignedDomainUrl permission. Without the permission, the domain user will receive an exception error.

Important

Do not share any presigned URLs. The get_app_url function creates presigned URLs, which automatically authenticates with your domain and user profile and gives access to any applications and files associated with your domain.

print( estimator.get_app_url( app_type=SupportedInteractiveAppTypes.TENSORBOARD, # Required create_presigned_domain_url=True, # Reguired to be set to True for creating a presigned URL. domain_id="your-domain-id", # Required if creating a presigned URL (create_presigned_domain_url=True). user_profile_name="your-user-profile-name", # Required if creating a presigned URL (create_presigned_domain_url=True). open_in_default_web_browser=False, # Optional. Set to False to print the URL to terminal. optional_create_presigned_url_kwargs={} # Optional. Add any additional args for Boto3 create_presigned_domain_url ) )

Open TensorBoard through the SageMaker console

You can also use the SageMaker console UI to open the TensorBoard application. There are two options to open the TensorBoard application through the SageMaker console.

Option 1: Launch TensorBoard from the domain details page

Navigate to the domain details page

The following procedure shows how to navigate to the domain details page.

  1. Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/.

  2. On the left navigation pane, choose Admin configurations.

  3. Under Admin configurations, choose domains.

  4. From the list of domains, select the domain in which you want to launch the TensorBoard application.

Launch a user profile application

The following procedure shows how to launch a Studio Classic application that is scoped to a user profile.

  1. On the domain details page, choose the User profiles tab.

  2. Identify the user profile for which you want to launch the Studio Classic application.

  3. Choose Launch for your selected user profile, then choose TensorBoard.

Option 2: Launch TensorBoard from the TensorBoard landing page

The following procedure describes how to launch a TensorBoard application from the TensorBoard landing page.

  1. Open the Amazon SageMaker console at https://console.aws.amazon.com/sagemaker/.

  2. On the left navigation pane, choose TensorBoard.

  3. Under Get started, select the domain in which you want to launch the Studio Classic application. If your user profile only belongs to one domain, you do not see the option for selecting a domain.

  4. Select the user profile for which you want to launch the Studio Classic application. If there is no user profile in the domain, choose Create user profile. For more information, see Add and Remove User Profiles.

  5. Choose Open TensorBoard.

The following screenshot shows the location of TensorBoard in the left navigation pane of the SageMaker console and the SageMaker with TensorBoard landing page in the main pane.


                        The TensorBoard landing page

Access and visualize training output data in TensorBoard

You can conduct an online or offline analysis by loading collected output tensors from S3 buckets paired with training jobs during or after training.

When you open the TensorBoard application, TensorBoard opens with the SageMaker Data Manager tab. The following screenshot shows the full view of the SageMaker Data Manager tab in the TensorBoard application.

A screenshot of the SageMaker Data Manager tab in the TensorBoard application

In the SageMaker Data Manager tab, you can select any training job and load TensorBoard-compatible training output data from Amazon S3.

  1. In the Search training jobs section, use the filters to narrow down the list of training jobs you want to find, load, and visualize.

  2. In the List of training jobs section, use the check boxes to choose training jobs from which you want to pull data and visualize for debugging.

  3. Choose Add selected jobs. The selected jobs should appear in the Tracked training jobs section, as shown in the following screenshot.

    A screenshot of the Tracked training jobs section under the SageMaker Data
                            Manager tab in the TensorBoard application
Note

The SageMaker Data Manager tab only shows training jobs configured with the TensorBoardOutputConfig parameter. Make sure you have configured the SageMaker estimator with this parameter. For more information, see Step 2: Construct a SageMaker training launcher with TensorBoard data configuration.

Note

The visualization tabs might not appear if you are using SageMaker with TensorBoard for the first time or no data is loaded from a previous use. After adding training jobs and waiting for a few seconds, refresh the viewer by choosing the clockwise circular arrow on the upper-right corner. The visualization tabs should appear after the job data are successfully loaded. You can also set to auto-refresh using the Settings button next to the refresh button in the upper right corner.

Explore training output data visualized in TensorBoard

In the graphics tabs, you can see the list of the loaded training jobs in the left pane. You can also use the check boxes of the training jobs to show or hide visualizations. The TensorBoard dynamic plugins are activated dynamically depending on how you have set your training script to include summary writers and pass callbacks for tensor and scalar collection, and therefore the graphics tabs also appear dynamically. The following screenshots show example views of each tab with visualization of two training jobs that collected metrics for time series, scalar, graph, distribution, and histogram plugins.

The TIME SERIES tab view


                A screenshot of the TIME SERIES tab that shows the
                    TensorBoard data of selected training jobs for tracking, and the serialized
                    histogram plots in the main pane
                .

The SCALARS tab view

A screenshot of the SCALARS tab that shows the TensorBoard data of
                    selected training jobs for tracking, and the plots of scalar values, such as
                    accuracy and loss, with respect to step (or epoch) in the main pane.

The GRAPHS tab view

A screenshot of the GRAPHS tab that shows the model graph of a selected
                    training job.

The DISTRIBUTIONS tab view

A screenshot of the DISTRIBUTIONS tab that shows the TensorBoard data of
                    selected training jobs for tracking, and the changes in distribution of model
                    parameters, such as biases and kernels, with respect to step (or epoch) in the
                    main pane.

The HISTOGRAMS tab view

A screenshot of the HISTOGRAMS tab that shows the TensorBoard data of
                    selected training jobs for tracking, and the serialized histograms of model
                    parameters, such as biases and kernels, in the main pane.

Delete unused TensorBoard applications

After you are done with monitoring and experimenting with jobs in TensorBoard, shut the TensorBoard application down.

  1. Open the SageMaker console.

  2. On the left navigation pane, choose Admin configurations.

  3. Under Admin configurations, choose domains.

  4. Choose your domain.

  5. Choose your user profile.

  6. Under Apps, choose Delete App for the TensorBoard row.

  7. Choose Yes, delete app.

  8. Type delete in the text box, then choose Delete.

  9. A blue message should appear at the top of the screen: default is being deleted.

Considerations

Consider the following when using SageMaker with TensorBoard.

  • You cannot share the TensorBoard applications for collaboration purposes because SageMaker domain does not allow application sharing among users. Users can share the output tensors saved in an S3 bucket, if they have access to the bucket.

  • The visualization plugins might not appear when you first launch the TensorBoard application. After you select training jobs in the SageMaker Data Manager plugin, the TensorBoard application loads the TensorBoard data and populates the visualization plugins.

  • The TensorBoard applications automatically shuts down after 1 hour of inactivity. If you want to shut the application down when you are done using it, make sure to manually shut down TensorBoard to avoid paying for the instance hosting it. For instructions on deleting the application, see Delete unused TensorBoard applications.