Prepare data using Amazon Glue Interactive Sessions - Amazon SageMaker
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Prepare data using Amazon Glue Interactive Sessions

Amazon Glue Interactive Sessions is an on-demand, serverless, Apache Spark runtime environment that data scientists and engineers can use to rapidly build, test, and run data preparation and analytics applications.

You can initiate an Amazon Glue interactive session by starting a SageMaker Studio Classic notebook. When creating your Studio Classic notebook, choose the built-in Glue PySpark or Glue Spark kernel. This automatically starts an interactive, serverless Spark session. You do not need to provision or manage any compute cluster or infrastructure. After initialization, you can explore the Amazon Glue Data Catalog, execute complex queries, and interactively analyze and prepare data using Spark within your Studio Classic notebook. You can then use the prepared data to build, train, tune, and deploy models using the purpose-built ML tools within SageMaker Studio Classic.

Before starting your Amazon Glue interactive session in SageMaker Studio Classic, you need to set the appropriate roles and policies. Additionally, you may need to provide access to additional resources, such as a storage Amazon S3 bucket, which might require additional policies. For more information about required and additional IAM policies, see Permissions for Amazon Glue Interactive Sessions in SageMaker Studio Classic.

SageMaker Studio Classic provides a default configuration for your Amazon Glue interactive session, however, you can use Amazon Glue’s full catalog of Jupyter magic commands to further customize your environment. For information about the default and additional Jupyter magics that you can use in your Amazon Glue interactive session, see Configure your Amazon Glue interactive session in SageMaker Studio Classic.

The supported images and kernels for connecting to a Amazon Glue interactive session are as follows:

  • Images: SparkAnalytics 1.0, SparkAnalytics 2.0

  • Kernel: Glue Python [PySpark and Ray] and Glue Spark

Prerequisites:

The SparkAnalytics image that you select to launch your Amazon Glue session in Studio Classic is a combination of two frameworks - the SparkMagic framework (used with Amazon EMR), and Amazon Glue. For this reason, the prerequisites for both frameworks apply. However, you do not have to set up the Amazon EMR cluster if you only plan to use Amazon Glue Interactive Sessions. Before you start your first Amazon Glue interactive session in Studio Classic, complete the following:

  • Complete the prerequisites required to use the SparkMagic image. For a list of the prerequisites, see the Prerequisites section in Prepare Data at Scale with Studio Classic Notebooks.

  • Create an execution role with permissions for both Amazon Glue and SageMaker Studio Classic. Add the managed policy AwsGlueSessionUserRestrictedServiceRole, and create a custom policy that includes permissions sts:GetCallerIdentity, iam:GetRole, and IAM:Passrole. For instructions about how to create the necessary permissions, see Permissions for Amazon Glue Interactive Sessions in SageMaker Studio Classic.

  • Create a SageMaker domain with the execution role you created. For instructions about how to create a domain, see Setting up.