

# Building Amazon Glue jobs with interactive sessions
<a name="interactive-sessions-chapter"></a>

 Data engineers can author Amazon Glue jobs faster and more easily than before using interactive sessions in Amazon Glue. 

**Topics**
+ [Overview of Amazon Glue interactive sessions](#interactive-sessions-overview)
+ [Getting started with Amazon Glue interactive sessions](interactive-sessions.md)
+ [Configuring Amazon Glue interactive sessions for Jupyter and Amazon Glue Studio notebooks](interactive-sessions-magics.md)
+ [Converting a script or notebook into an Amazon Glue job](interactive-sessions-convert.md)
+ [Working with streaming operations in Amazon Glue interactive sessions](interactive-sessions-streaming.md)
+ [Amazon Glue interactive session pricing](interactive-sessions-session-pricing.md)
+ [Developing and testing Amazon Glue job scripts locally](aws-glue-programming-etl-libraries.md)
+ [Development endpoints](development.md)

## Overview of Amazon Glue interactive sessions
<a name="interactive-sessions-overview"></a>

 With Amazon Glue interactive sessions, you can rapidly build, test, and run data preparation and analytics applications. Interactive sessions provides a programmatic and visual interface for building and testing extract, transform, and load (ETL) scripts for data preparation. Interactive sessions run Apache Spark analytics applications and provide on-demand access to a remote Spark runtime environment. Amazon Glue transparently manages serverless Spark for these interactive sessions. 

 Interactive sessions are flexible, so you build and test your applications from the environment of your choice. You can create and work with interactive sessions through the Amazon Command Line Interface and the API. You can use Jupyter-compatible notebooks to visually author and test your notebook scripts. Interactive sessions provide an open-source Jupyter kernel that integrates almost anywhere that Jupyter does, including integrating with IDEs such as PyCharm, IntelliJ, and VS Code. This enables you to author code in your local environment and run it seamlessly on the interactive sessions backend. 

 Using the interactive sessions API, customers can programmatically run applications that use Apache Spark analytics without having to manage Spark infrastructure. You can run one or more Spark statements within a single interactive session. 

 Interactive sessions therefore provide a faster, cheaper, more-flexible way to build and run data preparation and analytics applications. To learn how to use interactive sessions, see the documentation in this section. [ Magics supported by Amazon Glue](https://docs.amazonaws.cn/glue/latest/dg/interactive-sessions-magics.html#interactive-sessions-magics2) 

### Limitations
<a name="interactive-sessions-limitations"></a>
+ Job bookmarks are not supported in interactive sessions.
+  Creating notebook jobs using the Amazon Command Line Interface is not supported. 
+  Amazon Glue Studio notebooks do not support Scala. 

# Getting started with Amazon Glue interactive sessions
<a name="interactive-sessions"></a>

These sections describe how to run Amazon Glue interactive sessions locally.

## Prerequisites for setting up interactive sessions locally
<a name="glue-is-prereqs"></a>

The following are prerequisites for installing interactive sessions:
+ Supported Python versions are 3.6 - 3.10\$1. 
+  See sections below for MacOS/Linux and Windows instructions. 
+  Review the [interactive sessions pricing](https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions-session-pricing.html) documentation to understand the cost structure. 

## Installing Jupyter and Amazon Glue interactive sessions Jupyter kernels
<a name="interactive-sessions-install"></a>

 Use the following to install the kernel locally. 

 The command, `install-glue-kernels`, installs the jupyter kernelspec for both pyspark and spark kernels and also installs logos in the right directory. 

```
pip3 install --upgrade jupyter boto3 aws-glue-sessions
```

```
install-glue-kernels
```

## Running Jupyter
<a name="w2aac29c13c13"></a>

 To run Jupyter Notebook, complete the following steps. 

1.  Run the following command to launch Jupyter Notebook. 

   ```
   jupyter notebook
   ```

1.  Choose **New**, and then choose one of the Amazon Glue kernels to begin coding against Amazon Glue. 

## Configuring session credentials and region
<a name="interactive-sessions-credentials"></a>

### MacOS/Linux instructions
<a name="interactive-sessions-macos-linux-instructions"></a>

 Amazon Glue interactive sessions requires the same IAM permissions as Amazon Glue Jobs and Dev Endpoints. Specify the role used with interactive sessions in one of two ways: 

1.  With the `%iam_role` and `%region` magics 

1.  With an additional line in `~/.aws/config` 

 **Configuring a session role with magic** 

 In the first cell, type `%iam_role <YourGlueServiceRole>` in the first cell executed. 

 **Configuring a session role with `~/.aws/config`** 

 Amazon Glue Service Role for interactive sessions can either be specified in the notebook itself or stored alongside the Amazon CLI config. If you have a role you typically use with Amazon Glue Jobs this will be that role. If you do not have a role you use for Amazon Glue jobs, please follow this guide, [ Configuring IAM permissions for Amazon Glue](https://docs.amazonaws.cn/glue/latest/dg/configure-iam-for-glue.html) , to set one up. 

 To set this role as the default role for interactive sessions: 

1.  With a text editor, open `~/.aws/config`. 

1.  Look for the profile you use for Amazon Glue. If you don't use a profile, use the `[Default]` profile. 

1.  Add a line in the profile for the role you intend to use like `glue_role_arn=<AWSGlueServiceRole>`. 

1.  [Optional]: If your profile does not have a default region set, I recommend adding one with `region=us-east-1`, replacing `us-east-1` with your desired region. 

1.  Save the config. 

 For more information, see [Interactive sessions with IAM](glue-is-security.md). 

### Windows instructions
<a name="interactive-sessions-windows-instructions"></a>

 Amazon Glue interactive sessions requires the same IAM permissions as Amazon Glue Jobs and Dev Endpoints. Specify the role used with interactive sessions in one of two ways: 

1.  With the `%iam_role` and `%region` magics 

1.  With an additional line in `~/.aws/config` 

 **Configuring a session role with magic** 

 In the first cell, type `%iam_role <YourGlueServiceRole>` in the first cell executed. 

 ** Configuring a session role with `~/.aws/config` ** 

 Amazon Glue Service Role for interactive sessions can either be specified in the notebook itself or stored alongside the Amazon CLI config. If you have a role you typically use with Amazon Glue Jobs this will be that role. If you do not have a role you use for Amazon Glue jobs, please follow this guide, [ Setting up IAM permissions for Amazon Glue](https://docs.amazonaws.cn/glue/latest/dg/configure-iam-for-glue.html) , to set one up. 

 To set this role as the default role for interactive sessions: 

1.  With a text editor, open `~/.aws/config`. 

1.  Look for the profile you use for Amazon Glue. If you don't use a profile, use the `[Default]` profile. 

1.  Add a line in the profile for the role you intend to use like `glue_role_arn=<AWSGlueServiceRole>`. 

1.  [Optional]: If your profile does not have a default region set, I recommend adding one with `region=us-east-1`, replacing `us-east-1` with your desired region. 

1.  Save the config. 

 For more information, see [Interactive sessions with IAM](glue-is-security.md). 

## Upgrading from the interactive sessions preview
<a name="interactive-sessions-upgrading-from-preview"></a>

 The kernel was upgraded with new names when it was released with version 0.27. To clean up preview versions of the kernels run the following from a terminal or PowerShell. 

**Note**  
If you are a part of any other Amazon Glue preview that requires a custom service model, removing the kernel will remove the custom service model.

```
# Remove Old Glue Kernels
jupyter kernelspec remove glue_python_kernel
jupyter kernelspec remove glue_scala_kernel

# Remove Custom Model
cd ~/.aws/models
rm -rf glue/
```

# Using interactive sessions with SageMaker AI Studio
<a name="interactive-sessions-sagemaker-studio"></a>

 Amazon Glue Interactive Sessions is an on-demand, serverless, Apache Spark runtime environment that data scientists and engineers can use to rapidly build, test, and run data preparation and analytics applications. You can initiate an Amazon Glue interactive session by starting a Amazon SageMaker AI Studio Classic notebook. 

For more information, see [ Prepare Data using Amazon Glue interactive sessions ](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-notebooks-glue.html). 

# Using interactive sessions with Microsoft Visual Studio Code
<a name="interactive-sessions-vscode"></a>

 **Prerequisites** 
+  Install Amazon Glue interactive sessions and verify it works with Jupyter Notebook. 
+  Download and install Visual Studio Code with Jupyter. For details, see [Jupyter Notebook in VS Code](https://code.visualstudio.com/docs/datascience/jupyter-notebooks). 

**To get started with interactive sessions with VSCode**

1.  Disable Jupyter AutoStart in VS Code. 

    In Visual Studio Code, Jupyter kernels will auto-start which will prevent your magics from taking effect as the session will already be started. To disable **Auto Start** on Windows, go to **File** > **Preferences** > **Extensions** > **Jupyter** > right-click on Jupyter then choose **Extension Settings**. 

    On MacOS, go to **Code** > **Settings** > **Extensions** > **Jupyter** > right-click on Jupyter then choose **Extension Settings**. 

    Scroll down until you see **Jupyter: Disable Jupyter Auto Start**. Check the box "When true, disables Jupyter from being automatically started for you. You must instead run a cell to start Jupyter."   
![\[The screenshot shows the checkbox enabled in for the Jupyter Extension in VS Code.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/IS_vscode_step1.png)

1.  Go to File > New File > Save to save this file with name of your choice as an `.ipynb` extension or select **jupyter** under **select a language** and save the file.   
![\[The screenshot shows the file being saved with a new name.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/IS_vscode_step2.gif)

1.  Double-click on the file. The Jupyter shell will display and a notebook will be opened.   
![\[The screenshot shows the open notebook.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/IS_vscode_step3.png)

1.  On Windows, when you first create a file, by default no kernel is selected. Click on **Select Kernel** and a list of available kernels is displayed. Choose **Glue PySpark**. 

    On MacOS, If you do not see the **Glue PySpark** kernel, try the following steps: 

   1. Run a local Jupyter session to obtain the URL. 

      For example, run the following command to launch Jupyter Notebook.

      ```
      jupyter notebook
      ```

      When the notebook first runs, you will see a URL that looks like `http://localhost:8888/?token=3398XXXXXXXXXXXXXXXX`.

      Copy the URL.

   1. In VS Code, click the current kernel, then **Select Another Kernel...**, then select **Existing Jupyter Server...**. Paste the URL you copied from the step above.

      If you receive an error message, see the [ VS Code Jupyter wiki ](https://github.com/microsoft/vscode-jupyter/wiki/Connecting-to-a-remote-Jupyter-server-from-vscode.dev). 

   1. If successful, this will set the kernel to **Glue PySpark**.  
![\[The screenshot shows the Select Kernel button highlighted.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/IS_vscode_step4a.png)

    Choose the **Glue PySpark** or **Glue Spark** kernel (for Python and Scala respectively).   
![\[The screenshot shows the selection for Amazon Glue PySpark.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/IS_vscode_step4b.png)

    If you don't see **Amazon Glue PySpark** and **Amazon Glue Spark** kernels in the drop-down list, please ensure you have installed the Amazon Glue kernel in the step above, or that your `python.defaultInterpreterPath` setting in Visual Studio Code is correct. For more information, see [ python.defaultInterpreterPath setting description ](https://github.com/microsoft/vscode-python/wiki/Setting-descriptions#pythondefaultinterpreterpath). 

1.  Create an Amazon Glue Interactive Session. Proceed to create a session in the same manner as you did in Jupyter Notebook. Specify any magics at the top of your first cell and run a statement of code. 

# Interactive sessions with IAM
<a name="glue-is-security"></a>

 These sections describe security considerations for Amazon Glue interactive sessions. 

**Topics**
+ [IAM principals used with interactive sessions](#glue-is-security-iam-principals)
+ [Setting up a client principal](#glue-is-client-principals)
+ [Setting up a runtime role](#glue-is-runtime-role)
+ [Make your session private with TagOnCreate](#glue-is-tagoncreate)
+ [IAM policy considerations](#glue-is-security-iam-managed-policy)

## IAM principals used with interactive sessions
<a name="glue-is-security-iam-principals"></a>

 You use two IAM principals used with Amazon Glue interactive sessions. 
+  **Client principal**: The client principal (either a user or a role) authorizes API operations for interactive sessions from an Amazon Glue client that's configured with the principal's identity-based credentials. For example, this could be an IAM role that you typically use to access the Amazon Glue console. This could also be a role given to a user in IAM whose credentials are used for the Amazon Command Line Interface, or an Amazon Glue client used by the interactive sessions Jupyter kernel. 
+  **Runtime role**: The runtime role is an IAM role that the client principal passes to interactive sessions API operations. Amazon Glue uses this role to run statements in your session. For example, this role could be the one used for running Amazon Glue ETL jobs. 

   For more information, see [Setting up a runtime role](#glue-is-runtime-role). 

## Setting up a client principal
<a name="glue-is-client-principals"></a>

 You must attach an identity policy to the client principal to allow it to call the interactive sessions API. This role must have `iam:PassRole` access to the execution role that you would pass to the interactive sessions API, such as `CreateSession`. For example, you can attach the **AWSGlueConsoleFullAccess** managed policy to an IAM role which allows users in your account with the policy attached to access all the sessions created in your account (such as runtime statement or cancel statement). 

 If you would like to protect your session and make it private only to certain IAM roles, such as ones associated with the user who created the session then you can use Amazon Glue Interactive Session's Tag Based Authorization Control called TagOnCreate. For more information, see [Make your session private with TagOnCreate](#glue-is-tagoncreate) on how an owner tag-based scoped down managed policy can make your session private with TagOnCreate. For more information on identity-based policies, see [ Identity-based policies for Amazon Glue](https://docs.amazonaws.cn/glue/latest/dg/security_iam_service-with-iam.html#security_iam_service-with-iam-id-based-policies). 

## Setting up a runtime role
<a name="glue-is-runtime-role"></a>

 You must pass an IAM role to the CreateSession API operation in order to allow Amazon Glue to assume and run statements in interactive sessions. The role should have the same IAM permissions as those required to run a typical Amazon Glue job. For example, you can create a service role using the **AWSGlueServiceRole** policy that allows Amazon Glue to call Amazon services on your behalf. If you use the Amazon Glue console, it will automatically create a service role on your behalf or use an existing one. You can also create your own IAM role and attach your own IAM policy to allow similar permissions. 

 If you would like to protect your session and make it private only to the user who created the session then you can use Amazon Glue Interactive Session's Tag Based Authorization Control called TagOnCreate. For more information, see [Make your session private with TagOnCreate](#glue-is-tagoncreate) on how an owner tag-based scoped down managed policy can make your session private with TagOnCreate. For more information on identity-based policies, see [Identity-based policies for Amazon Glue](security_iam_service-with-iam.md#security_iam_service-with-iam-id-based-policies). If you are creating the execution role by yourself from the IAM console and you want to make your service private with TagOnCreate feature then follow the steps below. 

1.  Create an IAM role with role type set to `Glue`. 

1.  Attach this Amazon Glue managed policy: *AwsGlueSessionUserRestrictedServiceRole* 

1.  Prefix the role name with the policy name *AwsGlueSessionUserRestrictedServiceRole*. For example, you can create a role with name *AwsGlueSessionUserRestrictedServiceRole-myrole* and attach Amazon Glue managed policy *AwsGlueSessionUserRestrictedServiceRole*. 

1.  Attach a trust policy like following to allow Amazon Glue to assume the role: 

------
#### [ JSON ]

****  

   ```
   {
     "Version":"2012-10-17",		 	 	 
     "Statement": [
       {
         "Effect": "Allow",
         "Principal": {
           "Service": [
             "glue.amazonaws.com"
           ]
         },
         "Action": [
           "sts:AssumeRole"
         ]
       }
     ]
   }
   ```

------

 For an interactive sessions Jupyter kernel, you can specify the `iam_role` key in your Amazon Command Line Interface profile. For more information, see [ Configuring sessions with \$1/.aws/config ](https://docs.aws.amazon.com/glue/latest/ug/interactive-sessions-magics.html#interactive-sessions-named-profiles). If you're interacting with interactive sessions using an Amazon Glue notebook, then you can pass the execution role in the `%iam_role` magic in the first cell that you run. 

## Make your session private with TagOnCreate
<a name="glue-is-tagoncreate"></a>

 Amazon Glue interactive sessions supports tagging and Tag Based Authorization (TBAC) for interactive sessions as a named resource. In addition to TBAC using TagResource and UntagResource APIs, Amazon Glue interactive sessions supports the TagOnCreate feature to 'tag' a session with a given tag only during session creation with CreateSession operation. This also means those tags will be removed on DeleteSession, aka UntagOnDelete. 

 TagOnCreate offers a powerful security mechanism to make your session private to the creator of the session. For example, you can attach an IAM policy with "owner" RequestTag and value of \$1\$1aws:userId\$1 to a client principal (such as an user) in order to allow creating a session only if an "owner" tag with matching value of the callers userId is provided as userId tag in CreateSession request. This policy allows Amazon Glue interactive sessions to create a session resource and tag the session with the userId tag only during session creation time. In addition to it you can scope down the access (like running statements) to your session only to the creator (aka owner tag with value \$1\$1aws:userId\$1) of the session by attaching an IAM policy with "owner" ResourceTag to the execution role you passed in during CreateSession. 

 In order to make it easier for you to use TagOnCreate feature to make a session private to the session creator, Amazon Glue provides specialized managed policies and service roles. 

 If you want to create a Amazon Glue Interactive Session using an IAM AssumeRole principal (that is, using credential vended by assuming an IAM role) and you want to make the session private to the creator, then use policies similar to the **AWSGlueSessionUserRestrictedNotebookPolicy** and **AWSGlueSessionUserRestrictedNotebookServiceRole** respectively. These policies allow Amazon Glue to use \$1\$1aws:PrincipalTag\$1 to extract the owner tag value. This requires you to pass a userId tag with value \$1\$1aws:userId\$1 as SessionTag in the assume role credential. See [ ID session tags ](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_session-tags.html). If you are using an Amazon EC2 instance with an instance profile vending the credential and you want to create a session or interact with the session from within the Amazon EC2 instance , then you would require to pass a userId tag with value \$1\$1aws:userId\$1 as SessionTag in the assume role credential. 

 For example, If you are creating a session using an IAM AssumeRole principal credential and you want to make your service private with TagOnCreate feature then follow the steps below. 

1.  Create a runtime role yourself from the IAM console. Please attach this Amazon Glue managed policy *AwsGlueSessionUserRestrictedNotebookServiceRole* and prefix the role name with the policy name *AwsGlueSessionUserRestrictedNotebookServiceRole*. For example, you can create a role with name *AwsGlueSessionUserRestrictedNotebookServiceRole-myrole* and attach Amazon Glue managed policy *AwsGlueSessionUserRestrictedNotebookServiceRole*. 

1.  Attach a trust policy like below to allow Amazon Glue to assume the above role. 

------
#### [ JSON ]

****  

   ```
   {
     "Version":"2012-10-17",		 	 	 
     "Statement": [
       {
         "Effect": "Allow",
         "Principal": {
           "Service": [
             "glue.amazonaws.com"
           ]
         },
         "Action": [
           "sts:AssumeRole"
         ]
       }
     ]
   }
   ```

------

1.  Create another role named with a prefix *AwsGlueSessionUserRestrictedNotebookPolicy* and attach the Amazon Glue managed policy *AwsGlueSessionUserRestrictedNotebookPolicy* to make the session private. In addition to the managed policy please attach the following inline policy to allow iam:PassRole to the role you created in step 1. 

------
#### [ JSON ]

****  

   ```
   {
     "Version":"2012-10-17",		 	 	 
     "Statement": [
       {
         "Effect": "Allow",
         "Action": [
           "iam:PassRole"
         ],
         "Resource": [
           "arn:aws-cn:iam::*:role/AwsGlueSessionUserRestrictedNotebookServiceRole*"
         ],
         "Condition": {
           "StringLike": {
             "iam:PassedToService": [
               "glue.amazonaws.com"
             ]
           }
         }
       }
     ]
   }
   ```

------

1.  Attach a trust policy like following to the above IAM Amazon Glue to assume the role. 

------
#### [ JSON ]

****  

   ```
   {
     "Version":"2012-10-17",		 	 	 
     "Statement": [
       {
         "Effect": "Allow",
         "Principal": {
           "Service": [
             "glue.amazonaws.com"
           ]
         },
         "Action": [
           "sts:AssumeRole",
           "sts:TagSession"
         ]
       }
     ]
   }
   ```

------
**Note**  
 Optionally, you can use a single role (for example, notebook role) and attach both of the above managed policies *AwsGlueSessionUserRestrictedNotebookServiceRole* and *AwsGlueSessionUserRestrictedNotebookPolicy*. Also attach the additional inline policy to allow `iam:passrole` of your role to Amazon Glue. And finally attach the above trust policy to allow `sts:AssumeRole` and `sts:TagSession`. 

### AWSGlueSessionUserRestrictedNotebookPolicy
<a name="w2aac29c13c33c21c15"></a>

 The AWSGlueSessionUserRestrictedNotebookPolicy provides access to create a Amazon Glue Interactive Session from a notebook only if a tag key "owner" and value matching the Amazon user id of the principal (user or Role). For more information, see [Where you can use policy variables ](https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_variables.html#policy-vars-infotouse). This policy is attached to the principal (User or role) that creates Amazon Glue Interactive Session notebooks from Amazon Glue Studio. This policy also permits sufficient access to the Amazon Glue Studio notebook to interact with the Amazon Glue Studio Interactive Session resources that are created with the "owner" tag value matching the Amazon user ID of the principal. This policy denies permission to change or remove "owner" tag from a Amazon Glue session resource after the session is created. 

### AWSGlueSessionUserRestrictedNotebookServiceRole
<a name="w2aac29c13c33c21c17"></a>

 The **AWSGlueSessionUserRestrictedNotebookServiceRole** provides sufficient access to the Amazon Glue Studio notebook to interact with the Amazon Glue Interactive Session resources that are created with the "owner" tag value matching the Amazon user ID of the principal (user or role) of the notebook creator. For more information, see [Where you can use policy variables ](https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_variables.html#policy-vars-infotouse). This service-role policy is attached to the role that is passed as magic to a notebook or passed as execution role to the CreateSession API. This policy also permits to create a Amazon Glue Interactive Session from a notebook only if a tag key "owner" and value matching the Amazon user ID of the principal. This policy denies permission to change or remove "owner" tag from an Amazon Glue session resource after the session is created. This policy also includes permissions for writing and reading from Amazon S3 buckets, writing CloudWatch logs, creating and deleting tags for Amazon EC2 resources used by Amazon Glue. 

### Make your session private with user policies
<a name="w2aac29c13c33c21c21"></a>

You can attach the **AWSGlueSessionUserRestrictedPolicy** to IAM roles attached to each of the users in your account to restrict them from creating a session only with an owner tag with a value matching their own \$1\$1aws:userId\$1. Instead of using the **AWSGlueSessionUserRestrictedNotebookPolicy** and **AWSGlueSessionUserRestrictedNotebookServiceRole** you need to use policies similar to the **AWSGlueSessionUserRestrictedPolicy** and **AWSGlueSessionUserRestrictedServiceRole** respectively. For more information, see [Using-identity based policies ](https://docs.aws.amazon.com/glue/latest/dg/using-identity-based-policies.html). This policy scopes down the access to a session only to the creator, the \$1\$1aws:userId\$1 of the user who created the session with an owner tag bearing their own \$1\$1aws:userId\$1. If you have created the execution role yourself using the IAM console by following the steps in [Setting up a runtime role](#glue-is-runtime-role), then in addition to attaching the **AwsGlueSessionUserRestrictedPolicy** managed policy, also attach the following inline policy to each of the users in your account to allow `iam:PassRole` for the execution role you created earlier. 

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "iam:PassRole"
      ],
      "Resource": [
        "arn:aws-cn:iam::*:role/AwsGlueSessionUserRestrictedServiceRole*"
      ],
      "Condition": {
        "StringLike": {
          "iam:PassedToService": [
            "glue.amazonaws.com"
          ]
        }
      }
    }
  ]
}
```

------

#### AWSGlueSessionUserRestrictedPolicy
<a name="w2aac29c13c33c21c21c11"></a>

 The **AWSGlueSessionUserRestrictedPolicy** provides access to create an Amazon Glue Interactive Session using the CreateSession API only if a tag key "owner" and value matching their Amazon user ID is provided. This identity policy is attached to the user that invokes the CreateSession API. This policy also permits to interact with the Amazon Glue Interactive Session resources that were created with a "owner" tag and value matching their Amazon user id. This policy denies permission to change or remove "owner" tag from a Amazon Glue session resource after the session is created. 

#### AWSGlueSessionUserRestrictedServiceRole
<a name="w2aac29c13c33c21c21c13"></a>

 The **AWSGlueSessionUserRestrictedServiceRole** provides full access to all Amazon Glue resources except for sessions and allows users to create and use only the interactive sessions that are associated with the user. This policy also includes other permissions needed by Amazon Glue to manage Glue resources in other Amazon services. The policy also allows adding tags to Amazon Glue resources in other Amazon services. 

## IAM policy considerations
<a name="glue-is-security-iam-managed-policy"></a>

 Interactive sessions are IAM resources in Amazon Glue. Because they are IAM resources, access and interaction to a session is governed by IAM policies. Based on the IAM policies attached to a client principal or execution role configured by an admin, a client principal (user or role) will be able to create new sessions and interact with its own sessions and other sessions. 

 If an admin has attached an IAM policy such as AWSGlueConsoleFullAccess or AWSGlueServiceRole that allows access to all Amazon Glue resources in that account, a client principal will be able to collaborate with each other. For example, one user will be able to interact with sessions that are created by other users if policies allow this. 

 If you'd like to configure a policy tailored to your specific needs, see [ IAM documentation about configuring resources for a policy ](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_identity-vs-resource.html). For example, in order to isolate sessions that belong to an user, you can use the TagOnCreate feature supported by Amazon Glue Interactive sessions. See [Make your session private with TagOnCreate](#glue-is-tagoncreate). 

 Interactive sessions supports limiting session creation based on certain VPC conditions. See [Control policies that control settings using condition keys](security_iam_id-based-policy-examples.md#glue-identity-based-policy-condition-key-vpc). 

# Configuring Amazon Glue interactive sessions for Jupyter and Amazon Glue Studio notebooks
<a name="interactive-sessions-magics"></a>

## Introduction to Jupyter Magics
<a name="w2aac29c18b3"></a>

 Jupyter Magics are commands that can be run at the beginning of a cell or as a whole cell body. Magics start with `%` for line-magics and `%%` for cell-magics. Line-magics such as `%region` and `%connections` can be run with multiple magics in a cell, or with code included in the cell body like the following example. 

```
%region us-east-2
%connections my_rds_connection
dy_f = glue_context.create_dynamic_frame.from_catalog(database='rds_tables', table_name='sales_table')
```

 Cell magics must use the entire cell and can have the command span multiple lines. An example of `%%sql` is below. 

```
%%sql
select * from rds_tables.sales_table
```

## Magics supported by Amazon Glue interactive sessions for Jupyter
<a name="interactive-sessions-supported-magics"></a><a name="interactive-sessions-magics2"></a>

 The following are magics that you can use with Amazon Glue interactive sessions for Jupyter notebooks. 

 **Sessions magics** 


| Name | Type | Description | 
| --- | --- | --- | 
|  %help  |  n/a  |  Return a list of descriptions and input types for all magic commands.  | 
| %profile | String | Specify a profile in your Amazon configuration to use as the credentials provider. | 
| %region | String |  Specify the Amazon Web Services Region; in which to initialize a session. Default from `~/.aws/configure.` Example: `%region us-west-1`  | 
| %idle\$1timeout | Int |   The number of minutes of inactivity after which a session will timeout after a cell has been executed. The default idle timeout value for Spark ETL sessions is the default timeout, 2880 minutes (48 hours). For other session types, consult documentation for that session type. Example: `%idle_timeout 3000`  | 
| %session\$1id | n/a | Return the session ID for the running session.  | 
| %session\$1id\$1prefix | String |   Define a string that will precede all session IDs in the format **[session\$1id\$1prefix]-[session\$1id].** If a session ID is not provided, a random UUID will be generated. This magic is not supported when you run a Jupyter Notebook in Amazon Glue Studio.  Example: `%session_id_prefix 001`  | 
| %status |  | Return the status of the current Amazon Glue session including its duration, configuration and executing user / role.  | 
| %stop\$1session  |  | Stop the current session. | 
| %list\$1sessions |  | Lists all currently running sessions by name and ID. | 
| %session\$1type | String |  Sets the session type to one of Streaming, ETL, or Ray.  Example: `%session_type Streaming`  | 
| %glue\$1version | String |  The version of Amazon Glue to be used by this session.  Example: `%glue_version 3.0`  | 

 **Magics for selecting job types** 


| Name | Type | Description | 
| --- | --- | --- | 
| %streaming | String | Changes the session type to Amazon Glue Streaming. | 
| %etl | String | Changes the session type to Amazon Glue ETL. | 
| %glue\$1ray | String | Changes the session type to Amazon Glue for Ray. See [Magics supported by Amazon Glue Ray interactive sessions](https://docs.amazonaws.cn/glue/latest/dg/is-using-ray-configuration).  | 

 **Amazon Glue for Spark config magics** 

 The `%%configure` magic is a json-formatted dictionary consisting of all configuration parameters for a session. Each parameter can be specified here or through individual magics. 


| Name | Type | Description | 
| --- | --- | --- | 
|  %%configure  |  Dictionary  |   Specify a JSON-formatted dictionary consisting of all configuration parameters for a session. Each parameter can be specified here or through individual magics.   For a list of parameters and examples on how to use `%%configure`, see [%%configure cell magic arguments](#interactive-sessions-magics-configure-arguments).   | 
| %iam\$1role | String |   Specify an IAM role ARN to execute your session with. Default from \$1/.aws/configure.   Example: `%iam_role AWSGlueServiceRole`  | 
| %number\$1of\$1workers | Int |  The number of workers of a defined worker\$1type that are allocated when a job runs. `worker_type` must be set too. The default `number_of_workers` is 5. Example: `%number_of_workers 2`  | 
| %additional\$1python\$1modules | List |  Comma separated list of additional Python modules to include in your cluster (can be from PyPI or S3). Example: `%additional_python_modules pandas, numpy`.  | 
| %%tags | String |   Adds tags to a session. Specify the tags within curly brackets \$1 \$1. Each tag name pair is enclosed in parentheses (" ") and separated by a comma (,).  <pre>%%tags<br />{"billing":"Data-Platform", "team":"analytics"}<br />                      </pre> Use the `%status` magic to view tags associated with the session. <pre>%status</pre> <pre>Session ID: <sessionId><br /> Status: READY<br /> Role: <example-role><br /> CreatedOn: 2023-05-26 11:12:17.056000-07:00<br /> GlueVersion: 3.0<br /> Job Type: glueetl<br /> Tags: {'owner':'example-owner', 'team':'analytics', 'billing':'Data-Platform'}<br /> Worker Type: G.4X<br /> Number of Workers: 5<br /> Region: us-west-2<br /> Applying the following default arguments:<br /> --glue_kernel_version 0.38.0<br /> --enable-glue-datacatalog true<br /> Arguments Passed: ['--glue_kernel_version: 0.38.0', '--enable-glue-datacatalog: true']                <br />                </pre>  | 
| %%assume\$1role | Dictionary |  Specify a json-formatted dictionary or an IAM role ARN string to create a session for cross-account access. Example with ARN: <pre>%%assume_role<br />{<br />  'arn:aws:iam::XXXXXXXXXXXX:role/AWSGlueServiceRole'<br />}<br />                </pre> Example with credentials: <pre> %%assume_role<br />{{<br />    "aws_access_key_id" = "XXXXXXXXXXXX",<br />    "aws_secret_access_key" = "XXXXXXXXXXXX",<br />    "aws_session_token" = "XXXXXXXXXXXX"<br />}}</pre>  | 

### %%configure cell magic arguments
<a name="interactive-sessions-magics-configure-arguments"></a>

 The `%%configure` magic is a json-formatted dictionary consisting of all configuration parameters for a session. Each parameter can be specified here or through individual magics. See below for examples for arguments supported by the `%%configure` cell magic. Use the `--` prefix for run arguments specified for the job. Example: 

```
%%configure
{
   "--user-jars-first": "true",
   "--enable-glue-datacatalog": "false"
}
```

 For more information on job parameters, see [Job parameters](aws-glue-programming-etl-glue-arguments.md). 

**Session Configuration**


| Parameter | Type | Description | 
| --- | --- | --- | 
| max\$1retries | Int | The maximum number of times to retry this job if it fails. <pre>%%configure<br />{<br />  "max_retries": "0"<br />}                      <br />                          </pre> | 
| max\$1concurrent\$1runs | Int | The maximum number of concurrent runs allowed for a job. Example: <pre>%%configure<br />{<br />  "max_concurrent_runs": "3"<br />}</pre> | 

**Session parameters**


| Parameter | Type | Description | 
| --- | --- | --- | 
| --enable-spark-ui | Boolean | Enable Spark UI to monitor and debug Amazon Glue ETL jobs. <pre>%%configure<br />{<br />  "--enable-spark-ui": "true"<br />}</pre> | 
| --spark-event-logs-path | String | Specifies an Amazon S3 path. When using the Spark UI monitoring feature. Example: <pre>%%configure<br />{<br />  "--spark-event-logs-path": "s3://path/to/event/logs/"<br />}                           <br />                          </pre> | 
| --script\$1location | String | Specifies the S3 path to a script that executes a job. Example:<pre>%%configure <br />{<br />  "script_location": "s3://new-folder-here"<br />}                            <br />                          </pre> | 
| --SECURITY\$1CONFIGURATION | String | The name of a Amazon Glue security configuration Example: <pre>%%configure<br />{<br />    "--security_configuration": {<br />"encryption_type": "kms",<br />"kms_key_id": "YOUR_KMS_KEY_ARN"<br />}<br />}<br />                  </pre>  | 
| --job-language | String | The script programming language. Accepts a value of 'scala' or 'python'. Default is 'python'. Example: <pre>%%configure <br />{<br />  "--job-language": "scala"<br />}                            <br />                  </pre>  | 
| --class | String | The Scala class that serves as the entry point for your Scala script. Default is null. Example: <pre>%%configure <br />{<br />  "--class": "className"<br />}                            <br />                  </pre>  | 
| --user-jars-first | Boolean | Prioritizes the customer's extra JAR files in the classpath. Default is null. Example: <pre>%%configure <br />{<br />  "--user-jars-first": "true"<br />}                            <br />                  </pre>  | 
| --use-postgres-driver | Boolean | Prioritizes the Postgres JDBC driver in the class path to avoid a conflict with the Amazon Redshift JDBC driver. Default is null. Example: <pre>%%configure <br />{<br />  "--use-postgres-driver": "true"<br />}                            <br />                  </pre>  | 
| --extra-files | List(string) | The Amazon S3 paths to additional files, such as configuration files that Amazon Glue copies to the working directory of your script before executing it. Example: <pre>%%configure <br />{<br />  "--extra-files": "s3://path/to/additional/files/"<br />}                            <br />                  </pre>  | 
| --job-bookmark-option | String | Controls the behavior of a job bookmark. Accepts a value of 'job-bookmark-enable', 'job-bookmark-disable' or 'job-bookmark-pause'. Default is 'job-bookmark-disable'. Example: <pre>%%configure<br />{<br />  "--job-bookmark-option": "job-bookmark-enable"<br />}                            <br />                  </pre>  | 
| --TempDir | String | Specifies an Amazon S3 path to a bucket that can be used as a temporary directory for the job. Default is null. Example: <pre>%%configure <br />{<br />  "--TempDir": "s3://path/to/temp/dir"<br />}                            <br />                  </pre>  | 
| --enable-s3-parquet-optimized-committer | Boolean | Enables the EMRFS Amazon S3-optimized committer for writing Parquet data into Amazon S3. Default is 'true'. Example: <pre>%%configure <br />{<br />  "--enable-s3-parquet-optimized-committer": "false"<br />}                            <br />                  </pre>  | 
| --enable-rename-algorithm-v2 | Boolean | Sets the EMRFS rename algorithm version to version 2. Default is 'true'. Example: <pre>%%configure <br />{<br />  "--enable-rename-algorithm-v2": "true"<br />}                            <br />                  </pre>  | 
| --enable-glue-datacatalog | Boolean | Enables you to use the Amazon Glue Data Catalog as an Apache Spark Hive metastore. Example: <pre>%%configure <br />{<br />  "--enable-glue-datacatalog": "true"<br />}                            <br />                  </pre>  | 
| --enable-metrics | Boolean | Enables the collection of metrics for job profiling for job run. Default is 'false'. Example: <pre>%%configure <br />{<br />  "--enable-metrics": "true"<br />}                            <br />                  </pre>  | 
| --enable-continuous-cloudwatch-log | Boolean | Enables real-time continuous logging for Amazon Glue jobs. Default is 'false'. Example: <pre>%%configure <br />{<br />  "--enable-continuous-cloudwatch-log": "true"<br />}                            <br />                  </pre>  | 
| --enable-continuous-log-filter | Boolean | Specifies a standard filter or no filter when you create or edit a job enabled for continuous logging. Default is 'true'. Example: <pre>%%configure <br />{<br />  "--enable-continuous-log-filter": "true"<br />}                            <br />                  </pre>  | 
| --continuous-log-stream-prefix | String | Specifies a custom Amazon CloudWatch log stream prefix for a job enabled for continuous logging. Default is null. Example: <pre>%%configure <br />{<br />  "--continuous-log-stream-prefix": "prefix"<br />}                            <br />                  </pre>  | 
| --continuous-log-conversionPattern | String | Specifies a custom conversion log pattern for a job enabled for continuous logging. Default is null. Example: <pre>%%configure <br />{<br />  "--continuous-log-conversionPattern": "pattern"<br />}                      <br />                  </pre>  | 
| --conf | String | Controls Spark config parameters. It is for advanced use cases. Use --conf before each parameter. Example: <pre>%%configure<br />{<br />    "--conf": "spark.hadoop.hive.metastore.glue.catalogid=123456789012 --conf hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory --conf hive.metastore.schema.verification=false"<br />}       <br />        </pre>  | 
| timeout | Int | Determines the maximum amount of time that the Spark session should wait for a statement to complete before terminating it. <pre>%%configure <br />{<br />  "timeout": "30"<br />}</pre>  | 
| auto-scaling | Boolean | Determines whether or not to use auto-scaling. <pre>%%configure <br />{<br />  "––enable-auto-scaling": "true"<br />}</pre>  | 

### Spark jobs (ETL & streaming) magics
<a name="interactive-sessions-magics-spark-jobs"></a>


| Name | Type | Description | 
| --- | --- | --- | 
| %worker\$1type | String | Standard, G.1X, G.2X, G.4X, G.8X, G.12X, G.16X, R.1X, R.2X, R.4X, or R.8X. number\$1of\$1workers must be set too. The default worker\$1type is G.1X. | 
| %connections | List |  Specify a comma-separated list of connections to use in the session.   Example:  <pre>%connections my_rds_connection<br />                    dy_f = glue_context.create_dynamic_frame.from_catalog(database='rds_tables', table_name='sales_table')</pre>  | 
| %extra\$1py\$1files | List | Comma separated list of additional Python files from Amazon S3. | 
| %extra\$1jars | List | Comma-separated list of additional jars to include in the cluster. | 
| %spark\$1conf | String | Specify custom spark configurations for your session. For example, %spark\$1conf spark.serializer=org.apache.spark.serializer.KryoSerializer. | 

### Magics for Ray jobs
<a name="interactive-sessions-magics-ray-jobs"></a>


| Name | Type | Description | 
| --- | --- | --- | 
| %min\$1workers | Int |  The minimum number of workers that are allocated to a Ray job. Default: 1.  Example: `%min_workers 2`   | 
| %object\$1memory\$1head | Int | The percentage of free memory on the instance head node after a warm start. Minimum: 0. Maximum: 100. Example: `%object_memory_head 100`  | 
| %object\$1memory\$1worker | Int | The percentage of free memory on the instance worker nodes after a warm start. Minimum: 0. Maximum: 100. Example: `%object_memory_worker 100` | 

### Action magics
<a name="interactive-sessions-magics-action"></a>


| Name | Type | Description | 
| --- | --- | --- | 
| %%sql | String |   Run SQL code. All lines after the initial `%%sql` magic will be passed as part of the SQL code.   Example: `%%sql select * from rds_tables.sales_table`  | 
| %matplot | Matplotlib figure |  Visualize your data using the matplotlib library. Example: <pre>import matplotlib.pyplot as plt<br /><br /># Set X-axis and Y-axis values<br />x = [5, 2, 8, 4, 9]<br />y = [10, 4, 8, 5, 2]<br />  <br /># Create a bar chart <br />plt.bar(x, y)<br />  <br /># Show the plot<br />%matplot plt      <br />                </pre>  | 
| %plotly | Plotly figure |  Visualize your data using the plotly library. Example: <pre>import plotly.express as px<br />                  <br />#Create a graphical figure<br />fig = px.line(x=["a","b","c"], y=[1,3,2], title="sample figure")<br /><br />#Show the figure<br />%plotly fig</pre>  | 

## Naming sessions
<a name="interactive-sessions-naming-sessions"></a>

 Amazon Glue interactive sessions are Amazon resources and require a name. Names should be unique for each session and may be restricted by your IAM administrators. For more information, see [Interactive sessions with IAM](glue-is-security.md). The Jupyter kernel automatically generates unique session names for you. However sessions can be named manually in two ways: 

1.  Using the Amazon Command Line Interface config file located at `~.aws/config`. See [Setting Up Amazon Config with the Amazon Command Line Interface](https://docs.aws.amazon.com/config/latest/developerguide/gs-cli.html). 

1.  Using the `%session_id_prefix` magics. See [Magics supported by Amazon Glue interactive sessions for Jupyter](#interactive-sessions-supported-magics). 

 A session name is generated as follows: 
+ When the prefix and session\$1id are provided: the session name will be \$1prefix\$1-\$1UUID\$1.
+ When nothing is provided: the session name will be \$1UUID\$1.

Prefixing session names allows you to recognize your session when listing it in the Amazon CLI or console.

## Specifying an IAM role for interactive sessions
<a name="iam-role-interactive-sessions"></a>

 You must specify an Amazon Identity and Access Management (IAM) role to use with Amazon Glue ETL code that you run with interactive sessions. 

 The role requires the same IAM permissions as those required to run Amazon Glue jobs. See [Create an IAM role for Amazon Glue](https://docs.amazonaws.cn/glue/latest/dg/create-an-iam-role.html) for more information on creating a role for Amazon Glue jobs and interactive sessions. 

 IAM roles can be specified in two ways: 
+  Using the Amazon Command Line Interface config file located at `~.aws/config` (Recommended). For more information, see [ Configuring sessions with \$1/.aws/config ](https://docs.aws.amazon.com/glue/latest/ug/interactive-sessions-magics.html#interactive-sessions-named-profiles). 
**Note**  
 When the `%profile` magic is used, the configuration for `glue_iam_role` of that profile is honored. 
+  Using the %iam\$1role magic. For more information, see [Magics supported by Amazon Glue interactive sessions for Jupyter](#interactive-sessions-supported-magics). 

## Configuring sessions with named profiles
<a name="interactive-sessions-named-profiles"></a>

 Amazon Glue interactive sessions uses the same credentials as the Amazon Command Line Interface or boto3, and interactive sessions honors and works with named profiles like the Amazon CLI found in `~/.aws/config` (Linux and MacOS) or `%USERPROFILE%\.aws\config` (Windows). For more information, see [ Using named profiles ](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html#cli-configure-files-using-profiles). 

 Interactive sessions takes advantage of named profiles by allowing the Amazon Glue Service Role and Session ID Prefix to be specified in a profile. To configure a profile role, add a line for the `iam_role` key and/or `session_id_prefix `to your named profile as shown below. The `session_id_prefix` does not require quotes. For example, if you want to add a ` session_id_prefix`, enter the value of the `session_id_prefix=myprefix`. 

```
[default]
region=us-east-1
aws_access_key_id=AKIAIOSFODNN7EXAMPLE 
aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
glue_iam_role=arn:aws:iam::<AccountID>:role/<GlueServiceRole> 
session_id_prefix=<prefix_for_session_names>

[user1] 
region=eu-west-1
aws_access_key_id=AKIAI44QH8DHBEXAMPLE 
aws_secret_access_key=je7MtGbClwBF/2Zp9Utk/h3yCo8nvbEXAMPLEKEY
glue_iam_role=arn:aws:iam::<AccountID>:role/<GlueServiceRoleUser1> 
session_id_prefix=<prefix_for_session_names_for_user1>
```

 If you have a custom method of generating credentials, you can also configure your profile to use the `credential_process` parameter in your `~/.aws/config` file. For example: 

```
[profile developer]
region=us-east-1
credential_process = "/Users/Dave/generate_my_credentials.sh" --username helen
```

 You can find more details about sourcing credentials through the `credential_process` parameter here: [ Sourcing credentials with an external process](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-sourcing-external.html). 

 If a region or `iam_role` are not set in the profile that you are using, you must specify them using the `%region` and `%iam_role` magics in the first cell that you run. 

# Converting a script or notebook into an Amazon Glue job
<a name="interactive-sessions-convert"></a>

 There are two ways you can convert a script or notebook into an Amazon Glue job: 
+  Use **nbconvert** to convert your Jupyter `.ipynb` notebook document file into a `.py` file. For more information, see [nbconvert: Convert Notebooks to other formats](https://nbconvert.readthedocs.io/en/latest/). 
+  Upload the file to Amazon Glue Studio Notebooks. 
  +  In the Amazon Glue Studio console, choose **Jobs** from the navigation menu. 
  +  In the **Create job** section, choose **Jupyter Notebook**. 
  +  In the **Options** section, choose **Upload and edit an existing notebook**. 
  +  Select **Choose file** to upload an `.ipynb` file. 

# Working with streaming operations in Amazon Glue interactive sessions
<a name="interactive-sessions-streaming"></a>

## Switching streaming session type
<a name="interactive-sessions-switching-streaming-session-type"></a>

 Use the Amazon Glue interactive sessions configuration magic, `%streaming`, to define the job you are running and initialize a streaming interactive session. 

## Sampling input stream for interactive development
<a name="w2aac29c29b7"></a>

 One tool we have derived to help enhance the interactive experience in Amazon Glue interactive sessions is the addition of a new method under `GlueContext` to obtain a snapshot of a stream in a static DynamicFrame. `GlueContext` allows you to inspect, interact and implement your workflow. 

 With the `GlueContext` class instance, you will be able to locate the method `getSampleStreamingDynamicFrame`. Required arguments for this method are: 
+  `dataFrame`: The Spark Streaming DataFrame 
+  `options`: See available options below 

 Available options include： 
+  **windowSize**: This is also called Microbatch Duration. This parameter will determine how long a streaming query will wait after previous batch was triggered. This parameter value must be smaller than `pollingTimeInMs`. 
+  **pollingTimeInMs**: The total length of time the method will run. It will fire off at least one micro batch to obtain sample records from the input stream. 
+  **recordPollingLimit**: This parameter helps you limit the total number of records you will poll from the stream. 
+  (Optional) You can also use `writeStreamFunction` to apply this custom function to every record sampling function. See below for examples in Scala and Python. 

****  
  

```
val sampleBatchFunction = (batchDF: DataFrame, batchId: Long) => {//Optional but you can replace your own forEachBatch function here}
val jsonString: String = s"""{"pollingTimeInMs": "10000", "windowSize": "5 seconds"}"""
val dynFrame = glueContext.getSampleStreamingDynamicFrame(YOUR_STREAMING_DF, JsonOptions(jsonString), sampleBatchFunction)
dynFrame.show()
```

```
def sample_batch_function(batch_df, batch_id):
       //Optional but you can replace your own forEachBatch function here
options = {
            "pollingTimeInMs": "10000",
            "windowSize": "5 seconds",
        }
glue_context.getSampleStreamingDynamicFrame(YOUR_STREAMING_DF, options, sample_batch_function)
```

**Note**  
 When the sampled `DynFrame` is empty, it could be caused by a few reasons:   
 The Streaming source is set to "Latest" and no new data has been ingested during the sampling period. 
 The polling time is not enough to process the records it ingested. Data won't show up unless the whole batch has been processed. 

## Running streaming applications in interactive sessions
<a name="running-streaming-applications-interactive-sessions"></a>

 In Amazon Glue interactive sessions, you can run a the Amazon Glue streaming application like how you would create a streaming application in the Amazon Glue Console. Since interactive sessions is session-based, encountering exceptions in the runtime does not cause the session to stop. We now have the added benefit of developing your batch function iteratively. For example: 

```
def batch_function(data_frame, batch_id):
    log.info(data_frame.count())
    invalid_method_call()
glueContext.forEachBatch(frame=streaming_df, batch_function = batch_function, options = {**})
```

 In the example above, we included an invalid usage of a method and unlike regular Amazon Glue jobs which will exit the entire application, the user's coding context and definitions are fully preserved and the session is still operational. There is no need to bootstrap a new cluster and rerun all the preceding transformation. This allows you to focus on quickly iterating your batch function implementations to obtain desirable outcomes. 

 It is important to note that Interactive Session evaluates each statement in a blocking manner so that the session will only execute one statement at a time. Since streaming queries are continuous and never ending, sessions with active streaming queries won't be able to handle any follow up statements unless they are interrupted. You can issue the interruption command directly from Jupyter Notebook and our kernel will handle the cancellation for you. 

 Take the following sequence of statements which are waiting for execution as an example: 

```
Statement 1:
      val number = df.count() 
      #Spark Action with deterministic result
      Result: 5
      
Statement 2:
      streamingQuery.start().awaitTermination()
      #Spark Streaming Query that will be executing continously
      Result: Constantly updated with each microbatch
      
Statement 3:
      val number2 = df.count()
      #This will not be executed as previous statement will be running indefinitely
```

# Amazon Glue interactive session pricing
<a name="interactive-sessions-session-pricing"></a>

 Amazon charges for Amazon Glue interactive sessions based on how long the session is active and the number of Data Processing Units (DPU) used. You are charged an hourly rate for the number of DPUs used to run your workloads, billed in increments of one second. Amazon Glue interactive sessions assigns a default of 5 DPUs and requires a minimum of 2 DPUs. There is also a 1-minute minimum billing duration for each interactive session. To see the Amazon Glue rates and pricing examples, or to estimate your costs using the Amazon Pricing Calculator, see [Amazon Glue pricing](https://aws.amazon.com/glue/pricing/). 

## Configure your Amazon Glue interactive sessions
<a name="interactive-sessions-config"></a>

 You can use Jupyter magics in your Amazon interactive session to modify your session and configuration parameters. Magics are short commands prefixed with `%` at the start of Jupyter cells that provide a quick and easy way to help you control your environment. For example, if you want to change the number of workers allocated to your job from the default 5 to 10, you can specify `%number_of_workers 10`. If you want to configure your session to stop after 10 minutes of idle time instead of the default 2880, you can specify `%idle_timeout 10`. 

 For the complete list of Amazon magics available, see [Configuring Amazon interactive sessions for Jupyter and Amazon Glue Studio notebooks](https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions-magics.html). 

# Developing and testing Amazon Glue job scripts locally
<a name="aws-glue-programming-etl-libraries"></a>

When you develop and test your Amazon Glue for Spark job scripts, there are multiple available options:
+ Amazon Glue Studio console
  + Visual editor
  + Script editor
  + Amazon Glue Studio notebook
+ Interactive sessions
  + Jupyter notebook
+ Docker image
  + Local development
  + Remote development

You can choose any of the above options based on your requirements.

If you prefer no code or less code experience, the Amazon Glue Studio visual editor is a good choice.

If you prefer an interactive notebook experience, Amazon Glue Studio notebook is a good choice. For more information, see [Using Notebooks with Amazon Glue Studio and Amazon Glue](https://docs.amazonaws.cn/glue/latest/ug/notebooks-chapter.html). If you want to use your own local environment, interactive sessions is a good choice. For more information, see [Using interactive sessions with Amazon Glue](https://docs.amazonaws.cn/glue/latest/dg/interactive-sessions-chapter.html).

If you prefer local/remote development experience, the Docker image is a good choice. This helps you to develop and test Amazon Glue for Spark job scripts anywhere you prefer without incurring Amazon Glue cost.

If you prefer local development without Docker, installing the Amazon Glue ETL library directory locally is a good choice.

## Developing using Amazon Glue Studio
<a name="develop-using-studio"></a>

The Amazon Glue Studio visual editor is a graphical interface that makes it easy to create, run, and monitor extract, transform, and load (ETL) jobs in Amazon Glue. You can visually compose data transformation workflows and seamlessly run them on Amazon Glue's Apache Spark-based serverless ETL engine. You can inspect the schema and data results in each step of the job. For more information, see the [Amazon Glue Studio User Guide](https://docs.amazonaws.cn/glue/latest/ug/what-is-glue-studio.html).

## Developing using interactive sessions
<a name="develop-using-interactive-sessions"></a>

Interactive sessions allow you to build and test applications from the environment of your choice. For more information, see [Using interactive sessions with Amazon Glue](https://docs.amazonaws.cn/glue/latest/dg/interactive-sessions-chapter.html).

# Develop and test Amazon Glue jobs locally using a Docker image
<a name="develop-local-docker-image"></a>

 For a production-ready data platform, the development process and CI/CD pipeline for Amazon Glue jobs is a key topic. You can flexibly develop and test Amazon Glue jobs in a Docker container. Amazon Glue hosts Docker images on Docker Hub to set up your development environment with additional utilities. You can use your preferred IDE, notebook, or REPL using Amazon Glue ETL library. This topic describes how to develop and test Amazon Glue version 5.0 jobs in a Docker container using a Docker image.

## Available Docker images
<a name="develop-local-available-docker-images-ecr"></a>

 The following Docker images are available for Amazon Glue on [Amazon ECR:](https://gallery.ecr.aws/glue/aws-glue-libs). 
+  For Amazon Glue version 5.0: `public.ecr.aws/glue/aws-glue-libs:5` 
+ For Amazon Glue version 4.0: `public.ecr.aws/glue/aws-glue-libs:glue_libs_4.0.0_image_01`
+ For Amazon Glue version 3.0: `public.ecr.aws/glue/aws-glue-libs:glue_libs_3.0.0_image_01`
+ For Amazon Glue version 2.0: `public.ecr.aws/glue/aws-glue-libs:glue_libs_2.0.0_image_01`

**Note**  
 Amazon Glue Docker images are compatible with both x86\$164 and arm64. 

 In this example, we use `public.ecr.aws/glue/aws-glue-libs:5` and run the container on a local machine (Mac, Windows, or Linux). This container image has been tested for Amazon Glue version 5.0 Spark jobs. The image contains the following: 
+  Amazon Linux 2023 
+  Amazon Glue ETL Library 
+  Apache Spark 3.5.4 
+  Open table format libraries; Apache Iceberg 1.7.1, Apache Hudi 0.15.0, and Delta Lake 3.3.0 
+  Amazon Glue Data Catalog Client 
+  Amazon Redshift connector for Apache Spark 
+  Amazon DynamoDB connector for Apache Hadoop 

 To set up your container, pull the image from ECR Public Gallery and then run the container. This topic demonstrates how to run your container with the following methods, depending on your requirements: 
+ `spark-submit`
+ REPL shell `(pyspark)`
+ `pytest`
+ Visual Studio Code

## Prerequisites
<a name="develop-local-docker-image-prereq"></a>

Before you start, make sure that Docker is installed and the Docker daemon is running. For installation instructions, see the Docker documentation for [Mac](https://docs.docker.com/docker-for-mac/install/) or [Linux](https://docs.docker.com/engine/install/). The machine running the Docker hosts the Amazon Glue container. Also make sure that you have at least 7 GB of disk space for the image on the host running the Docker.

 For more information about restrictions when developing Amazon Glue code locally, see [ Local development restrictions ](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-libraries.html#local-dev-restrictions). 

### Configuring Amazon
<a name="develop-local-docker-image-config-aws-credentials"></a>

To enable Amazon API calls from the container, set up Amazon credentials by following steps. In the following sections, we will use this Amazon named profile.

1.  [ Create an Amazon named profile ](https://docs.amazonaws.cn//cli/latest/userguide/cli-configure-files.html). 

1.  Open `cmd` on Windows or a terminal on Mac/Linux and run the following command in a terminal: 

   ```
   PROFILE_NAME="<your_profile_name>"
   ```

In the following sections, we use this Amazon named profile.

### 
<a name="develop-local-docker-pull-image-from-ecr-public"></a>

 If you’re running Docker on Windows, choose the Docker icon (right-click) and choose **Switch to Linux containers** before pulling the image. 

Run the following command to pull the image from ECR Public:

```
docker pull public.ecr.aws/glue/aws-glue-libs:5 
```

## Run the container
<a name="develop-local-docker-image-setup-run"></a>

You can now run a container using this image. You can choose any of following based on your requirements.

### spark-submit
<a name="develop-local-docker-image-setup-run-spark-submit"></a>

You can run an Amazon Glue job script by running the `spark-submit` command on the container. 

1.  Write your script and save it as `sample.py` in the example below and save it under the `/local_path_to_workspace/src/` directory using the following commands: 

   ```
   $ WORKSPACE_LOCATION=/local_path_to_workspace
   $ SCRIPT_FILE_NAME=sample.py
   $ mkdir -p ${WORKSPACE_LOCATION}/src
   $ vim ${WORKSPACE_LOCATION}/src/${SCRIPT_FILE_NAME}
   ```

1.  These variables are used in the docker run command below. The sample code (sample.py) used in the spark-submit command below is included in the appendix at the end of this topic. 

    Run the following command to execute the `spark-submit` command on the container to submit a new Spark application: 

   ```
   $ docker run -it --rm \
       -v ~/.aws:/home
       /hadoop/.aws \
       -v $WORKSPACE_LOCATION:/home/hadoop/workspace/ \
       -e AWS_PROFILE=$PROFILE_NAME \
       --name glue5_spark_submit \
       public.ecr.aws/glue/aws-glue-libs:5 \
       spark-submit /home/hadoop/workspace/src/$SCRIPT_FILE_NAME
   ```

1. (Optionally) Configure `spark-submit` to match your environment. For example, you can pass your dependencies with the `--jars` configuration. For more information, consult [Dynamically Loading Spark Properties](https://spark.apache.org/docs/latest/configuration.html) in the Spark documentation. 

### REPL shell (Pyspark)
<a name="develop-local-docker-image-setup-run-repl-shell"></a>

 You can run REPL (`read-eval-print loops`) shell for interactive development. Run the following command to execute the PySpark command on the container to start the REPL shell: 

```
$ docker run -it --rm \
    -v ~/.aws:/home/hadoop/.aws \
    -e AWS_PROFILE=$PROFILE_NAME \
    --name glue5_pyspark \
    public.ecr.aws/glue/aws-glue-libs:5 \
    pyspark
```

 You will see the following output: 

```
Python 3.11.6 (main, Jan  9 2025, 00:00:00) [GCC 11.4.1 20230605 (Red Hat 11.4.1-2)] on linux
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.5.4-amzn-0
      /_/

Using Python version 3.11.6 (main, Jan  9 2025 00:00:00)
Spark context Web UI available at None
Spark context available as 'sc' (master = local[*], app id = local-1740643079929).
SparkSession available as 'spark'.
>>>
```

 With this REPL shell, you can code and test interactively. 

### Pytest
<a name="develop-local-docker-image-setup-run-pytest"></a>

 For unit testing, you can use `pytest` for Amazon Glue Spark job scripts. Run the following commands for preparation. 

```
$ WORKSPACE_LOCATION=/local_path_to_workspace
$ SCRIPT_FILE_NAME=sample.py
$ UNIT_TEST_FILE_NAME=test_sample.py
$ mkdir -p ${WORKSPACE_LOCATION}/tests
$ vim ${WORKSPACE_LOCATION}/tests/${UNIT_TEST_FILE_NAME}
```

 Run the following command to run `pytest` using `docker run`: 

```
$ docker run -i --rm \
    -v ~/.aws:/home/hadoop/.aws \
    -v $WORKSPACE_LOCATION:/home/hadoop/workspace/ \
    --workdir /home/hadoop/workspace \
    -e AWS_PROFILE=$PROFILE_NAME \
    --name glue5_pytest \
    public.ecr.aws/glue/aws-glue-libs:5 \
    -c "python3 -m pytest --disable-warnings"
```

 Once `pytest` finishes executing unit tests, your output will look something like this: 

```
============================= test session starts ==============================
platform linux -- Python 3.11.6, pytest-8.3.4, pluggy-1.5.0
rootdir: /home/hadoop/workspace
plugins: integration-mark-0.2.0
collected 1 item

tests/test_sample.py .                                                   [100%]

======================== 1 passed, 1 warning in 34.28s =========================
```

### Setting up the container to use Visual Studio Code
<a name="develop-local-docker-image-setup-visual-studio"></a>

 To set up the container with Visual Studio Code, complete the following steps: 

1. Install Visual Studio Code.

1. Install [Python](https://marketplace.visualstudio.com/items?itemName=ms-python.python).

1. Install [Visual Studio Code Remote - Containers](https://code.visualstudio.com/docs/remote/containers)

1. Open the workspace folder in Visual Studio Code.

1. Press `Ctrl+Shift+P` (Windows/Linux) or `Cmd+Shift+P` (Mac).

1. Type `Preferences: Open Workspace Settings (JSON)`.

1. Press Enter.

1. Paste the following JSON and save it.

   ```
   {
       "python.defaultInterpreterPath": "/usr/bin/python3.11",
       "python.analysis.extraPaths": [
           "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip:/usr/lib/spark/python/:/usr/lib/spark/python/lib/",
       ]
   }
   ```

 To set up the container: 

1. Run the Docker container.

   ```
   $ docker run -it --rm \
       -v ~/.aws:/home/hadoop/.aws \
       -v $WORKSPACE_LOCATION:/home/hadoop/workspace/ \
       -e AWS_PROFILE=$PROFILE_NAME \
       --name glue5_pyspark \
       public.ecr.aws/glue/aws-glue-libs:5 \
       pyspark
   ```

1. Start Visual Studio Code.

1.  Choose **Remote Explorer** on the left menu, and choose `amazon/aws-glue-libs:glue_libs_4.0.0_image_01`. 

1.  Right-click and choose **Attach in Current Window**.   
![\[When right-click, a window with the option to Attach in Current Window is presented.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/vs-code-other-containers.png)

1.  If the following dialog appears, choose **Got it**.   
![\[A window warning with message "Attaching to a container may execute arbitrary code".\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/vs-code-warning-got-it.png)

1. Open `/home/handoop/workspace/`.  
![\[A window drop-down with the option 'workspace' is highlighted.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/vs-code-open-workspace.png)

1.  Create a Amazon Glue PySpark script and choose **Run**. 

   You will see the successful run of the script.  
![\[The successful run of the script.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/vs-code-run-successful-script.png)

## Changes between Amazon Glue 4.0 and Amazon Glue 5.0 Docker image
<a name="develop-local-docker-glue4-glue5-changes"></a>

 The major changes between the Amazon Glue 4.0 and Amazon Glue 5.0 Docker image: 
+  In Amazon Glue 5.0, there is a single container image for both batch and streaming jobs. This differs from Glue 4.0, where there was one image for batch and another for streaming. 
+  In Amazon Glue 5.0, the default user name of the container is `hadoop`. In Amazon Glue 4.0, the default user name was `glue_user`. 
+  In Amazon Glue 5.0, several additional libraries including JupyterLab and Livy have been removed from the image. You can manually install them. 
+  In Amazon Glue 5.0, all of Iceberg, Hudi and Delta libraries are pre-loaded by default, and the environment variable `DATALAKE_FORMATS` is no longer needed. Prior to Amazon Glue 4.0, the environment variable `DATALAKE_FORMATS` environment variable was used to specify which specific table formats should be loaded. 

 The above list is specific to the Docker image. To learn more about Amazon Glue 5.0 updates, see [Introducing Amazon Glue 5.0 for Apache Spark ](https://aws.amazon.com/blogs/big-data/introducing-aws-glue-5-0-for-apache-spark/) and [Migrating Amazon Glue for Spark jobs to Amazon Glue version 5.0](https://docs.amazonaws.cn/glue/latest/dg/migrating-version-50.html). 

## Considerations
<a name="develop-local-docker-considerations"></a>

 Keep in mind that the following features are not supported when using the Amazon Glue container image to develop job scripts locally. 
+  [Job bookmarks](https://docs.amazonaws.cn/glue/latest/dg/monitor-continuations.html) 
+  Amazon Glue Parquet writer ([ Using the Parquet format in Amazon Glue](https://docs.amazonaws.cn/glue/latest/dg/aws-glue-programming-etl-format-parquet-home.html)) 
+  [ FillMissingValues transform ](https://docs.amazonaws.cn/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-fillmissingvalues.html) 
+  [FindMatches transform](https://docs.amazonaws.cn/glue/latest/dg/machine-learning.html#find-matches-transform) 
+  [ Vectorized SIMD CSV reader ](https://docs.amazonaws.cn/glue/latest/dg/aws-glue-programming-etl-format-csv-home.html#aws-glue-programming-etl-format-simd-csv-reader) 
+  The property [ customJdbcDriverS3Path ](https://docs.amazonaws.cn/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-jdbc) for loading JDBC driver from Amazon S3 path 
+  [Amazon Glue Data Quality](https://docs.amazonaws.cn/glue/latest/dg/glue-data-quality.html) 
+  [Sensitive Data Detection](https://docs.amazonaws.cn/glue/latest/dg/detect-PII.html) 
+  Amazon Lake Formation permission-based credential vending 

## Appendix: Adding JDBC drivers and Java libraries
<a name="develop-local-docker-image-appendix"></a>

 To add JDBC driver not currently available in the container, you can create a new directory under your workspace with JAR files you need and mount the directory to `/opt/spark/jars/` in docker run command. JAR files found under `/opt/spark/jars/` within the container are automatically added to Spark Classpath and will be available for use during job run. 

 For example, use the following docker run command to add JDBC driver jars to PySpark REPL shell. 

```
docker run -it --rm \
    -v ~/.aws:/home/hadoop/.aws \
    -v $WORKSPACE_LOCATION:/home/hadoop/workspace/ \
    -v $WORKSPACE_LOCATION/jars/:/opt/spark/jars/ \
    --workdir /home/hadoop/workspace \
    -e AWS_PROFILE=$PROFILE_NAME \
    --name glue5_jdbc \
    public.ecr.aws/glue/aws-glue-libs:5 \
    pyspark
```

 As highlighted in **Considerations**, `customJdbcDriverS3Path` connection option cannot be used to import a custom JDBC driver from Amazon S3 in Amazon Glue container images. 

# Development endpoints
<a name="development"></a>

**Note**  
 **The console experience for dev endpoints has been removed as of March 31, 2023.** Creating, updating, and monitoring dev endpoints is still available via the [Development endpoints API](aws-glue-api-dev-endpoint.md) and [ Amazon Glue CLI](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/glue/index.html#cli-aws-glue).

 We strongly recommend migrating from dev endpoints to interactive sessions for the reasons listed below. For required actions on how to migrate from dev endpoints to interactive sessions, see [Migrating from dev endpoints to interactive sessions](https://docs.amazonaws.cn/glue/latest/dg/development-migration-checklist.html). 


| Description | Dev endpoints | Interactive sessions | 
| --- | --- | --- | 
| Glue version support | Supports Amazon Glue version 0.9 and 1.0 | Supports Amazon Glue version 2.0 and later | 
| Dev endpoints are not available in the Asia Pacific (Jakarta) (ap-southeast-3), Middle East (UAE) (me-central-1), Europe (Spain) (eu-south-2), Europe (Zurich) (eu-central-2), or other new regions going forward | Interactive sessions are not currently available in the Middle East (UAE) (me-central-1) region, but may be made available later | 
| Access method to the Spark cluster | Supports SSH, REPL shell, Jupyter notebook, IDE (e.g. PyCharm) | supports Amazon Glue Studio notebook, Jupyter notebook, various IDEs (for example, Visual Studio Code, PyCharm), and SageMaker AI notebook | 
| Time to first query | Requires 10-15 minutes to setup a Spark cluster | Can take up to 1 minute to set up an ephemeral Spark cluster | 
| Price model | Amazon charges for development endpoints based on the time that the endpoint is provisioned and the number of DPUs. Development endpoints do not time out. There is a 10-minute minimum billing duration for each provisioned development endpoint. Additionally, Amazon charges for Jupyter notebook on Amazon EC2 instances, and SageMaker AI notebooks when you configure them with dev endpoints.  | Amazon charges for interactive sessions based on the time that the session is active and the number of DPUs. interactive sessions have configurable idle timeouts.  Amazon Glue Studio notebooks provide a built-in interface for interactive sessions and are offered at no additional cost. There is a 1-minute minimum billing duration for each interactive session. Amazon Glue Studio notebooks provide a built-in interface for interactive sessions and are offered at no additional cost | 
| Console experience | Only available via the CLI and API | Available through the Amazon Glue console, CLI, and APIs | 

# Migrating from dev endpoints to interactive sessions
<a name="development-migration-checklist"></a>

 Use the following checklist to determine the appropriate method to migrate from dev endpoints to interactive sessions. 

 **Does your script depend on Amazon Glue 0.9 or 1.0 specific features (for example, HDFS, YARN, etc.)?** 

 If the answer is yes, see [Migrating Amazon Glue jobs to  Amazon Glue version 3.0](https://docs.amazonaws.cn/glue/latest/dg/migrating-version-30.html). to learn how to migrate from Glue 0.9 or 1.0 to Glue 3.0 and later. 

 **Which method do you use to access your dev endpoint?** 


| If you use this method | Then do this | 
| --- | --- | 
| SageMaker AI notebook, Jupyter notebook, or JupyterLab | Migrate to [Amazon Glue Studio notebook](https://docs.amazonaws.cn/glue/latest/dg/interactive-sessions-gs-notebook.html) by downloading .ipynb files on Jupyter and create a new Amazon Glue Studio notebook job by uploading the  .ipynb file. Alternatively, you can also use [ SageMaker AI Studio](https://aws.amazon.com/blogs/machine-learning/prepare-data-at-scale-in-amazon-sagemaker-studio-using-serverless-aws-glue-interactive-sessions/) and select the Amazon Glue kernel.  | 
| Zeppelin notebook | Convert the notebook to a Jupyter notebook manually by copying and pasting code or automatically using a third-party converter such as ze2nb. Then, use the notebook in Amazon Glue Studio notebook or SageMaker AI Studio.  | 
| IDE |  See [ Author Amazon Glue jobs with PyCharm using Amazon Glue interactive sessions](https://aws.amazon.com/blogs/big-data/author-aws-glue-jobs-with-pycharm-using-aws-glue-interactive-sessions/), or [ Using interactive sessions with Microsoft Visual Studio Code](https://docs.amazonaws.cn/glue/latest/dg/interactive-sessions-vscode.html).  | 
| REPL |   Install the [https://docs.amazonaws.cn/glue/latest/dg/interactive-sessions.html](https://docs.amazonaws.cn/glue/latest/dg/interactive-sessions.html) locally, then run the following command:  [\[See the AWS documentation website for more details\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/development-migration-checklist.html)  | 
| SSH | No corresponding option on interactive sessions. Alternatively, you can use a Docker image. To learn more, see [Developing using a Docker image](https://docs.amazonaws.cn/glue/latest/dg/aws-glue-programming-etl-libraries.html#develop-local-docker-image).  | 

The following sections provide information on using dev endpoints to develop jobs in Amazon Glue version 1.0.

**Topics**
+ [Migrating from dev endpoints to interactive sessions](development-migration-checklist.md)
+ [Developing scripts using development endpoints](dev-endpoint.md)
+ [Managing notebooks](notebooks-with-glue.md)

# Developing scripts using development endpoints
<a name="dev-endpoint"></a>

**Note**  
 Development Endpoints are only supported for versions of Amazon Glue prior to 2.0. For an interactive environment where you can author and test ETL scripts, use [Notebooks on Amazon Glue Studio](https://docs.amazonaws.cn/glue/latest/ug/notebooks-chapter.html). 

Amazon Glue can create an environment—known as a *development endpoint*—that you can use to iteratively develop and test your extract, transform, and load (ETL) scripts. You can create, edit, and delete development endpoints using the  Amazon Glue console or API.

## Managing your development environment
<a name="dev-endpoint-managing-dev-environment"></a>

When you create a development endpoint, you provide configuration values to provision the development environment. These values tell Amazon Glue how to set up the network so that you can access the endpoint securely and the endpoint can access your data stores.

You can then create a notebook that connects to the endpoint, and use your notebook to author and test your ETL script. When you're satisfied with the results of your development process, you can create an ETL job that runs your script. With this process, you can add functions and debug your scripts in an interactive manner.

Follow the tutorials in this section to learn how to use your development endpoint with notebooks.

**Topics**
+ [Managing your development environment](#dev-endpoint-managing-dev-environment)
+ [Development endpoint workflow](dev-endpoint-workflow.md)
+ [How Amazon Glue development endpoints work with SageMaker notebooks](dev-endpoint-how-it-works.md)
+ [Adding a development endpoint](add-dev-endpoint.md)
+ [Accessing your development endpoint](dev-endpoint-elastic-ip.md)
+ [Tutorial: Set up a Jupyter notebook in JupyterLab to test and debug ETL scripts](dev-endpoint-tutorial-local-jupyter.md)
+ [Tutorial: Use a SageMaker AI notebook with your development endpoint](dev-endpoint-tutorial-sage.md)
+ [Tutorial: Use a REPL shell with your development endpoint](dev-endpoint-tutorial-repl.md)
+ [Tutorial: Set up PyCharm professional with a development endpoint](dev-endpoint-tutorial-pycharm.md)
+ [Advanced configuration: sharing development endpoints among multiple users](dev-endpoint-sharing.md)

# Development endpoint workflow
<a name="dev-endpoint-workflow"></a>

To use an Amazon Glue development endpoint, you can follow this workflow:

1. Create a development endpoint using the API. The endpoint is launched in a virtual private cloud (VPC) with your defined security groups.

1. The API polls the development endpoint until it is provisioned and ready for work. When it's ready, connect to the development endpoint using one of the following methods to create and test Amazon Glue scripts.
   + Create an SageMaker AI notebook in your account. For more information about how to create a notebook, see [Authoring code with Amazon Glue Studio notebooks](notebooks-chapter.md).
   + Open a terminal window to connect directly to a development endpoint.
   + If you have the professional edition of the JetBrains [PyCharm Python IDE](https://www.jetbrains.com/pycharm/), connect it to a development endpoint and use it to develop interactively. If you insert `pydevd` statements in your script, PyCharm can support remote breakpoints.

1. When you finish debugging and testing on your development endpoint, you can delete it.

# How Amazon Glue development endpoints work with SageMaker notebooks
<a name="dev-endpoint-how-it-works"></a>

One of the common ways to access your development endpoints is to use [Jupyter](https://jupyter.org/) on SageMaker notebooks. The Jupyter notebook is an open-source web application which is widely used in visualization, analytics, machine learning, etc. An Amazon Glue SageMaker notebook provides you a Jupyter notebook experience with Amazon Glue development endpoints. In the Amazon Glue SageMaker notebook, the Jupyter notebook environment is pre-configured with [SparkMagic](https://github.com/jupyter-incubator/sparkmagic), an open source Jupyter plugin to submit Spark jobs to a remote Spark cluster. [Apache Livy](https://livy.apache.org) is a service that allows interaction with a remote Spark cluster over a REST API. In the Amazon Glue SageMaker notebook, SparkMagic is configured to call the REST API against a Livy server running on an Amazon Glue development endpoint. 

The following text flow explains how each component works:

 *Amazon Glue SageMaker notebook: (Jupyter → SparkMagic) → (network) →  Amazon Glue development endpoint: (Apache Livy → Apache Spark)* 

Once you run your Spark script written in each paragraph on a Jupyter notebook, the Spark code is submitted to the Livy server via SparkMagic, then a Spark job named "livy-session-N" runs on the Spark cluster. This job is called a Livy session. The Spark job will run while the notebook session is alive. The Spark job will be terminated when you shutdown the Jupyter kernel from the notebook, or when the session is timed out. One Spark job is launched per notebook (.ipynb) file.

You can use a single Amazon Glue development endpoint with multiple SageMaker notebook instances. You can create multiple notebook files in each SageMaker notebook instance. When you open an each notebook file and run the paragraphs, then a Livy session is launched per notebook file on the Spark cluster via SparkMagic. Each Livy session corresponds to single Spark job.

## Default behavior for Amazon Glue development endpoints and SageMaker notebooks
<a name="dev-endpoint-default-behavior"></a>

The Spark jobs run based on the [Spark configuration](https://spark.apache.org/docs/2.4.3/configuration.html). There are multiple ways to set the Spark configuration (for example, Spark cluster configuration, SparkMagic's configuration, etc.).

By default, Spark allocates cluster resources to a Livy session based on the Spark cluster configuration. In the Amazon Glue development endpoints, the cluster configuration depends on the worker type. Here's a table which explains the common configurations per worker type.


****  

|  | Standard | G.1X | G.2X | 
| --- | --- | --- | --- | 
|  spark.driver.memory  | 5G | 10G | 20G | 
|  spark.executor.memory  | 5G | 10G | 20G | 
|  spark.executor.cores  | 4 | 8 | 16 | 
|  spark.dynamicAllocation.enabled  | TRUE | TRUE | TRUE | 

The maximum number of Spark executors is automatically calculated by combination of DPU (or `NumberOfWorkers`) and worker type. 


****  

|  | Standard | G.1X | G.2X | 
| --- | --- | --- | --- | 
| The number of max Spark executors |  (DPU - 1) \$1 2 - 1  |  (NumberOfWorkers - 1)   |  (NumberOfWorkers - 1)   | 

For example, if your development endpoint has 10 workers and the worker type is ` G.1X`, then you will have 9 Spark executors and the entire cluster will have 90G of executor memory since each executor will have 10G of memory.

Regardless of the specified worker type, Spark dynamic resource allocation will be turned on. If a dataset is large enough, Spark may allocate all the executors to a single Livy session since `spark.dynamicAllocation.maxExecutors` is not set by default. This means that other Livy sessions on the same dev endpoint will wait to launch new executors. If the dataset is small, Spark will be able to allocate executors to multiple Livy sessions at the same time.

**Note**  
For more information about how resources are allocated in different use cases and how you set a configuration to modify the behavior, see [Advanced configuration: sharing development endpoints among multiple users](dev-endpoint-sharing.md).

# Adding a development endpoint
<a name="add-dev-endpoint"></a>

Use development endpoints to iteratively develop and test your extract, transform, and load (ETL) scripts in Amazon Glue. Working with development endpoints is only available through the Amazon Command Line Interface.

1. In a command line window, enter a command similar to the following.

   ```
   aws glue create-dev-endpoint --endpoint-name "endpoint1" --role-arn "arn:aws-cn:iam::account-id:role/role-name" --number-of-nodes "3" --glue-version "1.0" --arguments '{"GLUE_PYTHON_VERSION": "3"}' --region "region-name"
   ```

   This command specifies Amazon Glue version 1.0. Because this version supports both Python 2 and Python 3, you can use the `arguments` parameter to indicate the desired Python version. If the `glue-version` parameter is omitted, Amazon Glue version 0.9 is assumed. For more information about Amazon Glue versions, see the [Glue version job property](add-job.md#glue-version-table).

   For information about additional command line parameters, see [create-dev-endpoint](https://docs.amazonaws.cn/cli/latest/reference/glue/create-dev-endpoint.html) in the *Amazon CLI Command Reference*.

1. (Optional) Enter the following command to check the development endpoint status. When the status changes to `READY`, the development endpoint is ready to use.

   ```
   aws glue get-dev-endpoint --endpoint-name "endpoint1"
   ```

# Accessing your development endpoint
<a name="dev-endpoint-elastic-ip"></a>

When you create a development endpoint in a virtual private cloud (VPC), Amazon Glue returns only a private IP address. The public IP address field is not populated. When you create a non-VPC development endpoint, Amazon Glue returns only a public IP address.

If your development endpoint has a **Public address**, confirm that it is reachable with the SSH private key for the development endpoint, as in the following example.

```
ssh -i dev-endpoint-private-key.pem glue@public-address
```

Suppose that your development endpoint has a **Private address**, your VPC subnet is routable from the public internet, and its security groups allow inbound access from your client. In this case, follow these steps to attach an *Elastic IP address* to a development endpoint to allow access from the internet.

**Note**  
If you want to use Elastic IP addresses, the subnet that is being used requires an internet gateway associated through the route table.

**To access a development endpoint by attaching an Elastic IP address**

1. Open the Amazon Glue console at [https://console.amazonaws.cn/glue/](https://console.amazonaws.cn/glue/).

1. In the navigation pane, choose **Dev endpoints**, and navigate to the development endpoint details page. Record the **Private address** for use in the next step. 

1. Open the Amazon EC2 console at [https://console.amazonaws.cn/ec2/](https://console.amazonaws.cn/ec2/).

1. In the navigation pane, under **Network & Security**, choose **Network Interfaces**. 

1. Search for the **Private DNS (IPv4)** that corresponds to the **Private address** on the Amazon Glue console development endpoint details page. 

   You might need to modify which columns are displayed on your Amazon EC2 console. Note the **Network interface ID** (ENI) for this address (for example, `eni-12345678`).

1. On the Amazon EC2 console, under **Network & Security**, choose **Elastic IPs**. 

1. Choose **Allocate new address**, and then choose **Allocate** to allocate a new Elastic IP address.

1. On the **Elastic IPs** page, choose the newly allocated **Elastic IP**. Then choose **Actions**, **Associate address**.

1. On the **Associate address** page, do the following:
   + For **Resource type**, choose **Network interface**.
   + In the **Network interface** box, enter the **Network interface ID** (ENI) for the private address.
   + Choose **Associate**.

1. Confirm that the newly associated Elastic IP address is reachable with the SSH private key that is associated with the development endpoint, as in the following example. 

   ```
   ssh -i dev-endpoint-private-key.pem glue@elastic-ip
   ```

   For information about using a bastion host to get SSH access to the development endpoint’s private address, see the Amazon Security Blog post [Securely Connect to Linux Instances Running in a Private Amazon VPC](https://amazonaws-china.com/blogs/security/securely-connect-to-linux-instances-running-in-a-private-amazon-vpc/).

# Tutorial: Set up a Jupyter notebook in JupyterLab to test and debug ETL scripts
<a name="dev-endpoint-tutorial-local-jupyter"></a>

In this tutorial, you connect a Jupyter notebook in JupyterLab running on your local machine to a development endpoint. You do this so that you can interactively run, debug, and test Amazon Glue extract, transform, and load (ETL) scripts before deploying them. This tutorial uses Secure Shell (SSH) port forwarding to connect your local machine to an Amazon Glue development endpoint. For more information, see [Port forwarding](https://en.wikipedia.org/wiki/Port_forwarding) on Wikipedia.

## Step 1: Install JupyterLab and Sparkmagic
<a name="dev-endpoint-tutorial-local-jupyter-install"></a>

You can install JupyterLab by using `conda` or `pip`. `conda` is an open-source package management system and environment management system that runs on Windows, macOS, and Linux. `pip` is the package installer for Python.

If you're installing on macOS, you must have Xcode installed before you can install Sparkmagic.

1. Install JupyterLab, Sparkmagic, and the related extensions.

   ```
   $ conda install -c conda-forge jupyterlab
   $ pip install sparkmagic
   $ jupyter nbextension enable --py --sys-prefix widgetsnbextension
   $ jupyter labextension install @jupyter-widgets/jupyterlab-manager
   ```

1. Check the `sparkmagic` directory from `Location`. 

   ```
   $ pip show sparkmagic | grep Location
   Location: /Users/username/.pyenv/versions/anaconda3-5.3.1/lib/python3.7/site-packages
   ```

1. Change your directory to the one returned for `Location`, and install the kernels for Scala and PySpark.

   ```
   $ cd /Users/username/.pyenv/versions/anaconda3-5.3.1/lib/python3.7/site-packages
   $ jupyter-kernelspec install sparkmagic/kernels/sparkkernel
   $ jupyter-kernelspec install sparkmagic/kernels/pysparkkernel
   ```

1. Download a sample `config` file. 

   ```
   $ curl -o ~/.sparkmagic/config.json https://raw.githubusercontent.com/jupyter-incubator/sparkmagic/master/sparkmagic/example_config.json
   ```

   In this configuration file, you can configure Spark-related parameters like `driverMemory` and `executorCores`.

## Step 2: Start JupyterLab
<a name="dev-endpoint-tutorial-local-jupyter-start"></a>

When you start JupyterLab, your default web browser is automatically opened, and the URL `http://localhost:8888/lab/workspaces/{workspace_name}` is shown.

```
$ jupyter lab
```

## Step 3: Initiate SSH port forwarding to connect to your development endpoint
<a name="dev-endpoint-tutorial-local-jupyter-port-forward"></a>

Next, use SSH local port forwarding to forward a local port (here, `8998`) to the remote destination that is defined by Amazon Glue (`169.254.76.1:8998`). 

1. Open a separate terminal window that gives you access to SSH. In Microsoft Windows, you can use the BASH shell provided by [Git for Windows](https://git-scm.com/downloads), or you can install [Cygwin](https://www.cygwin.com/).

1. Run the following SSH command, modified as follows:
   + Replace `private-key-file-path` with a path to the `.pem` file that contains the private key corresponding to the public key that you used to create your development endpoint.
   + If you're forwarding a different port than `8998`, replace `8998` with the port number that you're actually using locally. The address `169.254.76.1:8998` is the remote port and isn't changed by you.
   + Replace `dev-endpoint-public-dns` with the public DNS address of your development endpoint. To find this address, navigate to your development endpoint in the Amazon Glue console, choose the name, and copy the **Public address** that's listed on the **Endpoint details** page.

   ```
   ssh -i private-key-file-path -NTL 8998:169.254.76.1:8998 glue@dev-endpoint-public-dns
   ```

   You will likely see a warning message like the following:

   ```
   The authenticity of host 'ec2-xx-xxx-xxx-xx.us-west-2.compute.amazonaws.com (xx.xxx.xxx.xx)'
   can't be established.  ECDSA key fingerprint is SHA256:4e97875Brt+1wKzRko+JflSnp21X7aTP3BcFnHYLEts.
   Are you sure you want to continue connecting (yes/no)?
   ```

   Enter **yes** and leave the terminal window open while you use JupyterLab. 

1. Check that SSH port forwarding is working with the development endpoint correctly.

   ```
   $ curl localhost:8998/sessions
   {"from":0,"total":0,"sessions":[]}
   ```

## Step 4: Run a simple script fragment in a notebook paragraph
<a name="dev-endpoint-tutorial-local-jupyter-list-schema"></a>

Now your notebook in JupyterLab should work with your development endpoint. Enter the following script fragment into your notebook and run it.

1. Check that Spark is running successfully. The following command instructs Spark to calculate `1` and then print the value.

   ```
   spark.sql("select 1").show()
   ```

1. Check if Amazon Glue Data Catalog integration is working. The following command lists the tables in the Data Catalog.

   ```
   spark.sql("show tables").show()
   ```

1. Check that a simple script fragment that uses Amazon Glue libraries works.

   The following script uses the `persons_json` table metadata in the Amazon Glue Data Catalog to create a `DynamicFrame` from your sample data. It then prints out the item count and the schema of this data. 

```
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
 
# Create a Glue context
glueContext = GlueContext(SparkContext.getOrCreate())
 
# Create a DynamicFrame using the 'persons_json' table
persons_DyF = glueContext.create_dynamic_frame.from_catalog(database="legislators", table_name="persons_json")
 
# Print out information about *this* data
print("Count:  ", persons_DyF.count())
persons_DyF.printSchema()
```

The output of the script is as follows.

```
 Count:  1961
 root
 |-- family_name: string
 |-- name: string
 |-- links: array
 |    |-- element: struct
 |    |    |-- note: string
 |    |    |-- url: string
 |-- gender: string
 |-- image: string
 |-- identifiers: array
 |    |-- element: struct
 |    |    |-- scheme: string
 |    |    |-- identifier: string
 |-- other_names: array
 |    |-- element: struct
 |    |    |-- note: string
 |    |    |-- name: string
 |    |    |-- lang: string
 |-- sort_name: string
 |-- images: array
 |    |-- element: struct
 |    |    |-- url: string
 |-- given_name: string
 |-- birth_date: string
 |-- id: string
 |-- contact_details: array
 |    |-- element: struct
 |    |    |-- type: string
 |    |    |-- value: string
 |-- death_date: string
```

## Troubleshooting
<a name="dev-endpoint-tutorial-local-jupyter-troubleshooting"></a>
+ During the installation of JupyterLab, if your computer is behind a corporate proxy or firewall, you might encounter HTTP and SSL errors due to custom security profiles managed by corporate IT departments.

  The following is an example of a typical error that occurs when `conda` can't connect to its own repositories:

  ```
  CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://repo.anaconda.com/pkgs/main/win-64/current_repodata.json>
  ```

  This might happen because your company can block connections to widely used repositories in Python and JavaScript communities. For more information, see [Installation Problems](https://jupyterlab.readthedocs.io/en/stable/getting_started/installation.html#installation-problems) on the JupyterLab website.
+ If you encounter a *connection refused* error when trying to connect to your development endpoint, you might be using a development endpoint that is out of date. Try creating a new development endpoint and reconnecting.

# Tutorial: Use a SageMaker AI notebook with your development endpoint
<a name="dev-endpoint-tutorial-sage"></a>

 In Amazon Glue, you can create a development endpoint and then create a SageMaker AI notebook to help develop your ETL and machine learning scripts. A SageMaker AI notebook is a fully managed machine learning compute instance running the Jupyter Notebook application.

1. In the Amazon Glue console, choose **Dev endpoints** to navigate to the development endpoints list. 

1. Select the check box next to the name of a development endpoint that you want to use, and on the **Action** menu, choose **Create SageMaker notebook**.

1. Fill out the **Create and configure a notebook** page as follows:

   1. Enter a notebook name.

   1. Under **Attach to development endpoint**, verify the development endpoint.

   1. Create or choose an Amazon Identity and Access Management (IAM) role.

      Creating a role is recommended. If you use an existing role, ensure that it has the required permissions. For more information, see [Step 6: Create an IAM policy for SageMaker AI notebooks](create-sagemaker-notebook-policy.md).

   1. (Optional) Choose a VPC, a subnet, and one or more security groups.

   1. (Optional) Choose an Amazon Key Management Service encryption key.

   1. (Optional) Add tags for the notebook instance.

1. Choose **Create notebook**. On the **Notebooks** page, choose the refresh icon at the upper right, and continue until the **Status** shows `Ready`.

1. Select the check box next to the new notebook name, and then choose **Open notebook**.

1. Create a new notebook: On the **jupyter** page, choose **New**, and then choose **Sparkmagic (PySpark)**.

   Your screen should now look like the following:  
![\[The jupyter page has a menu bar, toolbar, and a wide text field into which you can enter statements.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/sagemaker-notebook.png)

1. (Optional) At the top of the page, choose **Untitled**, and give the notebook a name.

1. To start a Spark application, enter the following command into the notebook, and then in the toolbar, choose **Run**.

   ```
   spark
   ```

   After a short delay, you should see the following response:  
![\[The system response shows Spark application status and outputs the following message: SparkSession available as 'spark'.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/spark-command-response.png)

1. Create a dynamic frame and run a query against it: Copy, paste, and run the following code, which outputs the count and schema of the `persons_json` table.

   ```
   import sys
   from pyspark.context import SparkContext
   from awsglue.context import GlueContext
   from awsglue.transforms import *
   glueContext = GlueContext(SparkContext.getOrCreate())
   persons_DyF = glueContext.create_dynamic_frame.from_catalog(database="legislators", table_name="persons_json")
   print ("Count:  ", persons_DyF.count())
   persons_DyF.printSchema()
   ```

# Tutorial: Use a REPL shell with your development endpoint
<a name="dev-endpoint-tutorial-repl"></a>

 In Amazon Glue, you can create a development endpoint and then invoke a REPL (Read–Evaluate–Print Loop) shell to run PySpark code incrementally so that you can interactively debug your ETL scripts before deploying them.

 In order to use a REPL on a development endpoint, you need to have authorization to SSH to the endpoint. 

1. On your local computer, open a terminal window that can run SSH commands, and paste in the edited SSH command. Run the command.

   Assuming that you accepted Amazon Glue version 1.0 with Python 3 for the development endpoint, the output will look like this:

   ```
   Python 3.6.8 (default, Aug  2 2019, 17:42:44)
   [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux
   Type "help", "copyright", "credits" or "license" for more information.
   SLF4J: Class path contains multiple SLF4J bindings.
   SLF4J: Found binding in [jar:file:/usr/share/aws/glue/etl/jars/glue-assembly.jar!/org/slf4j/impl/StaticLoggerBinder.class]
   SLF4J: Found binding in [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
   SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
   SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
   Setting default log level to "WARN".
   To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
   2019-09-23 22:12:23,071 WARN  [Thread-5] yarn.Client (Logging.scala:logWarning(66)) - Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
   2019-09-23 22:12:26,562 WARN  [Thread-5] yarn.Client (Logging.scala:logWarning(66)) - Same name resource file:/usr/lib/spark/python/lib/pyspark.zip added multiple times to distributed cache
   2019-09-23 22:12:26,580 WARN  [Thread-5] yarn.Client (Logging.scala:logWarning(66)) - Same path resource file:///usr/share/aws/glue/etl/python/PyGlue.zip added multiple times to distributed cache.
   2019-09-23 22:12:26,581 WARN  [Thread-5] yarn.Client (Logging.scala:logWarning(66)) - Same path resource file:///usr/lib/spark/python/lib/py4j-src.zip added multiple times to distributed cache.
   2019-09-23 22:12:26,581 WARN  [Thread-5] yarn.Client (Logging.scala:logWarning(66)) - Same path resource file:///usr/share/aws/glue/libs/pyspark.zip added multiple times to distributed cache.
   Welcome to
         ____              __
        / __/__  ___ _____/ /__
       _\ \/ _ \/ _ `/ __/  '_/
      /__ / .__/\_,_/_/ /_/\_\   version 2.4.3
         /_/
   
   Using Python version 3.6.8 (default, Aug  2 2019 17:42:44)
   SparkSession available as 'spark'.
   >>>
   ```

1. Test that the REPL shell is working correctly by typing the statement, `print(spark.version)`. As long as that displays the Spark version, your REPL is now ready to use.

1. Now you can try executing the following simple script, line by line, in the shell:

   ```
   import sys
   from pyspark.context import SparkContext
   from awsglue.context import GlueContext
   from awsglue.transforms import *
   glueContext = GlueContext(SparkContext.getOrCreate())
   persons_DyF = glueContext.create_dynamic_frame.from_catalog(database="legislators", table_name="persons_json")
   print ("Count:  ", persons_DyF.count())
   persons_DyF.printSchema()
   ```

# Tutorial: Set up PyCharm professional with a development endpoint
<a name="dev-endpoint-tutorial-pycharm"></a>

This tutorial shows you how to connect the [PyCharm Professional](https://www.jetbrains.com/pycharm/) Python IDE running on your local machine to a development endpoint so that you can interactively run, debug, and test Amazon Glue ETL (extract, transfer, and load) scripts before deploying them. The instructions and screen captures in the tutorial are based on PyCharm Professional version 2019.3.

To connect to a development endpoint interactively, you must have PyCharm Professional installed. You can't do this using the free edition.

**Note**  
The tutorial uses Amazon S3 as a data source. If you want to use a JDBC data source instead, you must run your development endpoint in a virtual private cloud (VPC). To connect with SSH to a development endpoint in a VPC, you must create an SSH tunnel. This tutorial does not include instructions for creating an SSH tunnel. For information on using SSH to connect to a development endpoint in a VPC, see [Securely Connect to Linux Instances Running in a Private Amazon VPC](https://amazonaws-china.com/blogs/security/securely-connect-to-linux-instances-running-in-a-private-amazon-vpc/) in the Amazon security blog.

**Topics**
+ [Connecting PyCharm professional to a development endpoint](#dev-endpoint-tutorial-pycharm-connect)
+ [Deploying the script to your development endpoint](#dev-endpoint-tutorial-pycharm-deploy)
+ [Configuring a remote interpreter](#dev-endpoint-tutorial-pycharm-interpreter)
+ [Running your script on the development endpoint](#dev-endpoint-tutorial-pycharm-debug-run)

## Connecting PyCharm professional to a development endpoint
<a name="dev-endpoint-tutorial-pycharm-connect"></a>

1. Create a new pure-Python project in PyCharm named `legislators`.

1. Create a file named `get_person_schema.py` in the project with the following content:

   ```
   from pyspark.context import SparkContext
   from awsglue.context import GlueContext
   
   
   def main():
       # Create a Glue context
       glueContext = GlueContext(SparkContext.getOrCreate())
   
       # Create a DynamicFrame using the 'persons_json' table
       persons_DyF = glueContext.create_dynamic_frame.from_catalog(database="legislators", table_name="persons_json")
   
       # Print out information about this data
       print("Count:  ", persons_DyF.count())
       persons_DyF.printSchema()
   
   
   if __name__ == "__main__":
       main()
   ```

1. Do one of the following:
   + For Amazon Glue version 0.9, download the Amazon Glue Python library file, `PyGlue.zip`, from `https://s3.amazonaws.com/aws-glue-jes-prod-us-east-1-assets/etl/python/PyGlue.zip` to a convenient location on your local machine.
   + For Amazon Glue version 1.0 and later, download the Amazon Glue Python library file, `PyGlue.zip`, from `https://s3.amazonaws.com/aws-glue-jes-prod-us-east-1-assets/etl-1.0/python/PyGlue.zip` to a convenient location on your local machine.

1. Add `PyGlue.zip` as a content root for your project in PyCharm:
   + In PyCharm, choose **File**, **Settings** to open the **Settings** dialog box. (You can also press `Ctrl+Alt+S`.)
   + Expand the `legislators` project and choose **Project Structure**. Then in the right pane, choose **\$1 Add Content Root**.
   + Navigate to the location where you saved `PyGlue.zip`, select it, then choose **Apply**.

    The **Settings** screen should look something like the following:  
![\[The PyCharm Settings screen with PyGlue.zip added as a content root.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/PyCharm_AddContentRoot.png)

   Leave the **Settings** dialog box open after you choose **Apply**.

1. Configure deployment options to upload the local script to your development endpoint using SFTP (this capability is available only in PyCharm Professional):
   + In the **Settings** dialog box, expand the **Build, Execution, Deployment** section. Choose the **Deployment** subsection.
   + Choose the **\$1** icon at the top of the middle pane to add a new server. Set its **Type** to `SFTP` and give it a name.
   + Set the **SFTP host** to the **Public address** of your development endpoint, as listed on its details page. (Choose the name of your development endpoint in the Amazon Glue console to display the details page). For a development endpoint running in a VPC, set **SFTP host** to the host address and local port of your SSH tunnel to the development endpoint.
   + Set the **User name** to `glue`.
   + Set the **Auth type** to **Key pair (OpenSSH or Putty)**. Set the **Private key file** by browsing to the location where your development endpoint's private key file is located. Note that PyCharm only supports DSA, RSA and ECDSA OpenSSH key types, and does not accept keys in Putty's private format. You can use an up-to-date version of `ssh-keygen` to generate a key-pair type that PyCharm accepts, using syntax like the following:

     ```
     ssh-keygen -t rsa -f <key_file_name> -C "<your_email_address>"
     ```
   + Choose **Test connection**, and allow the connection to be tested. If the connection succeeds, choose **Apply**.

    The **Settings** screen should now look something like the following:  
![\[The PyCharm Settings screen with an SFTP server defined.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/PyCharm_SFTP_blurred.png)

   Again, leave the **Settings** dialog box open after you choose **Apply**.

1. Map the local directory to a remote directory for deployment:
   + In the right pane of the **Deployment** page, choose the middle tab at the top, labeled **Mappings**.
   + In the **Deployment Path** column, enter a path under `/home/glue/scripts/` for deployment of your project path. For example: `/home/glue/scripts/legislators`.
   + Choose **Apply**.

    The **Settings** screen should now look something like the following:  
![\[The PyCharm Settings screen after a deployment mapping.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/PyCharm_Mapping_blurred.png)

   Choose **OK** to close the **Settings** dialog box.

## Deploying the script to your development endpoint
<a name="dev-endpoint-tutorial-pycharm-deploy"></a>

1. Choose **Tools**, **Deployment**, and then choose the name under which you set up your development endpoint, as shown in the following image:  
![\[The menu item for deploying your script.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/PyCharm_Deploy.png)

   After your script has been deployed, the bottom of the screen should look something like the following:  
![\[The bottom of the PyCharm screen after a successful deployment.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/PyCharm_Deployed.png)

1. On the menu bar, choose **Tools**, **Deployment**, **Automatic Upload (always)**. Ensure that a check mark appears next to **Automatic Upload (always)**.

   When this option is enabled, PyCharm automatically uploads changed files to the development endpoint.

## Configuring a remote interpreter
<a name="dev-endpoint-tutorial-pycharm-interpreter"></a>

Configure PyCharm to use the Python interpreter on the development endpoint.

1. From the **File** menu, choose **Settings**.

1. Expand the project **legislators** and choose **Project Interpreter**.

1. Choose the gear icon next to the **Project Interpreter** list, and then choose **Add**.

1. In the **Add Python Interpreter** dialog box, in the left pane, choose **SSH Interpreter**.

1. Choose **Existing server configuration**, and in the **Deployment configuration** list, choose your configuration.

   Your screen should look something like the following image.  
![\[In the left pane, SSH Interpreter is selected, and in the right pane, the Existing server configuration radio button is selected. The Deployment configuration field contains the configuration name and the message "Remote SDK is saved in IDE settings, so it needs the deployment server to be saved there too. Which do you prefer?" The following are the choices beneath that message: "Create copy of this deployment server in IDE settings" and "Move this server to IDE settings."\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/PyCharm_Interpreter1_blurred.png)

1. Choose **Move this server to IDE settings**, and then choose **Next**.

1. In the **Interpreter** field, change the path to` /usr/bin/gluepython` if you are using Python 2, or to `/usr/bin/gluepython3` if you are using Python 3. Then choose **Finish**.

## Running your script on the development endpoint
<a name="dev-endpoint-tutorial-pycharm-debug-run"></a>

To run the script:
+ In the left pane, right-click the file name and choose **Run '*<filename>*'**.

  After a series of messages, the final output should show the count and the schema.

  ```
  Count:   1961
  root
  |-- family_name: string
  |-- name: string
  |-- links: array
  |    |-- element: struct
  |    |    |-- note: string
  |    |    |-- url: string
  |-- gender: string
  |-- image: string
  |-- identifiers: array
  |    |-- element: struct
  |    |    |-- scheme: string
  |    |    |-- identifier: string
  |-- other_names: array
  |    |-- element: struct
  |    |    |-- lang: string
  |    |    |-- note: string
  |    |    |-- name: string
  |-- sort_name: string
  |-- images: array
  |    |-- element: struct
  |    |    |-- url: string
  |-- given_name: string
  |-- birth_date: string
  |-- id: string
  |-- contact_details: array
  |    |-- element: struct
  |    |    |-- type: string
  |    |    |-- value: string
  |-- death_date: string
  
  
  Process finished with exit code 0
  ```

You are now set up to debug your script remotely on your development endpoint.

# Advanced configuration: sharing development endpoints among multiple users
<a name="dev-endpoint-sharing"></a>

This section explains how you can take advantage of development endpoints with SageMaker notebooks in typical use cases to share development endpoints among multiple users.

## Single-tenancy configuration
<a name="dev-endpoint-sharing-sharing-single"></a>

In single tenant use-cases, to simplify the developer experience and to avoid contention for resources it is recommended that you have each developer use their own development endpoint sized for the project they are working on. This also simplifies the decisions related to worker type and DPU count leaving them up to the discretion of the developer and project they are working on. 

You won't need to take care of resource allocation unless you runs multiple notebook files concurrently. If you run code in multiple notebook files at the same time, multiple Livy sessions will be launched concurrently. To segregate Spark cluster configurations in order to run multiple Livy sessions at the same time, you can follow the steps which are introduced in multi tenant use-cases.

For example, if your development endpoint has 10 workers and the worker type is ` G.1X`, then you will have 9 Spark executors and the entire cluster will have 90G of executor memory since each executor will have 10G of memory.

Regardless of the specified worker type, Spark dynamic resource allocation will be turned on. If a dataset is large enough, Spark may allocate all the executors to a single Livy session since `spark.dynamicAllocation.maxExecutors` is not set by default. This means that other Livy sessions on the same dev endpoint will wait to launch new executors. If the dataset is small, Spark will be able to allocate executors to multiple Livy sessions at the same time.

**Note**  
For more information about how resources are allocated in different use cases and how you set a configuration to modify the behavior, see [Advanced configuration: sharing development endpoints among multiple users](#dev-endpoint-sharing).

### Multi-tenancy configuration
<a name="dev-endpoint-sharing-sharing-multi"></a>

**Note**  
Please note, development endpoints are intended to emulate the Amazon Glue ETL environment as a single-tenant environment. While multi-tenant use is possible, it is an advanced use-case and it is recommended most users maintain a pattern of single-tenancy for each development endpoint.

In multi tenant use-cases, you might need to take care of resource allocation. The key factor is the number of concurrent users who use a Jupyter notebook at the same time. If your team works in a "follow-the-sun" workflow and there is only one Jupyter user at each time zone, then the number of concurrent users is only one, so you won't need to be concerned with resource allocation. However, if your notebook is shared among multiple users and each user submits code in an ad-hoc basis, then you will need to consider the below points.

To partition Spark cluster resources among multiple users, you can use SparkMagic configurations. There are two different ways to configure SparkMagic.

#### (A) Use the %%configure -f directive
<a name="dev-endpoint-sharing-sharing-multi-a"></a>

If you want to modify the configuration per Livy session from the notebook, you can run the `%%configure -f` directive on the notebook paragraph.

For example, if you want to run Spark application on 5 executors, you can run the following command on the notebook paragraph.

```
%%configure -f
{"numExecutors":5}
```

Then you will see only 5 executors running for the job on the Spark UI.

We recommend limiting the maximum number of executors for dynamic resource allocation.

```
%%configure -f
{"conf":{"spark.dynamicAllocation.maxExecutors":"5"}}
```

#### (B) Modify the SparkMagic config file
<a name="dev-endpoint-sharing-sharing-multi-b"></a>

SparkMagic works based on the [Livy API](https://livy.incubator.apache.org/docs/latest/rest-api.html). SparkMagic creates Livy sessions with configurations such as `driverMemory`, ` driverCores`, `executorMemory`, `executorCores`, ` numExecutors`, `conf`, etc. Those are the key factors that determine how much resources are consumed from the entire Spark cluster. SparkMagic allows you to provide a config file to specify those parameters which are sent to Livy. You can see a sample config file in this [Github repository](https://github.com/jupyter-incubator/sparkmagic/blob/master/sparkmagic/example_config.json).

If you want to modify configuration across all the Livy sessions from a notebook, you can modify `/home/ec2-user/.sparkmagic/config.json` to add `session_config` .

To modify the config file on a SageMaker notebook instance, you can follow these steps.

1. Open a SageMaker notebook.

1. Open the Terminal kernel.

1. Run the following commands:

   ```
   sh-4.2$ cd .sparkmagic
   sh-4.2$ ls
   config.json logs
   sh-4.2$ sudo vim config.json
   ```

   For example, you can add these lines to ` /home/ec2-user/.sparkmagic/config.json` and restart the Jupyter kernel from the notebook.

   ```
     "session_configs": {
       "conf": {
         "spark.dynamicAllocation.maxExecutors":"5"
       }
     },
   ```

### Guidelines and best practices
<a name="dev-endpoint-sharing-sharing-guidelines"></a>

To avoid this kind of resource conflict, you can use some basic approaches like:
+ Have a larger Spark cluster by increasing the `NumberOfWorkers` (scaling horizontally) and upgrading the `workerType` (scaling vertically)
+ Allocate fewer resources per user (fewer resources per Livy session)

Your approach will depend on your use case. If you have a larger development endpoint, and there is not a huge amount of data, the possibility of a resource conflict will decrease significantly because Spark can allocate resources based on a dynamic allocation strategy.

As described above, the number of Spark executors can be automatically calculated based on a combination of DPU (or `NumberOfWorkers`) and worker type. Each Spark application launches one driver and multiple executors. To calculate you will need the ` NumberOfWorkers` = `NumberOfExecutors + 1`. The matrix below explains how much capacity you need in your development endpoint based on the number of concurrent users.


****  

| Number of concurrent notebook users | Number of Spark executors you want to allocate per user | Total NumberOfWorkers for your dev endpoint | 
| --- | --- | --- | 
| 3 | 5 | 18 | 
| 10 | 5 | 60 | 
| 50 | 5 | 300 | 

If you want to allocate fewer resources per user, the ` spark.dynamicAllocation.maxExecutors` (or `numExecutors`) would be the easiest parameter to configure as a Livy session parameter. If you set the below configuration in `/home/ec2-user/.sparkmagic/config.json`, then SparkMagic will assign a maximum of 5 executors per Livy session. This will help segregating resources per Livy session.

```
"session_configs": {
    "conf": {
      "spark.dynamicAllocation.maxExecutors":"5"
    }
  },
```

Suppose there is a dev endpoint with 18 workers (G.1X) and there are 3 concurrent notebook users at the same time. If your session config has ` spark.dynamicAllocation.maxExecutors=5` then each user can make use of 1 driver and 5 executors. There won't be any resource conflicts even when you run multiple notebook paragraphs at the same time.

#### Trade-offs
<a name="dev-endpoint-sharing-sharing-multi-tradeoffs"></a>

With this session config `"spark.dynamicAllocation.maxExecutors":"5"`, you will be able to avoid resource conflict errors and you do not need to wait for resource allocation when there are concurrent user accesses. However, even when there are many free resources (for example, there are no other concurrent users), Spark cannot assign more than 5 executors for your Livy session.

#### Other notes
<a name="dev-endpoint-sharing-sharing-multi-notes"></a>

It is a good practice to stop the Jupyter kernel when you stop using a notebook. This will free resources and other notebook users can use those resources immediately without waiting for kernel expiration (auto-shutdown).

### Common issues
<a name="dev-endpoint-sharing-sharing-issues"></a>

Even when following the guidelines, you may experience certain issues.

#### Session not found
<a name="dev-endpoint-sharing-sharing-issues-session"></a>

When you try to run a notebook paragraph even though your Livy session has been already terminated, you will see the below message. To activate the Livy session, you need to restart the Jupyter kernel by choosing **Kernel** > **Restart** in the Jupyter menu, then run the notebook paragraph again.

```
An error was encountered:
Invalid status code '404' from http://localhost:8998/sessions/13 with error payload: "Session '13' not found."
```

#### Not enough YARN resources
<a name="dev-endpoint-sharing-sharing-issues-yarn-resources"></a>

When you try to run a notebook paragraph even though your Spark cluster does not have enough resources to start a new Livy session, you will see the below message. You can often avoid this issue by following the guidelines, however, there might be a possibility that you face this issue. To workaround the issue, you can check if there are any unneeded, active Livy sessions. If there are unneeded Livy sessions, you will need to terminate them to free the cluster resources. See the next section for details.

```
Warning: The Spark session does not have enough YARN resources to start. 
The code failed because of a fatal error:
    Session 16 did not start up in 60 seconds..

Some things to try:
a) Make sure Spark has enough available resources for Jupyter to create a Spark context.
b) Contact your Jupyter administrator to make sure the Spark magics library is configured correctly.
c) Restart the kernel.
```

### Monitoring and debugging
<a name="dev-endpoint-sharing-sharing-debugging"></a>

This section describes techniques for monitoring resources and sessions.

#### Monitoring and debugging cluster resource allocation
<a name="dev-endpoint-sharing-sharing-debugging-a"></a>

You can watch the Spark UI to monitor how many resources are allocated per Livy session, and what are the effective Spark configurations on the job. To activate the Spark UI, see [Enabling the Apache Spark Web UI for Development Endpoints](https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui-dev-endpoints.html).

(Optional) If you need a real-time view of the Spark UI, you can configure an SSH tunnel against the Spark history server running on the Spark cluster.

```
ssh -i <private-key.pem> -N -L 8157:<development endpoint public address>:18080 glue@<development endpoint public address>
```

You can then open http://localhost:8157 on your browser to view the Spark UI.

#### Free unneeded Livy sessions
<a name="dev-endpoint-sharing-sharing-debugging-b"></a>

Review these procedures to shut down any unneeded Livy sessions from a notebook or a Spark cluster.

**(a). Terminate Livy sessions from a notebook**  
You can shut down the kernel on a Jupyter notebook to terminate unneeded Livy sessions.

**(b). Terminate Livy sessions from a Spark cluster**  
If there are unneeded Livy sessions which are still running, you can shut down the Livy sessions on the Spark cluster.

As a pre-requisite to perform this procedure, you need to configure your SSH public key for your development endpoint.

To log in to the Spark cluster, you can run the following command:

```
$ ssh -i <private-key.pem> glue@<development endpoint public address>
```

You can run the following command to see the active Livy sessions:

```
$ yarn application -list
20/09/25 06:22:21 INFO client.RMProxy: Connecting to ResourceManager at ip-255-1-106-206.ec2.internal/172.38.106.206:8032
Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):2
Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL
application_1601003432160_0005 livy-session-4 SPARK livy default RUNNING UNDEFINED 10% http://ip-255-1-4-130.ec2.internal:41867
application_1601003432160_0004 livy-session-3 SPARK livy default RUNNING UNDEFINED 10% http://ip-255-1-179-185.ec2.internal:33727
```

You can then shut down the Livy session with the following command:

```
$ yarn application -kill application_1601003432160_0005
20/09/25 06:23:38 INFO client.RMProxy: Connecting to ResourceManager at ip-255-1-106-206.ec2.internal/255.1.106.206:8032
Killing application application_1601003432160_0005
20/09/25 06:23:39 INFO impl.YarnClientImpl: Killed application application_1601003432160_0005
```

# Managing notebooks
<a name="notebooks-with-glue"></a>

**Note**  
 Development Endpoints are only supported for versions of Amazon Glue prior to 2.0. For an interactive environment where you can author and test ETL scripts, use [Notebooks on Amazon Glue Studio](https://docs.amazonaws.cn/glue/latest/ug/notebooks-chapter.html). 

A notebook enables interactive development and testing of your ETL (extract, transform, and load) scripts on a development endpoint. Amazon Glue provides an interface to SageMaker AI Jupyter notebooks. With Amazon Glue, you create and manage SageMaker AI notebooks. You can also open SageMaker AI notebooks from the Amazon Glue console.

In addition, you can use Apache Spark with SageMaker AI on Amazon Glue development endpoints which support SageMaker AI (but not Amazon Glue ETL jobs). SageMaker Spark is an open source Apache Spark library for SageMaker AI. For more information, see [Using Apache Spark with Amazon SageMaker](https://docs.amazonaws.cn/sagemaker/latest/dg/apache-spark.html). 


| Region | Code | 
| --- | --- | 
|   Managing SageMaker AI notebooks with Amazon Glue development endpoints is available in the following Amazon Regions: [\[See the AWS documentation website for more details\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/notebooks-with-glue.html)   | 
| US East (Ohio) | `us-east-2` | 
| US East (N. Virginia) | `us-east-1` | 
| US West (N. California) | `us-west-1` | 
| US West (Oregon) | `us-west-2` | 
| Asia Pacific (Tokyo) | `ap-northeast-1` | 
| Asia Pacific (Seoul) | `ap-northeast-2` | 
| Asia Pacific (Mumbai) | `ap-south-1` | 
| Asia Pacific (Singapore) | `ap-southeast-1` | 
| Asia Pacific (Sydney) | `ap-southeast-2` | 
| Canada (Central) | `ca-central-1` | 
| Europe (Frankfurt) | `eu-central-1` | 
| Europe (Ireland) | `eu-west-1` | 
| Europe (London) | `eu-west-2` | 

**Topics**