Configure networking - Amazon SageMaker
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Configure networking

This section provides information about how administrators can configure their network to allow communication between Studio or Studio Classic and an Amazon EMR cluster.

The networking instructions vary based on whether Studio and Amazon EMR are deployed within a private Amazon Virtual Private Cloud (VPC) or communicate over the internet.

By default, Studio or Studio Classic run in an Amazon managed VPC with internet access. When using an internet connection, Studio and Studio Classic access Amazon resources, such as Amazon S3 buckets, over the internet. However, if you have security requirements to control access to your data and job containers, we recommend that you configure Studio or Studio Classic and Amazon EMR so that your data and containers aren’t accessible over the internet. To control access to your resources or run Studio or Studio Classic without public internet access, you can specify the VPC only network access type when you onboard to Amazon SageMaker domain. In this scenario, both Studio and Studio Classic establish connections with other Amazon services via private VPC endpoints. For information about configuring Studio or Studio Classic in VPC only mode, see Connect SageMaker Studio or Studio Classic notebooks in a VPC to external resources..

The first two sections describe how to ensure communication between Studio or Studio Classic and an Amazon EMR cluster in VPCs without public internet access. The last section covers how to ensure communication between Studio or Studio Classic and Amazon EMR using an internet connection. Prior to connecting Studio or Studio Classic and Amazon EMR without internet access, make sure to establish endpoints for Amazon Simple Storage Service (data storage), Amazon CloudWatch (logging and monitoring), and Amazon SageMaker Runtime (fine-grained role-based access control (RBAC)).

To connect Studio or Studio Classic and your Amazon EMR cluster:

Studio and Amazon EMR are in separate VPCs

To allow communication between Studio or Studio Classic and Amazon EMR when they are deployed in separate VPCs:

  1. Start by connecting your VPCs through a VPC peering connection.

  2. Update your routing tables in each VPC to route the network traffic between Studio or Studio Classic subnets and Amazon EMR subnets both ways.

  3. Configure your security groups to allow inbound and outbound traffic.

The steps to connect Studio or Studio Classic and Amazon EMR are the same whether the resources are deployed in a single Amazon account (Single account use case) or across multiple Amazon accounts (Cross-account use case).

  1. VPC peering

    Create a VPC peering connection to facilitate the networking between the two VPCs (Studio or Studio Classic and Amazon EMR).

    1. From your Studio or Studio Classic account, on the VPC dashboard, choose Peering connections, then Create peering connection.

    2. Create your request to peer the Studio or Studio Classic VPC with the Amazon EMR VPC. When requesting peering in another Amazon account, choose Another account in Select another VPC to peer with.

      For cross-account peering, the administrator must accept the request from the Amazon EMR account.

      When peering private subnets, you should enable private IP DNS resolution at the VPC peering connection level.

  2. Routing tables

    Send the network traffic between Studio or Studio Classic subnets and Amazon EMR subnets both ways.

    After you establish the peering connection, the administrator (on each account for cross-account access) can add routes to the private subnet route tables to route the traffic between Studio or Studio Classic and the cluster subnets. You can define those routes by going to the Route Tables section of each VPC in the VPC dashboard.

    The following illustration of the route table of a Studio VPC subnet shows an example of an outbound route from the Studio account to the Amazon EMR VPC IP range (here 2.0.1.0/24) through the peering connection.

    Route table of a Studio VPC subnet showing the outbound routes from the Studio account to the Amazon EMR VPC IP range (here 2.0.1.0/24) through the peering connection

    The following illustration of a route table of an Amazon EMR VPC subnet shows an example of return routes from the Amazon EMR VPC to Studio VPC IP range (here 10.0.20.0/24) through the peering connection.

    Route table of an Amazon EMR VPC subnet showing the return routes from the Amazon EMR account to the Studio VPC IP range (here 10.0.20.0/24) through the peering connection
  3. Security groups

    Lastly, the security group of your Studio or Studio Classic domain must allow outbound traffic, and the security group of the Amazon EMR primary node must allow inbound traffic on Apache Livy, Hive, or Presto TCP ports (respectively 8998, 10000, and 8889) from the Studio or Studio Classic instance security group. Apache Livy is a service that enables interaction with Amazon EMR over a REST interface.

The following diagram shows an example of an Amazon VPC setup that enables JupyterLab or Studio Classic notebooks to provision Amazon EMR clusters from Amazon CloudFormation templates in the Service Catalog and then connect to an Amazon EMR cluster within the same Amazon account. The diagram provides an additional illustration of the required endpoints for a direct connection to various Amazon services, such as Amazon S3 or Amazon CloudWatch, when the VPCs have no internet access. Alternatively, a NAT gateway must be used to allow instances in private subnets of multiple VPCs to share a single public IP address provided by the internet gateway when accessing the internet.

Architectural diagram illustrating an example of a simple Amazon VPC setup that enables Studio or Studio Classic notebooks to provision Amazon EMR clusters from Amazon CloudFormation templates in the Service Catalog and then connect to an Amazon EMR cluster within the same Amazon account. The diagram provides an additional illustration of the required endpoints for a direct connection to various Amazon services, such as Amazon S3 or Amazon CloudWatch, when the VPCs have no internet access. Alternatively, a NAT gateway must be used to allow instances in private subnets of multiple VPCs to share a single public IP address provided by the internet gateway when accessing the internet.

Studio and Amazon EMR are in the same VPC

If Studio or Studio Classic and the Amazon EMR clusters are in different subnets, add routes to each private subnet route table to route the traffic between Studio or Studio Classic and the cluster subnets. You can define those routes by going to the Route Tables section of each VPC in the VPC dashboard. If you deployed Studio or Studio Classic and an Amazon EMR cluster in the same VPC and the same subnet, you do not need to route the traffic between the Studio or Studio Classic and the cluster.

Whether or not you needed to update your routing tables, the security group of your Studio or Studio Classic domain must allow outbound traffic, and the security group of the Amazon EMR primary node must allow inbound traffic on Apache Livy, Hive,or Presto TCP ports (respectively 8998, 10000, and 8889) from the Studio or Studio Classic instance security group. Apache Livy is a service that enables interaction with a Amazon EMR cluster over a REST interface.

Studio and Amazon EMR communicate over public internet

By default, Studio and Studio Classic provide a network interface that allows communication with the internet through an internet gateway in the VPC associated with the SageMaker domain. If you choose to connect to Amazon EMR through the public internet, your Amazon EMR cluster needs to accept inbound traffic on Apache Livy, Hive,or Presto TCP ports (respectively 8998, 10000, and 8889) from its internet gateway. Apache Livy is a service that enables interaction with an Amazon EMR cluster over a REST interface.

Keep in mind that any port on which you allow inbound traffic represents a potential security vulnerability. Carefully review custom security groups to ensure that you minimize vulnerabilities. For more information, see Control network traffic with security groups.

Alternatively, see Blogs and whitepapers for a detailed walkthrough of how to enable Kerberos on Amazon EMR, set the cluster in a private subnet, and access the cluster using a Network Load Balancer (NLB) to expose only specific ports, which are access-controlled via security groups.

Note

When connecting to your Apache Livy endpoint through the public internet, we recommend that you secure communications between Studio or Studio Classic and your Amazon EMR cluster using TLS.

For information on setting up HTTPS with Apache Livy, see Enabling HTTPS with Apache Livy. For information on setting an Amazon EMR cluster with transit encryption enabled, see Providing certificates for encrypting data in transit with Amazon EMR encryption. Additionally, you need to configure Studio or Studio Classic to access your certificate key as specified in Connect to an Amazon EMR cluster over HTTPS.