Configure networking
This section provides information about how administrators can configure their network to allow communication between Studio or Studio Classic and an Amazon EMR cluster.
The networking instructions vary based on whether Studio and Amazon EMR are deployed within a private Amazon Virtual Private Cloud (VPC) or communicate over the internet.
By default, Studio or Studio Classic run in an Amazon managed VPC with internet access. When using an internet connection, Studio and
Studio Classic access Amazon resources, such as Amazon S3 buckets, over the internet. However, if
you have security requirements to control access to your data and job containers, we
recommend that you configure Studio or Studio Classic and Amazon EMR so that your data and
containers aren’t accessible over the internet. To control access to your resources or
run Studio or Studio Classic without public internet access, you can specify the
VPC only
network access type when you onboard to Amazon SageMaker domain. In this scenario, both
Studio and Studio Classic establish connections with other Amazon services via private
VPC
endpoints. For information about configuring Studio or Studio Classic in
VPC only
mode, see Connect SageMaker Studio or Studio Classic notebooks in a VPC to external
resources..
The first two sections describe how to ensure communication between Studio or Studio Classic and an Amazon EMR cluster in VPCs without public internet access. The last section covers how to ensure communication between Studio or Studio Classic and Amazon EMR using an internet connection. Prior to connecting Studio or Studio Classic and Amazon EMR without internet access, make sure to establish endpoints for Amazon Simple Storage Service (data storage), Amazon CloudWatch (logging and monitoring), and Amazon SageMaker Runtime (fine-grained role-based access control (RBAC)).
To connect Studio or Studio Classic and your Amazon EMR cluster:
-
If Studio or Studio Classic and Amazon EMR are in separate VPCs, either in the same Amazon account or in different accounts, see Studio and Amazon EMR are in separate VPCs.
-
If Studio or Studio Classic and Amazon EMR are in the same VPC, see Studio and Amazon EMR are in the same VPC.
-
If you chose to connect Studio or Studio Classic and Amazon EMR over public internet, see Studio and Amazon EMR communicate over public internet.
Studio and Amazon EMR are in separate VPCs
To allow communication between Studio or Studio Classic and Amazon EMR when they are deployed in separate VPCs:
-
Start by connecting your VPCs through a VPC peering connection.
-
Update your routing tables in each VPC to route the network traffic between Studio or Studio Classic subnets and Amazon EMR subnets both ways.
-
Configure your security groups to allow inbound and outbound traffic.
The steps to connect Studio or Studio Classic and Amazon EMR are the same whether the resources are deployed in a single Amazon account (Single account use case) or across multiple Amazon accounts (Cross-account use case).
-
VPC peering
Create a VPC peering connection to facilitate the networking between the two VPCs (Studio or Studio Classic and Amazon EMR).
-
From your Studio or Studio Classic account, on the VPC dashboard, choose Peering connections, then Create peering connection.
-
Create your request to peer the Studio or Studio Classic VPC with the Amazon EMR VPC. When requesting peering in another Amazon account, choose Another account in Select another VPC to peer with.
For cross-account peering, the administrator must accept the request from the Amazon EMR account.
When peering private subnets, you should enable private IP DNS resolution at the VPC peering connection level.
-
-
Routing tables
Send the network traffic between Studio or Studio Classic subnets and Amazon EMR subnets both ways.
After you establish the peering connection, the administrator (on each account for cross-account access) can add routes to the private subnet route tables to route the traffic between Studio or Studio Classic and the cluster subnets. You can define those routes by going to the Route Tables section of each VPC in the VPC dashboard.
The following illustration of the route table of a Studio VPC subnet shows an example of an outbound route from the Studio account to the Amazon EMR VPC IP range (here
2.0.1.0/24
) through the peering connection.The following illustration of a route table of an Amazon EMR VPC subnet shows an example of return routes from the Amazon EMR VPC to Studio VPC IP range (here
10.0.20.0/24
) through the peering connection. -
Security groups
Lastly, the security group of your Studio or Studio Classic domain must allow outbound traffic, and the security group of the Amazon EMR primary node must allow inbound traffic on Apache Livy, Hive, or Presto TCP ports (respectively
8998
,10000
, and8889
) from the Studio or Studio Classic instance security group. Apache Livyis a service that enables interaction with Amazon EMR over a REST interface.
The following diagram shows an example of an Amazon VPC setup that enables JupyterLab or Studio Classic notebooks to provision Amazon EMR clusters from Amazon CloudFormation templates in the Service Catalog and then connect to an Amazon EMR cluster within the same Amazon account. The diagram provides an additional illustration of the required endpoints for a direct connection to various Amazon services, such as Amazon S3 or Amazon CloudWatch, when the VPCs have no internet access. Alternatively, a NAT gateway must be used to allow instances in private subnets of multiple VPCs to share a single public IP address provided by the internet gateway when accessing the internet.
Studio and Amazon EMR are in the same VPC
If Studio or Studio Classic and the Amazon EMR clusters are in different subnets, add routes to each private subnet route table to route the traffic between Studio or Studio Classic and the cluster subnets. You can define those routes by going to the Route Tables section of each VPC in the VPC dashboard. If you deployed Studio or Studio Classic and an Amazon EMR cluster in the same VPC and the same subnet, you do not need to route the traffic between the Studio or Studio Classic and the cluster.
Whether or not you needed to update your routing tables, the security group of
your Studio or Studio Classic domain must allow outbound traffic, and the
security group of the Amazon EMR primary node must allow inbound traffic on Apache Livy, Hive,or
Presto TCP ports (respectively
8998
, 10000
, and 8889
) from the
Studio or Studio Classic instance security group. Apache Livy
Studio and Amazon EMR communicate over public internet
By default, Studio and Studio Classic provide a network interface that allows
communication with the internet through an internet gateway in the VPC
associated with the SageMaker domain. If you choose to connect to Amazon EMR through the
public internet, your Amazon EMR cluster needs to accept inbound traffic on Apache Livy, Hive,or
Presto TCP ports (respectively
8998
, 10000
, and 8889
) from its internet
gateway. Apache Livy
Keep in mind that any port on which you allow inbound traffic represents a potential security vulnerability. Carefully review custom security groups to ensure that you minimize vulnerabilities. For more information, see Control network traffic with security groups.
Alternatively, see Blogs and whitepapers for a detailed walkthrough of how to enable Kerberos on Amazon EMR, set the cluster in a private subnet, and access the cluster using a Network Load Balancer (NLB) to expose only specific ports, which are access-controlled via security groups.
Note
When connecting to your Apache Livy endpoint through the public internet, we recommend that you secure communications between Studio or Studio Classic and your Amazon EMR cluster using TLS.
For information on setting up HTTPS with Apache Livy, see Enabling HTTPS with Apache Livy. For information on setting an Amazon EMR cluster with transit encryption enabled, see Providing certificates for encrypting data in transit with Amazon EMR encryption. Additionally, you need to configure Studio or Studio Classic to access your certificate key as specified in Connect to an Amazon EMR cluster over HTTPS.