Setting up for Amazon Glue Studio
Complete the tasks in this section when you're using Amazon Glue Studio for the first time:
Topics
Complete initial Amazon configuration tasks
To use Amazon Glue Studio you must first complete the following tasks:
-
(Recommended) Create an IAM administrator user
-
(Recommended) Create an Amazon user for Amazon Glue Studio.
You can either use the administrator user for creating and managing your ETL jobs, or you can create a separate user for accessing Amazon Glue Studio.
To create additional users for Amazon Glue or Amazon Glue Studio, follow the steps in Creating Your First IAM Delegated User and Group in the IAM User Guide.
Sign up for Amazon
If you do not have an Amazon Web Services account, use the following procedure to create one.
To sign up for Amazon Web Services
Open http://www.amazonaws.cn/
and choose Sign Up. Follow the on-screen instructions.
Create an IAM administrator user
If your account already includes an IAM user with full Amazon administrative permissions, you can skip this section.
Secure IAM users
After you sign up for an Amazon Web Services account, safeguard your administrative user by turning on multi-factor authentication (MFA). For instructions, see Enable a virtual MFA device for an IAM user (console) in the IAM User Guide.
To give other users access to your Amazon Web Services account resources, create IAM users. To secure your IAM users, turn on MFA and only give the IAM users the permissions needed to perform their tasks.
For more information about creating and securing IAM users, see the following topics in the IAM User Guide:
Sign in as an IAM user
Sign in to the IAM console
For your convenience, the Amazon sign-in page uses a browser cookie to remember your IAM user name and account information. If you previously signed in as a different user, choose the sign-in link beneath the button to return to the main sign-in page. From there, you can enter your Amazon Web Services account ID or account alias to be redirected to the IAM user sign-in page for your account.
Review IAM permissions needed for the Amazon Glue Studio user
To use Amazon Glue Studio, the user must have access to various Amazon resources. The user must be able to view and select Amazon S3 buckets, IAM policies and roles, and Amazon Glue Data Catalog objects.
Amazon Glue service permissions
Amazon Glue Studio uses the actions and resources of the Amazon Glue service. Your user needs
permissions on these actions and resources to effectively use Amazon Glue Studio. You can grant
the Amazon Glue Studio user the AWSGlueConsoleFullAccess
managed policy, or create
a custom policy with a smaller set of permissions.
Per security best practices, it is recommended to restrict access by
tightening policies to further restrict access to Amazon S3 bucket and
Amazon CloudWatch log groups. For an example Amazon S3 policy,
see Writing IAM Policies: How to Grant Access to an Amazon S3
Bucket
Creating Custom IAM Policies for Amazon Glue Studio
You can create a custom policy with a smaller set of permissions for Amazon Glue Studio. The policy can grant permissions for a subset of objects or actions. Use the following information when creating a custom policy.
To use the Amazon Glue Studio APIs, include glue:UseGlueStudio
in the action policy
in your IAM permissions. Using glue:UseGlueStudio
will allow you to access
all Amazon Glue Studio actions even as more actions are added to the API over time.
Job Actions
-
GetJob
-
CreateJob
-
DeleteJob
-
GetJobs
-
UpdateJob
Job run Actions
-
StartJobRun
-
GetJobRuns
-
BatchStopJobRun
-
GetJobRun
Database Actions
-
GetDatabases
Plan Actions
-
GetPlan
Table Actions
-
SearchTables
-
GetTables
-
GetTable
Connection Actions
-
CreateConnection
-
DeleteConnection
UpdateConnection
-
GetConnections
-
GetConnection
Mapping Actions
-
GetMapping
Security Configuration Actions
-
GetSecurityConfigurations
Script Actions
-
CreateScript (different from API of same name in Amazon Glue)
Accessing Amazon Glue Studio APIs
To access Amazon Glue Studio, add glue:UseGlueStudio
in the actions policy list in the IAM permissions.
In the example below, glue:UseGlueStudio
is included in the action policy,
but the Amazon Glue Studio APIs are not individually identified. That is because when you include glue:UseGlueStudio
,
you are automatically granted access to the internal APIs without having to specify the individual Amazon Glue Studio
APIs in the IAM permissions.
In the example, the additional listed action policies (for example, glue:SearchTables
)
are not Amazon Glue Studio APIs,
so they will need to be included in the IAM permissions as required. You may also want to include Amazon S3 Proxy actions to specify
the level of Amazon S3 access to grant.
The example policy below provides access to open Amazon Glue Studio, create a visual job, and save/run it if the IAM role selected
has sufficient access.
{ "Version": "2012-10-17", "Statement": [ { "Sid": "VisualEditor0", "Effect": "Allow", "Action": [ "glue:UseGlueStudio", "iam:ListRoles", "iam:ListUsers", "iam:ListGroups", "iam:ListRolePolicies", "iam:GetRole", "iam:GetRolePolicy", "glue:SearchTables", "glue:GetConnections", "glue:GetJobs", "glue:GetTables", "glue:BatchStopJobRun", "glue:GetSecurityConfigurations", "glue:DeleteJob", "glue:GetDatabases", "glue:CreateConnection", "glue:GetSchema", "glue:GetTable", "glue:GetMapping", "glue:CreateJob", "glue:DeleteConnection", "glue:CreateScript", "glue:UpdateConnection", "glue:GetConnection", "glue:StartJobRun", "glue:GetJobRun", "glue:UpdateJob", "glue:GetPlan", "glue:GetJobRuns", "glue:GetTags", "glue:GetJob" ], "Resource": "*" }, { "Action": [ "iam:PassRole" ], "Effect": "Allow", "Resource": "arn:aws:iam::*:role/AWSGlueServiceRole*", "Condition": { "StringLike": { "iam:PassedToService": [ "glue.amazonaws.com" ] } } } ] }
Notebook and data preview permissions
Data previews and notebooks allow you to see a sample of your data at any stage of your job (reading, transforming, writing), without having to run the job. You specify an Amazon Identity and Access Management (IAM) role for Amazon Glue Studio to use when accessing the data. IAM roles are intended to be assumable and do not have standard long-term credentials such as a password or access keys associated with it. Instead, when Amazon Glue Studio assumes the role, IAM provides it with temporary security credentials.
To ensure data previews and notebook commands work correctly, use a role that has
a name that starts with the string AWSGlueServiceRole
. If you choose to
use a different name for your role, then you must add the iam:passrole
permission and configure a policy for the role in IAM. For more information, see Create an IAM policy for roles not named
"AWSGlueServiceRole*".
If a role grants the iam:passrole
permission for a notebook, and
you implement role chaining, a user could unintentionally gain access to the
notebook. There is currently no auditing implemented which would allow you to
monitor which users have been granted access to the notebook.
Amazon CloudWatch permissions
You can monitor your Amazon Glue Studio jobs using Amazon CloudWatch, which collects and processes raw data from Amazon Glue into readable, near-real-time metrics. By default, Amazon Glue metrics data is sent to CloudWatch automatically. For more information, see What Is Amazon CloudWatch? in the Amazon CloudWatch User Guide, and Amazon Glue Metrics in the Amazon Glue Developer Guide.
To access CloudWatch dashboards, the user accessing Amazon Glue Studio needs one of the following:
-
The
AdministratorAccess
policy -
The
CloudWatchFullAccess
policy -
A custom policy that includes one or more of these specific permissions:
-
cloudwatch:GetDashboard
andcloudwatch:ListDashboards
to view dashboards -
cloudwatch:PutDashboard
to create or modify dashboards -
cloudwatch:DeleteDashboards
to delete dashboards
-
For more information for changing permissions for an IAM user using policies, see Changing Permissions for an IAM User in the IAM User Guide.
Review IAM permissions needed for ETL jobs
When you create a job using Amazon Glue Studio, the job assumes the permissions of the IAM role that you specify when you create it. This IAM role must have permission to extract data from your data source, write data to your target, and access Amazon Glue resources.
The name of the role that you create for the job must start with the string
AWSGlueServiceRole
for it to be used correctly by Amazon Glue Studio. For example,
you might name your role AWSGlueServiceRole-FlightDataJob
.
Data source and data target permissions
An Amazon Glue Studio job must have access to Amazon S3 for any sources, targets, scripts, and temporary directories that you use in your job. You can create a policy to provide fine-grained access to specific Amazon S3 resources.
-
Data sources require
s3:ListBucket
ands3:GetObject
permissions. -
Data targets require
s3:ListBucket
,s3:PutObject
, ands3:DeleteObject
permissions.
If you choose Amazon Redshift as your data source, you can provide a role for cluster permissions. Jobs that run against a Amazon Redshift cluster issue commands that access Amazon S3 for temporary storage using temporary credentials. If your job runs for more than an hour, these credentials will expire causing the job to fail. To avoid this problem, you can assign a role to the Amazon Redshift cluster itself that grants the necessary permissions to jobs using temporary credentials. For more information, see Moving Data to and from Amazon Redshift in the Amazon Glue Developer Guide.
If the job uses data sources or targets other than Amazon S3, then you must attach the necessary permissions to the IAM role used by the job to access these data sources and targets. For more information, see Setting Up Your Environment to Access Data Stores in the Amazon Glue Developer Guide.
If you're using connectors and connections for your data store, you need additional permissions, as described in Permissions required for using connectors.
Permissions required for deleting jobs
In Amazon Glue Studio you can select multiple jobs in the console to delete. To perform this
action, you must have the glue:BatchDeleteJob
permission. This is
different from the Amazon Glue console, which requires the glue:DeleteJob
permission for deleting jobs.
Amazon Key Management Service permissions
If you plan to access Amazon S3 sources and targets that use server-side
encryption with Amazon Key Management Service (Amazon KMS), then attach a policy to the Amazon Glue Studio role
used by the job that enables the job to decrypt the data. The job role needs the
kms:ReEncrypt
, kms:GenerateDataKey
, and
kms:DescribeKey
permissions. Additionally, the job role needs the
kms:Decrypt
permission to upload or download an Amazon S3
object that is encrypted with an Amazon KMS customer master key (CMK).
There are additional charges for using Amazon KMS CMKs. For more information, see
Amazon Key Management Service Concepts - Customer Master Keys
(CMKs) and Amazon Key Management Service Pricing
Permissions required for using connectors
If you're using an Amazon Glue Custom Connector and connection to access a data store, the role used to run the Amazon Glue ETL job needs additional permissions attached:
-
The AWS managed policy
AmazonEC2ContainerRegistryReadOnly
for accessing connectors purchased from Amazon Web Services Marketplace. -
The
glue:GetJob
andglue:GetJobs
permissions. -
Amazon Secrets Manager permissions for accessing secrets that are used with connections. Refer to Example: Permission to retrieve secret values for example IAM policies.
If your Amazon Glue ETL job runs within a VPC running Amazon VPC, then the VPC must be configured as described in Configure a VPC for your ETL job.
Set up IAM permissions for Amazon Glue Studio
You can create the roles and assign policies to users and job roles by using the Amazon administrator user.
You can use the AWSGlueConsoleFullAccess Amazon managed policy to provide the necessary permissions for using the Amazon Glue Studio console.
To create your own policy, follow the steps documented in Create an IAM Policy for the Amazon Glue Service in the Amazon Glue Developer Guide. Include the IAM permissions described previously in Review IAM permissions needed for the Amazon Glue Studio user.
Topics
Create an IAM Role
Amazon Glue Studio needs permissions to access other services on your behalf. You provide those permissions by creating an IAM role and assigning policies to the role. You specify this role when creating jobs, when using the notebook editor, or when using data previews. Amazon Glue Studio or your ETL job assumes the role, gaining temporary permissions to access other services and data locations.
You need to grant your IAM role permissions that Amazon Glue Studio and Amazon Glue can assume when calling other services on your behalf. This includes access to Amazon S3 for storing scripts and temporary files, and any other sources or targets that you use with Amazon Glue Studio.
To create a role for your ETL jobs
-
In the Amazon Web Services Management Console, open the IAM console at https://console.aws.amazon.com/iam/
and choose Roles, then Create role in the left navigation pane. -
For role type, choose Amazon Service, find and choose Glue, and choose Next: Permissions.
-
On the Attach permissions policy page, choose the policies that contain the required permissions. For example, you might choose the Amazon managed policy AWSGlueServiceRole for general Amazon Glue Studio and Amazon Glue permissions and the Amazon managed policy AmazonS3FullAccess for access to Amazon S3 resources.
-
Add additional policies as needed for additional data stores or services.
-
Choose Next: Review.
-
For Role name, enter a name for your role; for example,
AWSGlueServiceRole-Studio
. Choose a name that begins with the stringAWSGlueServiceRole
to allow the role to be passed from console users to the service.If you choose a different name for your role, you must add a policy to allow your users the
iam:PassRole
permission for IAM roles to match your naming convention.Choose Create Role to finish creating the role.
Attach policies to the Amazon Glue Studio user
Any Amazon user that signs in to the Amazon Glue Studio console must have permissions to access specific resources. You provide those permissions by using assigning IAM policies to the user.
To attach the AWSGlueConsoleFullAccess managed policy to a user
Sign in to the Amazon Web Services Management Console and open the IAM console at https://console.amazonaws.cn/iam/
. -
In the navigation pane, choose Policies.
-
In the list of policies, select the check box next to the AWSGlueConsoleFullAccess. You can use the Filter menu and the search box to filter the list of policies.
-
Choose Policy actions, and then choose Attach.
-
Choose the user to attach the policy to. You can use the Filter menu and the search box to filter the list of principal entities. After choosing the user to attach the policy to, choose Attach policy.
-
Repeat the previous steps to attach additional policies to the user, as needed.
Create an IAM policy for roles not named "AWSGlueServiceRole*"
To configure an IAM policy for roles used by Amazon Glue Studio
Sign in to the Amazon Web Services Management Console and open the IAM console at https://console.amazonaws.cn/iam/
. -
Add a new IAM policy. You can add to an existing policy or create a new IAM inline policy. To create an IAM policy:
Choose Policies, and then choose Create Policy. If a Get Started button appears, choose it, and then choose Create Policy.
Next to Create Your Own Policy, choose Select.
For Policy Name, type any value that is easy for you to refer to later. Optionally, type descriptive text in Description.
For Policy Document, type a policy statement with the following format, and then choose Create Policy:
-
Copy and paste the following blocks into the policy under the "Statement" array.
{ "Action": ["iam:PassRole"], "Effect": "Allow", "Resource": "arn:aws:iam::*:role/AWSGlueServiceRole*", "Condition": { "StringLike": { "iam:PassedToService": ["glue.amazonaws.com"] } } }, { "Effect": "Allow", "Principal": { "Service": ["glue.amazonaws.com"] }, "Action": "sts:AssumeRole" }
Here is the full example with the Version and Statement arrays included in the policy
{ "Version": "2012-10-17", "Statement": [ { "Action": ["iam:PassRole"], "Effect": "Allow", "Resource": "arn:aws:iam::*:role/AWSGlueServiceRole*", "Condition": { "StringLike": { "iam:PassedToService": ["glue.amazonaws.com"] } } }, { "Effect": "Allow", "Principal": { "Service": ["glue.amazonaws.com"] }, "Action": "sts:AssumeRole" } ] }
To enable the policy for a user, choose Users.
Choose the user to whom you want to attach the policy.
Configure a VPC for your ETL job
You can use Amazon Virtual Private Cloud (Amazon VPC) to define a virtual network in your own logically isolated area within the Amazon Web Services Cloud, known as a virtual private cloud (VPC). You can launch your Amazon resources, such as instances, into your VPC. Your VPC closely resembles a traditional network that you might operate in your own data center, with the benefits of using the scalable infrastructure of Amazon. You can configure your VPC; you can select its IP address range, create subnets, and configure route tables, network gateways, and security settings. You can connect instances in your VPC to the internet. You can connect your VPC to your own corporate data center, making the Amazon Web Services Cloud an extension of your data center. To protect the resources in each subnet, you can use multiple layers of security, including security groups and network access control lists. For more information, see the Amazon VPC User Guide.
You can configure your Amazon Glue ETL jobs to run within a VPC when using connectors. You must configure your VPC for the following, as needed:
-
Public network access for data stores not in Amazon. All data stores that are accessed by the job must be available from the VPC subnet.
-
If your job needs to access both VPC resources and the public internet, the VPC needs to have a network address translation (NAT) gateway inside the VPC.
For more information, see Setting Up Your Environment to Access Data Stores in the Amazon Glue Developer Guide.
Populate the Amazon Glue Data Catalog
Amazon Glue Studio can use datasets that are defined in the Amazon Glue Data Catalog. These datasets are used as sources and targets for ETL workflows in Amazon Glue Studio. If you choose the Data Catalog for your data source or target, then the Data Catalog tables related to your data source or data target must exist prior to creating a job.
When reading from or writing to a data source, your ETL job needs to know the schema of the data. The ETL job can get this information from a table in the Amazon Glue Data Catalog. You can use a crawler, the Amazon Glue console, Amazon CLI, or an Amazon CloudFormation template file to add databases and tables to the Data Catalog. For more information about populating the Data Catalog, see Data Catalog in the Amazon Glue Developer Guide.
When using connectors, you can use the schema builder to enter the schema information when you configure the data source node of your ETL job in Amazon Glue Studio. For more information, see Authoring jobs with custom connectors.
For some data sources, Amazon Glue Studio can automatically infer the schema of the data it reads from the files at the specified location.
-
For Amazon S3 data sources, you can find more information at Using files in Amazon S3 for the data source.
-
For streaming data sources, you can find more information at Using a streaming data source.