Tutorial: Getting started with Amazon Glue Studio
You can use Amazon Glue Studio to create jobs that extract structured or semi-structured data from a data source, perform a transformation of that data, and save the result set in a data target.
In this tutorial, you will create a job in Amazon Glue Studio using Amazon S3 as the Source and Target. By completing these steps, you will learn how visual jobs are created and how to edit nodes, the component building blocks in the visual job editor.
You will learn how to:
-
Configure the data source node to a data source. In this tutorial, you will set the data source to Amazon S3.
-
Apply and edit a transform node. In this tutorial, you will apply the ApplyMapping transform to the job.
-
Configure the data target node. In this tutorial, you will set the data target to Amazon S3.
-
View and edit the job script.
-
Run the job and view run details for the job.
Topics
Prerequisites
This tutorial has the following prerequisites:
-
You have an Amazon account.
-
You have access to Amazon Glue Studio.
-
Your account has all the necessary permissions for creating and running a job for an Amazon S3 data source and data target. For more information, see Setting up for Amazon Glue Studio.
Launch the Amazon CloudFormation stack
The Amazon CloudFormation stack has all the resources you need to complete this tutorial.
-
Launch the following Amazon CloudFormation stack to create resources for this tutorial by clicking on the button, then follow the steps to complete the process.
-
Name the Amazon CloudFormation stack CreateJob-Tutorial.
-
Then, select the I acknowledge that Amazon CloudFormation might create IAM resources with custom names option.
-
Choose Create stack.
Launching this stack creates Amazon resources. The following resources shown in the Amazon CloudFormation output are the ones you need in the next steps:
-
Key – Description
-
Amazon Glue StudioRole – IAM role to run Amazon Gluejobs
-
Amazon Glue StudioAmazon S3Bucket – Name of the Amazon S3 bucket to store blog-related files
-
Amazon Glue StudioTicketsYYZDB – Amazon Glue Data Catalog database
-
Amazon Glue StudioTableTickets – Data Catalog table to use as a source
-
Amazon Glue StudioTableTrials – Data Catalog table to use as a source
-
Amazon Glue StudioParkingTicketCount – Data Catalog table to use as the destination
Step 1: Start the job creation process
In this task, you choose to start the job creation by using a template.
To create a job, starting with a template
Sign in to the Amazon Web Services Management Console and open the Amazon Glue Studio console at https://console.amazonaws.cn/gluestudio/
. -
On the Amazon Glue Studio landing page, choose View jobs under the heading Create and manage jobs.
-
On the Jobs page, under the heading Create job, the following options will be selected by default:
-
Visual with a source and target
-
For the Source: Amazon Simple Storage Service
-
For the Target: Amazon Simple Storage Service
-
-
Choose the Create button to start the job creation process.
The job editing page opens with a simple three-node job diagram displayed.

-
A - the Visual job editor canvas. This is where you can add nodes to create a job.
-
B - a Visual job is represented by nodes on the canvas. When a node is selected, it will be highlighted by a blue line.
-
C - the node panel contains several tabs: Node properties, Output schema and Data preview. When the node is selected, the node panel is displayed and a new tab unique to the node is displayed for additional configuration. For more information, see Job editor features.
-
D - the job editor tab ribbon. By default, Visual is selected. You can also choose: Script, Job details, Runs, and Schedules. Runs and Schedules are available after the job has been run. For more information, see Editing ETL jobs in Amazon Glue Studio.
-
E - the node toolbar provides actions to add Source, Transform and Target nodes, undo and redo actions, remove nodes and zoom in/out across the job editing canvas. For more information, see Editing ETL jobs in Amazon Glue Studio.
-
F - by default, the job is named 'Untitled job'. Click the text box to change the job name to a unique name.
-
G - the job editor action menus allow you to save, run, and delete the job. The Actions drop-down menu also provides additional options when running the job.
Step 2: Edit the data source node in the job diagram
Choose the Data source - S3 bucket node in the job diagram to edit the data source properties.
To edit the data source node
-
By default, the Data source properties - Amazon S3 tab is displayed.
-
By default, the Data Catalog table option for the Amazon S3 source type is already selected. This is because the source type is determined by the 'Node type' in the Node properties tab. By default, Amazon S3 is 'Node type'.
-
For Database, choose the yyz-tickets database from the list of available databases in your Amazon Glue Data Catalog. This database was already created for you when you launched the Amazon CloudFormation stack earlier in this tutorial.
-
For Table, click the drop-down menu and then choose the tickets table from your Amazon Glue Data Catalog. This table was already created for you when you launched the Amazon CloudFormation stack earlier in this tutorial.
After you have provided the required information for the data source node, a green check mark appears on the node in the job diagram.
-
(Optional) Choose the Output schema tab in the node details panel to view the data schema.
-
(Optional) On the Node properties tab in the node details pane, for Name, enter a name that is unique for this job.
The value you enter is used as the label for the data source node in the job diagram. If you use unique names for the nodes in your job, then it's easier to identify each node in the job diagram, and also to select parent nodes.
You can also set the node type. Changing the node type will change the fields in the data source properties tab.
Step 3: Edit the transform node of the job
The transform node is where you specify how you want to modify the data from its original format. An ApplyMapping transform enables you to rename data property keys, change the data types, and drop columns from the dataset.
When you edit the Transform - ApplyMapping node, the original schema for your data is shown in the Source key column in the node details panel. This is the data property key name (column name) that is obtained from the source data and stored in the table in the Amazon Glue Data Catalog.
The Target key column shows the key name that will appear in the data target. You can use this field to change the data property key name in the output. The Data type column shows the data type of the key and allows you to change it to different data type for the target. The Drop column contains a check box. This box allows you to choose a field to drop it from the target schema.
To edit the transform node
-
Choose the Transform - ApplyMapping node in the job diagram to edit the data transformation properties.
-
In the node details panel, on the Node properties tab, review the information.
Change the name of the node to Ticket_Mapping.
-
Choose the Transform tab in the node details panel.
-
Choose to drop the keys by selecting the check box in the Drop column for each key:
-
location1
-
location2
-
location3
-
location4
-
province
-
-
For the source key
officer
, change the Target key value toofficer_name
.Change the data type for the
ticket_number
andset_fine_amount
keys to float. When changing the data type, you must verify that the data type is supported by your target. -
(Optional) Choose the Output schema tab in the node details panel to view the modified schema.
Notice that the Transform - Apply Mapping node in the job diagram now has a green check mark, indicating that the node has been edited and has all the required information.
Step 4: Edit the data target node of the job
A data target node determines where the transformed output is sent. The location can be an Amazon S3 bucket, a Data Catalog table, or a connector and connection. If you choose a Data Catalog table, the data is written to the location associated with that table. For example, if you use a crawler to create a table in the Data Catalog for a JDBC target, the data is written to that JDBC table.
To edit the data target node
-
Choose the Data target - S3 bucket node in the job diagram to edit the data target properties.
-
In the node details panel on the right, choose the Node properties tab. For Name, enter a unique name for the node.
-
Choose the Data target properties - S3 tab.
-
For each field, make the following selections.
For more information about the available options, see Overview of data target options.
-
Format: Parquet
-
Compression Type: GZIP
-
S3 Target Location: Choose the Browse S3 button to see the Amazon S3 buckets that you have access to. Choose an Amazon S3 bucket as the target destination.
-
Data Catalog update options: Do not update the Data Catalog
-
Step 5: Specify the job details and save the job
Before you can save and run your extract, transform, and load (ETL) job, you must first enter additional information about the job itself.
To specify the job details and save the job
-
Choose the Job details tab.
-
Enter a name for the job. Provide a UTF-8 string with a maximum length of 255 characters.
(Optional) Enter a description of the job. Descriptions can be up to 2048 characters long.
-
For the IAM role, choose
Amazon Glue StudioRole
from the list of available roles.Note The Amazon Identity and Access Management (IAM) role is used to authorize access to resources that are used to run the job. You can only choose roles that already exist in your account. The role you choose must have permission to access your Amazon S3 sources, targets, temporary directory, scripts, and any libraries used by the job, as well as access to Amazon Glue service resources.
For the steps to create a role, see Create an IAM Role for Amazon Glue
in the Amazon Glue Developer Guide. You might have to add access to the target Amazon S3 bucket to this role.
If you have many roles to choose from, you can start entering part of the role name in the IAM role search field, and the roles with the matching text string will be displayed. For example, you can enter 'tutorial' in the search field to find all roles with
tutorial
(case-insensitive) in the name. -
For the remaining fields, use the default values.
-
Choose Save in the top-right corner of the page.
You should see a notification at the top of the page that the job was successfully saved.
If you don't see a notification that your job was successfully saved, then there is most likely information missing that prevents the job from being saved.
-
Review the job in the visual editor, and choose any node that doesn't have a green check mark.
-
If any of the tabs above the visual editor pane have a callout, choose that tab and look for any fields that are highlighted in red.
Step 6: Run the job
Now that the job has been saved, you can run the job.
-
Choose the Run button at the top of the page. You should then see a notification that the job was successfully started. You can also choose the Runs tab and choose Run jobs.
-
To view the job run details, click the link in the notification for Run Details, or choose the Runs tab to view the run status of the job.
-
To view the job run details in the Runs tab, view the job run detail card for the recent job run. For more information about job run information, see View information for recent job runs.
Congratulations on completing this tutorial! You have learned how to create a visual job, edit nodes, inspect the job script, save and run a job, and view run details.
Next steps
After you start the job run, you might want to try some of the following tasks:
-
View the job monitoring dashboard – Accessing the job monitoring dashboard.
-
Try a different transform on the data – Editing Amazon Glue managed data transform nodes.
-
View the jobs that exist in your account – View your jobs.
-
Run the job using a time-based schedule – Schedule job runs.