# Building visual ETL jobs
Building visual ETL jobs

## Build visual ETL jobs with Amazon Glue Studio


 Amazon Glue Studio provides a visual interface for creating, running, and monitoring Extract/Transform/Load (ETL) jobs in Amazon Glue. A job in Amazon Glue consists of the business logic that performs extract, transform, and load (ETL) work. With Amazon Glue Studio, you can visually compose data transformation workflows and seamlessly run them on Amazon Glue's Apache Spark-based serverless ETL engine. You can create jobs that move and transform data between various data stores and streams using a drag-and-drop interface without having to learn Spark or write code. 

An Amazon Glue job encapsulates a script that connects to your source data, processes it, and then writes it out to your data target. Typically, a job runs extract, transform, and load (ETL) scripts. Jobs can run scripts designed for Apache Spark and Ray runtime environments. Jobs can also run general-purpose Python scripts (Python shell jobs.) Amazon Glue triggers can start jobs based on a schedule or event, or on demand. You can monitor job runs to understand runtime metrics such as completion status, duration, and start time.

You can use scripts that Amazon Glue generates or you can provide your own. With a source schema and target location or schema, the Amazon Glue Studio code generator can automatically create an Apache Spark API (PySpark) script. You can use this script as a starting point and edit it to meet your goals.

Amazon Glue can write output files in several data formats. Each job type may support different output formats. For some data formats, common compression formats can be written. 

### Managing Amazon Glue Jobs in the Amazon Console
Signing in to the console

To view existing jobs, sign in to the Amazon Web Services Management Console and open the Amazon Glue console at [https://console.amazonaws.cn/glue/](https://console.amazonaws.cn/glue/). Then choose the **Jobs** tab in Amazon Glue. The **Jobs** list displays the location of the script that is associated with each job, when the job was last modified, and the current job bookmark option. 

 You can create jobs in the **ETL** section of the Amazon Glue console. While creating a new job, or after you have saved your job, you can use can Amazon Glue Studio to modify your ETL jobs. You can do this by editing the nodes in the visual editor or by editing the job script in developer mode. You can also add and remove nodes in the visual editor to create more complicated ETL jobs. 

### Next steps for creating a job in Amazon Glue Studio


You use the visual job editor to configure nodes for your job. Each node represents an action, such as reading data from the source location or applying a transform to the data. Each node you add to your job has properties that provide information about either the data location or the transform.

The next steps for creating and managing your jobs are:
+ [Starting visual ETL jobs in Amazon Glue Studio](edit-nodes-chapter.md)
+ [View the job script](managing-jobs-chapter.md#view-job-script)
+ [Modify the job properties](managing-jobs-chapter.md#edit-jobs-properties)
+ [Save the job](managing-jobs-chapter.md#save-job)
+ [Start a job run](managing-jobs-chapter.md#start-jobs)
+ [View information for recent job runs](managing-jobs-chapter.md#view-job-run-details)
+ [Accessing the job monitoring dashboard](view-job-runs.md#monitoring-accessing-dashboard)

## Build visual ETL flows with Amazon SageMaker


 With an Amazon SageMaker Unified Studio workflow, you can set up and run a series of tasks in Amazon SageMaker Unified Studio. Amazon SageMaker Unified Studio workflows use Apache Airflow to model data processing procedures and orchestrate your Amazon SageMaker Unified Studio code artifacts. For more information, see [ Using workflows in Amazon SageMaker Unified Studio ](https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/workflow-orchestration.html). 

# Starting visual ETL jobs in Amazon Glue Studio
Starting visual ETL jobs in Amazon Glue Studio

You can use the simple visual interface in Amazon Glue Studio to create your ETL jobs. You use the **Jobs** page to create new jobs. You can also use a script editor or notebook to work directly with code in the Amazon Glue Studio ETL job script.

On the **Jobs** page, you can see all the jobs that you have created either with Amazon Glue Studio or Amazon Glue. You can view, manage, and run your jobs on this page. 

 Also see the [blog tutorial ](https://aws.amazon.com/blogs/big-data/making-etl-easier-with-aws-glue-studio/) on another example of how to create ETL jobs with Amazon Glue Studio. 

## Starting jobs in Amazon Glue Studio


 Amazon Glue allows you to create a job through a visual interface, an interactive code notebook, or with a script editor. You can start a job by clicking on any of the options or create a new job based on a sample job. 

 Sample jobs create a job with the tool of your choice. For example, sample jobs allow you to create a visual ETL job that joins CSV files into a catatlog table, create a job in an interactive code notebook with Amazon Glue for Ray or Amazon Glue for Spark when working with pandas, or create a job in an interactive code notebook with SparkSQL. 

### Creating a job in Amazon Glue Studio from scratch


1. Sign in to the Amazon Web Services Management Console and open the Amazon Glue Studio console at [https://console.amazonaws.cn/gluestudio/](https://console.amazonaws.cn/gluestudio/).

1.  Choose **ETL jobs ** from the navigation pane. 

1.  In the **Create job** section, select a configuration option for your job. 

    Options to create a job from scratch: 
   +  **Visual ETL** – author in a visual interface focused on data flow 
   +  **Author using an Interactive code notebook ** – interactively author jobs in a notebook interface based on Jupyter Notebooks 

      When you select this option, you must provide additional information before creating a notebook authoring session. For more information about how to specify this information, see [Getting started with notebooks in Amazon Glue Studio](notebook-getting-started.md). 
   + **Author code with a script editor** – For those familiar with programming and writing ETL scripts, choose this option to create a new Spark ETL job. Choose the engine (Python shell, Ray, Spark (Python), or Spark (Scala). Then, choose **Start fresh** or **Upload script**. uploading an existing script from a local file. If you choose to use the script editor, you can't use the visual job editor to design or edit your job. 

     A Spark job is run in an Apache Spark environment managed by Amazon Glue. By default, new scripts are coded in Python. To write a new Scala script, see [Creating and editing Scala scripts in Amazon Glue Studio](edit-nodes-script.md#edit-job-scala-script).

### Creating a job in Amazon Glue Studio from an example job


 You can choose to create a job from an example job. In the **Example jobs** section, choose a sample job, then choose **Create sample job**. Creating a sample job from one of the options provides a quick template you can work from. 

1. Sign in to the Amazon Web Services Management Console and open the Amazon Glue Studio console at [https://console.amazonaws.cn/gluestudio/](https://console.amazonaws.cn/gluestudio/).

1.  Choose **ETL jobs ** from the navigation pane. 

1.  Select an option create a job from a sample job: 
   +  **Visual ETL job to join multiple sources** – Read three CSV files, combine the data, change the data types, then write the data to Amazon S3 and catalog it for querying later. 
   +  **Spark notebook using Pandas** – Explore and visualize data using the popular Pandas framework combined with Spark. 
   +  **Spark notebook using SQL** – Use SQL to get started quickly with Apache Spark. Access data through the Amazon Glue Data Catalog and transform it using familiar commands. 

1. Choose **Create sample job**.

# Job editor features


The job editor provides the following features for creating and editing jobs.
+ A visual diagram of your job, with a node for each job task: Data source nodes for reading the data; transform nodes for modifying the data; data target nodes for writing the data.

  You can view and configure the properties of each node in the job diagram. You can also view the schema and sample data for each node in the job diagram. These features help you to verify that your job is modifying and transforming the data in the right way, without having to run the job.
+ A Script viewing and editing tab, where you can modify the code generated for your job.
+ A Job details tab, where you can configure a variety of settings to customize the environment in which your Amazon Glue ETL job runs.
+ A Runs tab, where you can view the current and previous runs of the job, view the status of the job run, and access the logs for the job run.
+ A Data quality tab, where you can apply data quality rules to your job.
+ A Schedules tab, where you can configure the start time for you job, or set up a recurring job runs.
+ A Version Control tab, where you can configure a Git service to use with your job.

## Using schema previews in the visual job editor


While creating or editing your job, you can use the **Output schema** tab to view the schema for your data. 

Before you can see the schema, the job editor needs permissions to access the data source. You can specify an IAM role on the Job details tab of the editor or on the **Output schema** tab for a node. If the IAM role has all the necessary permissions to access the data source, you can then view the schema on the **Output schema** tab for a node.

## Using data previews in the visual job editor


[![AWS Videos](http://img.youtube.com/vi/https://www.youtube.com/embed/EqmljEWlp0c/0.jpg)](http://www.youtube.com/watch?v=https://www.youtube.com/embed/EqmljEWlp0c)


Data previews help you create and test your job using a sample of your data without having to repeatedly run the job. By using data preview, you can:
+ Test an IAM role to make sure you have access to your data sources or data targets.
+ Check that the transform is modifying the data in the intended way. For example, if you use a Filter transform, you can make sure that the filter is selecting the right subset of data.
+ Check your data. If your dataset contains columns with values of multiple types, the data preview shows a list of tuples for these columns. Each tuple contains the data type and its value.

**Note**  
 If you use a data preview session and a custom SQL or custom code node, the data preview session will execute the SQL or code block as-is for the entire dataset. 

 While creating or editing your job, you can use the **Data preview** tab beneath the job canvas to view a sample of your data. A new data preview session will start automatically when the role is already configured on the job or a default IAM role has been set up in the account. If a role has not been previously configured, you can start a session by selecting the role. 

**Note**  
 The role you choose for the data preview session will also be used for the job. 


 You can see the status and the progress of your session as well as the session details by clicking the info icon. 

 When the session is ready, Amazon Glue Studio will load the data for the node you selected. You can view the **% complete** as it progresses. 

![\[The screenshot shows the Data preview tab for a node that has started.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/data-preview-progress.png)


 As you author your visual job, Amazon Glue Studio will automatically update the schema for the selected node when you toggle **Infer schema from session** in the **Output schema** tab. 

![\[The screenshot shows the Data preview tab for a node that has started.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/data-preview-output-schema.png)


 To configure your data preview preferences: 

Choose the settings icon (a gear symbol) to configure your preferences for data previews. These settings apply to all nodes in the job diagram. You can: 
+ Choose to wrap the text from one line to the next. This option is enabled by default
+ Change the number of rows (default to 200) 
+ Choose an IAM role or create an IAM role if needed
+ Choose to automatically start a new session when you author a job. This provisions a new interactive session when authoring jobs. **This setting applies at the account level.** Once set, it will apply to all users in your account when editing any job.
+ Choose to automatically infer schema. Output schemas will be automatically inferred for the selected node
+ Choose to automatically import Amazon Glue libraries. This is useful as it will prevent data preview from restarting new sessions when adding new transforms that require a session restart

 Additional features include the ability to: 
+ Choose the **Previewing x of y fields** button to select which columns (fields) to preview. When you preview you data using the default settings, the job editor shows the first 5 columns of your dataset. You can change this to show all or none (not recommended). 
+ Scroll through the data preview window both horizontally and vertically. 
+ Use the maximize button to expand the Data preview tab to over-lay the job graph to better view the data and data structures. Similarly, use the minimize button to minimize the Data preview tab. You can also grab the handle pane and drag up to expand the **Data preview** tab.  
![\[The screenshot shows the data preview pane with the minimize and maximize buttons highlighted, as well as the handle pane that you can use to extend the data preview pane vertically.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/data-preview-maximize-handle.png)
+ Use **End session** to stop the data preview. When you stop the session, you can choose a new IAM role, and set additional settings (such as turn on or off settings to automatically start a new session, infer schema, or import Amazon Glue libraries, and start the session again.

## Restrictions when using data previews


When using data previews, you might encounter the following restrictions or limitations. 
+ The first time you choose the Data preview tab you must choose IAM role. This role must have the necessary permissions to access the data and other resources needed to create the data previews.
+ After you provide an IAM role, it takes a while before the data is available for viewing. For datasets with less than 1 GB of data, it can take up to one minute. If you have a large dataset, you should use partitions to improve the loading time. Loading data directly from Amazon S3 has the best performance.
+ If you have a very large dataset, and it takes more than 15 minutes to query the data for the data preview, the request will time out. Data previews have a 30 minute IDLE timeout. To alleviate this, reduce the dataset size to use data previews.
+ By default, you see the first 50 columns in the Data preview tab. If the columns have no data values, you will get a message that there is no data to display. You can increase the number of rows sampled, or selected different columns to see data values.
+ Data previews are currently not supported for streaming data sources, or for data sources that use custom connectors.
+ Errors on one node effect the entire job. If any one node has an error with data previews, the error will show up on all nodes until you correct it.
+ If you change a data source for the job, then the child nodes of that data source might need to be updated to match the new schema. For example, if you have an ApplyMapping node that modifies a column, and the column does not exist in the replacement data source, you will need to update the ApplyMapping transform node.
+ If you view the Data preview tab for a SQL query transform node, and the SQL query uses an incorrect field name, the Data preview tab shows an error. 

## Script code generation


When you use the visual editor to create a job, the ETL code is automatically generated for you. Amazon Glue Studio creates a functional and complete job script, and saves it in an Amazon S3 location.

There are two forms of code generated by Amazon Glue Studio: the original, or Classic version, and a newer, streamlined version. By default, the new code generator is used to create the job script. You can generate a job script using classic code generator on the **Script** tab by choosing the **Generate classic script** toggle button.

Some of the differences in the new version of the generated code include:
+ Large comment blocks are no longer added to the script
+ Output structures in the code use the node name that you specify in the visual editor. In the class script, the output structures are simply named `DataSource0`, `DataSource1`, `Transform0`, `Transform1`, `DataSink0`, `DataSink1`, and so on.
+ Long commands are split across multiple lines to remove the need to scroll across the page to see the entire command.

New features in Amazon Glue Studio require the new version of code generation, and will not work with the classic code script. You are prompted to update these jobs when you attempt to run them.

# Transform data with Amazon Glue managed transforms


 Amazon Glue Studio provides two types of transforms: 
+  **Amazon Glue-native transforms** - available to all users and are managed by Amazon Glue. 
+  **Custom visual transforms** - allows you to upload your own transforms to use in Amazon Glue Studio 

## Amazon Glue managed data transform nodes


Amazon Glue Studio provides a set of built-in transforms that you can use to process your data. Your data passes from one node in the job diagram to another in a data structure called a `DynamicFrame`, which is an extension to an Apache Spark SQL `DataFrame`.

In the pre-populated diagram for a job, between the data source and data target nodes is the **Change Schema** transform node. You can configure this transform node to modify your data, or you can use additional transforms. 

The following built-in transforms are available with Amazon Glue Studio:
+ **[ChangeSchema](transforms-configure-applymapping.md)**: Map data property keys in the data source to data property keys in the data target. You can rename keys, modify the data types for keys, and choose which keys to drop from the dataset.
+ **[SelectFields](transforms-configure-select-fields.md)**: Choose the data property keys that you want to keep.
+ **[DropFields](transforms-configure-drop-fields.md)**: Choose the data property keys that you want to drop.
+ **[RenameField](transforms-configure-rename-field.md)**: Rename a single data property key.
+ **[Spigot](transforms-configure-spigot.md)**: Write samples of the data to an Amazon S3 bucket.
+ **[Join](transforms-configure-join.md)**: Join two datasets into one dataset using a comparison phrase on the specified data property keys. You can use inner, outer, left, right, left semi, and left anti joins.
+ **[Union](transforms-configure-union.md)**: Combine rows from more than one data source that have the same schema.
+ **[SplitFields](transforms-configure-split-fields.md)**: Split data property keys into two `DynamicFrames`. Output is a collection of `DynamicFrames`: one with selected data property keys, and one with the remaining data property keys. 
+ **[SelectFromCollection](transforms-selectfromcollection-overview.md)**: Choose one `DynamicFrame` from a collection of `DynamicFrames`. The output is the selected `DynamicFrame`.
+ **[FillMissingValues](transforms-configure-fmv.md)**: Locate records in the dataset that have missing values and add a new field with a suggested value that is determined by imputation
+ **[Filter](transforms-filter.md)**: Split a dataset into two, based on a filter condition.
+  **[Drop Null Fields](transforms-dropnull-fields.md)**: Removes columns from the dataset if all values in the column are ‘null’. 
+  ** [Drop Duplicates](transforms-drop-duplicates.md)**: Removes rows from your data source by choosing to match entire rows or specify keys. 
+ **[SQL](transforms-sql.md)**: Enter SparkSQL code into a text entry field to use a SQL query to transform the data. The output is a single `DynamicFrame`. 
+  **[Aggregate](transforms-aggregate-fields.md)**: Performs a calculation (such as average, sum, min, max) on selected fields and rows, and creates a new field with the newly calculated value(s). 
+ **[Flatten](transforms-flatten.md)**: Extract fields inside structs into top level fields.
+ **[UUID](transforms-uuid.md)**: Add a column with a Universally Unique Identifier for each row.
+ **[Identifier](transforms-identifier.md)**: Add a column with a numeric identifier for each row.
+ **[To timestamp](transforms-to-timestamp.md)**: Convert a column to timestamp type.
+ **[Format timestamp](transforms-format-timestamp.md)**: Convert a timestamp column to a formatted string.
+ **[Conditional Router transform](transforms-conditional-router.md)**: Apply multiple conditions to incoming data. Each row of the incoming data is evaluated by a group filter condition and processed into its corresponding group. 
+  **[Concatenate Columns transform](transforms-concatenate-columns.md)**: Build a new string column using the values of other columns with an optional spacer. 
+  **[Split String transform](transforms-split-string.md)**: Break up a string into an array of tokens using a regular expression to define how the split is done. 
+  **[Array To Columns transform](transforms-array-to-columns.md)**: Extract some or all the elements of a column of type array into new columns. 
+  **[Add Current Timestamp transform](transforms-add-current-timestamp.md)**: Mark the rows with the time on which the data was processed. This is useful for auditing purposes or to track latency in the data pipeline. 
+  **[Pivot Rows to Columns transform](transforms-pivot-rows-to-columns.md)**: Aggregate a numeric column by rotating unique values on selected columns which become new columns. If multiple columns are selected, the values are concatenated to name the new columns. 
+  **[Unpivot Columns To Rows transform](transforms-unpivot-columns-to-rows.md)**: Convert columns into values of new columns generating a row for each unique value. 
+  **[Autobalance Processing transform](transforms-autobalance-processing.md)**: Redistribute the data better among the workers. This is useful where the data is unbalanced or as it comes from the source doesn’t allow enough parallel processing on it. 
+  **[Derived Column transform](transforms-derived-column.md)**: Define a new column based on a math formula or SQL expression in which you can use other columns in the data, as well as constants and literals. 
+  **[Lookup transform](transforms-lookup.md)**: Add columns from a defined catalog table when the keys match the defined lookup columns in the data. 
+  **[Explode Array or Map Into Rows transform](transforms-explode-array.md)**: Extract values from a nested structure into individual rows that are easier to manipulate. 
+  **[Record matching transform](transforms-record-matching.md)**: Invoke an existing Record Matching machine learning data classification transform. 
+  **[Remove null rows transform](transforms-remove-null-rows.md)**: Remove from the dataset rows that have all columns as null, or empty. 
+  **[Parse JSON column transform](transforms-parse-json-column.md)**: Parse a string column containing JSON data and convert it to a struct or an array column, depending if the JSON is an object or an array, respectively. 
+  **[Extract JSON path transform](transforms-extract-json-path.md)**: Extract new columns from a JSON string column. 
+  **[Extract string fragments from a regular expression](transforms-regex-extractor.md)**: Extract string fragments using a regular expression and create new column out of it, or multiple columns if using regex groups. 
+ **[Custom transform](transforms-custom.md)**: Enter code into a text entry field to use custom transforms. The output is a collection of `DynamicFrames`. 

# Using a data preparation recipe in Amazon Glue Studio
Using a data preparation recipe in Amazon Glue Studio

 The **Data preparation recipe** transform allows you to author a data preparation recipe from scratch using an interactive grid style authoring interface. It also allows you to import an existing Amazon Glue DataBrew recipe and then edit it in Amazon Glue Studio. 

 The **Data Preparation Recipe** node is available from the Resource panel. You can connect the **Data Preparation Recipe** node to another node in the visual workflow, whether it is a Data source node or another transformation node. After choosing a Amazon Glue DataBrew recipe and version, the applied steps in the recipe are visible in the node properties tab. 

## Prerequisites

+  If importing an Amazon Glue DataBrew recipe, you have the required IAM permissions as described in [Import a Amazon Glue DataBrew recipe in Amazon Glue Studio](glue-studio-data-preparation-import-recipe.md) . 
+  A data preview session must be created. 

## Limitations

+  Amazon Glue DataBrew recipes are only supported in [commercial DataBrew regions](https://docs.aws.amazon.com/general/latest/gr/databrew.html). 
+  Not all Amazon Glue DataBrew recipes are supported by Amazon Glue. Some recipes will not be able to be run in Amazon Glue Studio. 
  +  Recipes with `UNION` and `JOIN` transforms are not supported, however, Amazon Glue Studio already has "Join" and "Union" transform nodes which can be used before or after a **Data Preparation Recipe** node. 
+  **Data Preparation Recipe** nodes are supported for jobs starting with Amazon Glue version 4.0. This version will be auto-selected after a **Data Preparation Recipe** node is added to the job. 
+  **Data Preparation Recipe** nodes require Python. This is automatically set when the **Data Preparation Recipe** node is added to the job. 
+  Adding a new **Data Preparation Recipe** node to the visual graph will automatically restart your Data Preview session with the correct libraries to use the **Data Preparation Recipe** node. 
+  The following transforms are not supported for import or editing in a **Data Preparation Recipe** node: `GROUP_BY`, `PIVOT`, `UNPIVOT`, and `TRANSPOSE`. 

## Additional features


 When you've selected the **Data Preparation Recipe** transform, you have the ability to take additional actions after choosing **Author recipe**. 
+  Add step – you can add additional steps to a recipe as needed by choosing the add step icon, or use the toolbar in the Preview pane by choosing an action.   
![\[The screenshot shows the add recipe icon.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/add-recipe-icon.png)  
![\[The screenshot shows the add recipe icon.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/author-recipe-toolbar.png)
+  Import recipe – choose **More** then **Import recipe** to use in your Amazon Glue Studio job.   
![\[The screenshot shows the more icon.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/data-preparation-recipe-node-more-icon.png)  
![\[The screenshot shows the more icon.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/data-preparation-recipe-node-more-features.png)
+  Download as YAML – choose **More** then **Download as YAML** to download your recipe to save outside of Amazon Glue Studio. 
+  Download as JSON – choose **More** then **Download as JSON** to download your recipe to save outside of Amazon Glue Studio. 
+  Undo and redo recipe steps – You can undo and redo recipe steps in the Preview pane when working with data in the grid.   
![\[The screenshot shows the more icon.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/author-recipe-toolbar-undo-redo.png)

# Author and run data preparation recipes in a visual ETL Amazon Glue job


 In this scenario, you can author data preparation recipes without having to first create them in DataBrew. Before you can start authoring recipes, you must: 
+  Have an active Data Preview session running. When the data preview session is READY, then **Author Recipe** will become active and you can begin authoring or editing your recipe.   
![\[The screenshot shows the Data Preview session as complete.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/data-preparation-recipe-data-preview-complete.png)
+  Ensure that the toggle for **Automatically import glue libraries** is enabled.   
![\[The screenshot shows the option for Automatically import glue libraries toggled on.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/data-preparation-recipe-automatically-import-glue-libraries.png)

   You can do this by choosing the gear icon in the Data Preview pane.   
![\[The screenshot shows the option for Automatically import glue libraries toggled on.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/data-preview-preferences.png)

**To author a data preparation recipe in Amazon Glue Studio:**

1.  Add the **Data Preparation Recipe** transform to your job canvas. Your transform should be connected to a data source node parent. When adding the **Data Preparation Recipe** node, the node will restart with the proper libraries and you will see the Data Frame being prepared.   
![\[The screenshot shows the data frame loading after adding the Data Preparation Recipe.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/data-preparation-preparing-dataframe.png)

1.  Once the Data Preview session is ready, the data with any previously applied steps will appear on the bottom of the screen. 

1.  Choose **Author Recipe**. This will allow you to start a new recipe in Amazon Glue Studio.   
![\[The screenshot shows the Transform panel with the fields for Name and Node parents, as well as option to Author Recipe.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/data-preparation-recipe-transform-tab-new.png)

1.  In the **Transform** panel to the right of the job canvas, enter a name for your data preparation recipe. 

1.  On the left-side, the canvas will be replaced with a grid view of your data. To the right, the **Transform** panel will change to show you your recipe steps. Choose **Add step** to add the first step in your recipe.   
![\[The screenshot shows the Transform panel after choosing Add Step. When you choose a column, the options will change dynamically. You can choose to sort, take an action on the column, and filter values.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/author-recipe-preview-data-transform-panel.png)

1.  In the **Transform** panel, choose to sort, take an action on the column, and filter values. For example, choose **Rename column**.   
![\[The screenshot shows the Transform panel after choosing Add Step. When you choose a column, the options will change dynamically. You can choose to sort, take an action on the column, and filter values.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/author-recipe-add-step.png)

1.  In the Transform panel on the right-side, options for renaming a column allow you to choose the source column to rename, and to enter the new column name. Once you have done so, choose **Apply**. 

    You can preview each step, undo a step, and re-order steps and use any of the action icons, such as Filter, Sort, Split, Merge, etc. When you perform actions in the data grid, the steps are added to the recipe in the Transform panel.   
![\[The screenshot shows the Preview data grid with the toolbar highlighted. You can apply an action by using any of the tools and it will be added to the recipe in the Transform panel on the right.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/author-recipe-preview-data-grid.png)

    If you need to make a change, you can do this in the Preview pane by previewing the result of each step, undoing a step, and re-ordering steps. For example: 
   +  Undo/redo step – undo a step by choosing the **undo** icon. You can repeat a step by choosing the **redo** icon.   
![\[The screenshot shows the more icon.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/author-recipe-toolbar-undo-redo.png)
   +  Reorder step – when you reorder a step, Amazon Glue Studio will validate each step and let you know if the step is invalid. 

1.  Once you've applied a step, the Transform panel will show you all the steps in your recipe. You can clear all the steps to start over, add more steps by choosing the add icon, or choose **Done Authoring Recipe**.   
![\[The screenshot shows the Transform panel with steps added to your recipe. When done, choose Done Authoring Recipe or choose the add icon to add more steps to the recipe.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/author-recipe-done-authoring-recipe.png)

1.  Choose **Save** at the top right side of your screen. Your recipe steps will not be saved until you save your job. 

# Import a Amazon Glue DataBrew recipe in Amazon Glue Studio


 In Amazon Glue DataBrew, a recipe is a set of data transformation steps. Amazon Glue DataBrew recipes prescribes how to transform data that have already been read and doesn't describe where and how to read data, as well as how and where to write data. This is configured in Source and Target nodes in Amazon Glue Studio. For more information on recipes, see [ Creating and using Amazon Glue DataBrew recipes ](https://docs.amazonaws.cn/databrew/latest/dg/recipes.html). 

 To use Amazon Glue DataBrew recipes in Amazon Glue Studio, begin with creating recipes in Amazon Glue DataBrew. If you have recipes you want to use, you can skip this step. 

## IAM permissions for Amazon Glue DataBrew


 This topic provides information to help you understand the actions and resources that you an IAM administrator can use in an Amazon Identity and Access Management (IAM) policy for the Data Preparation Recipe transform. 

 For additional information about security in Amazon Glue, see [Access Management](https://docs.amazonaws.cn/glue/latest/dg/security.html). 

**Note**  
 The following table lists the permissions that a user needs if importing an existing Amazon Glue DataBrew recipe. 


**Data Preparation Recipe transform actions**  

| Action | Description | 
| --- | --- | 
| databrew:ListRecipes | Grants permission to retrieve Amazon Glue DataBrew recipes. | 
| databrew:ListRecipeVersions | Grants permission to retrieve Amazon Glue DataBrew recipe versions. | 
| databrew:DescribeRecipe | Grants permission to retrieve Amazon Glue DataBrew recipe description. | 


 The role you’re using for accessing this functionality should have a policy that allows several Amazon Glue DataBrew actions. You can achieve this by either using the `AWSGlueConsoleFullAccess` policy that includes the necessary actions or add the following inline policy to your role: 

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "databrew:ListRecipes",
        "databrew:ListRecipeVersions",
        "databrew:DescribeRecipe"
      ],
      "Resource": [
        "*"
      ]
    }
  ]
}
```

------


 To use the Data Preparation Recipe transform, you must add the `IAM:PassRole` action to the permissions policy. 


**Additional required permissions**  

| Action | Description | 
| --- | --- | 
| iam:PassRole | Grants permission for IAM to allow the user to pass the approved roles. | 

Without these permissions the following error occurs:

```
"errorCode": "AccessDenied"
"errorMessage": "User: arn:aws:sts::account_id:assumed-role/AWSGlueServiceRole is not 
authorized to perform: iam:PassRole on resource: arn:aws:iam::account_id:role/service-role/AWSGlueServiceRole 
because no identity-based policy allows the iam:PassRole action"
```


## Importing an Amazon Glue DataBrew recipe


**To import an Amazon Glue DataBrew recipe and use in Amazon Glue Studio:**

 If you have an existing **Data Preparation Recipe** node and you want to edit the recipe steps directly in Amazon Glue Studio, you will have to import the recipe steps into your Amazon Glue Studio job. 

1.  Start a Amazon Glue job in Amazon Glue Studio with a datasource. 

1.  Add the **Data Preparation Recipe** node to the job canvas.   
![\[The screenshot shows the Add node modal with data preparation recipe available for selection.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/glue-add-node-data-preparation-recipe.png)

1.  In the Transform panel, enter a name for your recipe. 

1.  Choose one or more parent nodes by selecting the available nodes on the canvas from the drop-down list. 

1.  Choose **Author Recipe**. If **Author Recipe** is grey it is unavailable until node parents have been selected and a data preview session has finished.   
![\[Author Data Preparation Recipe form with name field and node parents selection dropdown.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/glue-author-data-preparation-recipe.png)

1.  The data frame loads and shows you detailed information about your source data. 

    Select the **more actions** icon and choose **Import recipe**.   
![\[Data preparation interface showing "Build your Recipe" with an "Add step" button.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/glue-dataframe-import-recipe.png)

1.  Use the Import recipe wizard to complete the steps. In step 1, search for your recipe, select it, and choose **Next**.   
![\[Import recipe interface showing two recipes, with one selected for import.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/import-recipe-step-1.png)

1.  In step 2, choose your import options. You can choose to Append a new recipe to an existing recipe or Overwrite an existing recipe. Choose **Next**.   
![\[Import recipe interface showing selected recipe, version, and two imported steps.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/import-recipe-step-2.png)

1.  In step 3, validate the recipe steps. Once you import your Amazon Glue DataBrew recipe, you can edit this recipe directly in Amazon Glue Studio.   
![\[Recipe import interface showing two steps and a validation progress indicator.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/import-recipe-step-3.png)  
![\[Import recipe interface showing validated steps for sorting and formatting data.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/import-recipe-step-3-validated-2.png)

1.  After this, the steps will be imported as part of your Amazon Glue job. Make necessary configuration changes in the **Job details** tab, like naming your job and adjusting allocated capacity as needed. Choose **Save** to save your job and recipe. 
**Note**  
 JOIN, UNION, GROUP\$1BY, PIVOT, UNPIVOT, TRANSPOSE are not supported for recipe import, nor will they be available in recipe authoring mode. 

1.  Optionally, you can finish authoring the job by adding other transformations nodes as needed and add Data target node(s). 

    If you reorder steps after you import a recipe, Amazon Glue performs validation on those steps. For example, if you renamed and then deleted a column, and you moved the delete step on top, then the rename step would be invalid. You can then edit the steps to fix the validation error. 

# Migrating from Amazon Glue DataBrew to Amazon Glue Studio
Migrating from DataBrew

 If you have recipes in Amazon Glue DataBrew, use the following checklist to migrate your recipes to Amazon Glue Studio. 


| If you want to | Then do this | 
| --- | --- | 
|  Allow users to retrieve Amazon Glue DataBrew recipes, recipe versions, and recipe descriptions.  |  Add IAM permissions to a policy that allows your role to access the necessary actions. See [IAM permissions for Amazon Glue DataBrew](glue-studio-data-preparation-import-recipe.md#glue-studio-databrew-permissions).  | 
|  Import an existing Amazon Glue DataBrew recipe into Amazon Glue Studio.  |  Follow the steps in [Importing an Amazon Glue DataBrew recipe](glue-studio-data-preparation-import-recipe.md#glue-studio-databrew-import-steps).  | 
|  Import a recipe with JOIN and UNION.  |  Recipes with UNION and JOIN transforms are not supported. Use the Join and Union transforms in Amazon Glue Studio before or after a Data Preparation Recipe node.  | 

# Using Change Schema to remap data property keys


A *Change Schema* transform remaps the source data property keys into the desired configured for the target data. In a Change Schema transform node, you can:
+ Change the name of multiple data property keys.
+ Change the data type of the data property keys, if the new data type is supported and there is a transformation path between the two data types.
+ Choose a subset of data property keys by indicating which data property keys you want to drop.

You can also add additional *Change Schema* nodes to the job diagram as needed – for example, to modify additional data sources or following a *Join* transform. 

## Using Change Schema with decimal datatype


 When using the **Change Schema** transform with decimal datatype, the **Change Schema** transform modifies the precision to the default value of (10,2). To modify this and set the precision for your use case, you can use the **SQL Query** transform and cast the columns with a specific precision. 

 For example, if you have an input column named "DecimalCol" of type Decimal, and you want to remap it to an output column named "OutputDecimalCol" with a specific precision of (18,6), you would: 

1.  Add a subsequent **SQL Query** transform after the **Change Schema** transform. 

1.  In the **SQL Query** transform, use an SQL query to cast the remapped column to the desired precision. The SQL query would look like this: 

   ```
   SELECT col1, col2, CAST(DecimalCol AS DECIMAL(18,6)) AS OutputDecimalCol
   FROM __THIS__
   ```

    In the above SQL query: 
   +  `col1` and `col2` are other columns in your data that you want to pass through without modification. 
   +  `DecimalCol` is the original column name from the input data. 
   +  `CAST(DecimalCol AS DECIMAL(18,6))` casts the `DecimalCol` to a Decimal type with a precision of 18 digits and 6 decimal places. 
   +  `AS OutputDecimalCol` renames the casted column to `OutputDecimalCol`. 

 By using the **SQL Query** transform, you can override the default precision set by the **Change Schema** transform and explicitly cast the Decimal columns to the desired precision. This approach allows you to leverage the **Change Schema** transform for renaming and restructuring your data while handling the precision requirements for Decimal columns through the subsequent **SQL Query** transformation. 

## Adding a Change Schema transform to your job


**Note**  
The **Change Schema** transform is not case-sensitive.

**To add a Change Schema transform node to your job diagram**

1. (Optional) Open the Resource panel and then choose **Change Schema** to add a new transform to your job diagram, if needed. 

1. In the node properties panel, enter a name for the node in the job diagram. If a node parent isn't already selected, choose a node from the **Node parents** list to use as the input source for the transform.

1. Choose the **Transform** tab in the node properties panel.

1. Modify the input schema:
   + To rename a data property key, enter the new name of the key in the **Target key** field.
   + To change the data type for a data property key, choose the new data type for the key from the **Data type** list.
   + To remove a data property key from the target schema, choose the **Drop** check box for that key.

1. (Optional) After configuring the transform node properties, you can view the modified schema for your data by choosing the **Output schema** tab in the node details panel. The first time you choose this tab for any node in your job, you are prompted to provide an IAM role to access the data. If you have not specified an IAM role on the **Job details** tab, you are prompted to enter an IAM role here.

1. (Optional) After configuring the node properties and transform properties, you can preview the modified dataset by choosing the **Data preview** tab in the node details panel. The first time you choose this tab for any node in your job, you are prompted to provide an IAM role to access the data. There is a cost associated with using this feature, and billing starts as soon as you provide an IAM role.

# Using Drop Duplicates


 The Drop Duplicates transform removes rows from your data source by giving you two options. You can choose to remove the duplicate row that are completely the same, or you can choose to choose the fields to match and remove only those rows based on your chosen fields. 

 For example, in this data set, you have duplicate rows where all the values in some of the rows are exactly the same as another row, and some of the values in rows are the same or different. 


| Row | Name | Email | Age | State | Note | 
| --- | --- | --- | --- | --- | --- | 
| 1 | Joy | joy@gmail | 33 | NY |  | 
| 2 | Tim | tim@gmail | 45 | OH |  | 
| 3 | Rose | rose@gmail | 23 | NJ |  | 
| 4 | Tim | tim@gmail | 42 | OH |  | 
| 5 | Rose | rose@gmail | 23 | NJ |  | 
| 6 | Tim | tim@gmail | 42 | OH | this is a duplicate row and matches completely on all values as row \$14 | 
| 7 | Rose | rose@gmail | 23 | NJ | This is a duplicate row and matches completely on all values as row \$15 | 

 If you choose to match entire rows, rows 6 and 7 will be removed from the data set. The data set is now: 


| Row | Name | Email | Age | State | 
| --- | --- | --- | --- | --- | 
| 1 | Joy | joy@gmail | 33 | NY | 
| 2 | Tim | tim@gmail | 45 | OH | 
| 3 | Rose | rose@gmail | 23 | NJ | 
| 4 | Tim | tim@gmail | 42 | OH | 
| 5 | Rose | rose@gmail | 23 | NJ | 

 If you chose to specify keys, you can choose to remove rows that match on ‘name’ and ‘email’. This gives you finer control of what is a ‘duplicate row’ for your data set. By specifying ‘name’ and ‘email’, the data set is now: 


| Row | Name | Email | Age | State | 
| --- | --- | --- | --- | --- | 
| 1 | Joy | joy@gmail | 33 | NY | 
| 2 | Tim | tim@gmail | 45 | OH | 
| 3 | Rose | rose@gmail | 23 | NJ | 


 Some things to keep in mind: 
+  In order for rows to be recognized as a duplicate, values are case sensitive. all values in rows need to have the same casing - this applies to either option you choose (Match entire rows or Specify keys). 
+  All values are read in as strings. 
+  The **Drop Duplicates** transform utilizes the Spark dropDuplicates command. 
+  When using the **Drop Duplicates** transform, the first row is kept and other rows are dropped. 
+  The **Drop Duplicates** transform does not change the schema of the dataframe. If you choose to specify keys, all fields are kept in the resulting dataframe. 

# Using SelectFields to remove most data property keys


You can create a subset of data property keys from the dataset using the *SelectFields* transform. You indicate which data property keys you want to keep and the rest are removed from the dataset.

**Note**  
The *SelectFields* transform is case sensitive. Use *ApplyMapping* if you need a case-insensitive way to select fields.

**To add a SelectFields transform node to your job diagram**

1. (Optional) Open the Resource panel, and then choose **SelectFields** to add a new transform to your job diagram, if needed. 

1. On the **Node properties** tab, enter a name for the node in the job diagram. If a node parent is not already selected, choose a node from the **Node parents** list to use as the input source for the transform.

1. Choose the **Transform** tab in the node details panel.

1. Under the heading **SelectFields**, choose the data property keys in the dataset that you want to keep. Any data property keys not selected are dropped from the dataset.

   You can also choose the check box next to the column heading **Field** to automatically choose all the data property keys in the dataset. Then you can deselect individual data property keys to remove them from the dataset.

1. (Optional) After configuring the transform node properties, you can view the modified schema for your data by choosing the **Output schema** tab in the node details panel. The first time you choose this tab for any node in your job, you are prompted to provide an IAM role to access the data. If you have not specified an IAM role on the **Job details** tab, you are prompted to enter an IAM role here.

1. (Optional) After configuring the node properties and transform properties, you can preview the modified dataset by choosing the **Data preview** tab in the node details panel. The first time you choose this tab for any node in your job, you are prompted to provide an IAM role to access the data. There is a cost associated with using this feature, and billing starts as soon as you provide an IAM role.

# Using DropFields to keep most data property keys


You can create a subset of data property keys from the dataset using the *DropFields* transform. You indicate which data property keys you want to remove from the dataset and the rest of the keys are retained.

**Note**  
The *DropFields* transform is case sensitive. Use *Change Schema* if you need a case-insensitive way to select fields.

**To add a DropFields transform node to your job diagram**

1. (Optional) Open the Resource panel and then choose **DropFields** to add a new transform to your job diagram, if needed. 

1. On the **Node properties** tab, enter a name for the node in the job diagram. If a node parent is not already selected, then choose a node from the **Node parents** list to use as the input source for the transform.

1. Choose the **Transform** tab in the node details panel.

1. Under the heading **DropFields**, choose the data property keys to drop from the data source.

   You can also choose the check box next to the column heading **Field** to automatically choose all the data property keys in the dataset. Then you can deselect individual data property keys so they are retained in the dataset.

1. (Optional) After configuring the transform node properties, you can view the modified schema for your data by choosing the **Output schema** tab in the node details panel. The first time you choose this tab for any node in your job, you are prompted to provide an IAM role to access the data. If you have not specified an IAM role on the **Job details** tab, you are prompted to enter an IAM role here.

1. (Optional) After configuring the node properties and transform properties, you can preview the modified dataset by choosing the **Data preview** tab in the node details panel. The first time you choose this tab for any node in your job, you are prompted to provide an IAM role to access the data. There is a cost associated with using this feature, and billing starts as soon as you provide an IAM role.

# Renaming a field in the dataset


You can use the *RenameField* transform to change the name for an individual property key in the dataset. 

**Note**  
The *RenameField* transform is case sensitive. Use *ApplyMapping* if you need a case-insensitive transform.

**Tip**  
If you use the *Change Schema* transform, you can rename multiple data property keys in the dataset with a single transform.

**To add a RenameField transform node to your job diagram**

1. (Optional) Open the Resource panel and then choose **RenameField** to add a new transform to your job diagram, if needed. 

1. On the **Node properties** tab, enter a name for the node in the job diagram. If a node parent is not already selected, then choose a node from the **Node parents** list to use as the input source for the transform.

1. Choose the **Transform** tab.

1. Under the heading **Data field**, choose a property key from the source data and then enter a new name in the **New field name** field. 

1. (Optional) After configuring the transform node properties, you can view the modified schema for your data by choosing the **Output schema** tab in the node details panel. The first time you choose this tab for any node in your job, you are prompted to provide an IAM role to access the data. If you have not specified an IAM role on the **Job details** tab, you are prompted to enter an IAM role here.

1. (Optional) After configuring the node properties and transform properties, you can preview the modified dataset by choosing the **Data preview** tab in the node details panel. The first time you choose this tab for any node in your job, you are prompted to provide an IAM role to access the data. There is a cost associated with using this feature, and billing starts as soon as you provide an IAM role.

# Using Spigot to sample your dataset


To test the transformations performed by your job, you might want to get a sample of the data to check that the transformation works as intended. The *Spigot* transform writes a subset of records from the dataset to a JSON file in an Amazon S3 bucket. The data sampling method can be either a specific number of records from the beginning of the file or a probability factor used to pick records.

**To add a Spigot transform node to your job diagram**

1. (Optional) Open the Resource panel and then choose **Spigot** to add a new transform to your job diagram, if needed. 

1. On the **Node properties** tab, enter a name for the node in the job diagram. If a node parent is not already selected, then choose a node from the **Node parents** list to use as the input source for the transform.

1. Choose the **Transform** tab in the node details panel.

1. Enter an Amazon S3 path or choose **Browse S3** to choose a location in Amazon S3. This is the location where the job writes the JSON file that contains the data sample.

1. Enter information for the sampling method. You can specify a value for **Number of records** to write starting from the beginning of the dataset and a **Probability threshold** (entered as a decimal value with a maximum value of 1) of picking any given record. 

   For example, to write the first 50 records from the dataset, you would set **Number of records** to 50 and **Probability threshold** to 1 (100%).

# Joining datasets


The *Join* transform allows you to combine two datasets into one. You specify the key names in the schema of each dataset to compare. The output `DynamicFrame` contains rows where keys meet the join condition. The rows in each dataset that meet the join condition are combined into a single row in the output `DynamicFrame` that contains all the columns found in either dataset.

**To add a Join transform node to your job diagram**

1. If there is only one data source available, you must add a new data source node to the job diagram.

1. Choose one of the source nodes for the join. Open the Resource panel and then choose **Join** to add a new transform to your job diagram.

1. On the **Node properties** tab, enter a name for the node in the job diagram.

1. In the **Node properties** tab, under the heading **Node parents**, add a parent node so that there are two datasets providing inputs for the join. The parent can be a data source node or a transform node. 
**Note**  
A join can have only two parent nodes.

1. Choose the **Transform** tab.

   If you see a message indicating that there are conflicting key names, you can either:
   + Choose **Resolve it** to automatically add an *ApplyMapping* transform node to your job diagram. The ApplyMapping node adds a prefix to any keys in the dataset that have the same name as a key in the other dataset. For example, if you use the default value of **right**, then any keys in the right dataset that have the same name as a key in the left dataset will be renamed to `(right)key name`.
   + Manually add a transform node earlier in the job diagram to remove or rename the conflicting keys.

1. Choose the type of join in the **Join type** list. 
   + **Inner join**: Returns a row with columns from both datasets for every match based on the join condition. Rows that don't satisfy the join condition aren't returned.
   + **Left join**: All rows from the left dataset and only the rows from the right dataset that satisfy the join condition. 
   + **Right join**: All rows from the right dataset and only the rows from the left dataset that satisfy the join condition.
   + **Outer join**: All rows from both datasets.
   + **Left semi join**: All rows from the left dataset that have a match in the right dataset based on the join condition. 
   + **Left anti join**: All rows in the left dataset that don't have a match in the right dataset based on join condition. 

1. On the **Transform** tab, under the heading **Join conditions**, choose **Add condition**. Choose a property key from each dataset to compare. Property keys on the left side of the comparison operator are referred to as the left dataset and property keys on the right are referred to as the right dataset. 

   For more complex join conditions, you can add additional matching keys by choosing **Add condition** more than once. If you accidentally add a condition, you can choose the delete icon (![\[\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/delete-icon-black.png)) to remove it.

1. (Optional) After configuring the transform node properties, you can view the modified schema for your data by choosing the **Output schema** tab in the node details panel. The first time you choose this tab for any node in your job, you are prompted to provide an IAM role to access the data. If you have not specified an IAM role on the **Job details** tab, you are prompted to enter an IAM role here.

1. (Optional) After configuring the node properties and transform properties, you can preview the modified dataset by choosing the **Data preview** tab in the node details panel. The first time you choose this tab for any node in your job, you are prompted to provide an IAM role to access the data. There is a cost associated with using this feature, and billing starts as soon as you provide an IAM role.

For an example of the join output schema, consider a join between two datasets with the following property keys:

```
Left: {id, dept, hire_date, salary, employment_status}
Right: {id, first_name, last_name, hire_date, title}
```

The join is configured to match on the `id` and `hire_date` keys using the `=` comparison operator. 

Because both datasets contain `id` and `hire_date` keys, you chose **Resolve it** to automatically add the prefix **right** to the keys in the right dataset. 

The keys in the output schema would be:

```
{id, dept, hire_date, salary, employment_status, 
(right)id, first_name, last_name, (right)hire_date, title}
```

# Using Union to combine rows


 You use the Union transform node when you want to combine rows from more than one data source that have the same schema. 

 There are to types of Union transformations: 

1. ALL – when applying ALL, the resulting union does not remove duplicate rows.

1. DISTINCT – when applying DISTINCT, the resulting union removes duplicate rows.

 **Unions vs. Joins** 

 You use Union to combine rows. You use Join to combine columns. 

**Using the Union transform in the Visual ETL canvas**

1.  Add more than one data source to perform a union transform. To add a data source, open the Resource Panel, then choose the data source from the Sources tab. Before using the Union transformation, you must ensure that all data sources involved in the union have the same schema and structure. 

1.  When you have at least two data sources that you want to combine using the Union transform, create the Union transform by adding it to the canvas. Open the Resource Panel on the canvas and search for 'Union'. You can also choose the Transforms tab in the Resource Panel and scroll down until you find the Union transform, then choose **Union**. 

1. Select the Union node on the job canvas. In the Node properties window, choose the parent nodes to connect to the Union transform.

1. Amazon Glue checks for compatibility to make sure that the Union transform can be applied to all data sources. If the schema for the data sources are the same, the operation will be allowed. If the data sources do not have the same schema, an invalid error message is displayed: “The input schemas of this union are not the same. Consider using ApplyMapping to match the schemas.” To fix this, choose Use **ApplyMapping**. 

1. Choose the Union type.

   1. All – By default, the All Union type is selected; this will result in duplicate rows if there are any in the data combination.

   1. Distinct – Choose Distinct if you want duplicate rows to be removed from the resulting data combination.

# Using SplitFields to split a dataset into two


The *SplitFields* transform allows you to choose some of the data property keys in the input dataset and put them into one dataset and the unselected keys into a separate dataset. The output from this transform is a collection of `DynamicFrames`.

**Note**  
You must use a *SelectFromCollection* transform to convert the collection of `DynamicFrames` into a single `DynamicFrame` before you can send the output to a target location.

The *SplitFields* transform is case sensitive. Add an *ApplyMapping* transform as a parent node if you need case-insensitive property key names.

**To add a SplitFields transform node to your job diagram**

1. (Optional) Open the Resource panel and then choose **SplitFields** to add a new transform to your job diagram, if needed. 

1. On the **Node properties** tab, enter a name for the node in the job diagram. If a node parent is not already selected, then choose a node from the **Node parents** list to use as the input source for the transform.

1. Choose the **Transform** tab.

1. Choose which property keys you want to put into the first dataset. The keys that you do not choose are placed in the second dataset.

1. (Optional) After configuring the transform node properties, you can view the modified schema for your data by choosing the **Output schema** tab in the node details panel. The first time you choose this tab for any node in your job, you are prompted to provide an IAM role to access the data. If you have not specified an IAM role on the **Job details** tab, you are prompted to enter an IAM role here.

1. (Optional) After configuring the node properties and transform properties, you can preview the modified dataset by choosing the **Data preview** tab in the node details panel. The first time you choose this tab for any node in your job, you are prompted to provide an IAM role to access the data. There is a cost associated with using this feature, and billing starts as soon as you provide an IAM role.

1. Configure a *SelectFromCollection* transform node to process the resulting datasets.

# Overview of *SelectFromCollection* transform


Certain transforms have multiple datasets as their output instead of a single dataset, for example, *SplitFields*. The *SelectFromCollection* transform selects one dataset (`DynamicFrame`) from a collection of datasets (an array of `DynamicFrames`). The output for the transform is the selected `DynamicFrame`. 

You must use this transform after you use a transform that creates a collection of `DynamicFrames`, such as:
+ Custom code transforms
+ *SplitFields*

If you don't add a *SelectFromCollection* transform node to your job diagram after any of these transforms, you will get an error for your job. 

The parent node for this transform must be a node that returns a collection of `DynamicFrames`. If you choose a parent for this transform node that returns a single `DynamicFrame`, such as a *Join* transform, your job returns an error. 

Similarly, if you use a *SelectFromCollection* node in your job diagram as the parent for a transform that expects a single `DynamicFrame` as input, your job returns an error.

![\[The screenshot shows the Node parents field on the Node properties tab of the node details panel. The selected node parent is SplitFields and the error message displayed reads "Parent node Split Fields outputs a collection, but node Drop Fields does not accept a collection."\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/screenshot-edit-splitfields-wrong-parent.png)


# Using SelectFromCollection to choose which dataset to keep


Use the *SelectFromCollection* transform to convert a collection of `DynamicFrames` into a single `DynamicFrame`.

**To add a SelectFromCollection transform node to your job diagram**

1. (Optional) Open the Resource panel and then choose **SelectFromCollection** to add a new transform to your job diagram, if needed. 

1. On the **Node properties** tab, enter a name for the node in the job diagram. If a node parent is not already selected, then choose a node from the **Node parents** list to use as the input source for the transform.

1. Choose the **Transform** tab.

1. Under the heading **Frame index**, choose the array index number that corresponds to the `DynamicFrame` you want to select from the collection of `DynamicFrames`.

   For example, if the parent node for this transform is a *SplitFields* transform, on the **Output schema** tab of that node you can see the schema for each `DynamicFrame`. If you want to keep the `DynamicFrame` associated with the schema for **Output 2**, you would select **1** for the value of **Frame index**, which is the second value in the list.

   Only the `DynamicFrame` that you choose is included in the output.

1. (Optional) After configuring the transform node properties, you can view the modified schema for your data by choosing the **Output schema** tab in the node details panel. The first time you choose this tab for any node in your job, you are prompted to provide an IAM role to access the data. If you have not specified an IAM role on the **Job details** tab, you are prompted to enter an IAM role here.

1. (Optional) After configuring the node properties and transform properties, you can preview the modified dataset by choosing the **Data preview** tab in the node details panel. The first time you choose this tab for any node in your job, you are prompted to provide an IAM role to access the data. There is a cost associated with using this feature, and billing starts as soon as you provide an IAM role.

# Find and fill missing values in a dataset


You can use the *FillMissingValues* transform to locate records in the dataset that have missing values and add a new field with a value determined by imputation. The input data set is used to train the machine learning (ML) model that determines what the missing value should be. If you use incremental data sets, then each incremental set is used as the training data for the ML model, so the results might not be as accurate.

**To use a FillMissingValues transform node in your job diagram**

1. (Optional) Open the Resource panel and then choose **FillMissingValues** to add a new transform to your job diagram, if needed.

1. On the **Node properties** tab, enter a name for the node in the job diagram. If a node parent isn't already selected, choose a node from the **Node parents** list to use as the input source for the transform. 

1. Choose the **Transform** tab.

1. For **Data field**, choose the column or field name from the source data that you want to analyze for missing values.

1. (Optional) In the **New field name** field, enter a name for the field added to each record that will hold the estimated replacement value for the analyzed field. If the analyzed field doesn't have a missing value, the value in the analyzed field is copied into the new field. 

   If you don't specify a name for the new field, the default name is the name of the analyzed column with `_filled` appended. For example, if you enter **Age** for **Data field** and don't specify a value for **New field name**, a new field named **Age\$1filled** is added to each record.

1. (Optional) After configuring the transform node properties, you can view the modified schema for your data by choosing the **Output schema** tab in the node details panel. The first time you choose this tab for any node in your job, you are prompted to provide an IAM role to access the data. If you have not specified an IAM role on the **Job details** tab, you are prompted to enter an IAM role here.

1. (Optional) After configuring the node properties and transform properties, you can preview the modified dataset by choosing the **Data preview** tab in the node details panel. The first time you choose this tab for any node in your job, you are prompted to provide an IAM role to access the data. There is a cost associated with using this feature, and billing starts as soon as you provide an IAM role.

# Filtering keys within a dataset


Use the *Filter* transform to create a new dataset by filtering records from the input dataset based on a regular expression. Rows that don't satisfy the filter condition are removed from the output.
+ For string data types, you can filter rows where the key value matches a specified string.
+ For numeric data types, you can filter rows by comparing the key value to a specified value using the comparison operators `<`, `>`, `=`, `!=`, `<=`, and `>=`.

If you specify multiple filter conditions, the results are combined using an `AND` operator by default, but you can choose `OR` instead.

The *Filter* transform is case sensitive. Add an *ApplyMapping* transform as a parent node if you need case-insensitive property key names.

**To add a Filter transform node to your job diagram**

1. (Optional) Open the Resource panel and then choose **Filter** to add a new transform to your job diagram, if needed. 

1. On the **Node properties** tab, enter a name for the node in the job diagram. If a node parent isn't already selected, then choose a node from the **Node parents** list to use as the input source for the transform.

1. Choose the **Transform** tab.

1. Choose either **Global AND** or **Global OR**. This determines how multiple filter conditions are combined. All conditions are combined using either `AND` or `OR` operations. If you have only a single filter conditions, then you can choose either one.

1. Choose the **Add condition** button in the **Filter condition** section to add a filter condition. 

   In the **Key** field, choose a property key name from the dataset. In the **Operation** field, choose the comparison operator. In the **Value** field, enter the comparison value. Here are some examples of filter conditions:
   + `year >= 2018`
   + `State matches 'CA*'`

   When you filter on string values, make sure that the comparison value uses a regular expression format that matches the script language selected in the job properties (Python or Scala).

1. Add additional filter conditions, as needed. 

1. (Optional) After configuring the transform node properties, you can view the modified schema for your data by choosing the **Output schema** tab in the node details panel. The first time you choose this tab for any node in your job, you are prompted to provide an IAM role to access the data. If you have not specified an IAM role on the **Job details** tab, you are prompted to enter an IAM role here.

1. (Optional) After configuring the node properties and transform properties, you can preview the modified dataset by choosing the **Data preview** tab in the node details panel. The first time you choose this tab for any node in your job, you are prompted to provide an IAM role to access the data. There is a cost associated with using this feature, and billing starts as soon as you provide an IAM role.

# Using DropNullFields to remove fields with null values


 Use the *DropNullFields* transform to remove fields from the dataset if all values in the field are ‘null’. By default, Amazon Glue Studio will recognize null objects, but some values such as empty strings, strings that are “null”, -1 integers or other placeholders such as zeros, are not automatically recognized as nulls. 

**To use the DropNullFields**

1.  Add a DropNullFields node to the job diagram. 

1.  On the **Node properties** tab, choose additional values that represent a null value. You can choose to select none or all of the values:   
![\[The screenshot shows the Transform tab for the DropNullFields node.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/DropNullFields-transform-tab.png)
   +  Empty String ("" or '') - fields that contain empty strings will be removed 
   +  "null string" - fields that contain the string with the word 'null' will be removed 
   +  -1 integer - fields that contain a -1 (negative one) integer will be removed 

1.  If needed, you can also specify custom null values. These are null values that may be unique to your dataset. To add a custom null value, choose **Add new value**. 

1.  Enter the custom null value. For example, this can zero, or any value that is being used to represent a null in the dataset. 

1.  Choose the data type in the drop-down field. Data types can either be String or Integer. 
**Note**  
 Custom null values and their data types must match exactly in order for the fields to be recognized as null values and the fields removed. Partial matches where only the custom null value matches but the data type does not will not result in the fields being removed. 

# Using a SQL query to transform data


You can use a **SQL** transform to write your own transform in the form of a SQL query.

A SQL transform node can have multiple datasets as inputs, but produces only a single dataset as output. In contains a text field, where you enter the Apache SparkSQL query. You can assign aliases to each dataset used as input, to help simply the SQL query. For more information about the SQL syntax, see the [Spark SQL documentation](https://spark.apache.org/docs/latest/sql-ref.html).

**Note**  
If you use a Spark SQL transform with a data source located in a VPC, add an Amazon Glue VPC endpoint to the VPC that contains the data source. For more information about configuring development endpoints, see [Adding a Development Endpoint](https://docs.amazonaws.cn/glue/latest/dg/add-dev-endpoint.html), [Setting Up Your Environment for Development Endpoints](https://docs.amazonaws.cn/glue/latest/dg/start-development-endpoint.html), and [Accessing Your Development Endpoint](https://docs.amazonaws.cn/glue/latest/dg/dev-endpoint-elastic-ip.html) in the *Amazon Glue Developer Guide*.

**To use a SQL transform node in your job diagram**

1. (Optional) Add a transform node to the job diagram, if needed. Choose **SQL Query** for the node type.
**Note**  
 If you use a data preview session and a custom SQL or custom code node, the data preview session will execute the SQL or code block as-is for the entire dataset. 

1. On the **Node properties** tab, enter a name for the node in the job diagram. If a node parent is not already selected, or if you want multiple inputs for the SQL transform, choose a node from the **Node parents** list to use as the input source for the transform. Add additional parent nodes as needed.

1. Choose the **Transform** tab in the node details panel. 

1. The source datasets for the SQL query are identified by the names you specified in the **Name** field for each node. If you do not want to use these names, or if the names are not suitable for a SQL query, you can associate a name to each dataset. The console provides default aliases, such as `MyDataSource`.

   For example, if a parent node for the SQL transform node is named `Rename Org PK field`, you might associate the name `org_table` with this dataset. This alias can then be used in the SQL query in place of the node name. 

1. In the text entry field under the heading **Code block**, paste or enter the SQL query. The text field displays SQL syntax highlighting and keyword suggestions.

1. With the SQL transform node selected, choose the **Output schema** tab, and then choose **Edit**. Provide the columns and data types that describe the output fields of the SQL query.

   Specify the schema using the following actions in the **Output schema** section of the page:
   + To rename a column, place the cursor in the **Key** text box for the column (also referred to as a *field* or *property key*) and enter the new name.
   + To change the data type for a column, select the new data type for the column from the drop-down list.
   + To add a new top-level column to the schema, choose the Overflow (![\[\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/edit-schema-actions-button.png)) button, and then choose **Add root key**. New columns are added at the top of the schema.
   + To remove a column from the schema, choose the delete icon (![\[\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/delete-icon-black.png)) to the far right of the Key name. 

1. When you finish specifying the output schema, choose **Apply** to save your changes and exit the schema editor. If you do not want to save you changes, choose **Cancel** to edit the schema editor.

1. (Optional) After configuring the node properties and transform properties, you can preview the modified dataset by choosing the **Data preview** tab in the node details panel. The first time you choose this tab for any node in your job, you are prompted to provide an IAM role to access the data. There is a cost associated with using this feature, and billing starts as soon as you provide an IAM role.

# Using Aggregate to perform summary calculations on selected fields


**To use the Aggregate transform**

1.  Add the Aggregate node to the job diagram. 

1.  On the **Node properties** tab, choose fields to group together by selecting the drop-down field (optional). You can select more than one field at a time or search for a field name by typing in the search bar. 

    When fields are selected, the name and datatype are shown. To remove a field, choose 'X' on the field.   
![\[The screenshot shows the Transform tab for the Aggregate node.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/Aggregate-transform-tab.png)

1.  Choose **Aggregate another column**. It is required to select at least one field.   
![\[The screenshot shows the fields when choosing Aggregate another column.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/Aggregate-fieldtoaggregate.png)

1.  Choose a field in the **Field to aggregate** drop-down. 

1.  Choose the aggregation function to apply to the chosen field: 
   +  avg - calculates the average 
   +  countDistinct - calculates the number of unique non-null values 
   +  count - calculates the number of non-null values 
   +  first - returns the first value that satisfies the 'group by' criteria 
   +  last - returns the last value that satisfies the 'group by' criteria 
   +  kurtosis - calculates the the sharpness of the peak of a frequency-distribution curve 
   +  max - returns the highest value that satisfies the 'group by' criteria 
   +  min - returns the lowest value that satisfies the 'group by' criteria 
   +  skewness - measure of the asymmetry of the probability distribution of a normal distribution 
   +  stddev\$1pop - calculates the population standard deviation and returns the square root of the population variance 
   +  sum - the sum of all values in the group 
   +  sumDistinct - the sum of distinct values in the group 
   +  var\$1samp - the sample variance of the group (ignores nulls) 
   +  var\$1pop - the population variance of the group (ignores nulls) 

# Flatten nested structs


*Flatten* the fields of nested structs in the data, so they become top level fields. The new fields are named using the field name prefixed with the names of the struct fields to reach it, separated by dots. 

For example, if the data has a field of type Struct named “phone\$1numbers”, which among other fields has one of type “Struct” named “home\$1phone” with two fields: “country\$1code” and “number”. Once flattened, these two fields will become top level fields named: “phone\$1numbers.home\$1phone.country\$1code” and “phone\$1numbers.home\$1phone.number” respectively.

**To add a *Flatten* transform node in your job diagram**

1. Open the Resource panel and then choose the **Transforms** tab, then **Flatten** to add a new transform to your job diagram. You can also use the search bar by entering 'Flatten', then clicking the Flatten node. The node selected at the time of adding the node will be its parent.  
![\[The screenshot shows the Resource Panel and the search bar populated with the word 'Flatten'. The search result shows the Flatten transform.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/transform-flatten.png)

1. (Optional) On the **Node properties** tab, you can enter a name for the node in the job diagram. If a node parent is not already selected, then choose a node from the **Node parents** list to use as the input source for the transform.

1. (Optional) On the **Transform** tab, you can limit the maximum nesting level to flatten. For instance, setting that value to 1 means that only top-level structs will be flattened. Setting the max to 2 will flatten the top level and the structs directly under it.

# Add a UUID column


When you add a *UUID* (Universally Unique Identified) column, each row will be assigned a unique 36-character string.

**To add a *UUID* transform node in your job diagram**

1. Open the Resource panel and then choose **UUID** to add a new transform to your job diagram. The node selected at the time of adding the node will be its parent.

1. (Optional) On the **Node properties** tab, you can enter a name for the node in the job diagram. If a node parent is not already selected, then choose a node from the **Node parents** list to use as the input source for the transform.

1. (Optional) On the **Transform** tab, you can customize the name of the new column. By default it will be named "uuid".

# Add an identifier column


Assign a numeric *Identifier* for each row in the dataset.

**To add an *Identifier* transform node in your job diagram**

1. Open the Resource panel and then choose **Identifier** to add a new transform to your job diagram. The node selected at the time of adding the node will be its parent.

1. (Optional) On the **Node properties** tab, you can enter a name for the node in the job diagram. If a node parent is not already selected, then choose a node from the **Node parents** list to use as the input source for the transform.

1. (Optional) On the **Transform** tab, you can customize the name of the new column. By default, it will be named "id".

1. (Optional) If the job processes and stores data incrementally, you want to avoid the same ids to be reused between job runs.

   On the **Transform** tab, mark the **unique** checkbox option. It will include the job timestamp in the identifier, making it unique between multiple runs. To allow for the larger number, the column instead of type long will be a decimal.

# Convert a column to timestamp type


You can use the transform *To timestamp* to change the data type of a numeric or string column into timestamp, so that it can be stored with that data type or applied to other transforms that require a timestamp.

**To add a *To timestamp* transform node in your job diagram**

1. Open the Resource panel and then choose **To timestamp** to add a new transform to your job diagram. The node selected at the time of adding the node will be its parent.

1. (Optional) On the **Node properties** tab, you can enter a name for the node in the job diagram. If a node parent is not already selected, then choose a node from the **Node parents** list to use as the input source for the transform.

1. On the **Transform** tab, enter the name of the column to be converted.

1. On the **Transform** tab, define how to parse the column selected by choosing the type.

   If the value is a number, it can be expressed in seconds (Unix/Python timestamp), milliseconds or microseconds, choose the corresponding option.

   If the value is a formatted string, choose the "iso" type, the string needs to conform to one of the variants of the ISO format, for example: “2022-11-02T14:40:59.915Z“.

   If you don’t know the type at this point or different rows use different types, then you can choose ”autodetect“ and the system will make its best guess, with a small performance cost.

1. (Optional) On the **Transform** tab, instead of converting the selected column, you can create a new one and keep the original by entering a name for the new column.

# Convert a timestamp column to a formatted string


Format a timestamp column into a string based on a pattern. You can use *Format timestamp* to get date and time as a string with the desired format. You can define the format using [Spark date syntax](https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html) as well as most of the [Python date codes](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes).

For example, if you want your date string to be formatted like “2023-01-01 00:00”, you can define such format using the Spark syntax as “yyyy-MM-dd HH:mm” or the equivalent Python date codes as “%Y-%m-%d %H:%M”

**To add a *Format timestamp* transform node in your job diagram**

1. Open the Resource panel and then choose **Format timestamp** to add a new transform to your job diagram. The node selected at the time of adding the node will be its parent.

1. (Optional) On the **Node properties** tab, you can enter a name for the node in the job diagram. If a node parent is not already selected, then choose a node from the **Node parents** list to use as the input source for the transform.

1. On the **Transform** tab, enter the name of the column to be converted.

1. On the **Transform** tab, enter the **Timestamp format** pattern to use, expressed using [Spark date syntax](https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html) or [Python date codes](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes).

1. (Optional) On the **Transform** tab, instead of converting the selected column, you can create a new one and keep the original by entering a name for the new column.

# Creating a Conditional Router transformation


 The Conditional Router transform allows you to apply multiple conditions to incoming data. Each row of the incoming data is evaluated by a group filter condition and processed into its corresponding group. If a row meets more than one group filter condition, the transform passes the row to multiple groups. If a row does not meet any condition, it can either be dropped or routed to a default output group. 

 This transform is similar to the filter transform, but useful for users who want to test the same input data on multiple conditions. 

**To add a Conditional Router transform:**

1.  Choose a node where you will perform the conditional router transformation. This can be a source node or another transform. 

1.  Choose **Action**, then use the search bar to find and choose 'Conditional Router'. A ** Conditional Router** transform is added along with two output nodes. One output node, 'Default group', contains records which do not meet any of the conditions defined in the other output node(s). The default group cannot be edited.   
![\[The screenshot shows the conditional router transform node connected to a source node. Output nodes are shown branching from the conditional router node.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/transform-conditional-router-node.png)

    You can add additional output groups by choosing **Add group**. For each output group, you can name the group and add filter conditions and a logical operator.   
![\[The screenshot shows the conditional router transform tab with options to name the output group, logical operator and conditional filter(s).\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/transform-conditional-router-tab.png)

1.  Rename the output group name by entering a new name for the group. Amazon Glue Studio will automatically name your groups for you (for example, 'output\$1group\$11'). 

1.  Choose a logical operator (**AND**, **OR**) and add a **Filter condition** by specifying the **Key**, **Operation**, and **Value**. Logical operators allow you to implement more than one filter condition and perform the logical operator on each filter condition you specify. 

    When specifying the key, you can choose from available keys in your schema. You can then choose the available operation depending on the type of key you selected. For example, if the key type is 'string', then the available operation to choose from is 'matches'.   
![\[The screenshot shows the conditional router transform tab with the filter condition fields for key, operation and value.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/transform-conditional-router-filter-condition.png)

1.  Enter the value in the **Value** field. To add additional filter conditions, choose **Add condition**. To remove filter conditions, choose the trash can icon. 

# Using the Concatenate Columns transform to append columns


 The Concatenate transform allows you to build a new string column using the values of other columns with an optional spacer. For example, if we define a concatenated column “date” as the concatenation of “year”, “month” and “day” (in that order) with “-” as the spacer, we would get: 


| day | month | year | date | 
| --- | --- | --- | --- | 
| 01 | 01 | 2020 | 2020-01-01 | 
| 02 | 01 | 2020 | 2020-01-02 | 
| 03 | 01 | 2020 | 2020-01-03 | 
| 04 | 01 | 2020 | 2020-01-04 | 

**To add a Concatenate transform:**

1. Open the Resource panel. Then choose **Concatenate Columns** to add a new transform to your job diagram. The node selected at the time of adding the node will be its parent.

1. (Optional) On the **Node properties** tab, you can enter a name for the node in the job diagram. If a node parent is not already selected, then choose a node from the Node parents list to use as the input source for the transform.

1. On the **Transform** tab, enter the name of the column that will hold the concatenated string as well as the columns to concatenate. The order in which you check the columns in the dropdown will be the order used.  
![\[The screenshot shows the Transform tab for the Concatenate transform.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/transforms-concatenate-transform-tab.png)

1. **Spacer - optional** – Enter a string to place betwen the concatenated fields. By default, there is no spacer.

1. **Null value - optional** – Enter a string to use when a column value is null. By default, in the cases where columns have the value 'NULL' or 'NA', an empty string is used.

# Using the Split String transform to break up a string column


 The Split String transform allows you to break up a string into an array of tokens using a regular expression to define how the split is done. You can then keep the column as an array type or apply an **Array To Columns** transform after this one, to extract the array values onto top level fields, assuming that each token has a meaning we know beforehand. Also, if the order of the tokens is irrelevant (for instance, a set of categories), you can use the **Explode** transform to generate a separate row for each value. 

 For example, you can split a the column “categories” using a comma as a pattern to add a column “categories\$1arr”. 


| product\$1id | categories | categories\$1arr | 
| --- | --- | --- | 
| 1 | sports,winter | [sports, winter] | 
| 2 | garden,tools | [garden, tools] | 
| 3 | videogames | [videogames] | 
| 4 | game,boardgame,social | [game, boardgame, social] | 

**To add a Split String transform:**

1. Open the Resource panel and then choose Split String to add a new transform to your job diagram. The node selected at the time of adding the node will be its parent.

1. (Optional) On the Node properties tab, you can enter a name for the node in the job diagram. If a node parent is not already selected, then choose a node from the Node parents list to use as the input source for the transform.

1. On the **Transform** tab, choose the column to split and enter the pattern to use to split the string. In most cases you can just enter the character(s) unless it has a special meaning as a regular expression and needs to be escaped. The characters that need escaping are: `\.[]{}()<>*+-=!?^$|` by adding a backslash in front of the character. For instance if you want to separate by a dot ('.') you need to enter `\.`. However, a comma doesn’t have a special meaning and can just be specified as is: `,`.  
![\[The screenshot shows the Transform tab for the Split String transform.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/transforms-split-string-transform-tab.png)

1. (Optional) If you want to keep the original string column, then you can enter a name for a new array column, this way keeping both the original string column and the new tokenized array column.

# Using the Array To Columns transform to extract the elements of an array into top level columns


 The Array To Columns transform allows you extract some or all the elements of a column of type array into new columns. The transform will fill the new columns as much as possible if the array has enough values to extract, optionally taking the elements in the positions specified. 

 For instance, if you have an array column “subnet”, which was the result of applying the “Split String” transform on a ip v4 subnet, you can extract the first and forth positions into new columns “first\$1octect” and “forth\$1octect”. The output of the transform in this example would be (notice the last two rows have shorter arrays than expected): 


| subnet | first\$1octect | fourth\$1octect | 
| --- | --- | --- | 
| [54, 240, 197, 238] | 54 | 238 | 
| [192, 168, 0, 1] | 192 | 1 | 
| [192, 168] | 192 |  | 
| [] |  |  | 

**To add a Array To Columns transform:**

1. Open the Resource panel and then choose **Array To Columns** to add a new transform to your job diagram. The node selected at the time of adding the node will be its parent.

1. (Optional) On the **Node properties** tab, you can enter a name for the node in the job diagram. If a node parent is not already selected, then choose a node from the Node parents list to use as the input source for the transform.

1. On the **Transform** tab, choose the array column to extract and enter the list of new columns for the tokens extracted.  
![\[The screenshot shows the Transform tab for the Array To Columns transform.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/transforms-array-to-columns-transform-tab.png)

1. (Optional) If you don’t want to take the array tokens in order to assign to columns, you can specify the indexes to take which will be assigned to the list of columns in the same order specified. For instance if the output columns are “column1, column2, column3” and the indexes “4, 1, 3”, the forth element of the array will go to column1, the first to column2 and the third to column3 (if the array is shorter than the index number, a NULL value will be set).

# Using the Add Current Timestamp transform


 The **Add Current Timestamp** transform allows you to mark the rows with the time on which the data was processed. This is useful for auditing purposes or to track latency in the data pipeline. You can add this new column as a timestamp data type or a formatted string. 

**To add a Add Current Timestamp transform:**

1. Open the Resource panel and then choose **Add Current Timestamp** to add a new transform to your job diagram. The node selected at the time of adding the node will be its parent. 

1. (Optional) On the **Node properties** tab, you can enter a name for the node in the job diagram. If a node parent is not already selected, then choose a node from the Node parents list to use as the input source for the transform.  
![\[The screenshot shows the Transform tab for the Add Current Timestamp transform.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/transforms-add-current-timestamp-transform-tab.png)

1. (Optional) On the **Transform** tab, enter a custom name for the new column and a format if you rather the column to be a formatted date string.

# Using the Pivot Rows to Columns transform


 The **Pivot Rows to Columns** transform allows you to aggregate a numeric column by rotating unique values on selected columns which become new columns (if multiple columns are selected, the values are concatenated to name the new columns). That way rows are consolidated while having more columns with partial aggregations for each unique value. For example, if you have this dataset of sales by month and country (sorted to be easier to illustrate): 


| year | month | country | amount | 
| --- | --- | --- | --- | 
| 2020 | Jan | uk | 32 | 
| 2020 | Jan | de | 42 | 
| 2020 | Jan | us | 64 | 
| 2020 | Feb | uk | 67 | 
| 2020 | Feb | de | 4 | 
| 2020 | Feb | de | 7 | 
| 2020 | Feb | us | 6 | 
| 2020 | Feb | us | 12 | 
| 2020 | Jan | us | 90 | 

 If you pivot **amount** and **country** as the aggregation columns, new columns are created from the original **country** column. In the table below, you have new columns for **de**, **uk**, and **us** instead of the **country** column. 


| year | month | de | uk | us | 
| --- | --- | --- | --- | --- | 
| 2020 | Jan | 42 | 32 | 64 | 
| 2020 | Jan | 11 | 67 | 18 | 
| 2021 | Jan |  |  | 90 | 

 If instead you want to pivot both the month and county, you get a column for each combination of the values of those columns: 


| year | Jan\$1de | Jan\$1uk | Jan\$1us | Feb\$1de | Feb\$1uk | Feb\$1us | 
| --- | --- | --- | --- | --- | --- | --- | 
| 2020 | 42 | 32 | 64 | 11 | 67 | 18 | 
| 2021 |  |  | 90 |  |  |  | 

**To add a Pivot Rows To Columns transform:**

1. Open the Resource panel and then choose **Pivot Rows To Columns** to add a new transform to your job diagram. The node selected at the time of adding the node will be its parent.

1. (Optional) On the **Node properties** tab, you can enter a name for the node in the job diagram. If a node parent is not already selected, then choose a node from the Node parents list to use as the input source for the transform.

1. On the **Transform** tab, choose the numeric column which will be aggregated to produce the values for the new columns, the aggregation function to apply and the column(s) to convert its unique values into new columns.  
![\[The screenshot shows the Transform tab for the Pivot Rows To Columns transform.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/transforms-pivot-rows-to-columns-transform-tab.png)

# Using the Unpivot Columns To Rows transform


 The **Unpivot** transform allows you convert columns into values of new columns generating a row for each unique value. It’s the opposite of pivot but note that it’s not equivalent since it cannot separate rows with identical values that were aggregated or split combinations into the original columns (you can do that later using a Split transform). For example, if you have the following table: 


| year | month | de | uk | us | 
| --- | --- | --- | --- | --- | 
| 2020 | Jan | 42 | 32 | 64 | 
| 2020 | Feb | 11 | 67 | 18 | 
| 2021 | Jan |  |  | 90 | 

 You can unpivot the columns: “de”, “uk” and “us” into a column “country” with the value “amount”, and get the following (sorted here for illustration purposes): 


| year | month | country | amount | 
| --- | --- | --- | --- | 
| 2020 | Jan | uk | 32 | 
| 2020 | Jan | de | 42 | 
| 2020 | Jan | us | 64 | 
| 2020 | Feb | uk | 67 | 
| 2020 | Feb | de | 11 | 
| 2020 | Feb | us | 18 | 
| 2021 | Jan | us | 90 | 

 Notice the columns that have a NULL value (“de” and “uk of Jan 2021) don’t get generated by default. You can enable that option to get: 


| year | month | country | amount | 
| --- | --- | --- | --- | 
| 2020 | Jan | uk | 32 | 
| 2020 | Jan | de | 42 | 
| 2020 | Jan | us | 64 | 
| 2020 | Feb | uk | 67 | 
| 2020 | Feb | de | 11 | 
| 2020 | Feb | us | 18 | 
| 2021 | Jan | us | 90 | 
| 2021 | Jan | de |  | 
| 2021 | Jan | uk |  | 

**To add a Unpivot Columns to Rows transform:**

1. Open the Resource panel and then choose **Unpivot Columns to Rows** to add a new transform to your job diagram. The node selected at the time of adding the node will be its parent.

1. (Optional) On the **Node properties** tab, you can enter a name for the node in the job diagram. If a node parent is not already selected, then choose a node from the Node parents list to use as the input source for the transform.

1. On the **Transform** tab, enter the new columns to be created to hold the names and values of the columns chosen to unpivot.  
![\[The screenshot shows the Transform tab for the Unpivot Columns To Rows transform.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/transforms-unpivot-columns-to-rows-transform-tab.png)

# Using the Autobalance Processing transform to optimize your runtime


 The **Autobalance Processing** transform redistributes the data among the workers for better performance. This helps in cases where the data is unbalanced or as it comes from the source doesn’t allow enough parallel processing on it. This is common where the source is gzipped or is JDBC. The redistribution of data has a modest performance cost, so the optimization might not always compensate that effort if the data was already well balanced. Underneath, the transform uses Apache Spark repartition to randomly reassign data among a number of partitions optimal for the cluster capacity. For advanced users, it’s possible to enter a number of partitions manually. In addition, it can be used to optimize the writing of partitioned tables by reorganizing the data based on specified columns. This results in output files that are more consolidated. 

****

1. Open the Resource panel and then choose **Autobalance Processing** to add a new transform to your job diagram. The node selected at the time of adding the node will be its parent.

1. (Optional) On the **Node properties** tab, you can enter a name for the node in the job diagram. If a node parent is not already selected, then choose a node from the Node parents list to use as the input source for the transform.

1. (Optional) On the **Transform** tab, you can enter a number of partitions. In general, it’s recommended that you let the system decide this value, however you can tune the multiplier or enter a specific value if you need to control this. If you are going to save the data partitioned by columns, you can choose the same columns as repartition columns. This way it will minimize the number of files on each partition and avoid having many files per partitions, which would hinder the performance of the tools querying that data.  
![\[The screenshot shows the Transform tab for the Autobalance Processing transform.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/transforms-autobalance-processing-transform-tab.png)

# Using the Derived Column transform to combine other columns


 The **Derived Column** transform allows you to define a new column based on a math formula or SQL expression in which you can use other columns in the data, as well as constants and literals. For instance, to derive a “percentage” column from the columns "success" and "count", you can enter the SQL expression: "success \$1 100 / count \$1\$1 '%'". 

 Example result: 


| success | count | percentage | 
| --- | --- | --- | 
| 14 | 100 | 14% | 
| 6 | 20 | 3% | 
| 3 | 40 | 7.5% | 

**To add a Derived Column transform:**

1. Open the Resource panel and then choose **Derived Column** to add a new transform to your job diagram. The node selected at the time of adding the node will be its parent.

1. (Optional) On the **Node properties** tab, you can enter a name for the node in the job diagram. If a node parent is not already selected, then choose a node from the Node parents list to use as the input source for the transform.

1. On the **Transform** tab, enter the name of the column and the expression for its content.  
![\[The screenshot shows the Transform tab for the Derived Column transform.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/transforms-dervied-column-transform-tab.png)

# Using the Lookup transform to add matching data from a catalog table


 The **Lookup** transform allows you to add columns from a defined catalog table when the keys match the defined lookup columns in the data. This is equivalent to doing a left outer join between the data and the lookup table using as condition matching columns. 

**To add a Lookup transform:**

1. Open the Resource panel and then choose **Lookup** to add a new transform to your job diagram. The node selected at the time of adding the node will be its parent.

1. (Optional) On the **Node properties** tab, you can enter a name for the node in the job diagram. If a node parent is not already selected, then choose a node from the Node parents list to use as the input source for the transform.

1. On the **Transform** tab, enter the fully qualified catalog table name to use to perform the lookups. For example, if your database is “mydb” and your table “mytable” then enter “mydb.mytable”. Then enter the criteria to find a match in the lookup table, if the lookup key is composed. Enter the list of key columns separated by commas. If one or more of the key columns don’t have the same name then you need to define the match mapping. 

   For example, if the data columns are “user\$1id” and “region” and in the users table the corresponding columns are named “id” and “region“, then in the **Columns to match** field, enter: ”user\$1id=id, region“. You could do region=region but it’s not needed since they are the same.

1. Finally, enter the columns to bring from the row matched in the lookup table to incorporate them into the data. If no match was found those columns will be set to NULL.
**Note**  
Underneath the **Lookup** transform, it is using a left join in order to be efficient. If the lookup table has a composite key, ensure the columns to match are setup to match all the key columns so that only one match can occur. Otherwise, multiple lookup rows will match and this will result in extra rows added for each of those matches.  
![\[The screenshot shows the Transform tab for the Lookup transform.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/transforms-lookup-transform-tab.png)

# Using the Explode Array or Map Into Rows transform


 The **Explode** transform allows you to extract values from a nested structure into individual rows that are easier to manipulate. In the case of an array, the transform will generate a row for each value of the array, replicating the values for the other columns in the row. In the case of a map, the transform will generate a row for each entry with the key and value as columns plus any other columns in the row. 

 For example, if we have this dataset which has a “category” array column with multiple values. 


| product\$1id | category | 
| --- | --- | 
| 1 | [sports, winter] | 
| 2 | [garden, tools] | 
| 3 | [videogames] | 
| 4 | [game, boardgame, social] | 
| 5 | [] | 

 If you explode the 'category' column into a column with the same name, you will override the column. You can select that you want NULLs included to get the following (ordered for illustration purposes): 


| product\$1id | category | 
| --- | --- | 
| 1 | sports | 
| 1 | winter | 
| 2 | garden | 
| 2 | tool | 
| 3 | videogames | 
| 4 | game | 
| 4 | boardgame | 
| 4 | social | 
| 5 |  | 

**To add a Explode Array Or Map Into Rows transform:**

1. Open the Resource panel and then choose **Explode Array Or Map Into Rows** to add a new transform to your job diagram. The node selected at the time of adding the node will be its parent.

1. (Optional) On the **Node properties** tab, you can enter a name for the node in the job diagram. If a node parent is not already selected, then choose a node from the Node parents list to use as the input source for the transform.

1. On the **Transform** tab, choose the column to explode (it must be an array or map type). Then enter a name for the column for the items of the array or the names of the columns for the keys and values if you are exploding a map.

1. (Optional) On the **Transform** tab, by default if the column to explode is NULL or has an empty structure, it will be omitted on the exploded dataset. If you want to keep the row (with the new columns as NULL) then check “Include NULLs”.  
![\[The screenshot shows the Transform tab for the Explode Array or Map Into Rows transform.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/transforms-explode-array-transform-tab.png)

# Using the Record Matching transform to invoke an existing data classification transform


This transform invokes an existing Record Matching machine learning data classification transform.

The transform evaluates the current data against the trained model based on labels. A column "match\$1id" is added to assign each row to a group of items that are considered equivalent based on the algorithm training. For more information, see [Record matching with Lake Formation FindMatches](https://docs.amazonaws.cn/glue/latest/dg/machine-learning.html).

**Note**  
The version of Amazon Glue used by the visual job must match the version that Amazon Glue used to create the Record Matching transform.

![\[The screenshot shows a data preview for the transform.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/recording-matching-transform-1.png)


**To add a Record Matching transform node to your job diagram**

1. Open the Resource panel, and then choose **Record Matching** to add a new transform to your job diagram. The node selected at the time of adding the node will be its parent.

1. In the node properties panel, you can enter a name for the node in the job diagram. If a node parent isn't already selected, choose a node from the **Node parents** list to use as the input source for the transform.

1. On the **Transform** tab, enter the ID taken from the **Machine learning transforms** page:  
![\[The screenshot shows the ID from the Machine learning transforms page.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/recording-matching-transform-2.png)

1. (Optional) On the **Transform** tab, you can check the option to add the confidence scores. At the cost of extra computing, the model will estimate a confidence score for each match as an additional column.

# Removing null rows


This transform removes from the dataset rows that have all columns as null. In addition, you can extend this criteria to include empty fields, so as to keep rows where at least one column is non empty.

**To add a Remove Null Rows transform node to your job diagram**

1. Open the Resource panel, and then choose **Remove Null Rows** to add a new transform to your job diagram. The node selected at the time of adding the node will be its parent.

1. In the node properties panel, you can enter a name for the node in the job diagram. If a node parent isn't already selected, choose a node from the **Node parents** list to use as the input source for the transform.

1. (Optional) On the **Transform** tab, check the **Extended** option if you want to require rows not just to not be null but also not empty, this way empty strings, arrays or maps will be considered nulls for the purpose of this transform.

# Parsing a string column containing JSON data


This transform parses a string column containing JSON data and convert it to a struct or an array column, depending if the JSON is an object or an array, respectively. Optionally you can keep both the parsed and original column.

The JSON schema can be provided or inferred (in the case of JSON objects), with optional sampling.

**To add a Parse JSON Column transform node to your job diagram**

1. Open the Resource panel, and then choose **Parse JSON Column** to add a new transform to your job diagram. The node selected at the time of adding the node will be its parent.

1. In the node properties panel, you can enter a name for the node in the job diagram. If a node parent isn't already selected, choose a node from the **Node parents** list to use as the input source for the transform.

1. On the **Transform** tab, select the column containing the JSON string.

1. (Optional) On the **Transform** tab, enter the schema that the JSON data follows using SQL syntax, for instance: "field1 STRING, field2 INT" in the case of an object or "ARRAY<STRING>" in the case of an array.

   If the case of an array the schema is required but in the case of an object, if the schema is not specified it will be inferred using the data. To reduce the impact of inferring the schema (especially on a large dataset), you can avoid reading the whole data twice by entering a **Ratio of samples to use to infer schema**. If the value is lower than 1, the corresponding ratio of random samples is used to infer the schema. If the data is reliable and the object is consistent between rows, you can use a small ratio such as 0.1 to improve performance.

1. (Optional) On the **Transform** tab, you can enter a new column name if you want to keep both the original string column and the parsed column.

# Extracting a JSON path


This transform extracts new columns from a JSON string column. This transform is useful when you only need a few data elements and don't want to import the entire JSON content into the table schema.

**To add an Extract JSON Path transform node to your job diagram**

1. Open the Resource panel, and then choose **Extract JSON Path** to add a new transform to your job diagram. The node selected at the time of adding the node will be its parent.

1. In the node properties panel, you can enter a name for the node in the job diagram. If a node parent isn't already selected, choose a node from the **Node parents** list to use as the input source for the transform.

1. On the **Transform** tab, select the column containing the JSON string. Enter one of more JSON path expressions separated by commas, each one referencing how to extract a value out of the JSON array or object. For instance, if the JSON column contained an objects with properties "prop\$11" and "prop2" you could extract both specifying their names "prop\$11, prop\$12".

   If the JSON field has special characters, for instance to extract the property from this JSON `{"a. a": 1}` you could use the path `$['a. a']`. The exception is the comma because it is reserved to separate paths. Then enter the corresponding column names for each path, separated by commas.

1. (Optional) On the **Transform** tab, you can check to drop the JSON column once extracted, this makes sense when you don't need the rest of the JSON data once you have extracted the parts you need.

# Extracting string fragments using a regular expression


This transform extracts string fragments using a regular expression and creates a new column out of it, or multiple columns if using regex groups.

**To add a Regex Extractor transform node to your job diagram**

1. Open the Resource panel, and then choose **Regex Extractor** to add a new transform to your job diagram. The node selected at the time of adding the node will be its parent.

1. In the node properties panel, you can enter a name for the node in the job diagram. If a node parent isn't already selected, choose a node from the **Node parents** list to use as the input source for the transform.

1. On the **Transform** tab, enter the regular expression and the column on which it needs to be applied. Then enter the name of the new column on which to store the matching string. The new column will be null only if the source column is null, if the regex doesn’t match the column will be empty.

   If the regex uses groups, there has be a corresponding column name separated by comma but you can skip groups by leaving the column name empty.

   For example, if you have a column "purchase\$1date" with a string using both long and short ISO date formats, then you want to extract the year, month, day and hour, when available. Notice the hour group is optional, otherwise in the rows where not available, all the extracted groups would be empty strings (because the regex didn’t match). In this case, we don't want the group to make the time optional but the inner one, so we leave the name empty and it doesn’t get extracted (that group would include the T character).  
![\[The screenshot shows configuring a regular expression for the Regex extractor.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/regex-extractor-1.png)

   Resulting in the data preview:  
![\[The screenshot shows configuring a data preview for the Regex extractor.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/regex-extractor-2.png)

# Creating a custom transformation


If you need to perform more complicated transformations on your data, or want to add data property keys to the dataset, you can add a **Custom code** transform to your job diagram. The Custom code node allows you to enter a script that performs the transformation. 

When using custom code, you must use a schema editor to indicate the changes made to the output through the custom code. When editing the schema, you can perform the following actions:
+ Add or remove data property keys
+ Change the data type of data property keys
+ Change the name of data property keys
+ Restructure a nested property key

You must use a *SelectFromCollection* transform to choose a single `DynamicFrame` from the result of your Custom transform node before you can send the output to a target location. 

Use the following tasks to add a custom transform node to your job diagram.

## Adding a custom code transform node to the job diagram


**To add a custom transform node to your job diagram**

1. (Optional) Open the Resource panel and then choose **Custom transform** to add a custom transform to your job diagram. 

1. On the **Node properties** tab, enter a name for the node in the job diagram. If a node parent is not already selected, or if you want multiple inputs for the custom transform, then choose a node from the **Node parents** list to use as the input source for the transform.

## Entering code for the custom transform node


You can type or copy code into an input field. The job uses this code to perform the data transformation. You can provide a code snippet in either Python or Scala. The code should take one or more `DynamicFrames` as input and returns a collection of `DynamicFrames`. 

**To enter the script for a custom transform node**

1. With the custom transform node selected in the job diagram, choose the **Transform** tab. 

1. In the text entry field under the heading **Code block**, paste or enter the code for the transformation. The code that you use must match the language specified for the job on the **Job details** tab.

   When referring to the input nodes in your code, Amazon Glue Studio names the `DynamicFrames` returned by the job diagram nodes sequentially based on the order of creation. Use one of the following naming methods in your code:
   + Classic code generation – Use functional names to refer to the nodes in your job diagram.
     + Data source nodes: `DataSource0`, `DataSource1`, `DataSource2`, and so on.
     + Transform nodes: `Transform0`, `Transform1`, `Transform2`, and so on.
   + New code generation – Use the name specified on the **Node properties** tab of a node, appended with '`_node1`', '`_node2`', and so on. For example, `S3bucket_node1`, `ApplyMapping_node2`, `S3bucket_node2`, `MyCustomNodeName_node1`.

   For more information about the new code generator, see [Script code generation](job-editor-features.md#code-gen).

The following examples show the format of the code to enter in the code box:

------
#### [ Python ]

The following example takes the first `DynamicFrame` received, converts it to a `DataFrame` to apply the native filter method (keeping only records that have over 1000 votes), then converts it back to a `DynamicFrame` before returning it.

```
def FilterHighVoteCounts (glueContext, dfc) -> DynamicFrameCollection:
    df = dfc.select(list(dfc.keys())[0]).toDF()
    df_filtered = df.filter(df["vote_count"] > 1000)
    dyf_filtered = DynamicFrame.fromDF(df_filtered, glueContext, "filter_votes")
    return(DynamicFrameCollection({"CustomTransform0": dyf_filtered}, glueContext))
```

------
#### [ Scala ]

The following example takes the first `DynamicFrame` received, converts it to a `DataFrame` to apply the native filter method (keeping only records that have over 1000 votes), then converts it back to a `DynamicFrame` before returning it.

```
object FilterHighVoteCounts {
  def execute(glueContext : GlueContext, input : Seq[DynamicFrame]) : Seq[DynamicFrame] = {
    val frame = input(0).toDF()
    val filtered = DynamicFrame(frame.filter(frame("vote_count") > 1000), glueContext)
    Seq(filtered)
  }
}
```

------

## Editing the schema in a custom transform node


When you use a custom transform node, Amazon Glue Studio cannot automatically infer the output schemas created by the transform. You use the schema editor to describe the schema changes implemented by the custom transform code.

A custom code node can have any number of parent nodes, each providing a `DynamicFrame` as input for your custom code. A custom code node returns a collection of `DynamicFrames`. Each `DynamicFrame` that is used as input has an associated schema. You must add a schema that describes each `DynamicFrame` returned by the custom code node. 

**Note**  
 When you set your own schema on a custom transform, Amazon Glue Studio does not inherit schemas from previous nodes.To update the schema, select the Custom transform node, then choose the Data preview tab. Once the preview is generated, choose 'Use Preview Schema'. The schema will then be replaced by the schema using the preview data. 

**To edit the output schemas for a custom transform node**

1. With the custom transform node selected in the job diagram, in the node details panel, choose the **Output schema** tab. 

1. Choose **Edit** to make changes to the schema. 

   If you have nested data property keys, such as an array or object, you can choose the **Expand-Rows** icon (![\[\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/expand-rows-icon.png)) on the top right of each schema panel to expand the list of child data property keys. After you choose this icon, it changes to the **Collapse-Rows** icon (![\[\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/collapse-rows-icon.png)), which you can choose to collapse the list of child property keys.

1. Modify the schema using the following actions in the section on the right side of the page:
   + To rename a property key, place the cursor in the **Key** text box for the property key, then enter the new name.
   + To change the data type for a property key, use the list to choose the new data type for the property key.
   + To add a new top-level property key to the schema, choose the **Overflow** (![\[\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/edit-schema-actions-button.png)) icon to the left of the **Cancel** button, and then choose **Add root key**.
   + To add a child property key to the schema, choose the **Add-Key** icon ![\[\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/filter-add-icon.png)associated with the parent key. Enter a name for the child key and choose the data type.
   + To remove a property key from the schema, choose the **Remove** icon (![\[\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/delete-icon-black.png)) to the far right of the key name. 

1. If your custom transform code uses multiple `DynamicFrames`, you can add additional output schemas. 
   + To add a new, empty schema, choose the **Overflow** (![\[\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/edit-schema-actions-button.png)) icon, and then choose **Add output schema**.
   + To copy an existing schema to a new output schema, make sure the schema you want to copy is displayed in the schema selector. Choose the **Overflow** (![\[\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/edit-schema-actions-button.png)) icon, and then choose **Duplicate**.

   If you want to remove an output schema, make sure the schema you want to copy is displayed in the schema selector. Choose the **Overflow** (![\[\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/edit-schema-actions-button.png)) icon, and then choose **Delete**.

1. Add new root keys to the new schema or edit the duplicated keys. 

1. When you are modifying the output schemas, choose the **Apply** button to save your changes and exit the schema editor.

   If you do not want to save your changes, choose the **Cancel** button.

## Configure the custom transform output


A custom code transform returns a collection of `DynamicFrames`, even if there is only one `DynamicFrame` in the result set. 

**To process the output from a custom transform node**

1. Add a *SelectFromCollection* transform node, which has the custom transform node as its parent node. Update this transform to indicate which dataset you want to use. See [Using SelectFromCollection to choose which dataset to keep](transforms-configure-select-collection.md) for more information.

1. Add additional *SelectFromCollection* transforms to the job diagram if you want to use additional `DynamicFrames` produced by the custom transform node. 

   Consider a scenario in which you add a custom transform node to split a flight dataset into multiple datasets, but duplicate some of the identifying property keys in each output schema, such as the flight date or flight number. You add a *SelectFromCollection* transform node for each output schema, with the custom transform node as its parent.

1. (Optional) You can then use each *SelectFromCollection* transform node as input for other nodes in the job, or as a parent for a data target node.

# Transform data with custom visual transforms
Transform data with custom visual transforms

 Custom visual transforms allow you to create transforms and make them available for use in Amazon Glue Studio jobs. Custom visual transforms enable ETL developers, who may not be familiar with coding, to search and use a growing library of transforms using the Amazon Glue Studio interface. 

 You can create a custom visual transform, then upload it to Amazon S3 to make available for use through the visual editor in Amazon Glue Studio to work with these jobs. 

**Topics**
+ [

# Getting started with custom visual transforms
](custom-visual-transform-getting-started.md)
+ [

# Step 1. Create a JSON config file
](custom-visual-transform-json-config-file.md)
+ [

# Step 2. Implement the transform logic
](custom-visual-transform-implementation.md)
+ [

# Step 3. Validate and troubleshoot custom visual transforms in Amazon Glue Studio
](custom-visual-transform-validation.md)
+ [

# Step 4. Update custom visual transforms as needed
](custom-visual-transform-updating-transforms.md)
+ [

# Step 5. Use custom visual transforms in Amazon Glue Studio
](custom-visual-transform-create-gs.md)
+ [

# Usage examples
](custom-visual-transform-example-json.md)
+ [

# Examples of custom visual scripts
](custom-visual-transform-example-scripts.md)
+ [

## Video
](#custom-visual-transform-video)

# Getting started with custom visual transforms


 To create a custom visual transform, you go through the following steps. 
+  Step 1. Create a JSON config file 
+  Step 2. Implement the transform logic 
+  Step 3. Validate the custom visual transform 
+  Step 4. Update the custom visual transform as needed 
+  Step 5. Use the custom visual transform in Amazon Glue Studio 

 Get started by setting up the Amazon S3 bucket and continue to **Step 1. Create a JSON config file.** 

## Prerequisites


 Customer-supplied transforms reside within a customer Amazon account. That account owns the transforms and therefore has all permissions to view (search and use), edit, or delete them. 

 In order to use a custom transform in Amazon Glue Studio, you will need to create and upload two files to the Amazon S3 assets bucket in that Amazon account: 
+  **Python file** – contains the transform function 
+  **JSON file** – describes the transform. This is also known as the config file that is required to define the transform. 

 In order to pair the files together, use the same name for both. For example: 
+  myTransform.json 
+  myTransform.py 

 Optionally, you can give your custom visual transform a custom icon by providing a **SVG file** containing the icon. In order to pair the files together, use the same name for the icon: 
+  myTransform.svg 

 Amazon Glue Studio will automatically match them using their respective file names. File names cannot be the same for any existing module. 

## Recommended convention for transform file name


 Amazon Glue Studio will import your file as module (for example, `import myTransform`) in your job script. Therefore, your file name must follow the same naming rules set for python variable names (identifiers). Specifically, they must start with either a letter or an underscore and then be composed entirely of letters, digits, and/or underscores. 

**Note**  
 Ensure your transform file name is not conflicting with existing loaded python modules (for example, `sys, array, copy` etc.) to avoid unexpected runtime issues. 

## Setting up the Amazon S3 bucket


 Transforms you create are stored in Amazon S3 and is owned by your Amazon account. You create new custom visual transforms by simply uploading files (json and py) to the Amazon S3 assets folder where all job scripts are currently stored (for example, `s3://aws-glue-assets-<accountid>-<region>/transforms`). If using a custom icon, upload it as well. By default, Amazon Glue Studio will read all .json files from the /transforms folder in the same S3 bucket. 

# Step 1. Create a JSON config file


 A JSON config file is required to define and describe your custom visual transform. The schema for the config file is as follows. 

## JSON file structure


 **Fields** 
+  `name: string` – (required) the transform system name used to identify transforms. Follow the same naming rules set for python variable names (identifiers). Specifically, they must start with either a letter or an underscore and then be composed entirely of letters, digits, and/or underscores. 
+  `displayName: string` – (optional) the name of the transform displayed in the Amazon Glue Studio visual job editor. If no `displayName` is specified, the `name` is used as the name of the transform in Amazon Glue Studio. 
+  `description: string` – (optional) the transform description is displayed in Amazon Glue Studio and is searchable. 
+  `functionName: string` – (required) the Python function name is used to identify the function to call in the Python script. 
+  `path: string` – (optional) the full Amazon S3 path to the Python source file. If not specified, Amazon Glue uses file name matching to pair the .json and .py files together. For example, the name of the JSON file, `myTransform.json`, will be paired to the Python file, `myTransform.py`, on the same Amazon S3 location. 
+  `parameters: Array of TransformParameter object` – (optional) the list of parameters to be displayed when you configure them in the Amazon Glue Studio visual editor. 

<a name="transformparameter-fields"></a> **TransformParameter fields** 
+  `name: string` – (required) the parameter name that will be passed to the python function as a named argument in the job script. Follow the same naming rules set for python variable names (identifiers). Specifically, they must start with either a letter or an underscore and then be composed entirely of letters, digits, and/or underscores. 
+  `displayName: string` – (optional) the name of the transform displayed in the Amazon Glue Studio visual job editor. If no `displayName` is specified, the `name` is used as the name of the transform in Amazon Glue Studio. 
+  `type: string` – (required) the parameter type accepting common Python data types. Valid values: 'str' \$1 'int' \$1 'float' \$1 'list' \$1 'bool'. 
+  `isOptional: boolean` – (optional) determines whether the parameter is optional. By default all parameters are required. 
+  `description: string` — (optional) description is displayed in Amazon Glue Studio to help the user configure the transform parameter. 
+  `validationType: string` – (optional) defines the way this parameter is validated. Currently, it only supports regular expressions. By default, the validation type is set to `RegularExpression`. 
+  `validationRule: string` – (optional) regular expression used to validate form input before submit when `validationType` is set to `RegularExpression`. Regular expression syntax must be compatible with [ RegExp Ecmascript specifications](https://tc39.es/ecma262/multipage/text-processing.html#sec-regexp-regular-expression-objects). 
+  `validationMessage: string` – (optional) the message to display when validation fails. 
+  `listOptions: An array of TransformParameterListOption object` OR a `string` or the string value ‘column’ – (optional) options to display in Select or Multiselect UI control. Accepting a list of comma separated value or a strongly type JSON object of type `TransformParameterListOption`. It can also dynamically populate the list of columns from the parent node schema by specifying the string value “column”. 
+  `listType: string` – (optional) Define options types for type = 'list'. Valid values: 'str' \$1 'int' \$1 'float' \$1 'list' \$1 'bool'. Parameter type accepting common python data types. 

 **TransformParameterListOption fields** 
+  `value: string | int | float | bool` – (required) option value. 
+  `label: string` – (optional) option label displayed in the select dropdown. 

## Transform parameters in Amazon Glue Studio


 By default, parameters are required unless mark as `isOptional` in the .json file. In Amazon Glue Studio, parameters are displayed in the **Transform** tab. The example shows user-defined parameters such as Email Address, Phone Number, Your age, Your gender and Your origin country. 

![\[The screenshot shows a custom visual transform selected and the Transform tab with user-defined parameters.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/dynamic-transform-parameters.png)


 You can enforce some validations in Amazon Glue Studio using regular expressions in the json file by specifying the `validationRule` parameter and specifying a validation message in `validationMessage`. 

```
      "validationRule": "^\\(?(\\d{3})\\)?[- ]?(\\d{3})[- ]?(\\d{4})$",
      "validationMessage": "Please enter a valid US number"
```

**Note**  
 Since validation occurs in the browser, your regular expression syntax must be compatible with [ RegExp Ecmascript specifications](https://tc39.es/ecma262/multipage/text-processing.html#sec-regexp-regular-expression-objects). Python syntax is not supported for these regular expressions. 

 Adding validation will prevent the user from saving the job with incorrect user input. Amazon Glue Studio displays the validation message as displayed in the example: 

![\[The screenshot shows a custom visual transform parameter with a validation error message: Please enter a valid email address.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/dynamic-transform-validation-message.png)


 Parameters are displayed in Amazon Glue Studio based on the parameter configuration. 
+  When `type` is any of the following: `str`, `int` or `float`, a text input field is displayed. For example, the screenshot shows input fields for 'Email Address' and 'Your age' parameters.   
![\[The screenshot shows a custom visual transform parameter with text input field.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/dynamic-transform-email-address.png)  
![\[The screenshot shows a custom visual transform parameter with text input field.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/dynamic-transform-your-age.png)
+  When `type` is `bool`, a checkbox is displayed.   
![\[The screenshot shows a custom visual transform parameter with text input field.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/dynamic-transform-bool.png)
+  When `type` is `str` and `listOptions` is provided, a single select list is displayed.   
![\[The screenshot shows a custom visual transform parameter with a single select list drop-down.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/dynamic-transform-single-list.png)
+  When `type` is `list` and `listOptions` and `listType` are provided, a multi-select list is displayed.   
![\[The screenshot shows a custom visual transform parameter with a list drop-down.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/dynamic-transform-list-options.png)

### Displaying a column selector as parameter


 If the configuration requires the user to choose a column from the schema, you can display a column selector so the user isn't required to type the column name. By setting the `listOptions` field to '“column”, Amazon Glue Studio dynamically displays a column selector based on the parent node output schema. Amazon Glue Studio can display either a single or multiple column selector. 

 The following example uses the schema: 

![\[The screenshot shows a sample output schema.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/custom-visual-transform-example-schema.png)


**To define your Custom Visual Transform parameter to display a single column:**

1.  In your JSON file, for the `parameters` object, set the `listOptions` value to "column". This allows a user to choose a column from a pick list in Amazon Glue Studio.   
![\[The screenshot shows a sample JSON file with the listOptions parameter set to "column" and the resulting user interface in In Amazon Glue Studio.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/custom-visual-transform-example-listoptions-column.png)

1.  You can also allow multiple columns selection by defining the parameter as: 
   +  `listOptions: "column"` 
   +  `type: "list"`   
![\[The screenshot shows a sample JSON file with the listOptions parameter set to "column" and the type set to "list", and resulting user interface in Amazon Glue Studio.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/custom-visual-transform-example-listoptions-column-type-list.png)

# Step 2. Implement the transform logic


**Note**  
 Custom visual transforms only support Python scripts. Scala is not supported. 

 To add the code that implements the function defined by the .json config file, it is recommended to place the Python file in the same location as the .json file, with the same name but with the “.py” extension. Amazon Glue Studio automatically pairs the .json and .py files so that you don’t need to specify the path of the Python file in the config file. 

 In the Python file, add the declared function, with the named parameters configured and register it to be used in `DynamicFrame`. The following is an example of a Python file: 

```
from awsglue import DynamicFrame

# self refers to the DynamicFrame to transform, 
# the parameter names must match the ones defined in the config
# if it's optional, need to provide a default value
def myTransform(self, email, phone, age=None, gender="", 
                      country="", promotion=False):
   resulting_dynf = # do some transformation on self 
   return resulting_dynf
   
DynamicFrame.myTransform = myTransform
```

 It is recommended to use an Amazon Glue notebook for the quickest way to develop and test the python code. See [Getting started with notebooks in Amazon Glue Studio](https://docs.amazonaws.cn/glue/latest/ug/notebook-getting-started.html). 

 To illustrate how to implement the transform logic, the custom visual transform in the example below is a transform to filter incoming data to keep only the data related to a specific US state. The .json file contains the parameter for `functionName` as `custom_filter_state` and two arguments ("state" and "colName" with type "str"). 

 The example config .json file is: 

```
{
"name": "custom_filter_state",
"displayName": "Filter State",
"description": "A simple example to filter the data to keep only the state indicated.",
"functionName": "custom_filter_state",
"parameters": [
   {
    "name": "colName",
    "displayName": "Column name",
    "type": "str",
    "description": "Name of the column in the data that holds the state postal code"
   },
   {
    "name": "state",
    "displayName": "State postal code",
    "type": "str",
    "description": "The postal code of the state whole rows to keep"
   }   
  ]
}
```

**To implement the companion script in Python**

1.  Start a Amazon Glue notebook and run the initial cell provided for the session to be started. Running the initial cell creates the basic components required. 

1.  Create a function that performs the filtering as describe in the example and register it on `DynamicFrame`. Copy the code below and paste into a cell in the Amazon Glue notebook. 

   ```
   from awsglue import DynamicFrame
   
   def custom_filter_state(self, colName, state):
       return self.filter(lambda row: row[colName] == state)
   
   DynamicFrame.custom_filter_state = custom_filter_state
   ```

1.  Create or load sample data to test the code in the same cell or a new cell. If you add the sample data in a new cell, don't forget to run the cell. For example: 

   ```
   # A few of rows of sample data to test
   data_sample = [
       {"state": "CA", "count": 4},
       {"state": "NY", "count": 2},
       {"state": "WA", "count": 3}    
   ]
   df1 = glueContext.sparkSession.sparkContext.parallelize(data_sample).toDF()
   dynf1 = DynamicFrame.fromDF(df1, glueContext, None)
   ```

1.  Test to validate the “custom\$1filter\$1state” with different arguments:   
![\[The screenshot shows a cell in a Amazon Glue notebook with the arguments passed to the dynamicFrame.show function.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/dynamic-transform-notebook-test-python.png)

1.  After running several tests, save the code with the .py extension and name the .py file with a name that mirrors the .json file name. The .py and .json files should be in the same transform folder. 

    Copy the following code and paste it to a file and rename it with a .py file extension. 

   ```
   from awsglue import DynamicFrame
   
   def custom_filter_state(self, colName, state):
       return self.filter(lambda row: row[colName] == state)
   
   DynamicFrame.custom_filter_state = custom_filter_state
   ```

1.  In Amazon Glue Studio, open a visual job and add the transform to the job by selecting it from the list of available **Transforms**. 

    To reuse this transform in a Python script code, add the Amazon S3 path to the .py file in the job under “Referenced files path” and in the script, import the name of the python file (without the extension) by adding it to the top of the file. For example: `import` <name of the file (without the extension)> 

# Step 3. Validate and troubleshoot custom visual transforms in Amazon Glue Studio


 Amazon Glue Studio validates the JSON config file before custom visual transforms are loaded into Amazon Glue Studio. Validation includes: 
+  Presence of required fields 
+  JSON format validation 
+  Incorrect or invalid parameters 
+  Presence of both the .py and .json files in the same Amazon S3 path 
+  Matching filenames for the .py and .json 

 If validation succeeds, the transform is listed in the list of available **Actions** in the visual editor. If a custom icon has been provided, it should be visible beside the **Action**. 

 If validation fails, Amazon Glue Studio does not load the custom visual transform. 

# Step 4. Update custom visual transforms as needed


 Once created and used, the transform script can be updated as long as the transform follows the corresponding json definition: 
+  The name used when assigning to DynamicFrame much match the json `functionName`. 
+  The function arguments must be defined in the json file as described in [Step 1. Create a JSON config file](custom-visual-transform-json-config-file.md). 
+  The Amazon S3 path of the Python file cannot change, since the jobs depend directly on it. 

**Note**  
 If any updates need to be made, ensure the script and the .json file are consistently updated and any visual jobs are correctly saved again with the new transform. If visual jobs are not saved after the updates were made, the updates will not be applied and validated. If the Python script file is renamed or not placed next to the .json file, then you need to specify the full path in the .json file. 

**Custom icon**

If you determine the default icon for your **Action** does not visually distinguish it as part of your workflows, you can provide a custom icon, as described in [Getting started with custom visual transforms](custom-visual-transform-getting-started.md). You can update the icon by updating the corresponding SVG hosted in Amazon S3.

For best results, design your image to be viewed at 32x32px following guidelines from the Cloudscape Design System. For more information about Cloudscape guidelines, see [The Cloudscape documentation](https://cloudscape.design/foundation/visual-foundation/iconography/#custom-icons)

# Step 5. Use custom visual transforms in Amazon Glue Studio


 To use a custom visual transform in Amazon Glue Studio, you upload the config and source files, then select the transform from the **Action** menu. Any parameters that need values or input are available to you in the **Transform** tab. 

1.  Upload the two files (Python source file and JSON config file) to the Amazon S3 assets folder where the job scripts are stored. By default, Amazon Glue pulls all JSON files from the **/transforms** folder within the same Amazon S3 bucket. 

1.  From the **Action** menu, choose the custom visual transform. It is named with the transform `displayName` or name that you specified in the .json config file. 

1.  Enter values for any parameters that were configured in the config file.   
![\[The screenshot shows a custom visual transform with parameters for the user to complete in the Transform tab.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/dynamic-transform-parameters.png)

# Usage examples


 The following is an example of all possible parameters in a .json config file. 

```
{
  "name": "MyTransform",
  "displayName": "My Transform",
  "description": "This transform description will be displayed in UI",
  "functionName": "myTransform",
  "parameters": [
      {
      "name": "email",
      "displayName": "Email Address",
      "type": "str",
      "description": "Enter your work email address below",
      "validationType": "RegularExpression",
      "validationRule": "^\\w+([\\.-]?\\w+)*@\\w+([\\.-]?\\w+)*(\\.\\w{2,3})+$",
      "validationMessage": "Please enter a valid email address"
    },
    {
      "name": "phone",
      "displayName": "Phone Number",
      "type": "str",
      "description": "Enter your mobile phone number below",
      "validationRule": "^\\(?(\\d{3})\\)?[- ]?(\\d{3})[- ]?(\\d{4})$",
      "validationMessage": "Please enter a valid US number"
    },
    {
      "name": "age",
      "displayName": "Your age",
      "type": "int",
      "isOptional": true
    },
    {
      "name": "gender",
      "displayName": "Your gender",
      "type": "str",
      "listOptions": [
            {"label": "Male", "value": "male"},
            {"label": "Female", "value": "female"},
            {"label": "Other", "value": "other"}
        ],
      "isOptional": true
    },
    {
      "name": "country",
      "displayName": "Your origin country ?",
      "type": "list",
      "listOptions": "Afghanistan,Albania,Algeria,American Samoa,Andorra,Angola,Anguilla,Antarctica,Antigua and Barbuda,Argentina,Armenia,Aruba,Australia,Austria,Azerbaijan,Bahamas,Bahrain,Bangladesh,Barbados,Belarus,Belgium,Belize,Benin,Bermuda,Bhutan,Bolivia,Bosnia and Herzegovina,Botswana,Bouvet Island,Brazil,British Indian Ocean Territory,Brunei Darussalam,Bulgaria,Burkina Faso,Burundi,Cambodia,Cameroon,Canada,Cape Verde,Cayman Islands,Central African Republic,Chad,Chile,China,Christmas Island,Cocos (Keeling Islands),Colombia,Comoros,Congo,Cook Islands,Costa Rica,Cote D'Ivoire (Ivory Coast),Croatia (Hrvatska,Cuba,Cyprus,Czech Republic,Denmark,Djibouti,Dominica,Dominican Republic,East Timor,Ecuador,Egypt,El Salvador,Equatorial Guinea,Eritrea,Estonia,Ethiopia,Falkland Islands (Malvinas),Faroe Islands,Fiji,Finland,France,France,Metropolitan,French Guiana,French Polynesia,French Southern Territories,Gabon,Gambia,Georgia,Germany,Ghana,Gibraltar,Greece,Greenland,Grenada,Guadeloupe,Guam,Guatemala,Guinea,Guinea-Bissau,Guyana,Haiti,Heard and McDonald Islands,Honduras,Hong Kong,Hungary,Iceland,India,Indonesia,Iran,Iraq,Ireland,Israel,Italy,Jamaica,Japan,Jordan,Kazakhstan,Kenya,Kiribati,Korea (North),Korea (South),Kuwait,Kyrgyzstan,Laos,Latvia,Lebanon,Lesotho,Liberia,Libya,Liechtenstein,Lithuania,Luxembourg,Macau,Macedonia,Madagascar,Malawi,Malaysia,Maldives,Mali,Malta,Marshall Islands,Martinique,Mauritania,Mauritius,Mayotte,Mexico,Micronesia,Moldova,Monaco,Mongolia,Montserrat,Morocco,Mozambique,Myanmar,Namibia,Nauru,Nepal,Netherlands,Netherlands Antilles,New Caledonia,New Zealand,Nicaragua,Niger,Nigeria,Niue,Norfolk Island,Northern Mariana Islands,Norway,Oman,Pakistan,Palau,Panama,Papua New Guinea,Paraguay,Peru,Philippines,Pitcairn,Poland,Portugal,Puerto Rico,Qatar,Reunion,Romania,Russian Federation,Rwanda,Saint Kitts and Nevis,Saint Lucia,Saint Vincent and The Grenadines,Samoa,San Marino,Sao Tome and Principe,Saudi Arabia,Senegal,Seychelles,Sierra Leone,Singapore,Slovak Republic,Slovenia,Solomon Islands,Somalia,South Africa,S. Georgia and S. Sandwich Isls.,Spain,Sri Lanka,St. Helena,St. Pierre and Miquelon,Sudan,Suriname,Svalbard and Jan Mayen Islands,Swaziland,Sweden,Switzerland,Syria,Tajikistan,Tanzania,Thailand,Togo,Tokelau,Tonga,Trinidad and Tobago,Tunisia,Turkey,Turkmenistan,Turks and Caicos Islands,Tuvalu,Uganda,Ukraine,United Arab Emirates,United Kingdom (Britain / UK),United States of America (USA),US Minor Outlying Islands,Uruguay,Uzbekistan,Vanuatu,Vatican City State (Holy See),Venezuela,Viet Nam,Virgin Islands (British),Virgin Islands (US),Wallis and Futuna Islands,Western Sahara,Yemen,Yugoslavia,Zaire,Zambia,Zimbabwe",
      "description": "What country were you born in?",
      "listType": "str",
      "isOptional": true
    },
    {
      "name": "promotion",
      "displayName": "Do you want to receive promotional newsletter from us?",
      "type": "bool",
      "isOptional": true
    }
  ]
}
```

# Examples of custom visual scripts


 The following examples perform equivalent transformations. However, the second example (SparkSQL) is the cleanest and most efficient, followed by the Pandas UDF and finally the low level mapping in the first example. The following example is a complete example of a simple transformation to add up two columns: 

```
from awsglue import DynamicFrame
 
# You can have other auxiliary variables, functions or classes on this file, it won't affect the runtime
def record_sum(rec, col1, col2, resultCol):
    rec[resultCol] = rec[col1] + rec[col2]
    return rec
 
 
# The number and name of arguments must match the definition on json config file
# (expect self which is the current DynamicFrame to transform
# If an argument is optional, you need to define a default value here
#  (resultCol in this example is an optional argument)
def custom_add_columns(self, col1, col2, resultCol="result"):
    # The mapping will alter the columns order, which could be important
    fields = [field.name for field in self.schema()]
    if resultCol not in fields:
        # If it's a new column put it at the end
        fields.append(resultCol)
    return self.map(lambda record: record_sum(record, col1, col2, resultCol)).select_fields(paths=fields)
 
 
# The name we assign on DynamicFrame must match the configured "functionName"
DynamicFrame.custom_add_columns = custom_add_columns
```

 The following example is an equivalent transform leveraging the SparkSQL API. 

```
from awsglue import DynamicFrame
 
# The number and name of arguments must match the definition on json config file
# (expect self which is the current DynamicFrame to transform
# If an argument is optional, you need to define a default value here
#  (resultCol in this example is an optional argument)
def custom_add_columns(self, col1, col2, resultCol="result"):
    df = self.toDF()
    return DynamicFrame.fromDF(
        df.withColumn(resultCol, df[col1] + df[col2]) # This is the conversion logic
        , self.glue_ctx, self.name) 
 
 
# The name we assign on DynamicFrame must match the configured "functionName"
DynamicFrame.custom_add_columns = custom_add_columns
```

 The following example uses the same transformations but using a pandas UDF, which is more efficient that using a plain UDF. For more information about writing pandas UDFs see: [Apache Spark SQL documentation](https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.functions.pandas_udf.html). 

```
from awsglue import DynamicFrame
import pandas as pd
from pyspark.sql.functions import pandas_udf
 
# The number and name of arguments must match the definition on json config file
# (expect self which is the current DynamicFrame to transform
# If an argument is optional, you need to define a default value here
#  (resultCol in this example is an optional argument)
def custom_add_columns(self, col1, col2, resultCol="result"):
    @pandas_udf("integer")  # We need to declare the type of the result column
    def add_columns(value1: pd.Series, value2: pd.Series) → pd.Series:
        return value1 + value2
 
    df = self.toDF()
    return DynamicFrame.fromDF(
        df.withColumn(resultCol, add_columns(col1, col2)) # This is the conversion logic
        , self.glue_ctx, self.name) 
 
# The name we assign on DynamicFrame must match the configured "functionName"
DynamicFrame.custom_add_columns = custom_add_columns
```

## Video


The following video provides an introduction to visual custom transforms and demonstrates how to use them.

[![AWS Videos](http://img.youtube.com/vi/https://www.youtube.com/embed/xFpAhANcVcg/0.jpg)](http://www.youtube.com/watch?v=https://www.youtube.com/embed/xFpAhANcVcg)


# Using Data Lake frameworks with Amazon Glue Studio
Using Data Lake frameworks with Amazon Glue Studio

## Overview


 Open source data lake frameworks simplify incremental data processing for files stored in data lakes built on Amazon S3. Amazon Glue 3.0 and later supports the following open-source data lake storage frameworks: 
+  Apache Hudi 
+  Linux Foundation Delta Lake 
+  Apache Iceberg 

 As of Amazon Glue 4.0, Amazon Glue provides native support for these frameworks so that you can read and write data that you store in Amazon S3 in a transactionally consistent manner. There's no need to install a separate connector or complete extra configuration steps in order to use these frameworks in Amazon Glue jobs. 

 Data Lake frameworks can be used as a source or a target within Amazon Glue Studio through Spark Script Editor jobs. For more information on using Apache Hudi, Apache Iceberg and Delta Lake see: [Using data lake frameworks with Amazon Glue ETL jobs](https://docs.amazonaws.cn/glue/latest/dg/aws-glue-programming-etl-datalake-native-frameworks.html). 

## Creating open table formats from an Amazon Glue Streaming source


Amazon Glue streaming ETL jobs continuously consume data from streaming sources, clean and transform the data in-flight, and make it available for analysis in seconds.

Amazon offers a broad selection of services to support your needs. A database replication service such as Amazon Database Migration Service can replicate the data from your source systems to Amazon S3, which commonly hosts the storage layer of the data lake. Although it’s straightforward to apply updates on a relational database management system (RDBMS) that backs an online source application, it's difficult to apply this CDC process on your data lakes. The open-source data management frameworks simplify incremental data processing and data pipeline development, and are a good option to solve this problem.

For more information, see:
+ [Create an Apache Hudi-based near-real-time transactional data lake using Amazon Glue Streaming](https://aws.amazon.com/blogs/big-data/create-an-apache-hudi-based-near-real-time-transactional-data-lake-using-aws-dms-amazon-kinesis-aws-glue-streaming-etl-and-data-visualization-using-amazon-quicksight/)
+ [Build a real-time GDPR-aligned Apache Iceberg data lake](https://aws.amazon.com/blogs/big-data/build-a-real-time-gdpr-aligned-apache-iceberg-data-lake/)

# Using Hudi framework in Amazon Glue Studio


 When creating or editing a job, Amazon Glue Studio automatically adds the corresponding Hudi libraries for you depending on the version of Amazon Glue you are using. For more information, see [Using the Hudi framework in Amazon Glue](https://docs.amazonaws.cn/glue/latest/dg/aws-glue-programming-etl-format-hudi.html). 

## Using Apache Hudi framework in Data Catalog data sources


**To add a Hudi data source format to a job:**

1.  From the Source menu, choose Amazon Glue Studio Data Catalog. 

1.  In the **Data source properties** tab, choose a database and table. 

1.  Amazon Glue Studio displays the format type as Apache Hudi and the Amazon S3 URL.   
![\[The screenshot shows the data source properties tab for the Data Catalog source node.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/data_lake_formats_data_catalog_hudi.png)

## Using Hudi framework in Amazon S3 data sources


1.  From the Source menu, choose Amazon S3. 

1.  If you choose Data Catalog table as the Amazon S3 source type, choose a database and table. 

1.  Amazon Glue Studio displays the format as Apache Hudi and the Amazon S3 URL. 

1.  If you choose Amazon S3 location as the **Amazon S3 source type**, choose the Amazon S3 URL by clicking **Browse Amazon S3**. 

1.  In **Data format**, select Apache Hudi. 
**Note**  
 If Amazon Glue Studio is unable to infer the schema from the Amazon S3 folder or file you selected, choose **Additional options** to select a new folder or file.   
 In **Additional options** choose from the following options under **Schema inference**:   
 Let Amazon Glue Studio automatically choose a sample file — Amazon Glue Studio will choose a sample file in the Amazon S3 location so that the schema can be inferred. In the **Auto-sampled file** field, you can view the file that was automatically selected. 
 Choose a sample file from Amazon S3 — choose the Amazon S3 file to use by clicking **Browse Amazon S3**. 

1.  Click **Infer schema**. You can then view the output schema by clicking on the **Output schema** tab. 

1.  Choose **Additional options** to enter a key-value pair.   
![\[The screenshot shows the Additional options section in the Data source properties tab for an Amazon S3 data source node.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/data_lake_formats_additional_options.png)

## Using Apache Hudi framework in data targets


### Using Apache Hudi framework in Data Catalog data targets


1.  From the **Target** menu, choose Amazon Glue Studio Data Catalog. 

1.  In the **Data source properties** tab, choose a database and table. 

1.  Amazon Glue Studio displays the format type as Apache Hudi and the Amazon S3 URL. 

#### Using Apache Hudi framework in Amazon S3 data targets


 Enter values or select from the available options to configure Apache Hudi format. For more information on Apache Hudi, see [Apache Hudi documentation](https://hudi.apache.org/docs/overview). 

![\[The screenshot shows the Additional options section in the Data source properties tab for an Amazon S3 data source node.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/hudi_s3_target_properties.png)

+  **Hudi Table Name** — this is the name of your hudi table. 
+  **Hudi Storage Type** — choose from two options: 
  +  **Copy on write** — recommended for optimizing read performance. This is the default Hudi storage type. Each update creates a new version of files during a write. 
  +  **Merge on read** — recommended for minimizing write latency. Updates are logged to row-based delta files and are compacted as needed to create new versions of the columnar files. 
+  **Hudi Write Operation ** - choose from the following options: 
  +  **Upsert** — this is the default operation where the input records are first tagged as inserts or updates by looking up the index. Recommended where you are updating existing data. 
  +  **Insert** — this inserts records but doesn't check for existing records and may result in duplicates. 
  +  **Bulk Insert** — this inserts records and is recommended for large amounts of data. 
+  **Hudi Record Key Fields** — use the search bar to search for and choose primary record keys. Records in Hudi are identified by a primary key which is a pair of record key and partition path where the record belongs to. 
+  **Hudi Precombine Field** — this is the field used in preCombining before actual write. When two records have the same key value, Amazon Glue Studio will pick the one with the largest value for the precombine field. Set a field with incremental value (e.g. updated\$1at) belongs to. 
+  **Compression Type** — choose from one of the compression type options: Uncompressed, GZIP, LZO, or Snappy. 
+  **Amazon S3 Target Location** — choose the Amazon S3 target location by clicking **Browse S3**. 
+  **Data Catalog update options** — choose from the following options: 
  +  Do not update the Data Catalog: (Default) Choose this option if you don't want the job to update the Data Catalog, even if the schema changes or new partitions are added. 
  +  Create a table in the Data Catalog and on subsequent runs, update the schema and add new partitions: I f you choose this option, the job creates the table in the Data Catalog on the first run of the job. On subsequent job runs, the job updates the Data Catalog table if the schema changes or new partitions are added. 

     You must also select a database from the Data Catalog and enter a table name. 
  +  Create a table in the Data Catalog and on subsequent runs, keep existing schema and add new partitions: If you choose this option, the job creates the table in the Data Catalog on the first run of the job. On subsequent job runs, the job updates the Data Catalog table only to add new partitions. 

     You must also select a database from the Data Catalog and enter a table name. 
+  Partition keys: Choose which columns to use as partitioning keys in the output. To add more partition keys, choose Add a partition key. 
+  **Addtional options** — enter a key-value pair as needed. 

## Generating code through Amazon Glue Studio


 When the job is saved, the following job parameters are added to the job if a Hudi source or target are detected: 
+  `--datalake-formats` – a distinct list of data lake formats detected in the visual job (either directly by choosing a “Format” or indirectly by selecting a catalog table that is backed by a data lake). 
+  `--conf ` – generated based on the value of `--datalake-formats`. For example, if the value for `--datalake-formats` is 'hudi', Amazon Glue generates a value of `spark.serializer=org.apache.spark.serializer.KryoSerializer —conf spark.sql.hive.convertMetastoreParquet=false` for this parameter. 

## Overriding Amazon Glue-provided libraries


 To use a version of Hudi that Amazon Glue doesn't support, you can specify your own Hudi library JAR files. To use your own JAR file: 
+  use the `--extra-jars` job parameter. For example, `'--extra-jars': 's3pathtojarfile.jar'`. For more information, see [Amazon Glue job parameters](https://docs.amazonaws.cn/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html). 
+  do not include `hudi` as a value for the `--datalake-formats` job parameter. Entering a blank string as a value ensures that no data lake libraries are provided for you by Amazon Glue automatically. For more information, see [Using the Hudi framework in Amazon Glue](https://docs.amazonaws.cn/glue/latest/dg/aws-glue-programming-etl-format-hudi.html). 

# Using Delta Lake framework in Amazon Glue Studio


## Using Delta Lake framework in data sources


### Using Delta Lake framework in Amazon S3 data sources


1.  From the Source menu, choose Amazon S3. 

1.  If you choose Data Catalog table as the Amazon S3 source type, choose a database and table. 

1.  Amazon Glue Studio displays the format as Delta Lake and the Amazon S3 URL. 

1.  Choose **Additional options** to enter a key-value pair. For example, a key-value pair could be: **key**: timestampAsOf and **value**: 2023-02-24 14:16:18.   
![\[The screenshot shows the Additional options section in the Data source properties tab for an Amazon S3 data source node.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/data_lake_formats_additional_options.png)

1.  If you choose Amazon S3 location as the **Amazon S3 source type**, choose the Amazon S3 URL by clicking **Browse Amazon S3**. 

1.  In **Data format**, choose Delta Lake. 
**Note**  
 If Amazon Glue Studio is unable to infer the schema from the Amazon S3 folder or file you selected, choose **Additional options** to select a new folder or file.   
 In **Additional options** choose from the following options under **Schema inference**:   
 Let Amazon Glue Studio automatically choose a sample file — Amazon Glue Studio will choose a sample file in the Amazon S3 location so that the schema can be inferred. In the **Auto-sampled file** field, you can view the file that was automatically selected. 
 Choose a sample file from Amazon S3 — choose the Amazon S3 file to use by clicking **Browse Amazon S3**. 

1.  Click **Infer schema**. You can then view the output schema by clicking on the **Output schema** tab. 

### Using Delta Lake framework in Data Catalog data sources


1.  From the **Source** menu, choose Amazon Glue Studio Data Catalog. 

1.  In the **Data source properties** tab, choose a database and table. 

1.  Amazon Glue Studio displays the format type as Delta Lake and the Amazon S3 URL. 
**Note**  
 If your Delta Lake source is not registered as the Amazon Glue Data Catalog table yet, you have two options:   
 Create a Amazon Glue crawler for the Delta Lake data store. For more information, see [ How to specify configuration options for a Delta Lake data store](https://docs.amazonaws.cn/glue/latest/dg/crawler-configuration.html#crawler-delta-lake). 
 Use an Amazon S3 data source to select your Delta Lake data source. See [Using Delta Lake framework in Amazon S3 data sources](#gs-data-lake-formats-delta-lake-s3-data-source). 

## Using Delta Lake formats in data targets


### Using Delta Lake formats in Data Catalog data targets


1.  From the **Target** menu, choose Amazon Glue Studio Data Catalog. 

1.  In the **Data source properties** tab, choose a database and table. 

1.  Amazon Glue Studio displays the format type as Delta Lake and the Amazon S3 URL. 

### Using Delta Lake formats in Amazon S3 data sources


 Enter values or select from the available options to configure Delta Lake format. 
+  **Compression Type** — choose from one of the compression type options: Uncompressed or Snappy. 
+  **Amazon S3 Target Location** — choose the Amazon S3 target location by clicking **Browse S3**. 
+  **Data Catalog update options** — updating the Data Catalog is not supported for this format in the Glue Studio visual editor. 
  +  Do not update the Data Catalog: (Default) Choose this option if you don't want the job to update the Data Catalog, even if the schema changes or new partitions are added. 
  +  To update the Data Catalog after the Amazon Glue job execution, run or schedule a Amazon Glue crawler. For more information, see [ How to specify configuration options for a Delta Lake data store](https://docs.amazonaws.cn/glue/latest/dg/crawler-configuration.html#crawler-delta-lake). 
+  **Partition keys** — Choose which columns to use as partitioning keys in the output. To add more partition keys, choose **Add a partition key**. 
+  Optionally, choose **Addtional options** to enter a key-value pair. For example, a key-value pair could be: **key**: timestampAsOf and **value**: 2023-02-24 14:16:18. 

# Using Apache Iceberg framework in Amazon Glue Studio


## Using Apache Iceberg framework in data targets


### Using Apache Iceberg framework in Data Catalog data targets


1.  From the **Target** menu, choose Amazon Glue Studio Data Catalog. 

1.  In the **Data source properties** tab, choose a database and table. 

1.  Amazon Glue Studio displays the format type as Apache Iceberg and the Amazon S3 URL. 

### Using Apache Iceberg framework in Amazon S3 data targets


 Enter values or select from the available options to configure Apache Iceberg format. 
+  **Format** – choose **Apache Iceberg** from the drop-down menu. 
+  **Amazon S3 Target Location** – choose the Amazon S3 target location by clicking **Browse S3**. 
+  **Data Catalog update options** – **Create a table in the Data Catalog and on subsequent runs, keep existing schema and add new partitions** must be selected to proceed. Writing a new Iceberg table using Amazon Glue requires the Data Catalog to be configured as the catalog for the Iceberg table. To update an existing Iceberg table that has been registered in the Data Catalog, choose Data Catalog as the target. 
  +  **Database ** – Choose the database from the Data Catalog. 
  +  **Table Name** – Enter the value for your table name. Apache Iceberg table names must be in all lower case. Use underscores if needed since spaces are not allowed. For example "data\$1lake\$1format\$1tables". 

![\[The screenshot shows the Data target properties when using Apache Iceberg framework in Amazon S3 data targets.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/apache-iceberg-data-target-properties.png)


## Using Apache Iceberg framework in Amazon S3 data sources


### Using Apache Iceberg framework in Data Catalog data sources


1.  From the **Source** menu, choose Amazon Glue Studio Data Catalog. 

1.  In the **Data source properties** tab, choose a database and table. 

1.  Amazon Glue Studio displays the format type as Apache Iceberg and the Amazon S3 URL. 

![\[The screenshot shows the Data target properties when using Apache Iceberg framework in Data Catalog data sources.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/apache-iceberg-data-source-properties.png)


### Using Apache Iceberg framework in Amazon S3 data sources


 Apache Iceberg is not available as a data option for Amazon S3 source nodes in Amazon Glue Studio. 

# Connecting to data sources using Visual ETL jobs
Connecting to data sources

 While creating a new job, you can use connections to connect to data when editing visual ETL jobs in Amazon Glue. You can do this by adding source nodes that use connectors to read in data, and target nodes to specify the location for writing out data. 

**Topics**
+ [

# Modifying properties of a data source node
](edit-jobs-source.md)
+ [

# Using Data Catalog tables for the data source
](edit-jobs-source-catalog-tables.md)
+ [

# Using a connector for the data source
](edit-jobs-source-connectors.md)
+ [

# Using files in Amazon S3 for the data source
](edit-jobs-source-s3-files.md)
+ [

# Using a streaming data source
](edit-jobs-source-streaming.md)
+ [

# References
](edit-jobs-source-references.md)

# Modifying properties of a data source node


To specify the data source properties, you first choose a data source node in the job diagram. Then, on the right side in the node details panel, you configure the node properties.

**To modify the properties of a data source node**

1. Go to the visual editor for a new or saved job.

1. Choose a data source node in the job diagram.

1. Choose the **Node properties** tab in the node details panel, and then enter the following information:
   + **Name**: (Optional) Enter a name to associate with the node in the job diagram. This name should be unique among all the nodes for this job.
   + **Node type**: The node type determines the action that is performed by the node. In the list of options for **Node type**, choose one of the values listed under the heading **Data source**.

1. Configure the **Data source properties** information. For more information, see the following sections:
   + [Using Data Catalog tables for the data source](edit-jobs-source-catalog-tables.md)
   + [Using a connector for the data source](edit-jobs-source-connectors.md)
   + [Using files in Amazon S3 for the data source](edit-jobs-source-s3-files.md)
   + [Using a streaming data source](edit-jobs-source-streaming.md)

1. (Optional) After configuring the node properties and data source properties, you can view the schema for your data source by choosing the **Output schema** tab in the node details panel. The first time you choose this tab for any node in your job, you are prompted to provide an IAM role to access the data. If you have not specified an IAM role on the **Job details** tab, you are prompted to enter an IAM role here.

1. (Optional) After configuring the node properties and data source properties, you can preview the dataset from your data source by choosing the **Data preview** tab in the node details panel. The first time you choose this tab for any node in your job, you are prompted to provide an IAM role to access the data. There is a cost associated with using this feature, and billing starts as soon as you provide an IAM role.

# Using Data Catalog tables for the data source


For all data sources except Amazon S3 and connectors, a table must exist in the Amazon Glue Data Catalog for the source type that you choose. Amazon Glue does not create the Data Catalog table.

**To configure a data source node based on a Data Catalog table**

1. Go to the visual editor for a new or saved job.

1. Choose a data source node in the job diagram.

1. Choose the **Data source properties** tab, and then enter the following information:
   + **S3 source type**: (For Amazon S3 data sources only) Choose the option **Select a Catalog table** to use an existing Amazon Glue Data Catalog table.
   + **Database**: Choose the database in the Data Catalog that contains the source table you want to use for this job. You can use the search field to search for a database by its name.
   + **Table**: Choose the table associated with the source data from the list. This table must already exist in theAmazon Glue Data Catalog. You can use the search field to search for a table by its name.
   + **Partition predicate**: (For Amazon S3 data sources only) Enter a Boolean expression based on Spark SQL that includes only the partitioning columns. For example: `"(year=='2020' and month=='04')"`
   + **Temporary directory**: (For Amazon Redshift data sources only) Enter a path for the location of a working directory in Amazon S3 where your ETL job can write temporary intermediate results.
   + **Role associated with the cluster**: (For Amazon Redshift data sources only) Enter a role for your ETL job to use that contains permissions for Amazon Redshift clusters. For more information, see [Data source and data target permissions](getting-started-min-privs-job.md#getting-started-min-privs-data).

# Using a connector for the data source


If you select a connector for the **Node type**, follow the instructions at [Authoring jobs with custom connectors](job-authoring-custom-connectors.md) to finish configuring the data source properties.

# Using files in Amazon S3 for the data source


If you choose Amazon S3 as your data source, then you can choose either:
+ A Data Catalog database and table.
+ A bucket, folder, or file in Amazon S3.

If you use an Amazon S3 bucket as your data source, Amazon Glue detects the schema of the data at the specified location from one of the files, or by using the file you specify as a sample file. Schema detection occurs when you use the **Infer schema** button. If you change the Amazon S3 location or the sample file, then you must choose **Infer schema** again to perform the schema detection using the new information.

**To configure a data source node that reads directly from files in Amazon S3**

1. Go to the visual editor for a new or saved job.

1. Choose a data source node in the job diagram for an Amazon S3 source.

1. Choose the **Data source properties** tab, and then enter the following information:
   + **S3 source type**: (For Amazon S3 data sources only) Choose the option **S3 location**.
   + **S3 URL**: Enter the path to the Amazon S3 bucket, folder, or file that contains the data for your job. You can choose **Browse S3** to select the path from the locations available to your account. 
   + **Recursive**: Choose this option if you want Amazon Glue to read data from files in child folders at the S3 location. 

     If the child folders contain partitioned data, Amazon Glue doesn't add any partition information that's specified in the folder names to the Data Catalog. For example, consider the following folders in Amazon S3:

     ```
     S3://sales/year=2019/month=Jan/day=1
     S3://sales/year=2019/month=Jan/day=2
     ```

     If you choose **Recursive** and select the `sales` folder as your S3 location, then Amazon Glue reads the data in all the child folders, but doesn't create partitions for year, month or day.
   + **Data format**: Choose the format that the data is stored in. You can choose JSON, CSV, or Parquet. The value you select tells the Amazon Glue job how to read the data from the source file.
**Note**  
If you don't select the correct format for your data, Amazon Glue might infer the schema correctly, but the job won't be able to correctly parse the data from the source file.

     You can enter additional configuration options, depending on the format you choose. 
     + **JSON** (JavaScript Object Notation)
       + **JsonPath**: Enter a JSON path that points to an object that is used to define a table schema. JSON path expressions always refer to a JSON structure in the same way as XPath expression are used in combination with an XML document. The "root member object" in the JSON path is always referred to as `$`, even if it's an object or array. The JSON path can be written in dot notation or bracket notation.

         For more information about the JSON path, see [JsonPath](https://github.com/json-path/JsonPath) on the GitHub website.
       + **Records in source files can span multiple lines**: Choose this option if a single record can span multiple lines in the CSV file.
     + **CSV** (comma-separated values)
       + **Delimiter**: Enter a character to denote what separates each column entry in the row, for example, `;` or `,`.
       + **Escape character**: Enter a character that is used as an escape character. This character indicates that the character that immediately follows the escape character should be taken literally, and should not be interpreted as a delimiter.
       + **Quote character**: Enter the character that is used to group separate strings into a single value. For example, you would choose **Double quote (")** if you have values such as `"This is a single value"` in your CSV file.
       + **Records in source files can span multiple lines**: Choose this option if a single record can span multiple lines in the CSV file.
       + **First line of source file contains column headers**: Choose this option if the first row in the CSV file contains column headers instead of data.
     + **Parquet** (Apache Parquet columnar storage)

       There are no additional settings to configure for data stored in Parquet format.
     + **Apache Hudi**

       There are no additional settings to configure for data stored in Apache Hudi format.
     + **Delta Lake**

       There are no additional settings to configure for data stored in Delta Lake format.
     + **Excel**

       There are no additional settings to configure for data stored in Excel format.
   + **Partition predicate**: To partition the data that is read from the data source, enter a Boolean expression based on Spark SQL that includes only the partitioning columns. For example: `"(year=='2020' and month=='04')"`
   + **Advanced options**: Expand this section if you want Amazon Glue to detect the schema of your data based on a specific file. 
     + **Schema inference**: Choose the option **Choose a sample file from S3** if you want to use a specific file instead of letting Amazon Glue choose a file. Schema inference is not available for the Excel source.
     + **Auto-sampled file**: Enter the path to the file in Amazon S3 to use for inferring the schema.

     If you're editing a data source node and change the selected sample file, choose **Reload schema** to detect the schema by using the new sample file.

1. Choose the **Infer schema** button to detect the schema from the sources files in Amazon S3. If you change the Amazon S3 location or the sample file, you must choose **Infer schema** again to infer the schema using the new information.

# Using a streaming data source


You can create streaming extract, transform, and load (ETL) jobs that run continuously and consume data from streaming sources in Amazon Kinesis Data Streams, Apache Kafka, and Amazon Managed Streaming for Apache Kafka (Amazon MSK).

**To configure properties for a streaming data source**

1. Go to the visual graph editor for a new or saved job.

1. Choose a data source node in the graph for Kafka or Kinesis Data Streams.

1. Choose the **Data source properties** tab, and then enter the following information:

------
#### [ Kinesis ]
   + **Kinesis source type**: Choose the option **Stream details** to use direct access to the streaming source or choose **Data Catalog table** to use the information stored there instead.

     If you choose **Stream details**, specify the following additional information.
     + **Location of data stream**: Choose whether the stream is associated with the current user, or if it is associated with a different user.
     + **Region**: Choose the Amazon Web Services Region where the stream exists. This information is used to construct the ARN for accessing the data stream.
     + **Stream ARN**: Enter the Amazon Resource Name (ARN) for the Kinesis data stream. If the stream is located within the current account, you can choose the stream name from the drop-down list. You can use the search field to search for a data stream by its name or ARN.
     + **Data format**: Choose the format used by the data stream from the list. 

       Amazon Glue automatically detects the schema from the streaming data.

     If you choose **Data Catalog table**, specify the following additional information.
     + **Database**: (Optional) Choose the database in the Amazon Glue Data Catalog that contains the table associated with your streaming data source. You can use the search field to search for a database by its name. 
     + **Table**: (Optional) Choose the table associated with the source data from the list. This table must already exist in the Amazon Glue Data Catalog. You can use the search field to search for a table by its name. 
     + **Detect schema**: Choose this option to have Amazon Glue detect the schema from the streaming data, rather than using the schema information in a Data Catalog table. This option is enabled automatically if you choose the **Stream details** option.
   + **Starting position**: By default, the ETL job uses the **Earliest** option, which means it reads data starting with the oldest available record in the stream. You can instead choose **Latest**, which indicates the ETL job should start reading from just after the most recent record in the stream.
   + **Window size**: By default, your ETL job processes and writes out data in 100-second windows. This allows data to be processed efficiently and permits aggregations to be performed on data that arrives later than expected. You can modify this window size to increase timeliness or aggregation accuracy. 

     Amazon Glue streaming jobs use checkpoints rather than job bookmarks to track the data that has been read. 
   + **Connection options**: Expand this section to add key-value pairs to specify additional connection options. For information about what options you can specify here, see ["connectionType": "kinesis"](https://docs.amazonaws.cn/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-kinesis) in the *Amazon Glue Developer Guide*.

------
#### [ Kafka ]
   + **Apache Kafka source**: Choose the option **Stream details** to use direct access to the streaming source or choose **Data Catalog table** to use the information stored there instead.

     If you choose **Data Catalog table**, specify the following additional information.
     + **Database**: (Optional) Choose the database in the Amazon Glue Data Catalog that contains the table associated with your streaming data source. You can use the search field to search for a database by its name. 
     + **Table**: (Optional) Choose the table associated with the source data from the list. This table must already exist in the Amazon Glue Data Catalog. You can use the search field to search for a table by its name. 
     + **Detect schema**: Choose this option to have Amazon Glue detect the schema from the streaming data, rather than storing the schema information in a Data Catalog table. This option is enabled automatically if you choose the **Stream details** option.

     If you choose **Stream details**, specify the following additional information.
     + **Connection name**: Choose the Amazon Glue connection that contains the access and authentication information for the Kafka data stream. You must use a connection with Kafka streaming data sources. If a connection doesn't exist, you can use the Amazon Glue console to create a connection for your Kafka data stream.
     + **Topic name**: Enter the name of the topic to read from.
     + **Data format**: Choose the format to use when reading data from the Kafka event stream. 
   + **Starting position**: By default, the ETL job uses the **Earliest** option, which means it reads data starting with the oldest available record in the stream. You can instead choose **Latest**, which indicates the ETL job should start reading from just after the most recent record in the stream.
   + **Window size**: By default, your ETL job processes and writes out data in 100-second windows. This allows data to be processed efficiently and permits aggregations to be performed on data that arrives later than expected. You can modify this window size to increase timeliness or aggregation accuracy. 

     Amazon Glue streaming jobs use checkpoints rather than job bookmarks to track the data that has been read. 
   + **Connection options**: Expand this section to add key-value pairs to specify additional connection options. For information about what options you can specify here, see ["connectionType": "kafka"](https://docs.amazonaws.cn/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-kafka) in the *Amazon Glue Developer Guide*.

------

**Note**  
Data previews are not currently supported for streaming data sources.

# References


 **Best Practices** 
+  [ Build an ETL service pipeline to load data incrementally from Amazon S3 to Amazon Redshift using Amazon Glue](https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/build-an-etl-service-pipeline-to-load-data-incrementally-from-amazon-s3-to-amazon-redshift-using-aws-glue.html) 

 **ETL programming** 
+  [Connection types and options for ETL in Amazon Glue](https://docs.amazonaws.cn/glue/latest/dg/aws-glue-connections.html) 
+  [ JDBC connectionType values ](https://docs.amazonaws.cn/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-jdbc) 
+  [ Advanced options for moving data to and from Amazon Redshift](https://docs.amazonaws.cn/glue/latest/dg/aws-glue-programming-etl-redshift.html) 

# Configuring data target nodes


The data target is where the job writes the transformed data. 

## Overview of data target options


Your data target (also called a *data sink*) can be:
+ **S3** – The job writes the data in a file in the Amazon S3 location you choose and in the format you specify.

  If you configure partition columns for the data target, then the job writes the dataset to Amazon S3 into directories based on the partition key.
+ **Amazon Glue Data Catalog** – The job uses the information associated with the table in the Data Catalog to write the output data to a target location. 

  You can create the table manually or with the crawler. You can also use Amazon CloudFormation templates to create tables in the Data Catalog. 
+ A connector – A connector is a piece of code that facilitates communication between your data store and Amazon Glue. The job uses the connector and associated connection to write the output data to a target location. You can either subscribe to a connector offered in Amazon Web Services Marketplace, or you can create your own custom connector. For more information, see [Adding connectors to Amazon Glue Studio](creating-custom-connectors.md#creating-connectors)

You can choose to update the Data Catalog when your job writes to an Amazon S3 data target. Instead of requiring a crawler to update the Data Catalog when the schema or partitions change, this option makes it easy to keep your tables up to date. This option simplifies the process of making your data available for analytics by optionally adding new tables to the Data Catalog, updating table partitions, and updating the schema of your tables directly from the job.

## Editing the data target node


The data target is where the job writes the transformed data. 

**To add or configure a data target node in your job diagram**

1. (Optional) If you need to add a target node, choose **Target** in the toolbar at the top of the visual editor, and then choose either **S3** or **Glue Data Catalog**. 
   + If you choose **S3** for the target, then the job writes the dataset to one or more files in the Amazon S3 location you specify.
   + If you choose **Amazon Glue Data Catalog** for the target, then the job writes to a location described by the table selected from the Data Catalog.

1. Choose a data target node in the job diagram. When you choose a node, the node details panel appears on the right-side of the page.

1. Choose the **Node properties** tab, and then enter the following information:
   + **Name**: Enter a name to associate with the node in the job diagram.
   + **Node type**: A value should already be selected, but you can change it as needed.
   + **Node parents**: The parent node is the node in the job diagram that provides the output data you want to write to the target location. For a pre-populated job diagram, the target node should already have the parent node selected. If there is no parent node displayed, then choose a parent node from the list. 

     A target node has a single parent node.

1. Configure the **Data target properties** information. For more information, see the following sections:
   + [Using Amazon S3 for the data target](#edit-job-target-S3)
   + [Using Data Catalog tables for the data target](#edit-job-target-catalog)
   + [Using a connector for the data target](#edit-job-target-connector)

1. (Optional) After configuring the data target node properties, you can view the output schema for your data by choosing the **Output schema** tab in the node details panel. The first time you choose this tab for any node in your job, you are prompted to provide an IAM role to access the data. If you have not specified an IAM role on the **Job details** tab, you are prompted to enter an IAM role here.

### Using Amazon S3 for the data target


For all data sources except Amazon S3 and connectors, a table must exist in the Amazon Glue Data Catalog for the source type that you choose. Amazon Glue Studio does not create the Data Catalog table.

**To configure a data target node that writes to Amazon S3**

1. Go to the visual editor for a new or saved job.

1. Choose a data source node in the job diagram.

1. Choose the **Data source properties** tab, and then enter the following information:
   + **Format**: Choose a format from the list. The available format types for the data results are:
     + **JSON**: JavaScript Object Notation. 
     + **CSV**: Comma-separated values. 
     + **Avro**: Apache Avro JSON binary. 
     + **Parquet**: A custom Parquet writer type that is optimized for `DynamicFrames` as the data format. Instead of requiring a precomputed schema for the data, it computes and modifies the schema dynamically.
     + **ORC**: Apache Optimized Row Columnar (ORC) format. 
     + **Apache Hudi**: An open-source data lake storage framework that simplifies incremental data processing and data pipeline development. 
     + **Apache Iceberg**: A high-performance table format that works just like an SQL table. 
     + **Delta Lake**: An open-source data lake storage framework that helps you perform ACID transactions, scale metadata handling, and unify streaming and batch data processing. 
     + **XML**: Extensible Markup Language (XML). 
     + **Tableau Hyper**: Tableau’s in-memory data engine technology.

     To learn more about these format options, see [Format Options for ETL Inputs and Outputs in Amazon Glue](https://docs.amazonaws.cn/glue/latest/dg/aws-glue-programming-etl-format.html) in the *Amazon Glue Developer Guide*.
   + **Compression Type**: You can choose to optionally compress the data using the file types `CSV`, `JSON`, or `Parquet`. The default is no compression, or **None**.    
[\[See the AWS documentation website for more details\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/data-target-nodes.html)
   + **S3 Target Location**: The Amazon S3 bucket and location for the data output. You can choose the **Browse S3** button to see the Amazon S3 buckets that you have access to and choose one as the target destination. 
   + **Data catalog update options**
     + **Do not update the Data Catalog**: (Default) Choose this option if you don't want the job to update the Data Catalog, even if the schema changes or new partitions are added.
     + **Create a table in the Data Catalog and on subsequent runs, update the schema and add new partitions**: If you choose this option, the job creates the table in the Data Catalog on the first run of the job. On subsequent job runs, the job updates the Data Catalog table if the schema changes or new partitions are added.

       You must also select a database from the Data Catalog and enter a table name.
     + **Create a table in the Data Catalog and on subsequent runs, keep existing schema and add new partitions**: If you choose this option, the job creates the table in the Data Catalog on the first run of the job. On subsequent job runs, the job updates the Data Catalog table only to add new partitions.

       You must also select a database from the Data Catalog and enter a table name.
   + **File Partitioning**: Choose which type of partitioning you want to save the output in.
     + **Autogenerate files (Recommended)**: This is the default value for the number of generated files.
     + **Multiple file output**: Specify the number of file outputs you want. For optimal performance, use the default auto-generated number of files value.
   + **Partition keys**: Choose which columns to use as partitioning keys in the output. To add more partition keys, choose **Add a partition key**.

   File partitioning is not supported for Tableau Hyper as a target format.

### Using Data Catalog tables for the data target


For all data sources except Amazon S3 and connectors, a table must exist in the Amazon Glue Data Catalog for the target type that you choose. Amazon Glue Studio does not create the Data Catalog table.

**To configure the data properties for a target that uses a Data Catalog table**

1. Go to the visual editor for a new or saved job.

1. Choose a data target node in the job diagram.

1. Choose the **Data target properties** tab, and then enter the following information:
   + **Database**: Choose the database that contains the table you want to use as the target from the list. This database must already exist in the Data Catalog.
   + **Table**: Choose the table that defines the schema of your output data from the list. This table must already exist in the Data Catalog.

     A table in the Data Catalog consists of the names of columns, data type definitions, partition information, and other metadata about the target dataset. Your job writes to a location described by this table in the Data Catalog.

     For more information about creating tables in the Data Catalog, see [Defining Tables in the Data Catalog](https://docs.amazonaws.cn/glue/latest/dg/tables-described.html) in the *Amazon Glue Developer Guide*.
   + **Data catalog update options**
     + **Do not change table definition**: (Default) Choose this option if you don't want the job to update the Data Catalog, even if the schema changes, or new partitions are added.
     + **Update schema and add new partitions**: If you choose this option, the job updates the Data Catalog table if the schema changes or new partitions are added.
     + **Keep existing schema and add new partitions**: If you choose this option, the job updates the Data Catalog table only to add new partitions.
     + **Partition keys**: Choose which columns to use as partitioning keys in the output. To add more partition keys, choose **Add a partition key**.

### Using a connector for the data target


If you select a connector for the **Node type**, follow the instructions at [Authoring jobs with custom connectors](job-authoring-custom-connectors.md) to finish configuring the data target properties.

# Editing or uploading a job script


Use the Amazon Glue Studio visual editor to edit the job script or upload your own script.

You can use the visual editor to edit job nodes only if the jobs were created with Amazon Glue Studio. If the job was created using the Amazon Glue console, through API commands, or with the command line interface (CLI), you can use the script editor in Amazon Glue Studio to edit the job script, parameters, and schedule. You can also edit the script for a job created in Amazon Glue Studio by converting the job to script-only mode.

**To edit the job script or upload your own script**

1. If creating a new job, on the **Jobs** page, choose the **Spark script editor** option to create a Spark job or choose the **Python Shell script editor** to create a Python shell job. You can either write a new script, or upload an existing script. If you choose **Spark script editor**, you can write or upload either a Scala or Python script. If you choose **Python Shell script editor**, you can only write or upload a Python script.

   After choosing the option to create a new job, in the **Options** section that appears, you can choose to either start with a starter script (**Create a new script with boilerplate code**), or you can upload a local file to use as the job script.

   If you chose **Spark script editor**, you can upload either Python or Scala script files. Scala scripts must have the file extension `.scala`. Python scripts must be recognized as files of type Python. If you chose **Python Shell script editor**, you can upload only Python script files.

   When you are finished making your choices, choose **Create** to create the job and open the visual editor.

1. Go to the visual job editor for the new or saved job, and then choose the **Script** tab.

1. If you didn't create a new job using one of the script editor options, and you have never edited the script for an existing job, the **Script** tab displays the heading **Script (Locked)**. This means the script editor is in read-only mode. Choose **Edit script** to unlock the script for editing.

   To make the script editable, Amazon Glue Studio converts your job from a visual job to a script-only job. If you unlock the script for editing, you can't use the visual editor anymore for this job after you save it.

   In the confirmation window, choose **Confirm** to continue or **Cancel** to keep the job available for visual editing.

   If you choose **Confirm**, the **Visual** tab no longer appears in the editor. You can use Amazon Glue Studio to modify the script using the script editor, modify the job details or schedule, or view job runs.
**Note**  
Until you save the job, the conversion to a script-only job is not permanent. If you refresh the console web page, or close the job before saving it and reopen it in the visual editor, you will still be able to edit the individual nodes in the visual editor.

1. Edit the script as needed. 

   When you are done editing the script, choose **Save** to save the job and permanently convert the job from visual to script-only.

1. (Optional) You can download the script from the Amazon Glue Studio console by choosing the **Download** button on the **Script** tab. When you choose this button, a new browser window opens, displaying the script from its location in Amazon S3. The **Script filename** and **Script path** parameters in the **Job details** tab of the job determine the name and location of the script file in Amazon S3.   
![\[\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/job-details-script-location-params-screenshot.png)

   When you save the job, Amazon Glue save the job script at the location specified by these fields. If you modify the script file at this location within Amazon S3, Amazon Glue Studio will load the modified script the next time you edit the job.

## Creating and editing Scala scripts in Amazon Glue Studio


When you choose the script editor for creating a job, by default, the job programming language is set to `Python 3`. If you choose to write a new script instead of uploading a script, Amazon Glue Studio starts a new script with boilerplate text written in Python. If you want to write a Scala script instead, you must first configure the script editor to use Scala.

**Note**  
If you choose Scala as the programming language for the job and use the visual editor to design your job, the generated job script is written in Scala, and no further actions are needed.

**To write a new Scala script in Amazon Glue Studio**

1. Create a new job by choosing the **Spark script editor** option.

1. Under **Options**, choose **Create a new script with boilerplate code**.

1. Choose the **Job details** tab and set **Language** to `Scala` (instead of `Python 3`).
**Note**  
The **Type** property for the job is automatically set to `Spark` when you choose the **Spark script editor** option to create a job. 

1. Choose the **Script** tab.

1. Remove the Python boilerplate text. You can replace it with the following Scala boilerplate text.

   ```
   import com.amazonaws.services.glue.{DynamicRecord, GlueContext}
   import org.apache.spark.SparkContext
   import com.amazonaws.services.glue.util.JsonOptions
   import com.amazonaws.services.glue.util.GlueArgParser
   import com.amazonaws.services.glue.util.Job
   
   object MyScript {
     def main(args: Array[String]): Unit = {
       val sc: SparkContext = new SparkContext()
       val glueContext: GlueContext = new GlueContext(sc)
   
       }
   }
   ```

1. Write your Scala job script in the editor. Add additional `import` statements as needed.

## Creating and editing Python shell jobs in Amazon Glue Studio


When you choose the Python shell script editor for creating a job, you can upload an existing Python script, or write a new one. If you choose to write a new script, boilerplate code is added to the new Python job script. 

**To create a new Python shell job**  
Refer to the instructions at [Starting jobs in Amazon Glue Studio](edit-nodes-chapter.md#create-jobs-start).

The job properties that are supported for Python shell jobs are not the same as those supported for Spark jobs. The following list describes the changes to the available job parameters for Python shell jobs on the **Job details** tab.
+ The **Type** property for the job is automatically set to `Python Shell` and can't be changed. 
+ Instead of **Language**, there is a **Python version** property for the job. Currently, Python shell jobs created in Amazon Glue Studio use Python 3.6.
+ The **Glue version** property is not available, because it does not apply to Python shell jobs.
+ Instead of **Worker type** and **Number of workers**, a **Data processing units** property is shown instead. This job property determines how many data processing units (DPUs) are consumed by the Python shell when running the job.
+ The **Job bookmark** property is not available, because it is not supported for Python shell jobs.
+ Under **Advanced properties**, the following properties are not available for Python shell jobs.
  + **Job metrics**
  + **Continuous logging**
  + **Spark UI** and **Spark UI logs path**
  + **Dependent jars path**, under the heading **Libraries**

# Changing the parent nodes for a node in the job diagram


You can change a node's parents to move nodes within the job diagram or to change a data source for a node.

**To change the parent node**

1. Choose the node in the job diagram that you want to modify.

1. In the node details panel, on the **Node properties** tab, under the heading **Node parents** remove the current parent for the node.

1. Choose a new parent node from the list.

1. Modify the other properties of the node as needed to match the newly selected parent node.

If you modified a node by mistake, you can use the **Undo** button on the toolbar to reverse the action.

# Deleting nodes from the job diagram


 When working with Visual ETL jobs, you can remove nodes from the canvas without having to re-add or restructure any nodes that are connected to the removed node. 

 In the example below, you can follow along by choosing **ETL jobs > Visual ETL**, then in **Example jobs**, choosing **Visual ETL job to join multiple sources**. Choose **Create example job** to create a job and follow along with the steps below. 

![\[The screenshot shows the Example jobs panel with the Visual ETL job to join multiple sources eample job selected.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/visual-etl-example-jobs-create.png)


**To remove a node from the canvas**

1.  From the Amazon Glue console, choose **Visual ETL** from the navigation menu and choose an existing job. The job canvas displays the example job as depicted below.   
![\[The screenshot shows a job diagram generated from the Example job.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/visual-job-example-job-nodes.png)

1.  Choose the node you want to remove. The canvas will zoom in to the node. In the toolbar on the right side of the canvas, choose the **Trash ** icon. This will remove the node and any node connected to the node will move to take its place in the workflow. In this example, the first **Join** node was deleted from the canvas. 

    If you delete a node in the workflow, Amazon Glue will re-arrange the nodes so that they are organized in a way that does not result in an invalid workflow. You may still need to correct a node's configuration. 

    In the example, the **Join** node beneath the **Subscribers** node was removed. As a result, the **Plans** source node has been moved to the top level and is still connected to the child **Join** node. The **Join** node now requires additional configuration since **Join** requires two parent source nodes with selected tables. The **Transform** tab to the right of the canvas displays the missing requirement under **Join conditions **.   
![\[The screenshot shows a job diagram where parent nodes are two source nodes - Plan assignment and Subscribers. They are connected to a Join node. A Plans source node and Join node are connected to the Change Schema node. The Catalog node is connected to the Change schema node.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/visual-job-delete-node-join-plans.png)

1.  Delete the second **Join** node and **Select Fields** node. When the nodes have been deleted, the workflow will look like the example below.   
![\[The screenshot shows a job diagram where the Join nodes and Select Fields have been removed and the node connected to it, the Change Schema node, has moved up to take its place in the job flow.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/visual-job-three-data-sources-rearranged.png)

1.  To modify the node connections, click on the node's handle and drag the connection to a new node. This will allow you to delete nodes and rearrange the nodes in a logical flow. In the example, a new connection is being made by clicking the handle on the Plans node and dragging the connection to the Join node as depicted by the red arrow.   
![\[The screenshot shows a job diagram where the handle is enclosed in a red circle and a red arrow is joining the Plans node and Join node to demonstrate the action of clicking and dragging to connect the nodes together.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/visual-job-plans-node-handle-selected.png)

1.  If you need to undo any action, choose the **Undo** icon directly beneath the **Trash** icon in the toolbar on the right side of the canvas. 

# Adding source and target parameters to the Amazon Glue Data Catalog node


 Amazon Glue Studio allows you to parameterize visual jobs. Since catalog table names in production and development environment may be different, you can define and select runtime parameters for databases and tables that will run when your job runs. 

 Job parameterization allows you to parameterize sources and targets, and save those parameters to the job when using the Amazon Glue Data Catalog node. When you specify sources and targets as paramters, you are enabling the reusability of jobs, particularly when using the same job in multiple environments. This is useful when promoting code across deployment environments by saving time and effort in managing your sources and targets. In addition, the custom parameters you specify will override any default arguments for specific runs of Amazon Glue jobs. 

 ** To add source and target parameters ** 

 Whether you are using the Amazon Glue Data Catalog node as a source or a target, you can define runtime parameters in the **Advanced properties** section on the **Job details** tab. 

1.  Choose the Amazon Glue Data Catalog node as either the source node or the target node. 

1.  Choose the **Job details** tab. 

1.  Choose **Advanced properties**. 

1.  In the Job parameters section, enter a key value. For example, `--db.source` would be the parameter for a database source. You can enter any name for the key, as long as the key name is followed by the 'dash dash'.   
![\[The screenshot shows the job parameters section in the Job details tab. You can define parameters to use during runtime for the Database and Table.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/Data_Catalog_node_job_details_job_parameters.png)

1.  Enter the value. For example, `databasename` would be the value for database being parameterized. 

1.  Choose **Add new parameter** if you want to add more parameters. Max 50 parameters is allowed. Once the key value pair has been defined, you can use the parameter in the Amazon Glue Data Catalog node. 

 ** To select a runtime parameter ** 

**Note**  
 The process to select runtime parameters for databases and tables is the same whether the the Amazon Glue Data Catalog node is the source or the target. 

1.  Choose the Amazon Glue Data Catalog node as either the source node or the target node. 

1.  In the **Data source properties - Data Catalog** tab, under **Database**, choose **Use runtime parameters**.   
![\[The screenshot shows the runtime parameter drop-down menu. You can select any defined parameters to use during runtime for the Database and Table.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/Data_Catalog_node_select_runtime_parameters.png)

1.  Choose a parameter from the drop-down menu. For example, when you select a parameter you defined for a source database, the database will automatically populate in the database drop-down menu when you choose **Apply**. 

1.  In the Table section, choose a parameter you already defined as a source table. When you choose **Apply**, the table is automatically populated as the table to use. 

1.  When you save and run the job, Amazon Glue Studio will reference the selected parameters during the job run. 

# Using Git version control systems in Amazon Glue


**Note**  
 Notebooks are not currently supported for version control in Amazon Glue Studio. However, version control for Amazon Glue job scripts and visual ETL jobs are supported. 

 If you have remote repositories and want to manage your Amazon Glue jobs using your repositories, you can use Amazon Glue Studio or the Amazon CLI to sync changes to your repositories and your jobs in Amazon Glue. When you sync changes this way, you're pushing the job from Amazon Glue Studio to your repository, or pulling from the repository to Amazon Glue Studio. 

 With Git integration in Amazon Glue Studio, you can: 
+  Integrate with Git version control systems, such as Amazon CodeCommit, GitHub, GitLab, and Bitbucket 
+  Edit Amazon Glue jobs in Amazon Glue Studio whether you use visual jobs or script jobs and sync them to a repository 
+  Parameterize sources and targets in jobs 
+  Pull jobs from a repository and edit them in Amazon Glue Studio 
+  Test jobs by pulling from branches and/or pushing to branches utilizing multi-branch workflows in Amazon Glue Studio 
+  Download files from a repository and upload jobs into Amazon Glue Studio for cross-account job creation 
+  Use your automation tool of choice (for example, Jenkins, Amazon CodeDeploy, etc.) 

This video demonstrates how you can integrate Amazon Glue with Git and build a continuous and collaborative code pipeline.

## IAM permissions


 Ensure the job has one of the following IAM permissions. For more information on how to set up IAM permissions, see [Set up IAM permissions for Amazon Glue Studio](https://docs.amazonaws.cn/glue/latest/ug/setting-up.html?icmpid=docs_glue_studio_helppanel#getting-started-iam-permissions). 
+ `AWSGlueServiceRole`
+ `AWSGlueConsoleFullAccess`

 At minimum, the following actions are needed for Git integration: 
+  `glue:UpdateJobFromSourceControl` — to be able to update Amazon Glue with a job present in a version control system 
+  `glue:UpdateSourceControlFromJob` — to be able to update the version control system with a job stored in Amazon Glue 
+  `s3:GetObject` — to be able to retrieve the script for the job while pushing to version control system 
+  `s3:PutObject` — to be able to update the script when pulling a job from a source control system 

## Prerequisites


 In order to push jobs to a source control repository, you will need: 
+  a repository that has already been created by your administrator 
+  a branch in the repository 
+  a personal access token (for Bitbucket, this is the Repository Access Token) 
+  the username of the repository owner 
+  set permissions in the repository to allow Amazon Glue Studio to read and write to the repository 
  +  **GitLab** – set token scopes to api, read\$1repository, and write\$1repository 
  +  **Bitbucket** – set permissions to: 
    + **Workspace membership** – read, write
    + **Projects** – write, admin read
    + **Repositories** – read, write, admin, delete

**Note**  
 When using Amazon CodeCommit, personal access token and repository owner are not needed. See [ Getting started with Git and Amazon CodeCommit](https://docs.aws.amazon.com/codecommit/latest/userguide/getting-started.html). 

 **Using jobs from your source control repository in Amazon Glue Studio** 

 In order to pull a job from your source control repository that is not in Amazon Glue Studio, and to use that job in Amazon Glue Studio, the prerequisites will depend on the type of job. 

 **For visual jobs:** 
+  you need a folder and a JSON file of the job definition that matches the job name 

   For example, see the job definition below. The branch in your repository should contain a path `my-visual-job/my-visual-job.json` where both the folder and the JSON file match the job name 

  ```
  {
    "name" : "my-visual-job",
    "description" : "",
    "role" : "arn:aws:iam::aws_account_id:role/Rolename",
    "command" : {
      "name" : "glueetl",
      "scriptLocation" : "s3://foldername/scripts/my-visual-job.py",
      "pythonVersion" : "3"
    },
    "codeGenConfigurationNodes" : "{\"node-nodeID\":{\"S3CsvSource\":{\"AdditionalOptions\":{\"EnableSamplePath\":false,\"SamplePath\":\"s3://notebook-test-input/netflix_titles.csv\"},\"Escaper\":\"\",\"Exclusions\":[],\"Name\":\"Amazon S3\",\"OptimizePerformance\":false,\"OutputSchemas\":[{\"Columns\":[{\"Name\":\"show_id\",\"Type\":\"string\"},{\"Name\":\"type\",\"Type\":\"string\"},{\"Name\":\"title\",\"Type\":\"choice\"},{\"Name\":\"director\",\"Type\":\"string\"},{\"Name\":\"cast\",\"Type\":\"string\"},{\"Name\":\"country\",\"Type\":\"string\"},{\"Name\":\"date_added\",\"Type\":\"string\"},{\"Name\":\"release_year\",\"Type\":\"bigint\"},{\"Name\":\"rating\",\"Type\":\"string\"},{\"Name\":\"duration\",\"Type\":\"string\"},{\"Name\":\"listed_in\",\"Type\":\"string\"},{\"Name\":\"description\",\"Type\":\"string\"}]}],\"Paths\":[\"s3://dalamgir-notebook-test-input/netflix_titles.csv\"],\"QuoteChar\":\"quote\",\"Recurse\":true,\"Separator\":\"comma\",\"WithHeader\":true}}}"
  }
  ```

 **For script jobs: ** 
+  you need a folder, a JSON file of the job definition, and the script 
+  the folder and JSON file should match the job name. The script name needs to match the `scriptLocation` in the job definition along with the file extension 

   For example, in the job definition below, the branch in your repository should contain a path `my-script-job/my-script-job.json` and `my-script-job/my-script-job.py`. The script name should match the name in the `scriptLocation` including the extension of the script 

  ```
  {
    "name" : "my-script-job",
    "description" : "",
    "role" : "arn:aws:iam::aws_account_id:role/Rolename",
    "command" : {
      "name" : "glueetl",
      "scriptLocation" : "s3://foldername/scripts/my-script-job.py",
      "pythonVersion" : "3"
    }
  }
  ```

## Limitations

+  Amazon Glue currently does not support pushing/pulling from [GitLab-Groups](https://docs.gitlab.com/ee/user/group). 

## Connecting version control repositories with Amazon Glue


 You can enter your version control repository details and manage them in the **Version Control** tab in the Amazon Glue Studio job editor. To integrate with your Git repository, you must connect to your repository every time you log in to Amazon Glue Studio. 

 To connect a Git version control system: 

1.  In Amazon Glue Studio, start a new job and choose the **Version Control** tab.   
![\[The screenshot shows a job with the Version Control tab selected.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/editing-nodes-version-control-tab.png)

1.  In **Version control system**, choose the Git Service from the available options by clicking on the drop-down menu. 
   +  Amazon CodeCommit 
   +  GitHub 
   + GitLab
   + Bitbucket

1.  Depending on the Git version control system you choose, you will have different fields to complete. 

   
    **For Amazon CodeCommit**: 

    Complete the repository configuration by selecting the repository and branch for your job: 
   +  **Repository** — if you have set up repositories in Amazon CodeCommit, select the repository from the drop-down menu. Your repositories will automatically populate in the list 
   +  **Branch** — select the branch from the drop-down menu 
   +  **Folder** — *optional* - enter the name of the folder in which to save your job. If left empty, a folder is automatically created. The folder name defaults to the job name 

   
    **For GitHub**: 

    Complete the GitHub configuration by completing the fields: 
   +  **Personal access token** — this is the token provided by the GitHub repository. For more information on personal access tokens, see [ GitHub Docs ](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token) 
   +  **Repository owner** — this is the owner of the GitHub repository. 

    Complete the repository configuration by selecting the repository and branch from GitHub. 
   +  **Repository** — if you have set up repositories in GitHub, select the repository from the drop-down menu. Your repositories will automatically populate in the list 
   +  **Branch** — select the branch from the drop-down menu 
   +  **Folder** — *optional* - enter the name of the folder in which to save your job. If left empty, a folder is automatically created. The folder name defaults to the job name 

   
    **For GitLab**: 
**Note**  
 Amazon Glue currently does not support pushing/pulling from [GitLab-Groups](https://docs.gitlab.com/ee/user/group). 
   +  **Personal access token** — this is the token provided by the GitLab repository. For more information on personal access tokens, see [ GitLab Personal access tokens ](https://docs.gitlab.com/ee/user/profile/personal_access_tokens.html) 
   +  **Repository owner** — this is the owner of the GitLab repository. 

    Complete the repository configuration by selecting the repository and branch from GitLab. 
   +  **Repository** — if you have set up repositories in GitLab, select the repository from the drop-down menu. Your repositories will automatically populate in the list 
   +  **Branch** — select the branch from the drop-down menu 
   +  **Folder** — *optional* - enter the name of the folder in which to save your job. If left empty, a folder is automatically created. The folder name defaults to the job name 

    **For Bitbucket**: 
   +  **App password** — Bitbucket uses App passwords and not Repository Access Tokens. For more information on App passwords, see [ App passwords ](https://support.atlassian.com/bitbucket-cloud/docs/app-passwords/). 
   +  **Repository owner** — this is the owner of the Bitbucket repository. In Bitbucket, the owner is the creator of the repository. 

    Complete the repository configuration by selecting the workspace, repository, branch, and folder from Bitbucket. 
   +  **Workspace** – if you have workspaces set up in Bitbucket, select the workspace from the drop-down menu. Your workspaces are automatically populated 
   +  **Repository** — if you have set up repositories in Bitbucket, select the repository from the drop-down menu. Your repositories are automatically populated 
   +  **Branch** — select the branch from the drop-down menu. Your branches are automatically populated 
   +  **Folder** — *optional* - enter the name of the folder in which to save your job. If left empty, a folder is automatically created with the job name. 

1.  Choose **Save** at the top of the Amazon Glue Studio job 

## Pushing Amazon Glue jobs to the source repository


 Once you've entered the details of your version control system, you can edit jobs in Amazon Glue Studio and push the jobs to your source repository. If you're unfamiliar with Git concepts such as pushing and pulling, see this tutorial on [Getting started with Git and Amazon CodeCommit](https://docs.aws.amazon.com/codecommit/latest/userguide/getting-started.html). 

 In order to push your job to a repository, you need to enter the details of your version control system and save your job. 

1.  In the Amazon Glue Studiojob, choose **Actions**. This will open additional menu options.   
![\[The screenshot shows a job with the Actions menu opened. The Push to repository option is visible.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/editing-nodes-actions-push-to-repository.png)

1.  Choose **Push to repository**. 

    This action will save the job. When you push to repository, Amazon Glue Studio pushes the last saved change. If the job in the repository was modified by you or another user and is out of sync with the job in Amazon Glue Studio, the job in the repository is overwritten with the job saved in Amazon Glue Studio when you push the job from Amazon Glue Studio. 

1.  Choose **Confirm** to complete the action. This creates a new commit in the repository. If you are using Amazon CodeCommit, a confirmation message will display a link to the latest commit on Amazon CodeCommit. 

## Pulling Amazon Glue jobs from the source repository


 Once you've entered details of your Git repository into the **Version control** tab, you can also pull jobs from your repository and edit them in Amazon Glue Studio. 

1.  In the Amazon Glue Studio job, choose **Actions**. This will open additional menu options.   
![\[The screenshot shows a job with the Actions menu opened. The Push to repository option is visible.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/editing-nodes-actions-push-to-repository.png)

1.  Choose **Pull from repository**. 

1.  Choose **Confirm**. This takes the latest commit from the repository and updates your job in Amazon Glue Studio. 

1.  Edit your job in Amazon Glue Studio. If you make changes, you can sync your job to your repository by choosing **Push to repository** from the **Actions** drop-down menu. 

# Authoring code with Amazon Glue Studio notebooks
Authoring code with Amazon Glue Studio notebooks

 Data engineers can author Amazon Glue jobs faster and more easily than before using the interactive notebook interface in Amazon Glue Studio or interactive sessions in Amazon Glue. 

## Limitations

+  Amazon Glue Studio notebooks do not support Scala. 

**Topics**
+ [

## Limitations
](#notebooks-chapter-limitations)
+ [

# Overview of using notebooks
](using-notebooks-overview.md)
+ [

# Creating an ETL job using notebooks in Amazon Glue Studio
](create-notebook-job.md)
+ [

# Notebook editor components
](notebook-components.md)
+ [

# Saving your notebook and job script
](save-notebook.md)
+ [

# Managing notebook sessions
](manage-notebook-sessions.md)
+ [

# Using Amazon Q Developer with Amazon Glue Studio notebooks
](glue-studio-notebooks-amazon-q-developer.md)

# Overview of using notebooks


 Amazon Glue Studio allows you to interactively author jobs in a notebook interface based on Jupyter Notebooks. Through notebooks in Amazon Glue Studio, you can edit job scripts and view the output without having to run a full job, and you can edit data integration code and view the output without having to run a full job, and you can add markdown and save notebooks as .ipynb files and job scripts. You can start a notebook without installing software locally or managing servers. When you are satisfied with your code, Amazon Glue Studio can convert your notebook to a Glue job with the click of a button. 

 Some benefits of using notebooks include: 
+  No cluster to provision or manage 
+  No idle clusters to pay for 
+  No up-front configuration required 
+  No installation of Jupyter notebooks required 
+  The same runtime/platform as Amazon Glue ETL 

 When you start a notebook through Amazon Glue Studio, all the configuration steps are done for you so that you can explore your data and start developing your job script after only a few seconds. Amazon Glue Studio configures a Jupyter notebook with the Amazon Glue Jupyter kernel. You don’t have to configure VPCs, network connections, or development endpoints to use this notebook. 

 To create jobs using the notebook interface: 
+  configure the necessary IAM permissions. 
+  start a notebook session to create a job 
+  write code in the cells in the notebook 
+  run and test the code to view the output 
+  save the job 

 After your notebook is saved, your notebook is a full Amazon Glue job. You can manage all aspects of the job, such as scheduling jobs runs, setting job parameters, and viewing the job run history right along side your notebook. 

# Creating an ETL job using notebooks in Amazon Glue Studio


**To start using notebooks in the Amazon Glue Studio console**

1.  Attach Amazon Identity and Access Management policies to the Amazon Glue Studio user and create an IAM role for your ETL job and notebook. 

1.  Configure additional IAM security for notebooks, as described in [Granting permissions for the IAM role](notebook-getting-started.md#studio-notebook-permissions). 

1.  Open the Amazon Glue Studio console at [https://console.amazonaws.cn/gluestudio/](https://console.amazonaws.cn/gluestudio/). 
**Note**  
Check that your browser does not block third-party cookies. Any browser that blocks third party cookies either by default or as a user-enabled setting will prevent notebooks from launching. For more information on managing cookies, see:
   + [Chrome](https://support.alertlogic.com/hc/en-us/articles/360018127132-Turn-Off-Block-Third-Party-Cookies-in-Chrome-for-Windows)
   + [Firefox](https://support.mozilla.org/en-US/kb/third-party-cookies-firefox-tracking-protection)
   + [Safari](https://support.apple.com/guide/safari/manage-cookies-sfri11471/mac)

1. Choose the **Jobs** link in the left-side navigation menu. 

1.  Choose **Jupyter notebook** and then choose **Create** to start a new notebook session. 

1.  On the **Create job in Jupyter notebook** page, provide the job name, and choose the IAM role to use. Choose **Create job**. 

    After a short time period, the notebook editor appears. 

1.  After you add the code you must execute the cell to initiate a session. There are multiple ways to execute the cell: 
   + Press the play button.
   +  Use a keyboard shortcut: 
     +  On MacOS, **Command** \$1 **Enter** to run the cell. 
     +  On Windows, **Shift** \$1 **Enter** to run the cell. 

    For information about writing code using a Jupyter notebook interface, see * [The Jupyter Notebook User Documentation ](https://jupyter-notebook.readthedocs.io/en/stable/notebook.html) *. 

1.  To test your script, run the entire script, or individual cells. Any command output will be displayed in the area beneath the cell. 

1.  After you have finished developing your notebook, you can save the job and then run it. You can find the script in the **Script** tab. Any magics you added to the notebook will be stripped away and won't be saved as part of the script of the generated Amazon Glue job. Amazon Glue Studio will auto-add a `job.commit()` to the end of your generated script from the notebook contents.

   For more information about running jobs, see [Start a job run](managing-jobs-chapter.md#start-jobs). 

   
# Notebook editor components


 The notebook editor interface has the following main sections. 
+  Notebook interface (main panel) and toolbar 
+  Job editing tabs 

## The notebook editor


 The Amazon Glue Studio notebook editor is based on the Jupyter Notebook Application. The Amazon Glue Studio notebook interface is similar to that provided by Juypter Notebooks, which is described in the section [ Notebook user interface ](https://jupyter-notebook.readthedocs.io/en/stable/notebook.html?highlight=toolbar#notebook-user-interface). The notebook used by interactive sessions is a Jupyter Notebook. 

 Although the Amazon Glue Studio notebook is similar to Juptyer Notebooks, it differs in a few key ways: 
+  currently, the Amazon Glue Studio notebook cannot install extensions 
+  you cannot use multiple tabs; there is a 1:1 relationship between a job and a notebook 
+  the Amazon Glue Studio notebook does not have the same top file menu that exists in Jupyter Notebooks 
+  currently, the Amazon Glue Studio notebook only runs with the Amazon Glue kernel. Note that you cannot update the kernel on your own. 

## Amazon Glue Studio job editing tabs


 The tabs that you use to interact with the ETL job are at the top of the notebook page. They are similar to tabs that appear in the visual job editor of Amazon Glue Studio, and they perform the same actions. 
+  **Notebook** – Use this tab to view the job script using the notebook interface. 
+  **Job details** – Configure the environment and properties for the job runs. 
+  **Runs** – View information about previous runs of this job. 
+  **Schedules** – Configure a schedule for running your job at specific times. 

# Saving your notebook and job script


 You can save your notebook and the job script you are creating at any time. Simply choose the **Save** button in the upper right corner, the same as if you were using the visual or script editor. 

 When you choose **Save**, the notebook file is saved in the default locations: 
+  By default, the job script is saved to the Amazon S3 location indicated in the **Job Details** tab, under **Advanced properties**, in the Job details property **Script path**. Job scripts are saved in a subfolder named `Scripts`. 
+  By default, the notebook file (`.ipynb`) is saved to the Amazon S3 location indicated in the **Job Details** tab, under **Advanced properties**, in the Job details **Script path**. Notebook files are saved in a subfolder named `Notebooks`. 

**Note**  
 When you save the job, the job script contains only the code cells from the notebook. The Markdown cells and magics aren't included in the job script. However, the `.ipynb` file will contain any markdown and magics. 

 After you save the job, you can then run the job using the script that you created in the notebook. 

# Managing notebook sessions


 Notebooks in Amazon Glue Studio are based on the interactive sessions feature of Amazon Glue. There is a cost for using interactive sessions. To help manage your costs, you can monitor the sessions created for your account, and configure the default settings for all sessions. 

## Change the default timeout for all notebook sessions


 By default, the provisioned Amazon Glue Studio notebook times out after 12 hours if the notebook was launched and no cells have been executed. There is no cost associated to it and the timeout is not configurable. 

 Once you execute a cell this will start an interactive session. This session has a default timeout of 48 hours. This timeout can be configured by passing an `%idle_timeout` magic before executing a cell. 

**To modify the default session timeout for notebooks in Amazon Glue Studio**

1.  In the notebook, enter the `%idle_timeout` magic in a cell and specify the timeout value in minutes. 

1.  For example: `%idle_timeout 15` will change the default timeout to 15 minutes. If the session is not used in 15 minutes, the session is automatically stopped. 

## Installing additional Python modules


 If you would like to install additional modules to your session using pip you can do so by using `%additional_python_modules` to add them to your session: 

```
%additional_python_modules awswrangler, s3://amzn-s3-demo-bucket/mymodule.whl
```

 All arguments to additional\$1python\$1modules are passed to `pip3 install -m <>` 

 For a list of available Python modules, see [Using Python libraries with Amazon Glue](https://docs.amazonaws.cn/glue/latest/dg/aws-glue-programming-python-libraries.html). 

## Changing Amazon Glue Configuration


 You can use magics to control Amazon Glue job configuration values. If you want to change a job configuration value you have to use the proper magic in the notebook. See [Magics supported by Amazon Glue interactive sessions for Jupyter](https://docs.amazonaws.cn/glue/latest/dg/interactive-sessions-magics.html). 

**Note**  
 Overriding properties for a running session is no longer available. In order to change the session’s configurations, you can stop the session, set the new configurations and then start a new session. 

 Amazon Glue supports various worker types. You can set the worker type with `%worker_type`. For example: `%worker_type G.2X `. Available worker types include G.1X, G.2X, G.4X, G.8X, G.12X, G.16X, R.1X, R.2X, R.4X, and R.8X. The default is G.1X. 

 You can also specify the Number of workers with `%number_of_workers`. For example, to specify 40 workers: `%number_of_workers 40`. 

 For more information see [Defining Job Properties](https://docs.amazonaws.cn/glue/latest/dg/add-job.html) 

## Stop a notebook session


 To stop a notebook session, use the magic `%stop_session`. 

 If you navigate away from the notebook in the Amazon console, you will receive a warning message where you can choose to stop the session. 

# Using Amazon Q Developer with Amazon Glue Studio notebooks
Using Amazon Q Developer with Amazon Glue Studio notebooks

 Amazon Glue Studio allows you to interactively author jobs in a notebook interface based on Jupyter Notebooks. Using Amazon Q Developer improves the authoring experience within Amazon Glue Studio notebooks. 

 The Amazon Q Developer extension supports writing code by generating code recommendations and suggesting improvements related to code issues. Amazon Q Developer supports both Python and Scala, the two languages used for coding ETL scripts for Spark jobs in Amazon Glue Studio notebooks. 

## What is Amazon Q Developer?


 Amazon Q Developer is a service powered by machine learning that helps improve developer productivity. Amazon Q Developer achieves this by generating code recommendations based on developers’ comments in natural language and their code in the IDE. The service integrates with JupyterLab, Amazon SageMaker AI Studio, Amazon SageMaker AI notebook instances, and other integrated development environments (IDEs). 

 For more information, see [Using Amazon Q Developer with Amazon Glue Studio](https://docs.amazonaws.cn/amazonq/latest/qdeveloper-ug/glue-setup.html). 

# Amazon Glue job run statuses on the console
Monitoring job runs

You can view the status of an Amazon Glue extract, transform, and load (ETL) job while it is running or after it has stopped. You can view the status using the Amazon Glue console. 

## Accessing the job monitoring dashboard


You access the job monitoring dashboard by choosing the **Job run monitoring** link in the Amazon Glue navigation pane under **ETL jobs**.

## Overview of the job monitoring dashboard


The job monitoring dashboard provides an overall summary of the job runs, with totals for the jobs with a status of **Running**, **Canceled**, **Success**, or **Failed**. Additional tiles provide the overall job run success rate, the estimated DPU usage for jobs, a breakdown of the job status counts by job type, worker type, and by day. 

The graphs in the tiles are interactive. You can choose any block in a graph to run a filter that displays only those jobs in the **Job runs** table at the bottom of the page.

You can change the date range for the information displayed on this page by using the **Date range** selector. When you change the date range, the information tiles adjust to show the values for the specified number of days before the current date. You can also use a specific date range if you choose **Custom** from the date range selector. 

## Job runs view


**Note**  
 Job run history is accessible for 90 days for your workflow and job run. 

The **Job runs** resource list shows the jobs for the specified date range and filters.

You can filter the jobs on additional criteria, such as status, worker type, job type, and the job name. In the filter box at the top of the table, you can enter the text to use as a filter. The table results are updated with rows that contain matching text as you enter the text.

You can view a subset of the jobs by choosing elements from the graphs on the job monitoring dashboard. For example, if you choose the number of running jobs in the **Job runs summary** tile, then the **Job runs** list displays only the jobs that currently have a status of `Running`. If you choose one of the bars in the **Worker type breakdown** bar chart, then only job runs with the matching worker type and status are shown in the **Job runs** list.

The **Job runs** resource list displays the details for the job runs. You can sort the rows in the table by choosing a column heading. The table contains the following information:


| Property | Description | 
| --- | --- | 
| Job name | The name of the job. | 
| Type |  The type of job environment: [\[See the AWS documentation website for more details\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/view-job-runs.html)  | 
| Start time |  The date and time at which this job run was started.  | 
| End time |  The date and time that this job run completed.  | 
| Run status |  The current state of the job run. Values can be: [\[See the AWS documentation website for more details\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/view-job-runs.html)  | 
| Run time | The amount of time that the job run consumed resources. | 
| Capacity |  The number of Amazon Glue data processing units (DPUs) that were allocated for this job run. For more information about capacity planning, see [Monitoring for DPU Capacity Planning](https://docs.amazonaws.cn/glue/latest/dg/monitor-debug-capacity.html) in the *Amazon Glue Developer Guide*.  | 
| Worker type |  The type of predefined worker that was allocated when the job ran. Values can be `G.1X`, `G.2X`, `G.4X` or `G.8X`.  [\[See the AWS documentation website for more details\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/view-job-runs.html)  | 
| DPU hours |  The estimated number of DPUs used for the job run. A DPU is a relative measure of processing power. DPUs are used to determine the cost of running your job. For more information, see the [Amazon Glue pricing](https://aws.amazon.com/glue/pricing/) page.  | 

You can choose any job run in the list and view additional information. Choose a job run, and then do one of the following:
+ Choose the **Actions** menu and the **View job** option to view the job in the visual editor.
+ Choose the **Actions** menu and the **Stop run** option to stop the current run of the job.
+ Choose the **View CloudWatch logs** button to view the job run logs for that job. 
+ Choose **View details** to view the job run details page.

## Viewing the job run logs


You can view the job logs in a variety of ways:
+ On the **Monitoring** page, in the **Job runs** table, choose a job run, and then choose **View CloudWatch logs**.
+ In the visual job editor, on the **Runs** tab for a job, choose the hyperlinks to view the logs:
  + **Logs** – Links to the Apache Spark job logs written when continuous logging is enabled for a job run. When you choose this link, it takes you to the Amazon CloudWatch logs in the `/aws-glue/jobs/logs-v2` log group. By default, the logs exclude non-useful Apache Hadoop YARN heartbeat and Apache Spark driver or executor log messages. For more information about continuous logging, see [Continuous Logging for Amazon Glue Jobs](https://docs.amazonaws.cn/glue/latest/dg/monitor-continuous-logging.html) in the *Amazon Glue Developer Guide*.
  + **Error logs** – Links to the logs written to `stderr` for this job run. When you choose this link, it takes you to the Amazon CloudWatch logs in the `/aws-glue/jobs/error` log group. You can use these logs to view details about any errors that were encountered during the job run.
  + **Output logs** – Links to the logs written to `stdout` for this job run. When you choose this link, it takes you to the Amazon CloudWatch logs in the `/aws-glue/jobs/output` log group. You can use these logs to see all the details about the tables that were created in the Amazon Glue Data Catalog and any errors that were encountered.

## Viewing the details of a job run


You can choose a job in the **Job runs** list on the **Monitoring** page, and then choose **View run details** to see detailed information for that run of the job. 

The information displayed on the job run detail page includes:


| Property | Description | 
| --- | --- | 
| Job name | The name of the job. | 
| Run Status |  The current state of the job run. Values can be: [\[See the AWS documentation website for more details\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/view-job-runs.html)  | 
| Glue version | The Amazon Glue version used by the job run. | 
| Recent attempt | The number of automatic retry attempts for this job run. | 
| Start time |  The date and time at which this job run was started.  | 
| End time |  The date and time that this job run completed.  | 
| Start-up time |  The amount of time spent preparing to run the job.  | 
| Execution time |  The amount of time spent running the job script.  | 
| Trigger name |  The name of the trigger associated with the job.  | 
| Last modified on |  The date when the job was last modified.  | 
| Security configuration |  The security configuration for the job, which includes Amazon S3 encryption, CloudWatch encryption, and job bookmarks encryption settings.  | 
| Timeout | The job run timeout threshold value. | 
| Allocated capacity |  The number of Amazon Glue data processing units (DPUs) that were allocated for this job run. For more information about capacity planning, see [Monitoring for DPU Capacity Planning](https://docs.amazonaws.cn/glue/latest/dg/monitor-debug-capacity.html) in the *Amazon Glue Developer Guide*.  | 
| Max capacity |  The maximum capacity available to the job run.  | 
| Number of workers | The number of workers used for the job run.  | 
| Worker type |  The type of predefined workers allocated for the job run. Values can be `G.1X` or `G.2X`. [\[See the AWS documentation website for more details\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/view-job-runs.html)  | 
| Logs | A link to the job logs for continuous logging (/aws-glue/jobs/logs-v2).  | 
| Output Logs | A link to the job output log files (/aws-glue/jobs/output). | 
| Error logs | A link to the job error log files (/aws-glue/jobs/error). | 

You can also view the following additional items, which are available when you view information for recent job runs. For more information, see [View information for recent job runs](managing-jobs-chapter.md#view-job-run-details).
+ **Input arguments**
+ **Continuous logs**
+ **Metrics** – You can see visualizations of basic metrics. For more information on included metrics, see [Viewing Amazon CloudWatch metrics for a Spark job run](#monitoring-job-run-metrics).
+ **Spark UI** – You can visualize Spark logs for your job in the Spark UI. For more information about using the Spark Web UI, see [Monitoring jobs using the Apache Spark web UI](monitor-spark-ui.md). Enable this feature by following the procedure in [Enabling the Apache Spark web UI for Amazon Glue jobs](monitor-spark-ui-jobs.md).

## Viewing Amazon CloudWatch metrics for a Spark job run


On the details page for a job run, below the **Run details** section, you can view the job metrics. Amazon Glue Studio sends job metrics to Amazon CloudWatch for every job run. 

Amazon Glue reports metrics to Amazon CloudWatch every 30 seconds. The Amazon Glue metrics represent delta values from the previously reported values. Where appropriate, metrics dashboards aggregate (sum) the 30-second values to obtain a value for the entire last minute. However, the Apache Spark metrics that Amazon Glue passes on to Amazon CloudWatch are generally absolute values that represent the current state at the time they are reported. 

**Note**  
You must configure your account to access Amazon CloudWatch, .

The metrics provide information about your job run, such as:
+ **ETL Data Movement** – The number of bytes read from or written to Amazon S3.
+ **Memory Profile: Heap used** – The number of memory bytes used by the Java virtual machine (JVM) heap.
+ **Memory Profile: heap usage** – The fraction of memory (scale: 0–1), shown as a percentage, used by the JVM heap.
+ **CPU Load** – The fraction of CPU system load used (scale: 0–1), shown as a percentage.

## Viewing Amazon CloudWatch metrics for a Ray job run


On the details page for a job run, below the **Run details** section, you can view the job metrics. Amazon Glue Studio sends job metrics to Amazon CloudWatch for every job run. 

Amazon Glue reports metrics to Amazon CloudWatch every 30 seconds. The Amazon Glue metrics represent delta values from the previously reported values. Where appropriate, metrics dashboards aggregate (sum) the 30-second values to obtain a value for the entire last minute. However, the Apache Spark metrics that Amazon Glue passes on to Amazon CloudWatch are generally absolute values that represent the current state at the time they are reported. 

**Note**  
You must configure your account to access Amazon CloudWatch, as described in .

In Ray jobs, you can view the following aggregated metric graphs. With these, you can build a profile of your cluster and tasks, and can access detailed information about each node. The time-series data that back these graphs is available in CloudWatch for further analysis.

**Task Profile: Task State**  
Shows the number of Ray tasks in the system. Each task lifecycle is given its own time series.

**Task Profile: Task Name**  
Shows the number of Ray tasks in the system. Only pending and active tasks are shown. Each type of task (by name) is given its own time series.

**Cluster Profile: CPUs in use**  
Shows the number of CPU cores that are used. Each node is given its own time series. Nodes are identified by IP addresses, which are ephemeral and only used for identification.

**Cluster Profile: Object store memory use**  
Shows memory use by the Ray object cache. Each memory location (physical memory, cached on disk, and spilled in Amazon S3) is given its own time series. The object store manages data storage across all nodes in the cluster. For more information, see [Objects](https://docs.ray.io/en/latest/ray-core/objects.html) in the Ray documentation.

**Cluster Profile: Node count**  
Shows the number of nodes provisioned for the cluster.

**Node Detail: CPU use**  
Shows CPU utilization on each node as a percentage. Each series shows an aggregated percentage of CPU usage across all cores on the node.

**Node Detail: Memory use**  
Shows memory use on each node in GB. Each series shows memory aggregated between all processes on the node, including Ray tasks and the Plasma store process. This will not reflect objects stored to disk or spilled to Amazon S3.

**Node Detail: Disk use**  
Shows disk use on each node in GB.

**Node Detail: Disk I/O speed**  
Shows disk I/O on each node in KB/s.

**Node Detail: Network I/O throughput**  
Shows network I/O on each node in KB/s.

**Node Detail: CPU use by Ray component**  
Shows CPU use in fractions of a core. Each ray component on each node is given its own time series.

**Node Detail: Memory use by Ray component**  
Shows memory use in GiB. Each ray component on each node is given its own time series.

# Detect and process sensitive data
Detect and process sensitive data

 The Detect PII transform identifies Personal Identifiable Information (PII) in your data source. You choose the PII entity to identify, how you want the data to be scanned, and what to do with the PII entity that have been identified by the Detect PII transform. 

 The Detect PII transform provides the ability to detect, mask, or remove entities that you define, or are pre-defined by Amazon. This enables you to increase compliance and reduce liability. For example, you may want to ensure that no personally identifiable information exists in your data that can be read and want to mask social security numbers with a fixed string (such as xxx-xx-xxxx), phone numbers, or addresses. 

 To work with sensitive data outside of Amazon Glue Studio, see [Using Sensitive Data Detection outside Amazon Glue Studio](aws-glue-api-sensitive-data-example.md) 

**Topics**
+ [

## Choosing how you want the data to be scanned
](#choose-datascan-pii)
+ [

## Choosing the PII entities to detect
](#choose-pii-entities)
+ [

## Specifying the level of detection sensitivity
](#sensitive-data-sensitivity)
+ [

## Choosing what to do with identified PII data
](#choose-action-pii)
+ [

## Adding fine-grained action overrides
](#sensitive-data-fine-grained-actions-override)

## Choosing how you want the data to be scanned


 When you scan your dataset for sensitive data like personally identifiable information (PII), you can choose to detect PII in each row or detect the columns that contain PII data. 

![\[\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/detect-fields-PII.png)


 When you choose **Detect PII in each cell**, you’re choosing to scan all rows in the data source. This is a comprehensive scan to ensure that PII entities are identified. 

 When you choose **Detect fields containing PII**, you’re choosing to scan a sample of rows for PII entities. This is a way to keep costs and resources low while also identifying the fields where PII entities are found. 

 When you choose to detect fields that contain PII, you can reduce costs and improve performance by sampling a portion of rows. Choosing this option will allow you to specify additional options: 
+  **Sample portion:** This allows you to specify the percentage of rows to sample. For example, if you enter ‘50’, you’re specifying that you want 50 percent of scanned rows for the PII entity. 
+  **Detection threshold: ** This allows you to specify the percentage of rows that contain the PII entity in order for the entire column to be identified as having the PII entity. For example, if you enter ‘10’, you’re specifying that the number of the PII entity, US Phone, in the rows that are scanned must be 10 percent or greater in order for the field to be identified as having the PII entity, US Phone. If the percentage of rows that contain the PII entity is less than 10 percent, that field will not be labeled as having the PII entity, US Phone, in it. 

## Choosing the PII entities to detect


 If you chose **Detect PII in each cell**, you can choose from one of three options: 
+ All available PII patterns - this includes Amazon entities.
+ Select categories - when you select categories, PII patterns will automatically include patterns in the categories that you select.
+ Select specific patterns - Only the patterns that you select will be detected.

 For a full list of managed sensitive data types, see [Managed data types](https://docs.amazonaws.cn/glue/latest/dg/sensitive-data-managed-data-types.html). 

### Choose from all available PII patterns


 If you choose **All available PII patterns**, select entities pre-defined by Amazon. You can select one, more than one, or all entities. 

![\[\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/pii-select-entities-to-detect.png)


### Select categories


 If you chose **Select categories** as the PII patterns to detect, you can select from the options in the drop-down menu. Note that some entities can belong to more than one category. For example, *Person's name* is an entity that belongs to the *Universal* and *HIPAA* categories. 
+  Universal (examples: Email, Credit Card) 
+  HIPAA (examples: US Driving License, Healthcare Common Procedure Coding System (HCPCS) code) 
+  Networking (examples: IP Address, MAC Address) 
+ Argentina
+ Australia
+ Austria
+ Belgium
+ Bosnia
+ Bulgaria
+ Canada
+ Chile
+ Colombia
+ Croatia
+ Cyprus
+ Czechia
+ Denmark
+ Estonia
+ Finland
+ France
+ Germany
+ Greece
+ Hungary
+ Ireland
+ Korea
+ Japan
+ Mexico
+ Netherlands
+ New Zealand
+ Norway
+ Portugal
+ Romania
+ Singapore
+ Slovakia
+ Slovenia
+ Spain
+ Sweden
+ Switzerland
+ Turkey
+ Ukraine
+ United States
+ United Kingdom
+ Venezuela

### Select specific patterns


 If you choose **Select specific patterns** as the PII patterns to detect, you can search or browse from a list of patterns you've already created, or create a new detection entity pattern. 

 The steps below describe how to create a new custom pattern for detecting sensitive data. You will create the custom pattern by entering a name for the custom pattern, add a regular expression, and optionally, define context words. 

 
1.  To create a new pattern, click the **Create new** button.   
![\[\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/detectpii-create-new.png)

1.  In the Create detection entity page, enter the entity name and a regular expression. The regular expression (Regex) is what Amazon Glue will use to match entities. 

1.  Click **Validate**. If the validation is successful, you will see a confirmation message stating that the string is a valid regular expression. If the validation is not successful, you will see a message stating that the string does not conform to proper formatting and accepted character literals, operators or constructs. 

1.  You can choose to add Context words in addition to the regular expression. Context words may increase the likelihood of a match. These can be useful in cases where field names are not descriptive of the entity. For example, social security numbers may be named 'SSN' or 'SS'. Adding these context words can help match the entity. 

1.  Click **Create** to create the detection entity. Any created entities are visible in the Amazon Glue Studio console. Click on **Detection entities** in the left-hand navigation menu. 

    You can edit, delete, or create detection entities from the **Detection entities **page. You can also search for a pattern using the search field. 

## Specifying the level of detection sensitivity


 You can set the level of sensitivity when using detecting sensitive data. 
+  **High** – (Default) Detects more entities for use cases that require a higher level of sensitivity. All Amazon Glue jobs created after November 2023 are automatically opted-in to this setting. 
+  **Low** – Detects fewer entities and reduces false positives. 

![\[\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/detect-sensitve-data-sensitvity-new.png)


## Choosing what to do with identified PII data


 If you chose to detect PII in the entire data source, you can select a global action to apply: 
+  **Enrich data with detection results:** If you chose Detect PII in each cell, you can store the detected entities into a new column. 
+  **Redact detected text:** You can replace the detected PII value with a string that you specify in the optional Replacing text input field. If no string is specified, the detected PII entity is replaced with '\$1\$1\$1\$1\$1\$1\$1'. 
+  **Partially redact detected text:** You can replace part of the detected PII value with a string you choose. There are two possible options: to either leave the ends unmasked or to mask by providing an explicit regex pattern. This feature is not available in Amazon Glue 2.0. 
+  **Apply cryptographic hash:** You can pass the detected PII value to a SHA-256 cryptographic hash function and replace the value with the function’s output. 

![\[\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/detect-sensitive-data-global-action.png)


### Differences between Amazon Glue versions 2.0 and 3.0\$1


 Amazon Glue 2.0 jobs will return a new DataFrame with the detected PII information for each column in a supplementary column. Any redaction or hash work is visible within the Amazon Glue script in the visual tab. 

 Amazon Glue 3.0 and 4.0 jobs will return a new DataFrame with this same supplementary column. A new key for “actionUsed” is present and can be one of `DETECT`, `REDACT`, `PARTIAL_REDACT`, or `SHA256_HASH`. If a masking action is selected, the DataFrame will return data with sensitive data masked. 

## Adding fine-grained action overrides


 Additional detection and action settings can be added to the fine-grained actions overrides table. This allows you to: 
+  **Include or exclude certain columns from detection** – An inferred schema on the data source will populate the table with available columns. 
+  **Specify specific settings that are more fine-grained than using global actions** – For example, you can specify different redaction text settings for different entity types. 
+  **Specify a different action than the global action** – If a different action wants to be applied on a different sensitive data type, that can be done here. Note that two different edit-in-place actions (redaction and hashing) cannot be used on the same column, but detect can always be used. 

![\[\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/detect-sensitive-data-fga-overrides.png)


# Managing ETL jobs with Amazon Glue Studio
Managing jobs

You can use the simple graphical interface in Amazon Glue Studio to manage your ETL jobs. Using the navigation menu, choose **Jobs** to view the **Jobs** page. On this page, you can see all the jobs that you have created either with Amazon Glue Studio or the Amazon Glue console. You can view, manage, and run your jobs on this page. 

**Topics**
+ [

## Start a job run
](#start-jobs)
+ [

## Schedule job runs
](#schedule-jobs)
+ [

## Manage job schedules
](#manage-schedules)
+ [

## Stop job runs
](#stop-jobs)
+ [

## View your jobs
](#view-jobs)
+ [

## View information for recent job runs
](#view-job-run-details)
+ [

## View the job script
](#view-job-script)
+ [

## Modify the job properties
](#edit-jobs-properties)
+ [

## Save the job
](#save-job)
+ [

## Clone a job
](#clone-jobs)
+ [

## Delete jobs
](#delete-jobs)

## Start a job run


In Amazon Glue Studio, you can run your jobs on demand. A job can run multiple times, and each time you run the job, Amazon Glue collects information about the job activities and performance. This information is referred to as a *job run* and is identified by a job run ID.

You can initiate a job run in the following ways in Amazon Glue Studio:
+ On the **Jobs** page, choose the job you want to start, and then choose the **Run job** button.
+ If you're viewing a job in the visual editor and the job has been saved, you can choose the **Run** button to start a job run.

For more information about job runs, see [Working with Jobs on the Amazon Glue Console](https://docs.amazonaws.cn/glue/latest/dg/console-jobs.html) in the *Amazon Glue Developer Guide*.

## Schedule job runs


In Amazon Glue Studio, you can create a schedule to have your jobs run at specific times. You can specify constraints, such as the number of times that the jobs run, which days of the week they run, and at what time. These constraints are based on `cron` and have the same limitations as `cron`. For example, if you choose to run your job on day 31 of each month, keep in mind that some months don't have 31 days. For more information about `cron`, see [Cron Expressions](https://docs.amazonaws.cn/glue/latest/dg/monitor-data-warehouse-schedule.html#CronExpressions) in the *Amazon Glue Developer Guide*. 

**To run jobs according to a schedule**

1. Create a job schedule using one of the following methods:
   + On the **Jobs** page, choose the job you want to create a schedule for, choose **Actions**, and then choose **Schedule job**.
   + If you're viewing a job in the visual editor and the job has been saved, choose the **Schedules** tab. Then choose **Create Schedule**.

1. On the **Schedule job run** page, enter the following information:
   + **Name**: Enter a name for your job schedule. 
   + **Frequency**: Enter the frequency for the job schedule. You can choose the following: 
     + **Hourly**: The job will run every hour, starting at a specific minute. You can specify the **Minute** of the hour that the job should run. By default, when you choose hourly, the job runs at the beginning of the hour (minute 0).
     + **Daily**: The job will run every day, starting at a time. You can specify the **Minute** of the hour that the job should run and the **Start hour** for the job. Hours are specified using a 23-hour clock, where you use the numbers 13 to 23 for the afternoon hours. The default values for minute and hour are 0, which means that if you select **Daily**, the job by default will run at midnight.
     + **Weekly**: The job will run every week on one or more days. In addition to the same settings described previous for Daily, you can choose the days of the week on which the job will run. You can choose one or more days.
     + **Monthly**: The job will run every month on a specific day. In addition to the same settings described previous for Daily, you can choose the day of the month on which the job will run. Specify the day as a numeric value from 1 to 31. If you select a day that does not exist in a month, for example the 30th day of February, then the job does not run that month.
     + **Custom**: Enter an expression for your job schedule using the `cron` syntax. Cron expressions allow you to create more complicated schedules, such as the last day of the month (instead of a specific day of the month) or every third month on the 7th and 21st days of the month. 

       See [Cron Expressions](https://docs.amazonaws.cn/glue/latest/dg/monitor-data-warehouse-schedule.html#CronExpressions) in the *Amazon Glue Developer Guide*
   + **Description**: You can optionally enter a description for your job schedule. If you plan to use the same schedule for multiple jobs, having a description makes it easier to determine what the job schedule does.

1. Choose **Create schedule** to save the job schedule.

1. After you create the schedule, a success message appears at the top of the console page. You can choose **Job details** in this banner to view the job details. This opens the visual job editor page, with the **Schedules** tab selected.

## Manage job schedules


After you have created schedules for a job, you can open the job in the visual editor and choose the **Schedules** tab to manage the schedules.

On the **Schedules** tab of the visual editor, you can perform the following tasks:
+ Create a new schedule.

  Choose **Create schedule**, then enter the information for your schedule as described in [Schedule job runs](#schedule-jobs).
+ Edit an existing schedule.

  Choose the schedule you want to edit, then choose **Action** followed by **Edit schedule**. When you choose to edit an existing schedule, the **Frequency** shows as **Custom**, and the schedule is displayed as a `cron` expression. You can either modify the `cron` expression, or specify a new schedule using the **Frequency** button. When you finish with your changes, choose **Update schedule**.
+ Pause an active schedule.

  Choose an active schedule, and then choose **Action** followed by **Pause schedule**. The schedule is instantly deactivated. Choose the refresh (reload) button to see the updated job schedule status.
+ Resume a paused schedule.

  Choose a deactivated schedule, and then choose **Action** followed by **Resume schedule**. The schedule is instantly activated. Choose the refresh (reload) button to see the updated job schedule status.
+ Delete a schedule.

  Choose the schedule you want to remove, and then choose **Action** followed by **Delete schedule**. The schedule is instantly deleted. Choose the refresh (reload) button to see the updated job schedule list. The schedule will show a status of **Deleting** until it has been completely removed.

## Stop job runs


You can stop a job before it has completed its job run. You might choose this option if you know that the job isn't configured correctly, or if the job is taking too long to complete.

On the **Monitoring** page, in the **Job runs** list, choose the job that you want to stop, choose **Actions**, and then choose **Stop run**. 

## View your jobs


You can view all your jobs on the **Jobs** page. You can access this page by choosing **Jobs** in the navigation pane.

On the **Jobs** page, you can see all the jobs that were created in your account. The **Your jobs** list shows the job name, its type, the status of the last run of that job, and the dates on which the job was created and last modified. You can choose the name of a job to see detailed information for that job.

You can also use the Monitoring dashboard to view all your jobs. You can access the dashboard by choosing **Monitoring** in the navigation pane. 

### Customize the job display


You can customize how the jobs are displayed in the **Your jobs** section of the **Jobs** page. Also, you can enter text in the search text field to display only jobs with a name that contains that text.

If you choose the settings icon ![\[\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/manage-console-icon-settings.png) in the **Your jobs** section, you can customize how Amazon Glue Studio displays the information in the table. You can choose to wrap the lines of text in the display, change the number of jobs displayed on the page, and specify which columns to display.

## View information for recent job runs


A job can run multiple times as new data is added at the source location. Each time a job runs, the job run is assigned a unique ID, and information about that job run is collected. You can view this information by using the following methods:
+ Choose the **Runs** tab of the visual editor to view the job run information for the currently displayed job.

  On the **Runs** tab (the **Recent job runs** page), there is a card for each job run. The information displayed on the **Runs** tab includes:
  + Job run ID
  + Number of attempts to run this job
  + Status of the job run
  + Start and end time for the job run
  + The runtime for the job run
  + A link to the job log files
  + A link to the job error log files
  + The error returned for failed jobs
+ You can select a job run to view additional information about the job, including the following:
  + **Input arguments**
  + **Continuous logs**
  + **Metrics** – You can see visualizations of basic metrics. For more information on included metrics, see [Viewing Amazon CloudWatch metrics for a Spark job run](view-job-runs.md#monitoring-job-run-metrics).
  + **Spark UI** – You can visualize Spark logs for your job in the Spark UI. For more information about using the Spark Web UI, see [Monitoring jobs using the Apache Spark web UI](monitor-spark-ui.md). Enable this feature by following the procedure in [Enabling the Apache Spark web UI for Amazon Glue jobs](monitor-spark-ui-jobs.md).

You can select **View details** to view similar information on the job run details page. Alternatively, you can navigate to the job run details page through the **Monitoring** page. In the navigation pane, choose **Monitoring**. Scroll down to the **Job runs** list. Choose the job and then choose **View run details**. The contents are described in [Viewing the details of a job run](view-job-runs.md#monitoring-job-run-details).

For more information about the job logs, see [Viewing the job run logs](view-job-runs.md#monitoring-job-run-logs).

## View the job script


After you provide information for all the nodes in the job, Amazon Glue Studio generates a script that is used by the job to read the data from the source, transform the data, and write the data in the target location. If you save the job, you can view this script at any time.

**To view the generated script for your job**

1. Choose **Jobs** in the navigation pane.

1. On the **Jobs** page, in the **Your Jobs** list, choose the name of the job you want to review. Alternatively, you can select a job in the list, choose the **Actions** menu, and then choose **Edit job**.

1. On the visual editor page, choose the **Script** tab at the top to view the job script. 

   If you want to edit the job script, see [Amazon Glue programming guide](edit-script.md).

## Modify the job properties


The nodes in the job diagram define the actions performed by the job, but there are several properties that you can configure for the job as well. These properties determine the environment that the job runs in, the resources it uses, the threshold settings, the security settings, and more.

**To customize the job run environment**

1. Choose **Jobs** in the navigation pane.

1. On the **Jobs** page, in the **Your Jobs** list, choose the name of the job you want to review.

1. On the visual editor page, choose the **Job details** tab at the top of the job editing pane. 

1. Modify the job properties, as needed. 

   For more information about the job properties, see [Defining Job Properties](https://docs.amazonaws.cn/glue/latest/dg/add-job.html#create-job) in the *Amazon Glue Developer Guide*. 

1. Expand the **Advanced properties** section if you need to specify these additional job properties:
   + **Script filename** – The name of the file that stores the job script in Amazon S3.
   + **Script path** – The Amazon S3 location where the job script is stored.
   + **Job metrics** – (Not available for Python shell jobs) Turns on the creation of Amazon CloudWatch metrics when this job runs.
   + **Continuous logging** – (Not available for Python shell jobs) Turns on continuous logging to CloudWatch, so the logs are available for viewing before the job completes.
   + **Spark UI** and **Spark UI logs path** – (Not available for Python shell jobs) Turns on the use of Spark UI for monitoring this job and specifies the location for the Spark UI logs.
   + **Maximum concurrency** – Sets the maximum number of concurrent runs that are allowed for this job.
   + **Temporary path** – The location of a working directory in Amazon S3 where temporary intermediate results are written when Amazon Glue runs the job script. 
   + **Delay notification threshold (minutes)** – Specify a delay threshold for the job. If the job runs for a longer time than that specified by the threshold, then Amazon Glue sends a delay notification for the job to CloudWatch.
   + **Security configuration** and **Server-side encryption** – Use these fields to choose the encryption options for the job.
   + **Use Glue Data Catalog as the Hive metastore** – Choose this option if you want to use the Amazon Glue Data Catalog as an alternative to Apache Hive Metastore.
   + **Additional network connection** – For a data source in a VPC, you can specify a connection of type `Network` to ensure your job accesses your data through the VPC.
   + **Python library path**, **Dependent jars path** (Not available for Python shell jobs), or **Referenced files path** – Use these fields to specify the location of additional files used by the job when it runs the script.
   + **Job Parameters** – You can add a set of key-value pairs that are passed as named parameters to the job script. In Python calls to Amazon Glue APIs, it's best to pass parameters explicitly by name. For more information about using parameters in a job script, see [Passing and Accessing Python Parameters in Amazon Glue](https://docs.amazonaws.cn/glue/latest/dg/aws-glue-programming-python-calling.html#aws-glue-programming-python-calling-parameters) in the *Amazon Glue Developer Guide*. 
   + **Tags** – You can add tags to the job to help you organize and identify them.

1. After you have modified the job properties, save the job.

### Store Spark shuffle files on Amazon S3


Some ETL jobs require reading and combining information from multiple partitions, for example, when using a join transform. This operation is referred to as *shuffling*. During a shuffle, data is written to disk and transferred across the network. With Amazon Glue version 3.0, you can configure Amazon S3 as a storage location for these files. Amazon Glue provides a shuffle manager which writes and reads shuffle files to and from Amazon S3. Writing and reading shuffle files from Amazon S3 is slower (by 5%-20%) compared to local disk (or Amazon EBS which is heavily optimized for Amazon EC2). However, Amazon S3 provides unlimited storage capacity, so you don't have to worry about "`No space left on device`" errors when running your job.

**To configure your job to use Amazon S3 for shuffle files**

1. On the **Jobs** page, in the **Your Jobs** list, choose the name of the job you want to modify.

1. On the visual editor page, choose the **Job details** tab at the top of the job editing pane. 

   Scroll down to the **Job parameters** section.

1. Specify the following key-value pairs.
   + `--write-shuffle-files-to-s3` — `true`

     This is the main parameter that configures the shuffle manager in Amazon Glue to use Amazon S3 buckets for writing and reading shuffle data. By default, this parameter has a value of `false`.
   + (Optional) `--write-shuffle-spills-to-s3` — `true`

     This parameter allows you to offload spill files to Amazon S3 buckets, which provides additional resiliency to your Spark job in Amazon Glue. This is only required for large workloads that spill a lot of data to disk. By default, this parameter has a value of `false`.
   + (Optional) `--conf spark.shuffle.glue.s3ShuffleBucket` — `S3://<shuffle-bucket>`

     This parameter specifies the Amazon S3 bucket to use when writing the shuffle files. If you do not set this parameter, the location is the `shuffle-data` folder in the location specified for **Temporary path** (`--TempDir`).
**Note**  
Make sure the location of the shuffle bucket is in the same Amazon Web Services Region in which the job runs.   
Also, the shuffle service does not clean the files after the job finishes running, so you should configure the Amazon S3 storage life cycle policies on the shuffle bucket location. For more information, see [Managing your storage lifecycle](https://docs.amazonaws.cn/AmazonS3/latest/userguide/object-lifecycle-mgmt.html) in the *Amazon S3 User Guide*.

## Save the job


A red **Job has not been saved** callout is displayed to the left of the **Save** button until you save the job. 

![\[A red oval with the label "Job has not been saved" to the left of the Save button.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/edit-graph-callout-not-saved_GA.png)


**To save your job**

1. Provide all the required information in the **Visual** and **Job details** tabs.

1. Choose the **Save** button.

   After you save the job, the 'not saved' callout changes to display the time and date that the job was last saved.

If you exit Amazon Glue Studio before saving your job, the next time you sign in to Amazon Glue Studio, a notification appears. The notification indicates that there is an unsaved job, and asks if you want to restore it. If you choose to restore the job, you can continue to edit it.

### Troubleshooting errors when saving a job


If you choose the **Save** button, but your job is missing some required information, then a red callout appears on the tab where the information is missing. The number in the callout indicates how many missing fields were detected.

![\[A screenshot showing the tabs for the visual editor pane for a job named "Untitled job" with a callout labeled 2 on the Visual tab and a callout labeled 1 on the Job details tab.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/screenshot-save-job-error-in-graph-GA2.png)

+ If a node in the visual editor isn't configured correctly, the **Visual** tab shows a red callout, and the node with the error displays a warning symbol ![\[A red triangle with an exclamation point in the center\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/edit-graph-warning_icon.png).

  1. Choose the node. In the node details panel, a red callout appears on the tab where the missing or incorrect information is located. 

  1. Choose the tab in the node details panel that shows a red callout, and then locate the problem fields, which are highlighted. An error message below the fields provides additional information about the problem.  
![\[A screenshot showing the Visual tab in the job editor, which is marked with a callout labeled 2. The data source node, which is marked with a warning label, is selected. In the node details panel, the Data source properties tab has a callout labeled 2, and is selected. Two fields, Database and Table are outlined in red and have messages beneath them indicating a value is required in those fields.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/screenshot-save-job-error-in-graph2.png)
+ If there is a problem with the job properties, the **Job details** tab shows a red callout. Choose that tab and locate the problem fields, which are highlighted. The error messages below the fields provide additional information about the problem.  
![\[A screenshot showing the Job details tab in the job editor, which is marked with a callout labeled 1. The "IAM Role" field is outlined in red and has a message beneath it indicating a value is required.\]](http://docs.amazonaws.cn/en_us/glue/latest/dg/images/screenshot-save-job-error-in-job-details.png)

## Clone a job


You can use the **Clone job** action to copy an existing job into a new job.

**To create a new job by copying an existing job**

1. On the **Jobs** page, in the **Your jobs** list, choose the job that you want to duplicate.

1. From the **Actions** menu, choose **Clone job**.

1. Enter a name for the new job. You can then save or edit the job.

## Delete jobs


You can remove jobs that are no longer needed. You can delete one or more jobs in a single operation.

**To remove jobs from Amazon Glue Studio**

1. On the **Jobs** page, in the **Your jobs** list, choose the jobs that you want to delete.

1. From the **Actions ** menu, choose **Delete job**.

1. Verify that you want to delete the job by entering **delete**.

You can also delete a saved job when you're viewing the **Job details** tab for that job in the visual editor.