Transform data with Amazon Glue managed transforms
Amazon Glue Studio provides two types of transforms:
-
Amazon Glue-native transforms - available to all users and are managed by Amazon Glue.
-
Custom visual transforms - allows you to upload your own transforms to use in Amazon Glue Studio
Amazon Glue managed data transform nodes
Amazon Glue Studio provides a set of built-in transforms that you can use to process your data. Your
data passes from one node in the job diagram to another in a data structure called a
DynamicFrame
, which is an extension to an Apache Spark SQL
DataFrame
.
In the pre-populated diagram for a job, between the data source and data target nodes is the Change Schema transform node. You can configure this transform node to modify your data, or you can use additional transforms.
The following built-in transforms are available with Amazon Glue Studio:
-
ChangeSchema: Map data property keys in the data source to data property keys in the data target. You can rename keys, modify the data types for keys, and choose which keys to drop from the dataset.
-
SelectFields: Choose the data property keys that you want to keep.
-
DropFields: Choose the data property keys that you want to drop.
-
RenameField: Rename a single data property key.
-
Spigot: Write samples of the data to an Amazon S3 bucket.
-
Join: Join two datasets into one dataset using a comparison phrase on the specified data property keys. You can use inner, outer, left, right, left semi, and left anti joins.
-
Union: Combine rows from more than one data source that have the same schema.
-
SplitFields: Split data property keys into two
DynamicFrames
. Output is a collection ofDynamicFrames
: one with selected data property keys, and one with the remaining data property keys. -
SelectFromCollection: Choose one
DynamicFrame
from a collection ofDynamicFrames
. The output is the selectedDynamicFrame
. -
FillMissingValues: Locate records in the dataset that have missing values and add a new field with a suggested value that is determined by imputation
-
Filter: Split a dataset into two, based on a filter condition.
-
Drop Null Fields: Removes columns from the dataset if all values in the column are ‘null’.
-
Drop Duplicates: Removes rows from your data source by choosing to match entire rows or specify keys.
-
SQL: Enter SparkSQL code into a text entry field to use a SQL query to transform the data. The output is a single
DynamicFrame
. -
Aggregate: Performs a calculation (such as average, sum, min, max) on selected fields and rows, and creates a new field with the newly calculated value(s).
-
Flatten: Extract fields inside structs into top level fields.
-
UUID: Add a column with a Universally Unique Identifier for each row.
-
Identifier: Add a column with a numeric identifier for each row.
-
To timestamp: Convert a column to timestamp type.
-
Format timestamp: Convert a timestamp column to a formatted string.
-
Conditional Router transform: Apply multiple conditions to incoming data. Each row of the incoming data is evaluated by a group filter condition and processed into its corresponding group.
-
Concatenate Columns transform: Build a new string column using the values of other columns with an optional spacer.
-
Split String transform: Break up a string into an array of tokens using a regular expression to define how the split is done.
-
Array To Columns transform: Extract some or all the elements of a column of type array into new columns.
-
Add Current Timestamp transform: Mark the rows with the time on which the data was processed. This is useful for auditing purposes or to track latency in the data pipeline.
-
Pivot Rows to Columns transform: Aggregate a numeric column by rotating unique values on selected columns which become new columns. If multiple columns are selected, the values are concatenated to name the new columns.
-
Unpivot Columns To Rows transform: Convert columns into values of new columns generating a row for each unique value.
-
Autobalance Processing transform: Redistribute the data better among the workers. This is useful where the data is unbalanced or as it comes from the source doesn’t allow enough parallel processing on it.
-
Derived Column transform: Define a new column based on a math formula or SQL expression in which you can use other columns in the data, as well as constants and literals.
-
Lookup transform: Add columns from a defined catalog table when the keys match the defined lookup columns in the data.
-
Explode Array or Map Into Rows transform: Extract values from a nested structure into individual rows that are easier to manipulate.
-
Record matching transform: Invoke an existing Record Matching machine learning data classification transform.
-
Remove null rows transform: Remove from the dataset rows that have all columns as null, or empty.
-
Parse JSON column transform: Parse a string column containing JSON data and convert it to a struct or an array column, depending if the JSON is an object or an array, respectively.
-
Extract JSON path transform: Extract new columns from a JSON string column.
-
Extract string fragments from a regular expression: Extract string fragments using a regular expression and create new column out of it, or multiple columns if using regex groups.
-
Custom transform: Enter code into a text entry field to use custom transforms. The output is a collection of
DynamicFrames
.