Editing Amazon Glue managed data transform nodes - Amazon Glue Studio
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Editing Amazon Glue managed data transform nodes

Amazon Glue Studio provides two types of transforms:

  • Amazon Glue-native transforms - available to all users and are managed by Amazon Glue.

  • Custom visual transforms - allows you to upload your own transforms to use in Amazon Glue Studio

Amazon Glue managed data transform nodes

Amazon Glue Studio provides a set of built-in transforms that you can use to process your data. Your data passes from one node in the job diagram to another in a data structure called a DynamicFrame, which is an extension to an Apache Spark SQL DataFrame.

In the pre-populated diagram for a job, between the data source and data target nodes is the Transform - ApplyMapping node. You can configure this transform node to modify your data, or you can use additional transforms.

The following built-in transforms are available with Amazon Glue Studio:

  • ApplyMapping: Map data property keys in the data source to data property keys in the data target. You can rename keys, modify the data types for keys, and choose which keys to drop from the dataset.

  • SelectFields: Choose the data property keys that you want to keep.

  • DropFields: Choose the data property keys that you want to drop.

  • RenameField: Rename a single data property key.

  • Spigot: Write samples of the data to an Amazon S3 bucket.

  • Join: Join two datasets into one dataset using a comparison phrase on the specified data property keys. You can use inner, outer, left, right, left semi, and left anti joins.

  • SplitFields: Split data property keys into two DynamicFrames. Output is a collection of DynamicFrames: one with selected data property keys, and one with the remaining data property keys.

  • SelectFromCollection: Choose one DynamicFrame from a collection of DynamicFrames. The output is the selected DynamicFrame.

  • FillMissingValues: Locate records in the dataset that have missing values and add a new field with a suggested value that is determined by imputation

  • Filter: Split a dataset into two, based on a filter condition.

  • DropNullFields: Removes columns from the dataset if all values in the column are ‘null’.

  • SQL: Enter SparkSQL code into a text entry field to use a SQL query to transform the data. The output is a single DynamicFrame.

  • Aggregate: Performs a calculation (such as average, sum, min, max) on selected fields and rows, and creates a new field with the newly calculated value(s).

  • Flatten: Extract fields inside structs into top level fields.

  • UUID: Add a column with a Universally Unique Identifier for each row.

  • Identifier: Add a column with a numeric identifier for each row.

  • To timestamp: Convert a column to timestamp type.

  • Format timestamp: Convert a timestamp column to a formatted string.

  • Conditional Router transform: Apply multiple conditions to incoming data. Each row of the incoming data is evaluated by a group filter condition and processed into its corresponding group.

  • Custom transform: Enter code into a text entry field to use custom transforms. The output is a collection of DynamicFrames.