Amazon Glue PySpark transforms reference - Amazon Glue
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Amazon Glue PySpark transforms reference

Amazon Glue provides the following built-in transforms that you can use in PySpark ETL operations. Your data passes from transform to transform in a data structure called a DynamicFrame, which is an extension to an Apache Spark SQL DataFrame. The DynamicFrame contains your data, and you reference its schema to process your data.

Most of these transforms also exist as methods of the DynamicFrame class. For more information, see DynamicFrame transforms .

Data integration transforms

For Amazon Glue 4.0 and above, create or update job arguments with key: --enable-glue-di-transforms, value: true.

Example job script:

from pyspark.context import SparkContext from awsgluedi.transforms import * sc = SparkContext() input_df = spark.createDataFrame( [(5,), (0,), (-1,), (2,), (None,)], ["source_column"], ) try: df_output = math_functions.IsEven.apply( data_frame=input_df, spark_context=sc, source_column="source_column", target_column="target_column", value=None, true_string="Even", false_string="Not even", ) df_output.show() except: print("Unexpected Error happened ") raise

Example Sessions using Notebooks

%idle_timeout 2880 %glue_version 4.0 %worker_type G.1X %number_of_workers 5 %region eu-west-1
%%configure { "--enable-glue-di-transforms": "true" }
from pyspark.context import SparkContext from awsgluedi.transforms import * sc = SparkContext() input_df = spark.createDataFrame( [(5,), (0,), (-1,), (2,), (None,)], ["source_column"], ) try: df_output = math_functions.IsEven.apply( data_frame=input_df, spark_context=sc, source_column="source_column", target_column="target_column", value=None, true_string="Even", false_string="Not even", ) df_output.show() except: print("Unexpected Error happened ") raise

Example Sessions using Amazon CLI

aws glue create-session --default-arguments "--enable-glue-di-transforms=true"

DI transforms:

Maven: Bundle the plugin with your Spark applications

You can bundle the transforms dependency with your Spark applications and Spark distributions (version 3.3) by adding the plugin dependency in your Maven pom.xml while developing your Spark applications locally.

<repositories> ... <repository> <id>aws-glue-etl-artifacts</id> <url>https://aws-glue-etl-artifacts.s3.amazonaws.com/release/ </url> </repository> </repositories> ... <dependency> <groupId>com.amazonaws</groupId> <artifactId>AWSGlueTransforms</artifactId> <version>4.0.0</version> </dependency>

You can alternatively download the binaries from Amazon Glue Maven artifacts directly and include them in your Spark application as follows.

#!/bin/bash sudo wget -v https://aws-glue-etl-artifacts.s3.amazonaws.com/release/com/amazonaws/AWSGlueTransforms/4.0.0/AWSGlueTransforms-4.0.0.jar -P /usr/lib/spark/jars/