Using the Avro format in Amazon Glue - Amazon Glue
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China.

Using the Avro format in Amazon Glue

Amazon Glue retrieves data from sources and writes data to targets stored and transported in various data formats. If your data is stored or transported in the Avro data format, this document introduces you available features for using your data in Amazon Glue.

Amazon Glue supports using the Avro format. This format is a performance-oriented, row-based data format. For an introduction to the format by the standard authority see, Apache Avro 1.8.2 Documentation.

You can use Amazon Glue to read Avro files from Amazon S3 and from streaming sources as well as write Avro files to Amazon S3. You can read and write bzip and gzip archives containing Avro files from S3. You configure compression behavior on the Amazon S3 connection instead of in the configuration discussed on this page.

The following table shows which common Amazon Glue operations support the Avro format option.

Read Write Streaming read Group small files Job bookmarks
Supported Supported Supported Unsupported Supported

Example: Read Avro files or folders from S3

Prerequisites: You will need the S3 paths (s3path) to the Avro files or folders that you want to read.

Configuration: In your function options, specify format="avro". In your connection_options, use the paths key to specify s3path. You can configure how the reader interacts with S3 in the connection_options. For details, see Data format options for ETL inputs and outputs in Amazon Glue: "connectionType": "s3". You can configure how the reader interprets Avro files in your format_options. For details, see Avro Configuration Reference.

The following Amazon Glue ETL script shows the process of reading Avro files or folders from S3:

Python

For this example, use the create_dynamic_frame.from_options method.

from pyspark.context import SparkContext from awsglue.context import GlueContext sc = SparkContext.getOrCreate() glueContext = GlueContext(sc) dynamicFrame = glueContext.create_dynamic_frame.from_options( connection_type="s3", connection_options={"paths": ["s3://s3path"]}, format="avro" )
Scala

For this example, use the getSourceWithFormat operation.

import com.amazonaws.services.glue.util.JsonOptions import com.amazonaws.services.glue.GlueContext import org.apache.spark.sql.SparkContext object GlueApp { def main(sysArgs: Array[String]): Unit = { val spark: SparkContext = new SparkContext() val glueContext: GlueContext = new GlueContext(spark) val dynamicFrame = glueContext.getSourceWithFormat( connectionType="s3", format="avro", options=JsonOptions("""{"paths": ["s3://s3path"]}""") ).getDynamicFrame() }

Example: Write Avro files and folders to S3

Prerequisites: You will need an initialized DataFrame (dataFrame) or DynamicFrame (dynamicFrame). You will also need your expected S3 output path, s3path.

Configuration: In your function options, specify format="avro". In your connection_options, use the paths key to specify your s3path. You can further alter how the writer interacts with S3 in the connection_options. For details, see Data format options for ETL inputs and outputs in Amazon Glue: "connectionType": "s3". You can alter how the writer interprets Avro files in your format_options. For details, see Avro Configuration Reference.

The following Amazon Glue ETL script shows the process of writing Avro files or folders to S3.

Python

For this example, use the write_dynamic_frame.from_options method.

from pyspark.context import SparkContext from awsglue.context import GlueContext sc = SparkContext.getOrCreate() glueContext = GlueContext(sc) glueContext.write_dynamic_frame.from_options( frame=dynamicFrame, connection_type="s3", format="avro", connection_options={ "path": "s3://s3path" } )
Scala

For this example, use the getSinkWithFormat method.

import com.amazonaws.services.glue.util.JsonOptions import com.amazonaws.services.glue.{DynamicFrame, GlueContext} import org.apache.spark.SparkContext object GlueApp { def main(sysArgs: Array[String]): Unit = { val spark: SparkContext = new SparkContext() val glueContext: GlueContext = new GlueContext(spark) glueContext.getSinkWithFormat( connectionType="s3", options=JsonOptions("""{"path": "s3://s3path"}"""), format="avro" ).writeDynamicFrame(dynamicFrame) } }

Avro configuration reference

You can use the following format_options values wherever Amazon Glue libraries specify format="avro":

  • version — Specifies the version of Apache Avro reader/writer format to support. The default is "1.7". You can specify format_options={"version": “1.8”} to enable Avro logical type reading and writing. For more information, see the Apache Avro 1.7.7 Specification and Apache Avro 1.8.2 Specification.

    The Apache Avro 1.8 connector supports the following logical type conversions:

For the reader: this table shows the conversion between Avro data type (logical type and Avro primitive type) and Amazon Glue DynamicFrame data type for Avro reader 1.7 and 1.8.

Avro Data Type:

Logical Type

Avro Data Type:

Avro Primitive Type

GlueDynamicFrame Data Type:

Avro Reader 1.7

GlueDynamicFrame Data Type:

Avro Reader 1.8

Decimal bytes BINARY Decimal
Decimal fixed BINARY Decimal
Date int INT Date
Time (millisecond) int INT INT
Time (microsecond) long LONG LONG
Timestamp (millisecond) long LONG Timestamp
Timestamp (microsecond) long LONG LONG
Duration (not a logical type) fixed of 12 BINARY BINARY

For the writer: this table shows the conversion between Amazon Glue DynamicFrame data type and Avro data type for Avro writer 1.7 and 1.8.

Amazon Glue DynamicFrame Data Type Avro Data Type:

Avro Writer 1.7

Avro Data Type:

Avro Writer 1.8

Decimal String decimal
Date String date
Timestamp String timestamp-micros

Avro Spark DataFrame support

In order to use Avro from the Spark DataFrame API, you need to install the Spark Avro plugin for the corresponding Spark version. The version of Spark available in your job is determined by your Amazon Glue version. For more information about Spark versions, see Amazon Glue versions. This plugin is maintained by Apache, we do not make specific guarantees of support.

In Amazon Glue 2.0 - use version 2.4.3 of the Spark Avro plugin. You can find this JAR on Maven Central, see org.apache.spark:spark-avro_2.12:2.4.3.

In Amazon Glue 3.0 - use version 3.1.1 of the Spark Avro plugin. You can find this JAR on Maven Central, see org.apache.spark:spark-avro_2.12:3.1.1.

To include extra JARs in a Amazon Glue ETL job, use the --extra-jars job parameter. For more information about job parameters, see Amazon Glue job parameters. You can also configure this parameter in the Amazon Web Services Management Console.