Using the Avro format in Amazon Glue
Amazon Glue retrieves data from sources and writes data to targets stored and transported in various data formats. If your data is stored or transported in the Avro data format, this document introduces you available features for using your data in Amazon Glue.
Amazon Glue supports using the Avro format. This format is a performance-oriented, row-based data format. For
an introduction to the format by the standard authority see, Apache Avro 1.8.2 Documentation
You can use Amazon Glue to read Avro files from Amazon S3 and from streaming sources as well as write Avro files to Amazon S3.
You can read and write bzip2
and gzip
archives containing Avro files from S3. Additionally,
you can write deflate
, snappy
, and xz
archives containing Avro files. You
configure compression behavior on the S3 connection parameters instead of in the configuration discussed on this
page.
The following table shows which common Amazon Glue operations support the Avro format option.
Read | Write | Streaming read | Group small files | Job bookmarks |
---|---|---|---|---|
Supported | Supported | Supported* | Unsupported | Supported |
*Supported with restrictions. For more information, see Notes and restrictions for Avro streaming sources.
Example: Read Avro files or folders from S3
Prerequisites: You will need the S3 paths (s3path
) to the
Avro files or folders that you want to read.
Configuration:
In your function options, specify format="avro"
. In your connection_options
, use the
paths
key to specify s3path
. You can configure how the reader interacts with S3 in the
connection_options
. For details, see Data format options for ETL inputs and outputs in Amazon Glue: Amazon S3 connection option reference. You can configure how the reader interprets Avro files in your format_options
. For
details, see Avro Configuration
Reference.
The following Amazon Glue ETL script shows the process of reading Avro files or folders from S3:
Example: Write Avro files and folders to S3
Prerequisites: You will need an initialized DataFrame
(dataFrame
) or DynamicFrame (dynamicFrame
). You will also need your expected S3
output path, s3path
.
Configuration:
In your function options, specify format="avro"
. In your connection_options
, use the
paths
key to specify your s3path
. You can further alter how the writer interacts with S3 in the
connection_options
. For details, see Data format options for ETL inputs and outputs in Amazon Glue: Amazon S3 connection option reference. You can alter how the writer
interprets Avro files in your format_options
. For
details, see Avro Configuration
Reference.
The following Amazon Glue ETL script shows the process of writing Avro files or folders to S3.
Avro configuration reference
You can use the following format_options
values wherever Amazon Glue libraries specify
format="avro"
:
version
— Specifies the version of Apache Avro reader/writer format to support. The default is "1.7". You can specifyformat_options={"version": “1.8”}
to enable Avro logical type reading and writing. For more information, see the Apache Avro 1.7.7 Specificationand Apache Avro 1.8.2 Specification . The Apache Avro 1.8 connector supports the following logical type conversions:
For the reader: this table shows the conversion between Avro data type (logical type and Avro
primitive type) and Amazon Glue DynamicFrame
data type for Avro reader 1.7 and
1.8.
Avro Data Type: Logical Type |
Avro Data Type: Avro Primitive Type |
GlueDynamicFrame Data Type: Avro Reader 1.7 |
GlueDynamicFrame Data Type: Avro Reader 1.8 |
---|---|---|---|
Decimal | bytes | BINARY | Decimal |
Decimal | fixed | BINARY | Decimal |
Date | int | INT | Date |
Time (millisecond) | int | INT | INT |
Time (microsecond) | long | LONG | LONG |
Timestamp (millisecond) | long | LONG | Timestamp |
Timestamp (microsecond) | long | LONG | LONG |
Duration (not a logical type) | fixed of 12 | BINARY | BINARY |
For the writer: this table shows the conversion between Amazon Glue DynamicFrame
data
type and Avro data type for Avro writer 1.7 and 1.8.
Amazon Glue DynamicFrame Data Type |
Avro Data Type: Avro Writer 1.7 |
Avro Data Type: Avro Writer 1.8 |
---|---|---|
Decimal | String | decimal |
Date | String | date |
Timestamp | String | timestamp-micros |
Avro Spark DataFrame support
In order to use Avro from the Spark DataFrame API, you need to install the Spark Avro plugin for the corresponding Spark version. The version of Spark available in your job is determined by your Amazon Glue version. For more information about Spark versions, see Amazon Glue versions. This plugin is maintained by Apache, we do not make specific guarantees of support.
In Amazon Glue 2.0 - use version 2.4.3 of the Spark Avro plugin. You can find this JAR on Maven Central, see
org.apache.spark:spark-avro_2.12:2.4.3
In Amazon Glue 3.0 - use version 3.1.1 of the Spark Avro plugin. You can find this JAR on Maven Central,
see org.apache.spark:spark-avro_2.12:3.1.1
To include extra JARs in a Amazon Glue ETL job, use the --extra-jars
job parameter. For more information about job parameters, see
Using job parameters in Amazon Glue jobs. You can also configure this parameter in the Amazon Web Services Management Console.