Using the Parquet format in Amazon Glue
Amazon Glue retrieves data from sources and writes data to targets stored and transported in various data formats. If your data is stored or transported in the Parquet data format, this document introduces you available features for using your data in Amazon Glue.
Amazon Glue supports using the Parquet format. This format is a performance-oriented, column-based data format. For
an introduction to the format by the standard authority see, Apache Parquet Documentation Overview
You can use Amazon Glue to read Parquet files from Amazon S3 and from streaming sources as well as write Parquet
files to Amazon S3. You can read and write bzip
and gzip
archives containing Parquet files
from S3. You configure compression behavior on the S3 connection parameters instead of in the configuration discussed on this
page.
The following table shows which common Amazon Glue features support the Parquet format option.
Read | Write | Streaming read | Group small files | Job bookmarks |
---|---|---|---|---|
Supported | Supported | Supported | Unsupported | Supported* |
* Supported in Amazon Glue version 1.0+
Example: Read Parquet files or folders from S3
Prerequisites: You will need the S3 paths (s3path
) to the
Parquet files or folders that you want to read.
Configuration: In your function options, specify
format="parquet"
. In your connection_options
, use the paths
key to
specify your s3path
.
You can configure how the reader interacts with S3 in the
connection_options
. For details, see Connection types and options for ETL in Amazon Glue: S3 connection parameters.
You can configure how the reader interprets Parquet files in your format_options
. For
details, see Parquet Configuration
Reference.
The following Amazon Glue ETL script shows the process of reading Parquet files or folders from S3:
Example: Write Parquet files and folders to S3
Prerequisites: You will need an initialized DataFrame
(dataFrame
) or DynamicFrame (dynamicFrame
). You will also need your expected S3
output path, s3path
.
Configuration: In your function options, specify
format="parquet"
. In your connection_options
, use the paths
key to
specify s3path
.
You can further alter how the writer interacts with S3 in the
connection_options
. For details, see Connection types and options for ETL in Amazon Glue: S3 connection parameters. You can configure how your
operation writes the contents of your files in format_options
. For details, see Parquet Configuration Reference.
The following Amazon Glue ETL script shows the process of writing Parquet files and folders to S3.
We provide a custom Parquet writer with performance optimizations for
DynamicFrames, through the
useGlueParquetWriter
configuration key. To determine if this writer is right for your workload,
see Glue Parquet Writer.
Parquet configuration reference
You can use the following format_options
wherever Amazon Glue libraries specify
format="parquet"
:
-
useGlueParquetWriter
– Specifies the use of a custom Parquet writer that has performance optimizations for DynamicFrame workflows. For usage details, see Glue Parquet Writer.-
Type: Boolean, Default:
false
-
-
compression
– Specifies the compression codec used. Values are fully compatible withorg.apache.parquet.hadoop.metadata.CompressionCodecName
.-
Type: Enumerated Text, Default:
"snappy"
-
Values:
"uncompressed"
,"snappy"
,"gzip"
, and"lzo"
-
-
blockSize
– Specifies the size in bytes of a row group being buffered in memory. You use this for tuning performance. Size should divide exactly into a number of megabytes.-
Type: Numerical, Default:
134217728
-
The default value is equal to 128 MB.
-
-
pageSize
– Specifies the size in bytes of a page. You use this for tuning performance. A page is the smallest unit that must be read fully to access a single record.-
Type: Numerical, Default:
1048576
-
The default value is equal to 1 MB.
-
Note
Additionally, any options that are accepted by the underlying SparkSQL code can be passed to this
format by way of the connection_options
map parameter. For example, you can set a Spark
configuration such as mergeSchema
Optimize write performance with Amazon Glue Parquet writer
Note
The Amazon Glue Parquet writer has historically been accessed through the glueparquet
format type. This access pattern is no longer advocated. Instead, use the parquet
type
with useGlueParquetWriter
enabled.
The Amazon Glue Parquet writer has performance enhancements that allow faster Parquet file writes. The traditional writer computes a schema before writing. The Parquet format doesn't store the schema in a quickly retrievable fashion, so this might take some time. With the Amazon Glue Parquet writer, a pre-computed schema isn't required. The writer computes and modifies the schema dynamically, as data comes in.
Note the following limitations when you specify useGlueParquetWriter
:
-
The writer supports only schema evolution (such as adding or removing columns), but not changing column types, such as with
ResolveChoice
. -
The writer doesn't support writing empty DataFrames—for example, to write a schema-only file. When integrating with the Amazon Glue Data Catalog by setting
enableUpdateCatalog=True
, attempting to write an empty DataFrame will not update the Data Catalog. This will result in creating a table in the Data Catalog without a schema.
If your transform doesn't require these limitations, turning on the Amazon Glue Parquet writer should increase performance.