Example: Read Parquet Example: Write Parquet Parquet reference Write with Amazon Glue Parquet writer

Using the Parquet format in Amazon Glue

Amazon Glue retrieves data from sources and writes data to targets stored and transported in various data formats. If your data is stored or transported in the Parquet data format, this document introduces you available features for using your data in Amazon Glue.

Amazon Glue supports using the Parquet format. This format is a performance-oriented, column-based data format. For an introduction to the format by the standard authority see, Apache Parquet Documentation Overview.

You can use Amazon Glue to read Parquet files from Amazon S3 and from streaming sources as well as write Parquet files to Amazon S3. You can read and write bzip and gzip archives containing Parquet files from S3. You configure compression behavior on the S3 connection parameters instead of in the configuration discussed on this page.

The following table shows which common Amazon Glue features support the Parquet format option.

Read	Write	Streaming read	Group small files	Job bookmarks
Supported	Supported	Supported	Unsupported	Supported^*

^* Supported in Amazon Glue version 1.0+

Example: Read Parquet files or folders from S3

Prerequisites: You will need the S3 paths (s3path) to the Parquet files or folders that you want to read.

Configuration: In your function options, specify format="parquet". In your connection_options, use the paths key to specify your s3path.

You can configure how the reader interacts with S3 in the connection_options. For details, see Connection types and options for ETL in Amazon Glue: S3 connection parameters.

You can configure how the reader interprets Parquet files in your format_options. For details, see Parquet Configuration Reference.

The following Amazon Glue ETL script shows the process of reading Parquet files or folders from S3:

Example: Write Parquet files and folders to S3

Prerequisites: You will need an initialized DataFrame (dataFrame) or DynamicFrame (dynamicFrame). You will also need your expected S3 output path, s3path.

Configuration: In your function options, specify format="parquet". In your connection_options, use the paths key to specify s3path.

You can further alter how the writer interacts with S3 in the connection_options. For details, see Connection types and options for ETL in Amazon Glue: S3 connection parameters. You can configure how your operation writes the contents of your files in format_options. For details, see Parquet Configuration Reference.

The following Amazon Glue ETL script shows the process of writing Parquet files and folders to S3.

We provide a custom Parquet writer with performance optimizations for DynamicFrames, through the useGlueParquetWriter configuration key. To determine if this writer is right for your workload, see Glue Parquet Writer.

Parquet configuration reference

You can use the following format_options wherever Amazon Glue libraries specify format="parquet":

useGlueParquetWriter – Specifies the use of a custom Parquet writer that has performance optimizations for DynamicFrame workflows. For usage details, see Glue Parquet Writer.
- Type: Boolean, Default:false
compression – Specifies the compression codec used. Values are fully compatible with org.apache.parquet.hadoop.metadata.CompressionCodecName.
- Type: Enumerated Text, Default: "snappy"
- Values: "uncompressed", "snappy", "gzip", and "lzo"
blockSize – Specifies the size in bytes of a row group being buffered in memory. You use this for tuning performance. Size should divide exactly into a number of megabytes.
- Type: Numerical, Default:134217728
- The default value is equal to 128 MB.
pageSize – Specifies the size in bytes of a page. You use this for tuning performance. A page is the smallest unit that must be read fully to access a single record.
- Type: Numerical, Default:1048576
- The default value is equal to 1 MB.

Note

Additionally, any options that are accepted by the underlying SparkSQL code can be passed to this format by way of the connection_options map parameter. For example, you can set a Spark configuration such as mergeSchema for the Amazon Glue Spark reader to merge the schema for all files.

Optimize write performance with Amazon Glue Parquet writer

Note

The Amazon Glue Parquet writer has historically been accessed through the glueparquet format type. This access pattern is no longer advocated. Instead, use the parquet type with useGlueParquetWriter enabled.

The Amazon Glue Parquet writer has performance enhancements that allow faster Parquet file writes. The traditional writer computes a schema before writing. The Parquet format doesn't store the schema in a quickly retrievable fashion, so this might take some time. With the Amazon Glue Parquet writer, a pre-computed schema isn't required. The writer computes and modifies the schema dynamically, as data comes in.

Note the following limitations when you specify useGlueParquetWriter:

The writer supports only schema evolution (such as adding or removing columns), but not changing column types, such as with ResolveChoice.
The writer doesn't support writing empty DataFrames—for example, to write a schema-only file. When integrating with the Amazon Glue Data Catalog by setting enableUpdateCatalog=True, attempting to write an empty DataFrame will not update the Data Catalog. This will result in creating a table in the Data Catalog without a schema.

If your transform doesn't require these limitations, turning on the Amazon Glue Parquet writer should increase performance.

Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

CSV

XML