Using the Parquet format in Amazon Glue - Amazon Glue
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Using the Parquet format in Amazon Glue

Amazon Glue retrieves data from sources and writes data to targets stored and transported in various data formats. If your data is stored or transported in the Parquet data format, this document introduces you available features for using your data in Amazon Glue.

Amazon Glue supports using the Parquet format. This format is a performance-oriented, column-based data format. For an introduction to the format by the standard authority see, Apache Parquet Documentation Overview.

You can use Amazon Glue to read Parquet files from Amazon S3 and from streaming sources as well as write Parquet files to Amazon S3. You can read and write bzip and gzip archives containing Parquet files from S3. You configure compression behavior on the S3 connection parameters instead of in the configuration discussed on this page.

The following table shows which common Amazon Glue features support the Parquet format option.

Read Write Streaming read Group small files Job bookmarks
Supported Supported Supported Unsupported Supported*

* Supported in Amazon Glue version 1.0+

Example: Read Parquet files or folders from S3

Prerequisites: You will need the S3 paths (s3path) to the Parquet files or folders that you want to read.

Configuration: In your function options, specify format="parquet". In your connection_options, use the paths key to specify your s3path.

You can configure how the reader interacts with S3 in the connection_options. For details, see Connection types and options for ETL in Amazon Glue: S3 connection parameters.

You can configure how the reader interprets Parquet files in your format_options. For details, see Parquet Configuration Reference.

The following Amazon Glue ETL script shows the process of reading Parquet files or folders from S3:

Python

For this example, use the create_dynamic_frame.from_options method.

# Example: Read Parquet from S3 from pyspark.context import SparkContext from awsglue.context import GlueContext sc = SparkContext.getOrCreate() glueContext = GlueContext(sc) spark = glueContext.spark_session dynamicFrame = glueContext.create_dynamic_frame.from_options( connection_type = "s3", connection_options = {"paths": ["s3://s3path/"]}, format = "parquet" )

You can also use DataFrames in a script (pyspark.sql.DataFrame).

dataFrame = spark.read.parquet("s3://s3path/")
Scala

For this example, use the getSourceWithFormat method.

// Example: Read Parquet from S3 import com.amazonaws.services.glue.util.JsonOptions import com.amazonaws.services.glue.{DynamicFrame, GlueContext} import org.apache.spark.SparkContext object GlueApp { def main(sysArgs: Array[String]): Unit = { val spark: SparkContext = new SparkContext() val glueContext: GlueContext = new GlueContext(spark) val dynamicFrame = glueContext.getSourceWithFormat( connectionType="s3", format="parquet", options=JsonOptions("""{"paths": ["s3://s3path"]}""") ).getDynamicFrame() } }

You can also use DataFrames in a script (org.apache.spark.sql.DataFrame).

spark.read.parquet("s3://s3path/")

Example: Write Parquet files and folders to S3

Prerequisites: You will need an initialized DataFrame (dataFrame) or DynamicFrame (dynamicFrame). You will also need your expected S3 output path, s3path.

Configuration: In your function options, specify format="parquet". In your connection_options, use the paths key to specify s3path.

You can further alter how the writer interacts with S3 in the connection_options. For details, see Connection types and options for ETL in Amazon Glue: S3 connection parameters. You can configure how your operation writes the contents of your files in format_options. For details, see Parquet Configuration Reference.

The following Amazon Glue ETL script shows the process of writing Parquet files and folders to S3.

We provide a custom Parquet writer with performance optimizations for DynamicFrames, through the useGlueParquetWriter configuration key. To determine if this writer is right for your workload, see Glue Parquet Writer.

Python

For this example, use the write_dynamic_frame.from_options method.

# Example: Write Parquet to S3 # Consider whether useGlueParquetWriter is right for your workflow. from pyspark.context import SparkContext from awsglue.context import GlueContext sc = SparkContext.getOrCreate() glueContext = GlueContext(sc) glueContext.write_dynamic_frame.from_options( frame=dynamicFrame, connection_type="s3", format="parquet", connection_options={ "path": "s3://s3path", }, format_options={ # "useGlueParquetWriter": True, }, )

You can also use DataFrames in a script (pyspark.sql.DataFrame).

df.write.parquet("s3://s3path/")
Scala

For this example, use the getSinkWithFormat method.

// Example: Write Parquet to S3 // Consider whether useGlueParquetWriter is right for your workflow. import com.amazonaws.services.glue.util.JsonOptions import com.amazonaws.services.glue.{DynamicFrame, GlueContext} import org.apache.spark.SparkContext object GlueApp { def main(sysArgs: Array[String]): Unit = { val spark: SparkContext = new SparkContext() val glueContext: GlueContext = new GlueContext(spark) glueContext.getSinkWithFormat( connectionType="s3", options=JsonOptions("""{"path": "s3://s3path"}"""), format="parquet" ).writeDynamicFrame(dynamicFrame) } }

You can also use DataFrames in a script (org.apache.spark.sql.DataFrame).

df.write.parquet("s3://s3path/")

Parquet configuration reference

You can use the following format_options wherever Amazon Glue libraries specify format="parquet":

  • useGlueParquetWriter – Specifies the use of a custom Parquet writer that has performance optimizations for DynamicFrame workflows. For usage details, see Glue Parquet Writer.

    • Type: Boolean, Default:false

  • compression – Specifies the compression codec used. Values are fully compatible with org.apache.parquet.hadoop.metadata.CompressionCodecName.

    • Type: Enumerated Text, Default: "snappy"

    • Values: "uncompressed", "snappy", "gzip", and "lzo"

  • blockSize – Specifies the size in bytes of a row group being buffered in memory. You use this for tuning performance. Size should divide exactly into a number of megabytes.

    • Type: Numerical, Default:134217728

    • The default value is equal to 128 MB.

  • pageSize – Specifies the size in bytes of a page. You use this for tuning performance. A page is the smallest unit that must be read fully to access a single record.

    • Type: Numerical, Default:1048576

    • The default value is equal to 1 MB.

Note

Additionally, any options that are accepted by the underlying SparkSQL code can be passed to this format by way of the connection_options map parameter. For example, you can set a Spark configuration such as mergeSchema for the Amazon Glue Spark reader to merge the schema for all files.

Optimize write performance with Amazon Glue Parquet writer

Note

The Amazon Glue Parquet writer has historically been accessed through the glueparquet format type. This access pattern is no longer advocated. Instead, use the parquet type with useGlueParquetWriter enabled.

The Amazon Glue Parquet writer has performance enhancements that allow faster Parquet file writes. The traditional writer computes a schema before writing. The Parquet format doesn't store the schema in a quickly retrievable fashion, so this might take some time. With the Amazon Glue Parquet writer, a pre-computed schema isn't required. The writer computes and modifies the schema dynamically, as data comes in.

Note the following limitations when you specify useGlueParquetWriter:

  • The writer supports only schema evolution (such as adding or removing columns), but not changing column types, such as with ResolveChoice.

  • The writer doesn't support writing empty DataFrames—for example, to write a schema-only file. When integrating with the Amazon Glue Data Catalog by setting enableUpdateCatalog=True, attempting to write an empty DataFrame will not update the Data Catalog. This will result in creating a table in the Data Catalog without a schema.

If your transform doesn't require these limitations, turning on the Amazon Glue Parquet writer should increase performance.