Using the ORC format in Amazon Glue - Amazon Glue
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China.

Using the ORC format in Amazon Glue

Amazon Glue retrieves data from sources and writes data to targets stored and transported in various data formats. If your data is stored or transported in the ORC data format, this document introduces you available features for using your data in Amazon Glue.

Amazon Glue supports using the ORC format. This format is a performance-oriented, column-based data format. For an introduction to the format by the standard authority see, Apache Orc.

You can use Amazon Glue to read ORC files from Amazon S3 and from streaming sources as well as write ORC files to Amazon S3. You can read and write bzip and gzip archives containing ORC files from S3. You configure compression behavior on the Amazon S3 connection instead of in the configuration discussed on this page.

The following table shows which common Amazon Glue operations support the ORC format option.

Read Write Streaming read Group small files Job bookmarks
Supported Supported Supported Unsupported Supported*

*Supported in Amazon Glue version 1.0+

Example: Read ORC files or folders from S3

Prerequisites: You will need the S3 paths (s3path) to the ORC files or folders that you want to read.

Configuration: In your function options, specify format="orc". In your connection_options, use the paths key to specify your s3path. You can configure how the reader interacts with S3 in the connection_options. For details, see Connection types and options for ETL in Amazon Glue: "connectionType": "s3".

The following Amazon Glue ETL script shows the process of reading ORC files or folders from S3:

Python

For this example, use the create_dynamic_frame.from_options method.

from pyspark.context import SparkContext from awsglue.context import GlueContext sc = SparkContext.getOrCreate() glueContext = GlueContext(sc) dynamicFrame = glueContext.create_dynamic_frame.from_options( connection_type="s3", connection_options={"paths": ["s3://s3path"]}, format="orc" )

You can also use DataFrames in a script (pyspark.sql.DataFrame).

dataFrame = spark.read\ .orc("s3://s3path")
Scala

For this example, use the getSourceWithFormat operation.

import com.amazonaws.services.glue.util.JsonOptions import com.amazonaws.services.glue.GlueContext import org.apache.spark.sql.SparkContext object GlueApp { def main(sysArgs: Array[String]): Unit = { val spark: SparkContext = new SparkContext() val glueContext: GlueContext = new GlueContext(spark) val dynamicFrame = glueContext.getSourceWithFormat( connectionType="s3", format="orc", options=JsonOptions("""{"paths": ["s3://s3path"]}""") ).getDynamicFrame() } }

You can also use DataFrames in a script (pyspark.sql.DataFrame).

val dataFrame = spark.read .orc("s3://s3path")

Example: Write ORC files and folders to S3

Prerequisites: You will need an initialized DataFrame (dataFrame) or DynamicFrame (dynamicFrame). You will also need your expected S3 output path, s3path.

Configuration: In your function options, specify format="orc". In your connection options, use the paths key to specify s3path. You can further alter how the writer interacts with S3 in the connection_options. For details, see Data format options for ETL inputs and outputs in Amazon Glue: "connectionType": "s3". The following code example shows the process:

Python

For this example, use the write_dynamic_frame.from_options method.

from pyspark.context import SparkContext from awsglue.context import GlueContext sc = SparkContext.getOrCreate() glueContext = GlueContext(sc) glueContext.write_dynamic_frame.from_options( frame=dynamicFrame, connection_type="s3", format="orc", connection_options={ "path": "s3://s3path" } )

You can also use DataFrames in a script (pyspark.sql.DataFrame).

df.write.orc("s3://s3path/")
Scala

For this example, use the getSinkWithFormat method.

import com.amazonaws.services.glue.util.JsonOptions import com.amazonaws.services.glue.{DynamicFrame, GlueContext} import org.apache.spark.SparkContext object GlueApp { def main(sysArgs: Array[String]): Unit = { val spark: SparkContext = new SparkContext() val glueContext: GlueContext = new GlueContext(spark) glueContext.getSinkWithFormat( connectionType="s3", options=JsonOptions("""{"path": "s3://s3path"}"""), format="orc" ).writeDynamicFrame(dynamicFrame) } }

You can also use DataFrames in a script (pyspark.sql.DataFrame).

df.write.orc("s3://s3path/")

ORC configuration reference

There are no format_options values for format="orc". However, any options that are accepted by the underlying SparkSQL code can be passed to it by way of the connection_options map parameter.