Example: Read ORC Example: Write ORC ORC reference

Using the ORC format in Amazon Glue

Amazon Glue retrieves data from sources and writes data to targets stored and transported in various data formats. If your data is stored or transported in the ORC data format, this document introduces you available features for using your data in Amazon Glue.

Amazon Glue supports using the ORC format. This format is a performance-oriented, column-based data format. For an introduction to the format by the standard authority see, Apache Orc.

You can use Amazon Glue to read ORC files from Amazon S3 and from streaming sources as well as write ORC files to Amazon S3. You can read and write bzip and gzip archives containing ORC files from S3. You configure compression behavior on the S3 connection parameters instead of in the configuration discussed on this page.

The following table shows which common Amazon Glue operations support the ORC format option.

Read	Write	Streaming read	Group small files	Job bookmarks
Supported	Supported	Supported	Unsupported	Supported^*

^*Supported in Amazon Glue version 1.0+

Example: Read ORC files or folders from S3

Prerequisites: You will need the S3 paths (s3path) to the ORC files or folders that you want to read.

Configuration: In your function options, specify format="orc". In your connection_options, use the paths key to specify your s3path. You can configure how the reader interacts with S3 in the connection_options. For details, see Connection types and options for ETL in Amazon Glue: Amazon S3 connection option reference.

The following Amazon Glue ETL script shows the process of reading ORC files or folders from S3:

Example: Write ORC files and folders to S3

Prerequisites: You will need an initialized DataFrame (dataFrame) or DynamicFrame (dynamicFrame). You will also need your expected S3 output path, s3path.

Configuration: In your function options, specify format="orc". In your connection options, use the paths key to specify s3path. You can further alter how the writer interacts with S3 in the connection_options. For details, see Data format options for ETL inputs and outputs in Amazon Glue: Amazon S3 connection option reference. The following code example shows the process:

ORC configuration reference

There are no format_options values for format="orc". However, any options that are accepted by the underlying SparkSQL code can be passed to it by way of the connection_options map parameter.

Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

JSON

Data lake frameworks