Using the ORC format in Amazon Glue
Amazon Glue retrieves data from sources and writes data to targets stored and transported in various data formats. If your data is stored or transported in the ORC data format, this document introduces you available features for using your data in Amazon Glue.
Amazon Glue supports using the ORC format. This format is a performance-oriented, column-based data format. For
an introduction to the format by the standard authority see, Apache Orc
You can use Amazon Glue to read ORC files from Amazon S3 and from streaming sources as well as write ORC files to Amazon S3.
You can read and write bzip
and gzip
archives containing ORC files from S3. You
configure compression behavior on the S3 connection parameters instead of in the configuration discussed on this
page.
The following table shows which common Amazon Glue operations support the ORC format option.
Read | Write | Streaming read | Group small files | Job bookmarks |
---|---|---|---|---|
Supported | Supported | Supported | Unsupported | Supported* |
*Supported in Amazon Glue version 1.0+
Example: Read ORC files or folders from S3
Prerequisites: You will need the S3 paths (s3path
) to the
ORC files or folders that you want to read.
Configuration: In your function options, specify
format="orc"
. In your connection_options
, use the paths
key to
specify your s3path
. You can configure how the reader interacts with S3 in the
connection_options
. For details, see Connection types and options for ETL in Amazon Glue: Amazon S3 connection option reference.
The following Amazon Glue ETL script shows the process of reading ORC files or folders from S3:
Example: Write ORC files and folders to S3
Prerequisites: You will need an initialized DataFrame
(dataFrame
) or DynamicFrame (dynamicFrame
). You will also need your expected S3
output path, s3path
.
Configuration:
In your function options, specify format="orc"
. In your connection options, use the paths
key to specify s3path
.
You can further alter how the writer interacts with S3 in the
connection_options
. For details, see Data format options for ETL inputs and outputs in Amazon Glue: Amazon S3 connection option reference.
The following code example shows the process:
ORC configuration reference
There are no format_options
values for format="orc"
. However,
any options that are accepted by the underlying SparkSQL code can be passed to it by way of
the connection_options
map parameter.