Using the JSON format in Amazon Glue
Amazon Glue retrieves data from sources and writes data to targets stored and transported in various data formats. If your data is stored or transported in the JSON data format, this document introduces you to available features for using your data in Amazon Glue.
Amazon Glue supports using the JSON format. This format represents data structures with consistent shape but
flexible contents, that aren't row or column based. JSON is defined by parallel standards issued by several
authorities, one of which is ECMA-404. For an introduction to the format by a commonly referenced source, see
Introducing JSON
You can use Amazon Glue to read JSON files from Amazon S3, as well as bzip
and gzip
compressed
JSON files. You configure compression behavior on the S3 connection parameters instead of in the configuration discussed on this
page.
Read | Write | Streaming read | Group small files | Job bookmarks |
---|---|---|---|---|
Supported | Supported | Supported | Supported | Supported |
Example: Read JSON files or folders from S3
Prerequisites:
You will need the S3 paths (s3path
) to the JSON files or folders you would like to read.
Configuration:
In your function options, specify format="json"
. In your connection_options
, use the
paths
key to specify your s3path
. You can further alter how your read operation will traverse s3 in the
connection options, consult Amazon S3 connection option reference for details. You can configure how the reader interprets
JSON files in your format_options
. For details, see JSON Configuration Reference.
The following Amazon Glue ETL script shows the process of reading JSON files or folders from S3:
Example: Write JSON files and folders to S3
Prerequisites:You will need an initialized DataFrame
(dataFrame
) or DynamicFrame (dynamicFrame
). You will also need your expected S3
output path, s3path
.
Configuration: In your function options, specify
format="json"
. In your connection_options
, use the paths
key to
specify s3path
. You can further alter how the writer interacts with S3 in the
connection_options
. For details, see Data format options for ETL inputs and outputs in Amazon Glue
: Amazon S3 connection option reference. You can configure how the writer interprets
JSON files in your format_options
. For details, see JSON Configuration Reference.
The following Amazon Glue ETL script shows the process of writing JSON files or folders from S3:
Json configuration reference
You can use the following format_options
values with
format="json"
:
-
jsonPath
— A JsonPathexpression that identifies an object to be read into records. This is particularly useful when a file contains records nested inside an outer array. For example, the following JsonPath expression targets the id
field of a JSON object.format="json", format_options={"jsonPath": "$.id"}
multiLine
— A Boolean value that specifies whether a single record can span multiple lines. This can occur when a field contains a quoted new-line character. You must set this option to"true"
if any record spans multiple lines. The default value is"false"
, which allows for more aggressive file-splitting during parsing.-
optimizePerformance
— A Boolean value that specifies whether to use the advanced SIMD JSON reader along with Apache Arrow based columnar memory formats. Only available in Amazon Glue 3.0. Not compatible withmultiLine
orjsonPath
. Providing either of those options will instruct Amazon Glue to fall back to the standard reader. -
withSchema
— A String value that specifies a table schema in the format described in Manually specify the XML schema. Only used withoptimizePerformance
when reading from non-Catalog connections.
Using vectorized SIMD JSON reader with Apache Arrow columnar format
Amazon Glue version 3.0 adds a vectorized reader for JSON data. It performs 2x faster under certain conditions, compared to the standard reader. This reader comes with certain limitations users should be aware of before use, documented in this section.
To use the optimized reader, set "optimizePerformance"
to True in the
format_options
or table property. You will also need to provide withSchema
unless
reading from the catalog. withSchema
expects an input as described in the Manually specify the XML schema
// Read from S3 data source glueContext.create_dynamic_frame.from_options( connection_type = "s3", connection_options = {"paths": ["s3://
s3path
"]}, format = "json", format_options={ "optimizePerformance": True, "withSchema":SchemaString
}) // Read from catalog table glueContext.create_dynamic_frame.from_catalog( database = database, table_name = table, additional_options = { // The vectorized reader for JSON can read your schema from a catalog table property. "optimizePerformance": True, })
For more information about the building a SchemaString
in the Amazon Glue library, see PySpark extension types.
Limitations for the vectorized CSV reader
Note the following limitations:
JSON elements with nested objects or array values are not supported. If provided, Amazon Glue will fall back to the standard reader.
A schema must be provided, either from the Catalog or with
withSchema
.Not compatible with
multiLine
orjsonPath
. Providing either of those options will instruct Amazon Glue to fall back to the standard reader.Providing input records that do not match the input schema will cause the reader to fail.
Error records will not be created.
JSON files with multi-byte characters (such as Japanese or Chinese characters) are not supported.