Example: Read JSON Example: Write JSON json reference Using optimized JSON reader

Using the JSON format in Amazon Glue

Amazon Glue retrieves data from sources and writes data to targets stored and transported in various data formats. If your data is stored or transported in the JSON data format, this document introduces you to available features for using your data in Amazon Glue.

Amazon Glue supports using the JSON format. This format represents data structures with consistent shape but flexible contents, that aren't row or column based. JSON is defined by parallel standards issued by several authorities, one of which is ECMA-404. For an introduction to the format by a commonly referenced source, see Introducing JSON.

You can use Amazon Glue to read JSON files from Amazon S3, as well as bzip and gzip compressed JSON files. You configure compression behavior on the S3 connection parameters instead of in the configuration discussed on this page.

Read	Write	Streaming read	Group small files	Job bookmarks
Supported	Supported	Supported	Supported	Supported

Example: Read JSON files or folders from S3

Prerequisites: You will need the S3 paths (s3path) to the JSON files or folders you would like to read.

Configuration: In your function options, specify format="json". In your connection_options, use the paths key to specify your s3path. You can further alter how your read operation will traverse s3 in the connection options, consult Amazon S3 connection option reference for details. You can configure how the reader interprets JSON files in your format_options. For details, see JSON Configuration Reference.

The following Amazon Glue ETL script shows the process of reading JSON files or folders from S3:

Python

For this example, use the create_dynamic_frame.from_options method.


# Example: Read JSON from S3
# For show, we handle a nested JSON file that we can limit with the JsonPath parameter
# For show, we also handle a JSON where a single entry spans multiple lines
# Consider whether optimizePerformance is right for your workflow.

from pyspark.context import SparkContext
from awsglue.context import GlueContext

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

dynamicFrame = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options={"paths": ["s3://s3path"]},
    format="json",
    format_options={
        "jsonPath": "$.id",
        "multiline": True,
        # "optimizePerformance": True, -> not compatible with jsonPath, multiline
    }
)

You can also use DataFrames in a script (pyspark.sql.DataFrame).


dataFrame = spark.read\
    .option("multiline", "true")\
    .json("s3://s3path")

Scala

For this example, use the getSourceWithFormat operation.


// Example: Read JSON from S3
// For show, we handle a nested JSON file that we can limit with the JsonPath parameter
// For show, we also handle a JSON where a single entry spans multiple lines
// Consider whether optimizePerformance is right for your workflow.

import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.{DynamicFrame, GlueContext}
import org.apache.spark.SparkContext

object GlueApp {
  def main(sysArgs: Array[String]): Unit = {
    val spark: SparkContext = new SparkContext()
    val glueContext: GlueContext = new GlueContext(spark)

    val dynamicFrame = glueContext.getSourceWithFormat(
      formatOptions=JsonOptions("""{"jsonPath": "$.id", "multiline": true, "optimizePerformance":false}"""),
      connectionType="s3",
      format="json",
      options=JsonOptions("""{"paths": ["s3://s3path"], "recurse": true}""")
    ).getDynamicFrame()
  }
}

You can also use DataFrames in a script (pyspark.sql.DataFrame).


val dataFrame = spark.read
    .option("multiline", "true")
    .json("s3://s3path")

Example: Write JSON files and folders to S3

Prerequisites:You will need an initialized DataFrame (dataFrame) or DynamicFrame (dynamicFrame). You will also need your expected S3 output path, s3path.

Configuration: In your function options, specify format="json". In your connection_options, use the paths key to specify s3path. You can further alter how the writer interacts with S3 in the connection_options. For details, see Data format options for ETL inputs and outputs in Amazon Glue : Amazon S3 connection option reference. You can configure how the writer interprets JSON files in your format_options. For details, see JSON Configuration Reference.

The following Amazon Glue ETL script shows the process of writing JSON files or folders from S3:

Json configuration reference

You can use the following format_options values with format="json":

jsonPath — A JsonPath expression that identifies an object to be read into records. This is particularly useful when a file contains records nested inside an outer array. For example, the following JsonPath expression targets the id field of a JSON object.
```
format="json", format_options={"jsonPath": "$.id"}
```
multiline — A Boolean value that specifies whether a single record can span multiple lines. This can occur when a field contains a quoted new-line character. You must set this option to "true" if any record spans multiple lines. The default value is "false", which allows for more aggressive file-splitting during parsing.
optimizePerformance — A Boolean value that specifies whether to use the advanced SIMD JSON reader along with Apache Arrow based columnar memory formats. Only available in Amazon Glue 3.0. Not compatible with multiline or jsonPath. Providing either of those options will instruct Amazon Glue to fall back to the standard reader.
withSchema — A String value that specifies a table schema in the format described in Manually specify the XML schema. Only used with optimizePerformance when reading from non-Catalog connections.

Using vectorized SIMD JSON reader with Apache Arrow columnar format

Amazon Glue version 3.0 adds a vectorized reader for JSON data. It performs 2x faster under certain conditions, compared to the standard reader. This reader comes with certain limitations users should be aware of before use, documented in this section.

To use the optimized reader, set "optimizePerformance" to True in the format_options or table property. You will also need to provide withSchema unless reading from the catalog. withSchema expects an input as described in the Manually specify the XML schema



// Read from S3 data source        
glueContext.create_dynamic_frame.from_options(
    connection_type = "s3", 
    connection_options = {"paths": ["s3://s3path"]}, 
    format = "json", 
    format_options={
        "optimizePerformance": True,
        "withSchema": SchemaString
        })    
 
// Read from catalog table
glueContext.create_dynamic_frame.from_catalog(
    database = database, 
    table_name = table, 
    additional_options = {
    // The vectorized reader for JSON can read your schema from a catalog table property.
        "optimizePerformance": True,
        })

For more information about the building a SchemaString in the Amazon Glue library, see PySpark extension types.

Limitations for the vectorized CSV reader

Note the following limitations:

JSON elements with nested objects or array values are not supported. If provided, Amazon Glue will fall back to the standard reader.
A schema must be provided, either from the Catalog or with withSchema.
Not compatible with multiline or jsonPath. Providing either of those options will instruct Amazon Glue to fall back to the standard reader.
Providing input records that do not match the input schema will cause the reader to fail.
Error records will not be created.
JSON files with multi-byte characters (such as Japanese or Chinese characters) are not supported.

Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Ion

ORC