Extracting data from JSON - Amazon Athena
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Extracting data from JSON

You may have source data containing JSON-encoded strings that you do not necessarily want to deserialize into a table in Athena. In this case, you can still run SQL operations on this data, using the JSON functions available in Presto.

Consider this JSON string as an example dataset.

{"name": "Susan Smith", "org": "engineering", "projects": [ {"name":"project1", "completed":false}, {"name":"project2", "completed":true} ] }

Examples: Extracting properties

To extract the name and projects properties from the JSON string, use the json_extract function as in the following example. The json_extract function takes the column containing the JSON string, and searches it using a JSONPath-like expression with the dot . notation.

Note

JSONPath performs a simple tree traversal. It uses the $ sign to denote the root of the JSON document, followed by a period and an element nested directly under the root, such as $.name.

WITH dataset AS ( SELECT '{"name": "Susan Smith", "org": "engineering", "projects": [{"name":"project1", "completed":false}, {"name":"project2", "completed":true}]}' AS myblob ) SELECT json_extract(myblob, '$.name') AS name, json_extract(myblob, '$.projects') AS projects FROM dataset

The returned value is a JSON-encoded string, and not a native Athena data type.

+-----------------------------------------------------------------------------------------------+ | name | projects | +-----------------------------------------------------------------------------------------------+ | "Susan Smith" | [{"name":"project1","completed":false},{"name":"project2","completed":true}] | +-----------------------------------------------------------------------------------------------+

To extract the scalar value from the JSON string, use the json_extract_scalar function. It is similar to json_extract, but returns only scalar values (Boolean, number, or string).

Note

Do not use the json_extract_scalar function on arrays, maps, or structs.

WITH dataset AS ( SELECT '{"name": "Susan Smith", "org": "engineering", "projects": [{"name":"project1", "completed":false},{"name":"project2", "completed":true}]}' AS myblob ) SELECT json_extract_scalar(myblob, '$.name') AS name, json_extract_scalar(myblob, '$.projects') AS projects FROM dataset

This query returns:

+---------------------------+ | name | projects | +---------------------------+ | Susan Smith | | +---------------------------+

To obtain the first element of the projects property in the example array, use the json_array_get function and specify the index position.

WITH dataset AS ( SELECT '{"name": "Bob Smith", "org": "engineering", "projects": [{"name":"project1", "completed":false},{"name":"project2", "completed":true}]}' AS myblob ) SELECT json_array_get(json_extract(myblob, '$.projects'), 0) AS item FROM dataset

It returns the value at the specified index position in the JSON-encoded array.

+---------------------------------------+ | item | +---------------------------------------+ | {"name":"project1","completed":false} | +---------------------------------------+

To return an Athena string type, use the [] operator inside a JSONPath expression, then Use the json_extract_scalar function. For more information about [], see Accessing array elements.

WITH dataset AS ( SELECT '{"name": "Bob Smith", "org": "engineering", "projects": [{"name":"project1", "completed":false},{"name":"project2", "completed":true}]}' AS myblob ) SELECT json_extract_scalar(myblob, '$.projects[0].name') AS project_name FROM dataset

It returns this result:

+--------------+ | project_name | +--------------+ | project1 | +--------------+