Copying data from JSON and Avro Copying data from text and CSV Copying data from columnar-format Parquet and ORC

Using COPY to load data into SUPER columns

In the following sections, you can learn about different ways to use the COPY command to load JSON data into Amazon Redshift. For information about the data format parameters that Amazon Redshift uses to parse JSON in COPY commands, read the JSON format for COPY parameter description in Data format parameters.

Topics

Copying data from JSON and Avro
Copying data from text and CSV
Copying data from columnar-format Parquet and ORC

Copying data from JSON and Avro

Amazon Redshift provides the following methods to ingest a JSON document using COPY, even with a JSON structure that is fully or partially unknown:

Store the data deriving from a JSON document into a single SUPER data column using the noshred option. This method is useful when the schema isn't known or is expected to change. Thus, this method makes it easier to store the entire tuple in a single SUPER column.
Shred the JSON document into multiple Amazon Redshift columns using the auto or jsonpaths option. Attributes can be Amazon Redshift scalars or SUPER values.

You can use these options with the JSON or Avro formats. For more information on JSON options such as noshred, auto, and jsonpaths, see JSON format for COPY.

The maximum size for a JSON object in Amazon Redshift is 4 MB, which applies before any shredding or parsing.

Method 1: Copying a JSON document into a single SUPER data column using `noshred`

You can copy entire JSON documents into single SUPER data columns using the noshred option in the COPY command. Consider the following example:

Create a table with a single SUPER data column.
```
CREATE TABLE region_nations_noshred (rdata SUPER);
```
Copy the data from Amazon S3 into the single SUPER data column. To ingest the JSON source data into a single SUPER data column, specify the noshred option in the FORMAT JSON clause.
```
COPY region_nations_noshred FROM 's3://redshift-downloads/semistructured/tpch-nested/data/json/region_nation'
REGION 'us-east-1' IAM_ROLE 'arn:aws:iam::xxxxxxxxxxxx:role/Redshift-S3'
FORMAT JSON 'noshred';
```
After COPY has successfully ingested the JSON, your table has a rdata SUPER data column that contains the data of the entire JSON object. The ingested data maintains all the properties of the JSON hierarchy. However, the leaves are converted to Amazon Redshift scalar types for efficient query processing.
Use the following query to retrieve the original JSON string.
```
SELECT rdata FROM region_nations_noshred;
```
When Amazon Redshift generates a SUPER data column, it becomes accessible using JDBC as a string through JSON serialization. For more information, see Serializing complex nested JSON.

Method 2: Copying a JSON document into multiple SUPER data columns

You can shred a JSON document into multiple columns that can be either SUPER data columns or Amazon Redshift scalar types. Amazon Redshift spreads different portions of the JSON object to different columns. Consider the following example:

Create a table with multiple columns to hold the shredded JSON.


CREATE TABLE region_nations
(
 r_regionkey smallint
 ,r_name varchar
 ,r_comment varchar
 ,r_nations super
 );

To copy JSON into the region_nations table, specify the AUTO option in the FORMAT JSON clause to split the JSON value across multiple columns. COPY matches the top-level JSON attributes with column names and allows nested values to be ingested as SUPER values, such as JSON arrays and objects.
```
COPY region_nations FROM 's3://redshift-downloads/semistructured/tpch-nested/data/json/region_nation'
REGION 'us-east-1' IAM_ROLE 'arn:aws:iam::xxxxxxxxxxxx:role/Redshift-S3'
FORMAT JSON 'auto';
```
When the JSON attribute names are in mixed upper and lower cases, specify the auto ignorecase option in the FORMAT JSON clause. For more information about the COPY command, see Load from JSON data using the 'auto ignorecase' option.

In some cases, there is a mismatch between column names and JSON attributes or the attribute to load is nested more than a level deep. If so, use a jsonpaths file to manually map JSON attributes to Amazon Redshift columns. Consider the following example:

Create a table with multiple columns to hold the shredded JSON.


CREATE TABLE region_nations
(
 r_regionkey smallint
 ,r_name varchar
 ,r_comment varchar
 ,r_nations super
 );

In this case, the column names don't match the JSON attributes. You can create a jsonpaths file that maps the paths of attributes to the table columns by their position in the jsonpaths array, like the following:
```
{"jsonpaths": [
       "$.r_regionkey",
       "$.r_name",
       "$.r_comment",
       "$.r_nations
    ]
}
```

Use the location of the jsonpaths file as the argument to the FORMAT JSON option in COPY.


COPY nations FROM 's3://redshift-downloads/semistructured/tpch-nested/data/json/region_nation'
REGION 'us-east-1' IAM_ROLE 'arn:aws:iam::xxxxxxxxxxxx:role/Redshift-S3'
FORMAT JSON 's3://redshift-downloads/semistructured/tpch-nested/data/jsonpaths/nations_jsonpaths.json';

Use the following query to access the table that shows data spread to multiple columns. The SUPER data columns are printed using the JSON format.
```
SELECT r_regionkey,r_name,r_comment,r_nations[0].n_nationkey FROM region_nations ORDER BY 1,2,3 LIMIT 1;
```

jsonpaths files map fields in the JSON document to table columns. You can extract additional columns, such as distribution and sort keys, while still loading the complete document as a SUPER column. The following query loads the complete document to the nations column. The name column is the sort key and the regionkey column is the distribution key. Consider the following example:

Create a table with multiple columns to hold the shredded JSON.


CREATE TABLE nations_sorted (
    regionkey smallint,
    name varchar,
    nations super
) DISTKEY(regionkey) SORTKEY(name);

Map the root jsonpath "$" to the root of the document as follows:


{"jsonpaths": [
       "$.r_regionkey",
       "$.r_name",
       "$"
    ]
}

Use the location of the jsonpaths file as the argument to the FORMAT JSON option in COPY.


COPY nations_sorted FROM 's3://redshift-downloads/semistructured/tpch-nested/data/json/region_nation'
REGION 'us-east-1' IAM_ROLE 'arn:aws:iam::xxxxxxxxxxxx:role/Redshift-S3'
FORMAT JSON 's3://redshift-downloads/semistructured/tpch-nested/data/jsonpaths/nations_sorted_jsonpaths.json';

For more information on using jsonpaths, see COPY from JSON format.

Copying data from text and CSV

Amazon Redshift represents SUPER columns in text and CSV formats as serialized JSON. Valid JSON formatting is required for SUPER columns to load with the correct type information. Unquote objects, arrays, numbers, booleans, and null values. Wrap string values in double quotes. SUPER columns use standard escaping rules for text and CSV formats.

When copying from CSV, delimiters are escaped according to the CSV standard. Consider the following example:


CREATE TABLE region_nations
(
 r_regionkey smallint
 ,r_name varchar
 ,r_comment varchar
 ,r_nations super
 );
 
 COPY region_nations FROM 's3://redshift-downloads/semistructured/tpch-nested/data/csv/region_nation'
REGION 'us-east-1' IAM_ROLE 'arn:aws:iam::xxxxxxxxxxxx:role/Redshift-S3'
FORMAT CSV;

When copying from text, if the chosen delimiter might also appear in a SUPER field, use the ESCAPE option during COPY and UNLOAD. Consider the following example:


CREATE TABLE region_nations
(
 r_regionkey smallint
 ,r_name varchar
 ,r_comment varchar
 ,r_nations super
 );
 
COPY region_nations FROM 's3://redshift-downloads/semistructured/tpch-nested/data/text/region_nation'
REGION 'us-east-1' IAM_ROLE 'arn:aws:iam::xxxxxxxxxxxx:role/Redshift-S3'
DELIMITER ','
ESCAPE;

Copying data from columnar-format Parquet and ORC

If your semi-structured or nested data is already available in either Apache Parquet or Apache ORC format, you can use the COPY command to ingest data into Amazon Redshift.

The Amazon Redshift table structure should match the number of columns and the column data types of the Parquet or ORC files. By specifying SERIALIZETOJSON in the COPY command, you can load any column type in the file that aligns with a SUPER column in the table as SUPER. This includes structure and array types.

The following example uses a Parquet format:


COPY region_nations FROM 's3://redshift-downloads/semistructured/tpch-nested/data/parquet/region_nation'
REGION 'us-east-1' IAM_ROLE 'arn:aws:iam::xxxxxxxxxxxx:role/Redshift-S3'
FORMAT PARQUET SERIALIZETOJSON;

The following example uses an ORC format.


COPY region_nations FROM 's3://redshift-downloads/semistructured/tpch-nested/data/orc/region_nation'
IAM_ROLE 'arn:aws:iam::xxxxxxxxxxxx:role/Redshift-S3'
FORMAT ORC SERIALIZETOJSON;

When the attributes of the date or time data types are in ORC, Amazon Redshift converts them to varchar upon encoding them in SUPER.

Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Using JSON_PARSE to insert JSON data

Unloading semi-structured data

Using COPY to load data into SUPER columns

Topics

Copying data from JSON and Avro

Method 1: Copying a JSON document into a single SUPER data column using noshred

Method 2: Copying a JSON document into multiple SUPER data columns

Copying data from text and CSV

Copying data from columnar-format Parquet and ORC

Method 1: Copying a JSON document into a single SUPER data column using `noshred`