Supported data types - Amazon Redshift
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Supported data types

The following data types in Amazon Redshift are supported with the Spark connector. For a complete list of supported data types in Amazon Redshift, see Data types. If a data type is not in the table below, it's not supported in the Spark connector.

Data type Aliases
SMALLINT INT2
INTEGER INT, INT4
BIGINT INT8
DECIMAL NUMERIC
REAL FLOAT4
DOUBLE PRECISION FLOAT8, FLOAT
BOOLEAN BOOL
CHAR CHARACTER, NCHAR, BPCHAR
VARCHAR CHARACTER VARYING, NVARCHAR, TEXT
DATE
TIMESTAMP Timestamp without time zone
TIMESTAMPTZ Timestamp with time zone
SUPER
TIME Time without time zone
TIMETZ Time with time zone
VARBYTE VARBINARY, BINARY VARYING

Complex data types

You can use the spark connector to read and write Spark complex data types such as ArrayType, MapType, and StructType to and from Redshift SUPER data type columns. If you provide a schema during a read operation, the data in the column will be converted to its corresponding complex types in Spark, including any nested types. Additionally, if autopushdown is enabled, projection of nested attributes, map values, and array indices are pushed down to Redshift so that the entire nested data structure no longer needs to be unloaded when accessing just a portion of the data.

When you write DataFrames from the connector, any column of type MapType (using StringType), StructType, or ArrayType is written to a Redshift SUPER data type column. When writing these nested data structures, the tempformat parameter must be of type CSV, CSV GZIP, or PARQUET. Using AVRO will cause an exception. Writing a MapType data structure that has a key type other than StringType will also cause an exception.

StructType

The following example demonstrates how to create a table with a SUPER data type that contains a struct

create table contains_super (a super);

You can then use the connector to query a StringType field hello from the SUPER column a in the table using a schema like in the following example.

import org.apache.spark.sql.types._ val sc = // existing SparkContext val sqlContext = new SQLContext(sc) val schema = StructType(StructField("a", StructType(StructField("hello", StringType) ::Nil)) :: Nil) val helloDF = sqlContext.read .format("io.github.spark_redshift_community.spark.redshift") .option("url", jdbcURL ) .option("tempdir", tempS3Dir) .option("dbtable", "contains_super") .schema(schema) .load().selectExpr("a.hello")

The following example demonstrates how to write a struct to the column a.

import org.apache.spark.sql.types._ import org.apache.spark.sql._ val sc = // existing SparkContext val sqlContext = new SQLContext(sc) val schema = StructType(StructField("a", StructType(StructField("hello", StringType) ::Nil)) :: Nil) val data = sc.parallelize(Seq(Row(Row("world")))) val mydf = sqlContext.createDataFrame(data, schema) mydf.write.format("io.github.spark_redshift_community.spark.redshift"). option("url", jdbcUrl). option("dbtable", tableName). option("tempdir", tempS3Dir). option("tempformat", "CSV"). mode(SaveMode.Append).save

MapType

If you prefer to use a MapType to represent your data, then you can use a MapType data structure in your schema and retrieve the value corresponding to a key in the map. Note that all keys in your MapType data structure must be of type String, and all of the values must of the same type, such as int.

The following example demonstrates how to get the value of the key hello in the column a.

import org.apache.spark.sql.types._ val sc = // existing SparkContext val sqlContext = new SQLContext(sc) val schema = StructType(StructField("a", MapType(StringType, IntegerType))::Nil) val helloDF = sqlContext.read .format("io.github.spark_redshift_community.spark.redshift") .option("url", jdbcURL ) .option("tempdir", tempS3Dir) .option("dbtable", "contains_super") .schema(schema) .load().selectExpr("a['hello']")

ArrayType

If the column contains an array instead of a struct, you can use the connector to query the first element in the array.

import org.apache.spark.sql.types._ val sc = // existing SparkContext val sqlContext = new SQLContext(sc) val schema = StructType(StructField("a", ArrayType(IntegerType)):: Nil) val helloDF = sqlContext.read .format("io.github.spark_redshift_community.spark.redshift") .option("url", jdbcURL ) .option("tempdir", tempS3Dir) .option("dbtable", "contains_super") .schema(schema) .load().selectExpr("a[0]")

Limitations

Using complex data types with the spark connector has the following limitations:

  • All nested struct field names and map keys must be lowercase. If querying for complex field names with uppercase letters, you can try omitting the schema and using the from_json spark function to convert the returned string locally as a workaround.

  • Any map fields used in read or write operations must have only StringType keys.

  • Only CSV, CSV GZIP, and PARQUET are supported tempformat values for writing complex types to Redshift. Attempting to use AVRO will throw an exception.