Maximum size of string columns Column type Compression encoding on a column Description for a column Authentication between Redshift and Amazon S3 Pushdown statements Connector parameters

Other configuration options

On this page, you can find descriptions for the options that you can specify for the Amazon Redshift Spark connector.

Maximum size of string columns

Redshift creates string columns as text columns when creating tables, which are stored as VARCHAR(256). If you want columns that support larger sizes, you can use maxlength to specify the maximum length of string columns. The following is an example of how to specify maxlength.


columnLengthMap.foreach { case (colName, length) =>
  val metadata = new MetadataBuilder().putLong("maxlength", length).build()
  df = df.withColumn(colName, df(colName).as(colName, metadata))
}

Column type

To set a column type, use the redshift_type field.


columnTypeMap.foreach { case (colName, colType) =>
  val metadata = new MetadataBuilder().putString("redshift_type", colType).build()
  df = df.withColumn(colName, df(colName).as(colName, metadata))
}

Compression encoding on a column

To use a specific compression encoding on a column, use the encoding field. For a full list of support compression encodings, see Compression encodings.

Description for a column

To set a description, use the description field.

Authentication between Redshift and Amazon S3

By default, the result is unloaded to Amazon S3 in the parquet format. To unload the result as pipe-delimited text file, specify the following option.


.option("unload_s3_format", "TEXT")

Pushdown statements

Parameter	Required	Default	Description
spark.datasource.redshift.community.autopushdown.lazyMode	No	True	Specifies whether the connector should lazily run pushdown statements Redshift. If true, the spark connector retrieves all of the related models and information before running the query, which generally yields better performance. If false, the spark connector runs pushdown statements immediately in the main Spark driver thread and is serialized across expressions.

Parameter

Required

Default

Description

spark.datasource.redshift.community.autopushdown.lazyMode

True

Specifies whether the connector should lazily run pushdown statements Redshift.

If true, the spark connector retrieves all of the related models and information before running the query, which generally yields better performance.

If false, the spark connector runs pushdown statements immediately in the main Spark driver thread and is serialized across expressions.

Connector parameters

The parameter map or OPTIONS in Spark SQL supports the following settings.

Parameter	Required	Default	Description
dbtable	Yes, unless query is specified	N/A	The table to create or read from in Redshift. This parameter is required when saving data back to Redshift.
query	Yes, unless dbtable is specified	N/A	The query to read from in Redshift.
user	No	N/A	The Redshift username. Must be used with the password parameter. Valid only if the user and password are not parameters in the URL. Using both will cause an error.
password	No	N/A	The Redshift password. Must be used with the user parameter. Valid only if the user and password are not parameters in the URL. Using both will cause an error.
url	No	N/A	A JDBC URL. The format is jdbc:subprotocol://host:port/database?user=username&password=password. Subprotocol can be postgresql or Redshift, depending on which JDBC driver you have loaded. Note that one Redshift compatible driver must be in the classpath and match this URL. Host and port should point to the Redshift master node, so you must configure security groups and/or VPC to allow access from your driver application. Database is the Redshift database name. User and password are credentials to access the database, which must be embedded in this URL for JDBC, and the database user must have the necessary permissions to access the table.
aws_iam_role	Only if using IAM roles to authorize Redshift COPY/UNLOAD operations	N/A	Fully specified ARN of the IAM Role attached to the Redshift cluster.
forward_spark_s3_credentials	No	False	Indicates whether this library should automatically discover the credentials that Spark uses to connect to Amazon S3, and whether to forward those credentials to Redshift over the JDBC driver. These credentials are sent as part of the JDBC query. Therefore we recommend you enable SSL encryption with JDBC connection when using this option.
temporary_aws_access_key_id	No	N/A	Amazon access key. Must have write permissions to the S3 bucket.
temporary_aws_secret_access_key	No	N/A	Amazon secret access key corresponding to the access key.
temporary_aws_session_token	No	N/A	Amazon session token corresponding to provided access key.
tempdir	No	N/A	A writeable location in Amazon S3. Used for unloading data when reading and Avro data to be loaded into Redshift when writing. If you're using a Redshift data source for Spark as part of a regular ETL pipeline, it can be useful to set a lifecycle policy on a bucket and use that as a temp location for this data.
jdbcdriver	No	Determined by the JDBC URL's subprotocol	The class name of the JDBC driver to use. This class must be on the classpath. In most cases, it should not be necessary to specify this option, as the appropriate driver classname should automatically be determined by the JDBC URL's subprotocol.
diststyle	No	Even	The Redshift Distribution Style to use when creating a table. Valid options are EVEN, KEY or ALL. When using KEY, you must also set a distribution key with the distkey option.
distkey	No, unless using DISTSTYLE_KEY	N/A	The name of a column in the table to use as the distribution key when creating a table.
sortkeyspec	No	N/A	A full Redshift Sort Key definition.
include_column_list	No	False	Indicates whether this library should automatically extract the columns from the schema and add them to the COPY command according to the Column mapping options.
description	No	N/A	A description for the table. The description is set with the SQL COMMENT command, and appears in most query tools. See the `description` metadata to set descriptions on individual columns.
preactions	No	N/A	A a semicolon-delimited list of SQL commands to be run before loading COPY command. It might be useful to run DELETE commands or similar before loading new data. If the command contains %s, the table name will be formatted in before runtime (in case you're using a staging table). If this command fails, it is treated as an exception. If you're using a staging table, the changes will be reverted and restore the backup table if preactions fail.
extracopyoptions	No	N/A	A list of extra options to append to the Redshift COPY command when loading data (such as `TRUNCATECOLUMNS` or `MAXERROR n`). See Optional parameter for a full list of available parameters. Note that since these options are appended to the end of the `COPY` command, you can only use options that make sense at the end of the command. That should cover most possible use cases.
sse_kms_key	No	N/A	The Amazon KMS key ID to use for server-side encryption in S3 during the Redshift UNLOAD operation rather than the Amazon default encryption. The Redshift IAM role must have access to the KMS key for writing with it, and the Spark IAM role must have access to the key for read operations. Reading the encrypted data requires no changes (Amazon handles this) as long as Spark's IAM role has the proper access.
tempformat	No	AVRO	The format in which to save temporary files in Amazon S3 when writing to Redshift. Valid values are AVRO, CSV, and CSV GZIP (compressed CSV).
csvnullstring (experimental)	No	Null	The string value to write for nulls when using the CSV tempformat. This should be a value that does not appear in your actual data.
autopushdown	No	True	Indicates whether to apply predicate and query pushdown by capturing and analyzing the Spark logical plans for SQL operations. The operations are translated into a SQL query and then run in Redshift to improve performance.
autopushdown.s3_result_cache	No	False	Cache the query SQL to unload data Amazon S3 path mapping in memory, so that the same query doesn't need to run again in the same Spark session. Only supported when autopushdown is turned on. We do not recommend using this parameter when mixing read and write operations because cached results might contain stale information.
unload_s3_format	No	Parquet	The format with which to unload query results. Valid options are Parquet and Text, which specifies to unload query results in the pipe-delimited text format.
extraunloadoptions	No	N/A	Extra options to append to the Redshift UNLOAD command. Not all options are guaranteed to work as some options might conflict with other options set within the connector.
copydelay	No	30000	The delay (in ms) between retries for Redshift COPY operations.
copyretrycount	No	2	The number of times to retry Redshift COPY operations.
tempdir_region	No	N/A	The Amazon region where `tempdir` is located. Setting this option improves connector performance for interactions with `tempdir` as well as automatically supply this value as part of the COPY and UNLOAD operations during the connector's read and write operations. This setting is recommended in the following situations: 1) When the connector is running outside of Amazon, as automatic Region discovery will fail and negatively affect connector performance. 2) When `tempdir` is in a different Region than the Redshift cluster, as using this setting alleviates the need to supply the Region manually using the `extracopyoptions` and `extraunloadoptions` parameters. `tempdir` can't be in a different Region than the Redshift cluster when using `PARQUET` as the `tempformat` even if this using this parameter. 3) When the connector is running in a different Region than `tempdir`, as it improves the connector's access performance of `tempdir`.
secret.id	No	N/A	The name or ARN of your secret stored in Amazon Secrets Manager. You can use this parameter to automatically supply Redshift credentials, but only if the user, password, and `DbUser` credentials are not passed into the JDBC URL or as other options.
secret.region	No	N/A	The primary Amazon Region, such as US East (N. Virginia), to search for the `secret.id` value. If you don't specify this Region, the connector will try to use the Default credential provider chain to resolve the Region of the `secret.id`. In some cases, such as if you're using the connector outside of an the connector will not be able to find the Region. We recommend using this setting in the folowing situations: 1) When the connector is running outside of Amazon, as automatic Region discovery will fail and prevent authentication with Redshift When the connector is running in a different Region than `secret.id`, as it improves the connector's access performance of the secret.
secret.vpcEndpointUrl	No	N/A	The PrivateLink DNS endpoint URL for Amazon Secrets Manager when overriding the Default credential provider chain.
secret.vpcEndpointRegion	No	N/A	The PrivateLink DNS endpoint Region for Amazon Secrets Manager when overriding the Default credential provider chain.
jdbc.*	No	N/A	Additional parameters to pass to the underlying JDBC driver where the wildcard is the name of the JDBC parameter, such as jdbc.ssl. Note that the `jdbc` prefix will be removed before it is passed to the JDBC driver. To see all of the possible options for the Redshift JDBC driver, see Options for JDBC driver version 2.1 configuration .
label	No	" "	An identifier to include in the query group set when running queries with the connector. Must be 100 or fewer characters, and all characters must be valid `unicodeIdentifierParts`. If your identifier has more than 100 characters, the excess will be removed. When running a query with the connector, the query group will be set as a JSON format string, such as {"spark-redshift-connector":{"svc":"","ver":"5.1.0-amzn-1-spark_3.3","op":"Read","lbl":""}}`) . This option replaces the value of the `lbl` key.

Note

Acknowledgement: This documentation contains sample code and language developed by the Apache Software Foundation licensed under the Apache 2.0 license.

Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Performance improvements with pushdown

Supported data types