Other configuration options - Amazon Redshift
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Other configuration options

Change the maximum size of string columns

Redshift creates string columns as text columns when creating tables, which are stored as VARCHAR(256). If you want columns that support larger sizes, you can use maxlength to specify the maximum length of string columns. The following is an example of how to specify maxlength.

columnLengthMap.foreach { case (colName, length) => val metadata = new MetadataBuilder().putLong("maxlength", length).build() df = df.withColumn(colName, df(colName).as(colName, metadata)) }

Setting a column type

To set a column type, use the redshift_type field.

columnTypeMap.foreach { case (colName, colType) => val metadata = new MetadataBuilder().putString("redshift_type", colType).build() df = df.withColumn(colName, df(colName).as(colName, metadata)) }

Setting a compression encoding on a column

To use a specific compression encoding on a column, use the encoding field. For a full list of support compression encodings, see Compression encodings.

Setting a description for a column

To set a description, use the description field.

Authentication between Redshift and Amazon S3

By default, the result is unloaded to Amazon S3 in the parquet format. To unload the result as pipe-delimited text file, specify the following option.

.option("unload_s3_format", "TEXT")

Lazily run pushdown statements

Parameter Required Default Description
spark.datasource.redshift.community.autopushdown.lazyMode No True

Specifies whether the connector should lazily run pushdown statements Redshift.

If true, the spark connector retrieves all of the related models and information before running the query, which generally yields better performance.

If false, the spark connector runs pushdown statements immediately in the main Spark driver thread and is serialized across expressions.

Connector parameters

The parameter map or OPTIONS in Spark SQL supports the following settings.

Parameter Required Default Description
dbtable Yes, unless query is specified N/A The table to create or read from in Redshift. This parameter is required when saving data back to Redshift.
query Yes, unless dbtable is specified N/A The query to read from in Redshift.
user No N/A The Redshift username. Must be used with the password parameter. Valid only if the user and password are not parameters in the URL. Using both will cause an error.
password No N/A The Redshift password. Must be used with the user parameter. Valid only if the user and password are not parameters in the URL. Using both will cause an error.
url No N/A

A JDBC URL. The format is jdbc:subprotocol://host:port/database?user=username&password=password.

Subprotocol can be postgresql or Redshift, depending on which JDBC driver you have loaded. Note that one Redshift compatible driver must be in the classpath and match this URL.

Host and port should point to the Redshift master node, so you must configure security groups and/or VPC to allow access from your driver application.

Database is the Redshift database name.

User and password are credentials to access the database, which must be embedded in this URL for JDBC, and the database user must have the necessary permissions to access the table.

aws_iam_role Only if using IAM roles to authorize Redshift COPY/UNLOAD operations N/A Fully specified ARN of the IAM Role attached to the Redshift cluster.
forward_spark_s3_credentials No False Indicates whether this library should automatically discover the credentials that Spark uses to connect to Amazon S3, and whether to forward those credentials to Redshift over the JDBC driver. These credentials are sent as part of the JDBC query. Therefore we recommend you enable SSL encryption with JDBC connection when using this option.
temporary_aws_access_key_id No N/A Amazon access key. Must have write permissions to the S3 bucket.
temporary_aws_secret_access_key No N/A Amazon secret access key corresponding to the access key.
temporary_aws_session_token No N/A Amazon session token corresponding to provided access key.
tempdir No N/A A writeable location in Amazon S3. Used for unloading data when reading and Avro data to be loaded into Redshift when writing. If you're using a Redshift data source for Spark as part of a regular ETL pipeline, it can be useful to set a lifecycle policy on a bucket and use that as a temp location for this data.
jdbcdriver No Determined by the JDBC URL's subprotocol The class name of the JDBC driver to use. This class must be on the classpath. In most cases, it should not be necessary to specify this option, as the appropriate driver classname should automatically be determined by the JDBC URL's subprotocol.
diststyle No Even The Redshift Distribution Style to use when creating a table. Valid options are EVEN, KEY or ALL. When using KEY, you must also set a distribution key with the distkey option.
distkey No, unless using DISTSTYLE_KEY N/A The name of a column in the table to use as the distribution key when creating a table.
sortkeyspec No N/A A full Redshift Sort Key definition.
include_column_list No False Indicates whether this library should automatically extract the columns from the schema and add them to the COPY command according to the Column mapping options.
description No N/A A description for the table. The description is set with the SQL COMMENT command, and appears in most query tools. See the description metadata to set descriptions on individual columns.
preactions No N/A A a semicolon-delimited list of SQL commands to be run before loading COPY command. It might be useful to run DELETE commands or similar before loading new data. If the command contains %s, the table name will be formatted in before runtime (in case you're using a staging table). If this command fails, it is treated as an exception. If you're using a staging table, the changes will be reverted and restore the backup table if preactions fail.
extracopyoptions No N/A

A list of extra options to append to the Redshift COPY command when loading data (such as TRUNCATECOLUMNS or MAXERROR n). See Optional parameter for a full list of available parameters.

Note that since these options are appended to the end of the COPY command, you can only use options that make sense at the end of the command. That should cover most possible use cases.

sse_kms_key No N/A The Amazon KMS key ID to use for server-side encryption in S3 during the Redshift UNLOAD operation rather than the Amazon default encryption. The Redshift IAM role must have access to the KMS key for writing with it, and the Spark IAM role must have access to the key for read operations. Reading the encrypted data requires no changes (Amazon handles this) as long as Spark's IAM role has the proper access.
tempformat No AVRO The format in which to save temporary files in Amazon S3 when writing to Redshift. Valid values are AVRO, CSV, and CSV GZIP (compressed CSV).
csvnullstring (experimental) No Null The string value to write for nulls when using the CSV tempformat. This should be a value that does not appear in your actual data.
autopushdown No True Indicates whether to apply predicate and query pushdown by capturing and analyzing the Spark logical plans for SQL operations. The operations are translated into a SQL query and then run in Redshift to improve performance.
autopushdown.s3_result_cache No False Cache the query SQL to unload data Amazon S3 path mapping in memory, so that the same query doesn't need to run again in the same Spark session. Only supported when autopushdown is turned on. We do not recommend using this parameter when mixing read and write operations because cached results might contain stale information.
unload_s3_format No Parquet The format with which to unload query results. Valid options are Parquet and Text, which specifies to unload query results in the pipe-delimited text format.
extraunloadoptions No N/A Extra options to append to the Redshift UNLOAD command. Not all options are guaranteed to work as some options might conflict with other options set within the connector.
copydelay No 30000 The delay (in ms) between retries for Redshift COPY operations.
copyretrycount No 2 The number of times to retry Redshift COPY operations.
tempdir_region No N/A

The Amazon region where tempdir is located. Setting this option improves connector performance for interactions with tempdir as well as automatically supply this value as part of the COPY and UNLOAD operations during the connector's read and write operations.

This setting is recommended in the following situations:

1) When the connector is running outside of Amazon, as automatic Region discovery will fail and negatively affect connector performance.

2) When tempdir is in a different Region than the Redshift cluster, as using this setting alleviates the need to supply the Region manually using the extracopyoptions and extraunloadoptions parameters. tempdir can't be in a different Region than the Redshift cluster when using PARQUET as the tempformat even if this using this parameter.

3) When the connector is running in a different Region than tempdir, as it improves the connector's access performance of tempdir.

secret.id No N/A The name or ARN of your secret stored in Amazon Secrets Manager. You can use this parameter to automatically supply Redshift credentials, but only if the user, password, and DbUser credentials are not passed into the JDBC URL or as other options.
secret.region No N/A

The primary Amazon Region, such as US East (N. Virginia), to search for the secret.id value.

If you don't specify this Region, the connector will try to use the Default credential provider chain to resolve the Region of the secret.id. In some cases, such as if you're using the connector outside of an the connector will not be able to find the Region. We recommend using this setting in the folowing situations:

1) When the connector is running outside of Amazon, as automatic Region discovery will fail and prevent authentication with Redshift

When the connector is running in a different Region than secret.id, as it improves the connector's access performance of the secret.

secret.vpcEndpointUrl No N/A The PrivateLink DNS endpoint URL for Amazon Secrets Manager when overriding the Default credential provider chain.
secret.vpcEndpointRegion No N/A The PrivateLink DNS endpoint Region for Amazon Secrets Manager when overriding the Default credential provider chain.
jdbc.* No N/A Additional parameters to pass to the underlying JDBC driver where the wildcard is the name of the JDBC parameter, such as jdbc.ssl. Note that the jdbc prefix will be removed before it is passed to the JDBC driver. To see all of the possible options for the Redshift JDBC driver, see Options for JDBC driver version 2.1 configuration .
label No " " An identifier to include in the query group set when running queries with the connector. Must be 100 or fewer characters, and all characters must be valid unicodeIdentifierParts. If your identifier has more than 100 characters, the excess will be removed. When running a query with the connector, the query group will be set as a JSON format string, such as
{"spark-redshift-connector":{"svc":"","ver":"5.1.0-amzn-1-spark_3.3","op":"Read","lbl":""}}`)
. This option replaces the value of the lbl key.
Note

Acknowledgement: This documentation contains sample code and language developed by the Apache Software Foundation licensed under the Apache 2.0 license.