Other configuration options
On this page, you can find descriptions for the options that you can specify for the Amazon Redshift Spark connector.
Maximum size of string columns
Redshift creates string columns as text columns when creating tables, which are
stored as VARCHAR(256). If you want columns that support larger sizes, you can use
maxlength to specify the maximum length of string columns. The following is an
example of how to specify maxlength
.
columnLengthMap.foreach { case (colName, length) => val metadata = new MetadataBuilder().putLong("maxlength", length).build() df = df.withColumn(colName, df(colName).as(colName, metadata)) }
Column type
To set a column type, use the redshift_type
field.
columnTypeMap.foreach { case (colName, colType) => val metadata = new MetadataBuilder().putString("redshift_type", colType).build() df = df.withColumn(colName, df(colName).as(colName, metadata)) }
Compression encoding on a column
To use a specific compression encoding on a column, use the encoding field. For a full list of support compression encodings, see Compression encodings.
Description for a column
To set a description, use the description
field.
Authentication between Redshift and Amazon S3
By default, the result is unloaded to Amazon S3 in the parquet format. To unload the result as pipe-delimited text file, specify the following option.
.option("unload_s3_format", "TEXT")
Pushdown statements
Parameter | Required | Default | Description |
---|---|---|---|
spark.datasource.redshift.community.autopushdown.lazyMode | No | True | Specifies whether the connector should lazily run pushdown statements Redshift. If true, the spark connector retrieves all of the related models and information before running the query, which generally yields better performance. If false, the spark connector runs pushdown statements immediately in the main Spark driver thread and is serialized across expressions. |
Connector parameters
The parameter map or OPTIONS
in Spark SQL supports the following
settings.
Parameter | Required | Default | Description |
---|---|---|---|
dbtable | Yes, unless query is specified | N/A | The table to create or read from in Redshift. This parameter is required when saving data back to Redshift. |
query | Yes, unless dbtable is specified | N/A | The query to read from in Redshift. |
user | No | N/A | The Redshift username. Must be used with the password parameter. Valid only if the user and password are not parameters in the URL. Using both will cause an error. |
password | No | N/A | The Redshift password. Must be used with the user parameter. Valid only if the user and password are not parameters in the URL. Using both will cause an error. |
url | No | N/A |
A JDBC URL. The format is jdbc:subprotocol://host:port/database?user=username&password=password. Subprotocol can be postgresql or Redshift, depending on which JDBC driver you have loaded. Note that one Redshift compatible driver must be in the classpath and match this URL. Host and port should point to the Redshift master node, so you must configure security groups and/or VPC to allow access from your driver application. Database is the Redshift database name. User and password are credentials to access the database, which must be embedded in this URL for JDBC, and the database user must have the necessary permissions to access the table. |
aws_iam_role | Only if using IAM roles to authorize Redshift COPY/UNLOAD operations | N/A | Fully specified ARN of the IAM Role attached to the Redshift cluster. |
forward_spark_s3_credentials | No | False | Indicates whether this library should automatically discover the credentials that Spark uses to connect to Amazon S3, and whether to forward those credentials to Redshift over the JDBC driver. These credentials are sent as part of the JDBC query. Therefore we recommend you enable SSL encryption with JDBC connection when using this option. |
temporary_aws_access_key_id | No | N/A | Amazon access key. Must have write permissions to the S3 bucket. |
temporary_aws_secret_access_key | No | N/A | Amazon secret access key corresponding to the access key. |
temporary_aws_session_token | No | N/A | Amazon session token corresponding to provided access key. |
tempdir | No | N/A | A writeable location in Amazon S3. Used for unloading data when reading and Avro data to be loaded into Redshift when writing. If you're using a Redshift data source for Spark as part of a regular ETL pipeline, it can be useful to set a lifecycle policy on a bucket and use that as a temp location for this data. |
jdbcdriver | No | Determined by the JDBC URL's subprotocol | The class name of the JDBC driver to use. This class must be on the classpath. In most cases, it should not be necessary to specify this option, as the appropriate driver classname should automatically be determined by the JDBC URL's subprotocol. |
diststyle | No | Even | The Redshift Distribution Style to use when creating a table. Valid options are EVEN, KEY or ALL. When using KEY, you must also set a distribution key with the distkey option. |
distkey | No, unless using DISTSTYLE_KEY | N/A | The name of a column in the table to use as the distribution key when creating a table. |
sortkeyspec | No | N/A | A full Redshift Sort Key definition. |
include_column_list | No | False | Indicates whether this library should automatically extract the columns from the schema and add them to the COPY command according to the Column mapping options. |
description | No | N/A | A description for the table. The description is
set with the SQL COMMENT command, and appears in most query tools.
See the description metadata to set descriptions on
individual columns. |
preactions | No | N/A | A a semicolon-delimited list of SQL commands to be run before loading COPY command. It might be useful to run DELETE commands or similar before loading new data. If the command contains %s, the table name will be formatted in before runtime (in case you're using a staging table). If this command fails, it is treated as an exception. If you're using a staging table, the changes will be reverted and restore the backup table if preactions fail. |
extracopyoptions | No | N/A |
A list of extra options to append to the Redshift COPY
command when loading data (such as Note that since these options are appended to the end of the
|
sse_kms_key | No | N/A | The Amazon KMS key ID to use for server-side encryption in S3 during the Redshift UNLOAD operation rather than the Amazon default encryption. The Redshift IAM role must have access to the KMS key for writing with it, and the Spark IAM role must have access to the key for read operations. Reading the encrypted data requires no changes (Amazon handles this) as long as Spark's IAM role has the proper access. |
tempformat | No | AVRO | The format in which to save temporary files in Amazon S3 when writing to Redshift. Valid values are AVRO, CSV, and CSV GZIP (compressed CSV). |
csvnullstring (experimental) | No | Null | The string value to write for nulls when using the CSV tempformat. This should be a value that does not appear in your actual data. |
autopushdown | No | True | Indicates whether to apply predicate and query pushdown by capturing and analyzing the Spark logical plans for SQL operations. The operations are translated into a SQL query and then run in Redshift to improve performance. |
autopushdown.s3_result_cache | No | False | Cache the query SQL to unload data Amazon S3 path mapping in memory, so that the same query doesn't need to run again in the same Spark session. Only supported when autopushdown is turned on. We do not recommend using this parameter when mixing read and write operations because cached results might contain stale information. |
unload_s3_format | No | Parquet | The format with which to unload query results. Valid options are Parquet and Text, which specifies to unload query results in the pipe-delimited text format. |
extraunloadoptions | No | N/A | Extra options to append to the Redshift UNLOAD command. Not all options are guaranteed to work as some options might conflict with other options set within the connector. |
copydelay | No | 30000 | The delay (in ms) between retries for Redshift COPY operations. |
copyretrycount | No | 2 | The number of times to retry Redshift COPY operations. |
tempdir_region | No | N/A | The Amazon region where This setting is recommended in the following situations: 1) When the connector is running outside of Amazon, as automatic Region discovery will fail and negatively affect connector performance. 2) When 3) When the connector is running in a different Region than
|
secret.id | No | N/A | The name or ARN of your secret stored in
Amazon Secrets Manager. You can use this parameter to automatically supply
Redshift credentials, but only if the user, password, and
DbUser credentials are not passed into the JDBC URL
or as other options. |
secret.region | No | N/A | The primary Amazon Region, such as
US East (N. Virginia), to search for the If you don't specify this Region, the connector will try to
use the Default credential provider chain to resolve the
Region of the 1) When the connector is running outside of Amazon, as automatic Region discovery will fail and prevent authentication with Redshift When the connector is running in a different Region than
|
secret.vpcEndpointUrl | No | N/A | The PrivateLink DNS endpoint URL for Amazon Secrets Manager when overriding the Default credential provider chain. |
secret.vpcEndpointRegion | No | N/A | The PrivateLink DNS endpoint Region for Amazon Secrets Manager when overriding the Default credential provider chain. |
jdbc.* | No | N/A | Additional parameters to pass to the underlying
JDBC driver where the wildcard is the name of the JDBC parameter,
such as jdbc.ssl. Note that the jdbc prefix will be
removed before it is passed to the JDBC driver. To see all of the
possible options for the Redshift JDBC driver, see Options for JDBC driver version 2.1 configuration
. |
label | No | " " | An identifier to include in the query group set
when running queries with the connector. Must be 100 or fewer
characters, and all characters must be valid
unicodeIdentifierParts . If your identifier has more
than 100 characters, the excess will be removed. When running a
query with the connector, the query group will be set as a JSON
format string, such as
. This option replaces the value of the lbl key. |
Note
Acknowledgement: This documentation contains sample code and language
developed by the Apache Software
Foundation