Other configuration options
Change the maximum size of string columns
Redshift creates string columns as text columns when creating tables, which are stored
as VARCHAR(256). If you want columns that support larger sizes, you can use maxlength to specify
the maximum length of string columns.
The following is an example of how to specify maxlength
.
columnLengthMap.foreach { case (colName, length) => val metadata = new MetadataBuilder().putLong("maxlength", length).build() df = df.withColumn(colName, df(colName).as(colName, metadata)) }
Setting a column type
To set a column type, use the redshift_type
field.
columnTypeMap.foreach { case (colName, colType) => val metadata = new MetadataBuilder().putString("redshift_type", colType).build() df = df.withColumn(colName, df(colName).as(colName, metadata)) }
Setting a compression encoding on a column
To use a specific compression encoding on a column, use the encoding field. For a full list of support compression encodings, see Compression encodings.
Setting a description for a column
To set a description, use the description
field.
Authentication between Redshift and Amazon S3
By default, the result is unloaded to Amazon S3 in the parquet format. To unload the result as pipe-delimited text file, specify the following option.
.option("unload_s3_format", "TEXT")
Connector parameters
The parameter map or OPTIONS
in Spark SQL supports the following settings.
Parameter | Required | Default | Description |
---|---|---|---|
dbtable | Yes, unless query is specified | N/A | The table to create or read from in Redshift. This parameter is required when saving data back to Redshift. |
query | Yes, unless dbtable is specified | N/A | The query to read from in Redshift. |
user | No | N/A | The Redshift username. Must be used with the password parameter. Valid only if the user and password are not parameters in the URL. Using both will cause an error. |
password | No | N/A | The Redshift password. Must be used with the user parameter. Valid only if the user and password are not parameters in the URL. Using both will cause an error. |
url | No | N/A |
A JDBC URL. The format is jdbc:subprotocol://host:port/database?user=username&password=password. Subprotocol can be postgresql or Redshift, depending on which JDBC driver you have loaded. Note that one Redshift compatible driver must be in the classpath and match this URL. Host and port should point to the Redshift master node, so you must configure security groups and/or VPC to allow access from your driver application. Database is the Redshift database name. User and password are credentials to access the database, which must be embedded in this URL for JDBC, and the database user must have the necessary permissions to access the table. |
aws_iam_role | Only if using IAM roles to authorize Redshift COPY/UNLOAD operations | N/A | Fully specified ARN of the IAM Role attached to the Redshift cluster. |
forward_spark_s3_credentials | No | False | Indicates whether this library should automatically discover the credentials that Spark uses to connect to Amazon S3, and whether to forward those credentials to Redshift over the JDBC driver. These credentials are sent as part of the JDBC query. Therefore we recommend you enable SSL encryption with JDBC connection when using this option. |
temporary_aws_access_key_id | No | N/A | Amazon access key. Must have write permissions to the S3 bucket. |
temporary_aws_secret_access_key | No | N/A | Amazon secret access key corresponding to the access key. |
temporary_aws_session_token | No | N/A | Amazon session token corresponding to provided access key. |
tempdir | No | N/A | A writeable location in Amazon S3. Used for unloading data when reading and Avro data to be loaded into Redshift when writing. If you're using a Redshift data source for Spark as part of a regular ETL pipeline, it can be useful to set a Lifecycle Policy on a bucket and use that as a temp location for this data. |
jdbcdriver | No | Determined by the JDBC URL's subprotocol | The class name of the JDBC driver to use. This class must be on the classpath. In most cases, it should not be necessary to specify this option, as the appropriate driver classname should automatically be determined by the JDBC URL's subprotocol. |
diststyle | No | Even | The Redshift Distribution Style to use when creating a table. Valid options are EVEN, KEY or ALL. When using KEY, you must also set a distribution key with the distkey option. |
distkey | No, unless using DISTSTYLE_KEY | N/A | The name of a column in the table to use as the distribution key when creating a table. |
sortkeyspec | No | N/A | A full Redshift Sort Key definition. |
include_column_list | No | False | Indicates whether this library should automatically extract the columns from the schema and add them to the COPY command according to the Column mapping options. |
description | No | N/A | A description for the table. The description is
set with the SQL COMMENT command, and appears in most query tools.
See the description metadata to set descriptions on
individual columns. |
preactions | No | N/A | A a semicolon-delimited list of SQL commands to be run before loading COPY command. It might be useful to run DELETE commands or similar before loading new data. If the command contains %s, the table name will be formatted in before runtime (in case you're using a staging table). If this command fails, it is treated as an exception. If you're using a staging table, the changes will be reverted and restore the backup table if preactions fail. |
extracopyoptions | No | N/A |
A list of extra options to append to the Redshift COPY
command when loading data (such as Note that since these options are appended to the end of the
|
sse_kms_key | No | N/A | The Amazon KMS key ID to use for server-side encryption in S3 during the Redshift UNLOAD operation rather than the Amazon default encryption. The Redshift IAM role must have access to the KMS key for writing with it, and the Spark IAM role must have access to the key for read operations. Reading the encrypted data requires no changes (Amazon handles this) as long as Spark's IAM role has the proper access. |
tempformat (experimental) | No | AVRO | The format in which to save temporary files in Amazon S3 when writing to Redshift. Valid values are AVRO, CSV, and CSV GZIP. Redshift is faster when loading CSV than when loading Avro files, so using that tempformat might yield better performance when writing to Redshift. |
csvnullstring (experimental) | No | Null | The string value to write for nulls when using the CSV tempformat. This should be a value that does not appear in your actual data. |
autopushdown | No | True | Indicates whether to apply predicate and query pushdown by capturing and analyzing the Spark logical plans for SQL operations. The operations are translated into a SQL query and then run in Redshift to improve performance. |
autopushdown.s3_result_cache | No | False | Cache the query SQL to unload data Amazon S3 path mapping in memory, so that the same query doesn't need to run again in the same Spark session. Only supported when autopushdown is turned on. We do not recommend using this parameter when mixing read and write operations because cached results might contain stale information. |
unload_s3_format | No | Parquet | The format with which to unload query results. Valid options are Parquet and Text, which specifies to unload query results in the pipe-delimited text format. |
extraunloadoptions | No | N/A | Extra options to append to the Redshift UNLOAD command. Not all options are guaranteed to work as some options might conflict with other options set within the connector. |
copydelay | No | 30000 | The delay (in ms) between retries for Redshift COPY operations. |
copyretrycount | No | 2 | The number of times to retry Redshift COPY operations. |
tempdir_region | No | N/A | The Amazon region where This setting is recommended in the following situations: 1) When the connector is running outside of Amazon, as automatic Region discovery will fail and negatively affect connector performance. 2) When
3) When the connector is running in a different Region than |
Acknowledgement: This documentation contains sample code and language developed by the
Apache Software Foundation