Use Spark on Amazon Redshift with a connector
With Amazon EMR release versions 6.4.0 and later, every Amazon EMR cluster created with Apache
Spark includes a connector between Spark and Amazon Redshift. This connector is
based on the spark-redshift
open-source connector and allows you to use
Spark on Amazon EMR to process data stored in Amazon Redshift.
Starting in Amazon EMR release version 6.6.0, you must use the --jars
or
--packages
option to specify which of the following JAR files you want
to use. The --jars
option specifies dependencies stored locally, in HDFS,
or using HTTP/S. To see other file locations supported by the --jars
option, see Advanced Dependency Management--packages
option specifies dependencies stored in the public Maven
repo.
-
spark-redshift.jar
-
spark-avro.jar
-
RedshiftJDBC.jar
-
minimal-json.jar
These jars are already installed on each cluster by default in all Amazon EMR release
versions 6.4.0 and higher, but you don't need to specify them in versions 6.4.0 and
6.5.0. The following example shows how to launch a Spark application with a
spark-redshift
connector with versions 6.4.0 and 6.5.0.
spark-submit my_script.py
To launch a Spark application with a spark-redshift
connector on Amazon EMR
release version 6.6.0 or higher, you must use the --jars
or
--packages
option, as the following example shows. Note that the paths
listed with the --jars
option are the default paths for the JAR
files.
spark-submit \ --jars /usr/share/aws/redshift/jdbc/RedshiftJDBC.jar,/usr/share/aws/redshift/spark-redshift/lib/spark-redshift.jar,/usr/share/aws/redshift/spark-redshift/lib/spark-avro.jar,/usr/share/aws/redshift/spark-redshift/lib/minimal-json.jar \ my_script.py
To get started with this connector and learn about the supported parameters, see the
README filespark-redshift
Github repository. The
repository also includes a tutorial
Amazon EMR always reviews open-source code when importing it into your cluster. Due to security concerns, we don't support the following authentication methods from Spark to Amazon S3:
-
Setting Amazon access keys in the
hadoop-env
configuration classification -
Encoding Amazon access keys in the
tempdir
URI
Considerations and limitations
-
The parameter
tempformat
currently doesn't support the Parquet format. -
The
tempdir
URI points to an Amazon S3 location. This temp directory is not cleaned up automatically and hence could add additional cost. We recommend using Amazon S3 lifecycle policies to define the retention rules for the Amazon S3 bucket. -
We recommend using Amazon S3 server-side encryption to encrypt the Amazon S3 buckets used.
-
We recommend blocking public access to Amazon S3 buckets.
-
We recommend that the Amazon Redshift cluster should not be publicly accessible.
-
We recommend enabling Amazon Redshift audit logging.
-
We recommend enabling Amazon Redshift at-rest encryption.
-
We recommend enabling SSL for the JDBC connection from Spark on Amazon EMR to Amazon Redshift.
-
We recommend passing an IAM role using the parameter
aws_iam_role
for the Amazon Redshift authentication parameter. -
We recommend managing Amazon Redshift credentials (username and password for the Amazon Redshift cluster) in Amazon Secrets Manager as a best practice. The code sample below shows how you can use Amazon Secrets Manager to retrieve credentials to connect to an Amazon Redshift cluster using
pyspark
:from pyspark.sql import SQLContext import boto3 sc = # existing SparkContext sql_context = SQLContext(sc) secretsmanager_client = boto3.client('secretsmanager') secret_manager_response = secretsmanager_client.get_secret_value( SecretId='string', VersionId='string', VersionStage='string' ) username = # get username from secret_manager_response password = # get password from secret_manager_response url = "jdbc:redshift://redshifthost:5439/database?user=" + username + "&password=" + password # Read data from a table df = sql_context.read \ .format("io.github.spark_redshift_community.spark.redshift") \ .option("url", url) \ .option("dbtable", "my_table") \ .option("tempdir", "s3://path/for/temp/data") \ .load()