Use Spark on Amazon Redshift with a connector - Amazon EMR
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China.

Use Spark on Amazon Redshift with a connector

With Amazon EMR release versions 6.4.0 and later, every Amazon EMR cluster created with Apache Spark includes a connector between Spark and Amazon Redshift. This connector is based on the spark-redshift open-source connector and allows you to use Spark on Amazon EMR to process data stored in Amazon Redshift.

Starting in Amazon EMR release version 6.6.0, you must use the --jars or --packages option to specify which of the following JAR files you want to use. The --jars option specifies dependencies stored locally, in HDFS, or using HTTP/S. To see other file locations supported by the --jars option, see Advanced Dependency Management in the Spark documentation. The --packages option specifies dependencies stored in the public Maven repo.

  • spark-redshift.jar

  • spark-avro.jar

  • RedshiftJDBC.jar

  • minimal-json.jar

These jars are already installed on each cluster by default in all Amazon EMR release versions 6.4.0 and higher, but you don't need to specify them in versions 6.4.0 and 6.5.0. The following example shows how to launch a Spark application with a spark-redshift connector with versions 6.4.0 and 6.5.0.

spark-submit my_script.py

To launch a Spark application with a spark-redshift connector on Amazon EMR release version 6.6.0 or higher, you must use the --jars or --packages option, as the following example shows. Note that the paths listed with the --jars option are the default paths for the JAR files.

spark-submit \ --jars /usr/share/aws/redshift/jdbc/RedshiftJDBC.jar,/usr/share/aws/redshift/spark-redshift/lib/spark-redshift.jar,/usr/share/aws/redshift/spark-redshift/lib/spark-avro.jar,/usr/share/aws/redshift/spark-redshift/lib/minimal-json.jar \ my_script.py

To get started with this connector and learn about the supported parameters, see the README file on the spark-redshift Github repository. The repository also includes a tutorial for those new to Amazon Redshift.

Amazon EMR always reviews open-source code when importing it into your cluster. Due to security concerns, we don't support the following authentication methods from Spark to Amazon S3:

  • Setting Amazon access keys in the hadoop-env configuration classification

  • Encoding Amazon access keys in the tempdir URI

Considerations and limitations

  • The parameter tempformat currently doesn't support the Parquet format.

  • The tempdir URI points to an Amazon S3 location. This temp directory is not cleaned up automatically and hence could add additional cost. We recommend using Amazon S3 lifecycle policies to define the retention rules for the Amazon S3 bucket.

  • We recommend using Amazon S3 server-side encryption to encrypt the Amazon S3 buckets used.

  • We recommend blocking public access to Amazon S3 buckets.

  • We recommend that the Amazon Redshift cluster should not be publicly accessible.

  • We recommend enabling Amazon Redshift audit logging.

  • We recommend enabling Amazon Redshift at-rest encryption.

  • We recommend enabling SSL for the JDBC connection from Spark on Amazon EMR to Amazon Redshift.

  • We recommend passing an IAM role using the parameter aws_iam_role for the Amazon Redshift authentication parameter.

  • We recommend managing Amazon Redshift credentials (username and password for the Amazon Redshift cluster) in Amazon Secrets Manager as a best practice. The code sample below shows how you can use Amazon Secrets Manager to retrieve credentials to connect to an Amazon Redshift cluster using pyspark:

    from pyspark.sql import SQLContext import boto3 sc = # existing SparkContext sql_context = SQLContext(sc) secretsmanager_client = boto3.client('secretsmanager') secret_manager_response = secretsmanager_client.get_secret_value( SecretId='string', VersionId='string', VersionStage='string' ) username = # get username from secret_manager_response password = # get password from secret_manager_response url = "jdbc:redshift://redshifthost:5439/database?user=" + username + "&password=" + password # Read data from a table df = sql_context.read \ .format("io.github.spark_redshift_community.spark.redshift") \ .option("url", url) \ .option("dbtable", "my_table") \ .option("tempdir", "s3://path/for/temp/data") \ .load()