Athena data source connectors for Apache Spark

Some Athena data source connectors are available as Spark DSV2 connectors. The Spark DSV2 connector names have a -dsv2 suffix (for example, athena-dynamodb-dsv2).

Following are the currently available DSV2 connectors, their Spark .format() class name, and links to their corresponding Amazon Athena Federated Query documentation:

DSV2 connector Spark .format() class name Documentation
athena-cloudwatch-dsv2 com.amazonaws.athena.connectors.dsv2.cloudwatch.CloudwatchTableProvider CloudWatch
athena-cloudwatch-metrics-dsv2 com.amazonaws.athena.connectors.dsv2.cloudwatch.metrics.CloudwatchMetricsTableProvider CloudWatch metrics
athena-aws-cmdb-dsv2 CMDB
athena-dynamodb-dsv2 com.amazonaws.athena.connectors.dsv2.dynamodb.DDBTableProvider DynamoDB

To download .jar files for the DSV2 connectors, visit the Amazon Athena Query Federation DSV2 GitHub page and see the Releases, Release <version>, Assets section.

Specifying the jar to Spark

To use the Athena DSV2 connectors with Spark, you submit the .jar file for the connector to the Spark environment that you are using. The following sections describe specific cases.

Athena for Spark

For information on adding custom .jar files and custom configuration to Amazon Athena for Apache Spark, see Adding JAR files and custom Spark configuration.

General Spark

To pass in the connector .jar file to Spark, use the spark-submit command and specify the .jar file in the --jars option, as in the following example:

spark-submit \ --deploy-mode cluster \ --jars

Amazon EMR Spark

In order to run a spark-submit command with the --jars parameter on Amazon EMR, you must add a step to your Amazon EMR Spark cluster. For details on how to use spark-submit on Amazon EMR, see Add a Spark step in the Amazon EMR Release Guide.

Amazon Glue ETL Spark

For Amazon Glue ETL, you can pass in the .jar file's URL to the --extra-jars argument of the aws glue start-job-run command. The Amazon Glue documentation describes the --extra-jars parameter as taking an Amazon S3 path, but the parameter can also take an HTTPS URL. For more information, see Job parameter reference in the Amazon Glue Developer Guide.

Querying the connector on Spark

To submit the equivalent of your existing Athena federated query on Apache Spark, use the spark.sql() function. For example, suppose you have the following Athena query that you want to use on Apache Spark.

SELECT somecola, somecolb, somecolc FROM ddb_datasource.some_schema_or_glue_database.some_ddb_or_glue_table WHERE somecola > 1

To perform the same query on Spark using the Amazon Athena DynamoDB DSV2 connector, use the following code:

dynamoDf = ( .option("athena.connectors.schema", "some_schema_or_glue_database") .option("athena.connectors.table", "some_ddb_or_glue_table") .format("com.amazonaws.athena.connectors.dsv2.dynamodb.DDBTableProvider") .load()) dynamoDf.createOrReplaceTempView("ddb_spark_table") spark.sql(''' SELECT somecola, somecolb, somecolc FROM ddb_spark_table WHERE somecola > 1 ''')

Specifying parameters

The DSV2 versions of the Athena data source connectors use the same parameters as the corresponding Athena data source connectors. For parameter information, refer to the documentation for the corresponding Athena data source connector.

In your PySpark code, use the following syntax to configure your parameters."athena.connectors.conf.parameter", "value")

For example, the following code sets the Amazon Athena DynamoDB connector disable_projection_and_casing parameter to always.

dynamoDf = ( .option("athena.connectors.schema", "some_schema_or_glue_database") .option("athena.connectors.table", "some_ddb_or_glue_table") .option("athena.connectors.conf.disable_projection_and_casing", "always") .format("com.amazonaws.athena.connectors.dsv2.dynamodb.DDBTableProvider") .load())