Release versions - Amazon Athena
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Release versions

Amazon Athena for Apache Spark offers the following release versions:

PySpark engine version 3

PySpark version 3 includes Apache Spark version 3.2.1. With this version, you can execute Spark code in Athena in-console notebooks.

Apache Spark version 3.5

Apache Spark version 3.5 is based on Amazon EMR 7.12 and packages Apache Spark version 3.5.6. With this version, you can run Spark code from Amazon SageMaker AI Unified Studio notebook or your preferred compatible Spark clients. This version adds key features to deliver an improved experience for interactive workloads:

  • Secure Spark Connect – Adds Spark Connect as an authenticated and authorized Amazon Endpoint.

  • Session level cost attribution – Users can track the costs per interactive session in Amazon Cost Explorer or Cost and Usage reports. For more information, see Session level cost attribution.

  • Advanced debugging capabilities – Adds live Spark UI and Spark History Server support for debugging workloads both from the APIs as well as from notebooks. For more information, see Accessing the Spark UI.

  • unfiltered access support – Access protected Amazon Glue Data catalog tables where you have full table permissions. For more information, see Using Lake Formation with Athena Spark workgroups.

Spark default properties

The following table lists Spark properties and their default values that are applied for Athena SparkConnect Sessions.

Key Default value Description

spark.app.id

<Athena SessionId>

This is not modifiable.

spark.app.name

default

spark.driver.cores

4

The number of cores driver uses. This is not modifiable during the initial launch.

spark.driver.memory

10g

Amount of memory that each driver uses. This is not modifiable during the initial launch.

spark.driver.memoryOverhead

6g

Amount of memory overhead assigned for Python workloads and other processes running on driver. This is not modifiable during the initial launch.

spark.cortex.driver.disk

64g

The Spark driver disk. This is not modifiable during the initial launch.

spark.executor.cores

4

The number of cores that each executor uses. This is not modifiable during the initial launch.

spark.executor.memory

10g

Amount of memory that each driver uses.

spark.executor.memoryOverhead

6g

Amount of memory overhead assigned for Python workloads and other processes running on executor. This is not modifiable during the initial launch.

spark.cortex.executor.disk

64g

The Spark executor disk. This is not modifiable during the initial launch.

spark.cortex.executor.architecture

AARCH_64

Architecture of executor.

spark.driver.extraJavaOptions

-Djava.net.preferIPv6Addresses=false -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/jdk.internal.ref=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED -Djdk.reflect.useDirectMethodHandle=false

Extra Java options for the Spark driver. This is not modifiable during the initial launch.

spark.executor.extraJavaOptions

-Djava.net.preferIPv6Addresses=false -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/jdk.internal.ref=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED -Djdk.reflect.useDirectMethodHandle=false

Extra Java options for the Spark executor. This is not modifiable during the initial launch.

spark.executor.instances

1

The number of Spark executor containers to allocate.

spark.dynamicAllocation.enabled

TRUE

Option that turns on dynamic resource allocation. This option scales up or down the number of executors registered with the application, based on the workload.

spark.dynamicAllocation.minExecutors

0

The lower bound for the number of executors if you turn on dynamic allocation.

spark.dynamicAllocation.maxExecutors

59

The upper bound for the number of executors if you turn on dynamic allocation.

spark.dynamicAllocation.initialExecutors

1

The initial number of executors to run if you turn on dynamic allocation.

spark.dynamicAllocation.executorIdleTimeout

60s

The length of time that an executor can remain idle before Spark removes it. This only applies if you turn on dynamic allocation.

spark.dynamicAllocation.shuffleTracking.enabled

TRUE

DRA enabled requires shuffle tracking to be enabled.

spark.dynamicAllocation.sustainedSchedulerBacklogTimeout

1s

Timeout defines how long the Spark scheduler must observe a sustained backlog of pending tasks before it triggers a request to the cluster manager to launch new executors.

spark.sql.catalogImplementation

hive

spark.hadoop.hive.metastore.client.factory.class

com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory

The Amazon Glue metastore implementation class.

spark.hadoop.hive.metastore.glue.catalogid

<accountId>

Amazon Glue catalog accountId.

spark.sql.hive.metastore.sharedPrefixes

software.amazon.awssdk.services.dynamodb

Property specifies a comma-separated list of package prefixes for classes that should be loaded by the Application ClassLoader rather than the isolated ClassLoader created for Hive Metastore Client code.

spark.hadoop.fs.s3.impl

org.apache.hadoop.fs.s3a.S3AFileSystem

Defines the implementation for the S3 client to use S3A.

spark.hadoop.fs.s3a.impl

org.apache.hadoop.fs.s3a.S3AFileSystem

Defines the implementation for the S3A client (S3A).

spark.hadoop.fs.s3n.impl

org.apache.hadoop.fs.s3a.S3AFileSystem

Defines the implementation for the Native S3 client (S3N) to use S3A.

spark.hadoop.fs.AbstractFileSystem.s3.impl

org.apache.hadoop.fs.s3a.S3A

spark.hadoop.fs.s3a.aws.credentials.provider

software.amazon.awssdk.auth.credentials.DefaultCredentialsProvider

spark.hadoop.fs.s3.customAWSCredentialsProvider

com.amazonaws.auth.DefaultAWSCredentialsProviderChain

spark.hadoop.mapreduce.output.fs.optimized.committer.enabled

TRUE

This property enables an optimized commit protocol for Spark jobs when writing data to Amazon S3. When set to true, it helps Spark avoid costly file rename operations, resulting in faster and more reliable atomic writes compared to the default Hadoop committer.

spark.hadoop.fs.s3a.endpoint.region

<REGION>

This configuration explicitly sets the Amazon region for the Amazon S3 bucket accessed via the S3A client.

spark.hadoop.fs.s3.getObject.initialSocketTimeoutMilliseconds

2000

This specifies the socket connection timeout in milliseconds.

spark.hadoop.fs.s3a.committer.magic.enabled

TRUE

This enables the S3A "Magic" Committer, a highly performant but specific commit protocol that relies on the underlying cluster manager's support for special paths.

spark.hadoop.fs.s3a.committer.magic.track.commits.in.memory.enabled

TRUE

Relevant only when the Magic Committer is enabled, this specifies whether the list of files committed by a task should be tracked in memory instead of being written to temporary disk files.

spark.hadoop.fs.s3a.committer.name

magicv2

This setting explicitly selects the specific S3A Output Committer algorithm to be used (e.g., directory, partitioned, or magic). By specifying the name, you choose the strategy that manages temporary data, handles task failures, and performs the final atomic commit to the target Amazon S3 path.

spark.hadoop.fs.s3.s3AccessGrants.enabled

FALSE

Property enables support for Amazon S3 Access Grants when accessing Amazon S3 data via the S3A/EMRFS filesystem client.

spark.hadoop.fs.s3.s3AccessGrants.fallbackToIAM

FALSE

When Amazon S3 Access Grants are enabled, this property controls whether the Amazon S3 client should fall back to traditional IAM credentials if the Access Grants lookup fails or does not provide sufficient permissions.

spark.pyspark.driver.python

/usr/bin/python3.11

Python path for driver.

spark.pyspark.python

/usr/bin/python3.11

Python path for executor.

spark.python.use.daemon

TRUE

This configuration controls whether Spark utilizes a Python worker daemon process on each executor. When enabled (true, the default), the executor keeps the Python worker alive between tasks to avoid the overhead of repeatedly launching and initializing a new Python interpreter for every task, significantly improving the performance of PySpark applications.

spark.sql.execution.arrow.pyspark.enabled

TRUE

Enables the use of Apache Arrow to optimize data transfer between the JVM and Python processes in PySpark.

spark.sql.execution.arrow.pyspark.fallback.enabled

TRUE

Configuration property that controls Spark's behavior when an error occurs during the data transfer between the JVM and Python using the Apache Arrow optimization.

spark.sql.parquet.fs.optimized.committer.optimization-enabled

TRUE

Configuration property that controls whether Spark uses an optimized file committer when writing Parquet files to certain file systems, specifically cloud storage systems like Amazon S3.

spark.sql.parquet.output.committer.class

com.amazon.emr.committer.EmrOptimizedSparkSqlParquetOutputCommitter

Spark configuration property that specifies the fully qualified class name of the Hadoop OutputCommitter to be used when writing Parquet files.

spark.resourceManager.cleanupExpiredHost

TRUE

This property controls whether the Driver actively cleans up Spark application resources associated with executors that were running on nodes that have been deleted or expired.

spark.blacklist.decommissioning.enabled

TRUE

Property enables Spark's logic to automatically blacklist executors that are currently undergoing decommissioning (graceful shutdown) by the cluster manager. This prevents the scheduler from sending new tasks to executors that are about to exit, improving job stability during resource scaling down.

spark.blacklist.decommissioning.timeout

1h

Maximum time Spark will wait for a task to be successfully migrated off a decommissioning executor before blacklisting the host.

spark.stage.attempt.ignoreOnDecommissionFetchFailure

TRUE

Tells Spark to be lenient and not fail an entire stage attempt if a fetch failure occurs when reading shuffle data from a decommissioning executor. The fetch failure is considered recoverable, and Spark will re-fetch the data from a different location (potentially requiring re-computation), prioritizing job completion over strict error handling during graceful shutdowns.

spark.decommissioning.timeout.threshold

20

This property is typically used internally or in specific cluster manager setups to define the maximum total duration Spark expects a host's decommissioning process to take. If the actual decommissioning time exceeds this threshold, Spark may take aggressive action, like blacklisting the host or requesting forced termination, to free up the resource.

spark.files.fetchFailure.unRegisterOutputOnHost

TRUE

When a task fails to fetch shuffle or RDD data from a specific host, setting this to true instructs Spark to unregister all output blocks associated with the failing application on that host. This prevents future tasks from attempting to fetch data from the unreliable host, forcing Spark to re-calculate the necessary blocks elsewhere and increasing job robustness against intermittent network issues.