Release versions
Amazon Athena for Apache Spark offers the following release versions:
PySpark engine version 3
PySpark version 3 includes Apache Spark version 3.2.1. With this version, you can execute Spark code in Athena in-console notebooks.
Apache Spark version 3.5
Apache Spark version 3.5 is based on Amazon EMR 7.12 and packages Apache Spark version 3.5.6. With this version, you can run Spark code from Amazon SageMaker AI Unified Studio notebook or your preferred compatible Spark clients. This version adds key features to deliver an improved experience for interactive workloads:
-
Secure Spark Connect – Adds Spark Connect as an authenticated and authorized Amazon Endpoint.
-
Session level cost attribution – Users can track the costs per interactive session in Amazon Cost Explorer or Cost and Usage reports. For more information, see Session level cost attribution.
-
Advanced debugging capabilities – Adds live Spark UI and Spark History Server support for debugging workloads both from the APIs as well as from notebooks. For more information, see Accessing the Spark UI.
-
unfiltered access support – Access protected Amazon Glue Data catalog tables where you have full table permissions. For more information, see Using Lake Formation with Athena Spark workgroups.
Spark default properties
The following table lists Spark properties and their default values that are applied for Athena SparkConnect Sessions.
| Key | Default value | Description |
|---|---|---|
|
|
|
This is not modifiable. |
|
|
|
|
|
|
|
The number of cores driver uses. This is not modifiable during the initial launch. |
|
|
|
Amount of memory that each driver uses. This is not modifiable during the initial launch. |
|
|
|
Amount of memory overhead assigned for Python workloads and other processes running on driver. This is not modifiable during the initial launch. |
|
|
|
The Spark driver disk. This is not modifiable during the initial launch. |
|
|
|
The number of cores that each executor uses. This is not modifiable during the initial launch. |
|
|
|
Amount of memory that each driver uses. |
|
|
|
Amount of memory overhead assigned for Python workloads and other processes running on executor. This is not modifiable during the initial launch. |
|
|
|
The Spark executor disk. This is not modifiable during the initial launch. |
|
|
|
Architecture of executor. |
|
|
|
Extra Java options for the Spark driver. This is not modifiable during the initial launch. |
|
|
|
Extra Java options for the Spark executor. This is not modifiable during the initial launch. |
|
|
|
The number of Spark executor containers to allocate. |
|
|
|
Option that turns on dynamic resource allocation. This option scales up or down the number of executors registered with the application, based on the workload. |
|
|
|
The lower bound for the number of executors if you turn on dynamic allocation. |
|
|
|
The upper bound for the number of executors if you turn on dynamic allocation. |
|
|
|
The initial number of executors to run if you turn on dynamic allocation. |
|
|
|
The length of time that an executor can remain idle before Spark removes it. This only applies if you turn on dynamic allocation. |
|
|
|
DRA enabled requires shuffle tracking to be enabled. |
|
|
|
Timeout defines how long the Spark scheduler must observe a sustained backlog of pending tasks before it triggers a request to the cluster manager to launch new executors. |
|
|
|
|
|
|
|
The Amazon Glue metastore implementation class. |
|
|
|
Amazon Glue catalog accountId. |
|
|
|
Property specifies a comma-separated list of package prefixes for classes that should be loaded by the Application ClassLoader rather than the isolated ClassLoader created for Hive Metastore Client code. |
|
|
|
Defines the implementation for the S3 client to use S3A. |
|
|
|
Defines the implementation for the S3A client (S3A). |
|
|
|
Defines the implementation for the Native S3 client (S3N) to use S3A. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
This property enables an optimized commit protocol for Spark jobs when writing data to Amazon S3. When set to true, it helps Spark avoid costly file rename operations, resulting in faster and more reliable atomic writes compared to the default Hadoop committer. |
|
|
|
This configuration explicitly sets the Amazon region for the Amazon S3 bucket accessed via the S3A client. |
|
|
|
This specifies the socket connection timeout in milliseconds. |
|
|
|
This enables the S3A "Magic" Committer, a highly performant but specific commit protocol that relies on the underlying cluster manager's support for special paths. |
|
|
|
Relevant only when the Magic Committer is enabled, this specifies whether the list of files committed by a task should be tracked in memory instead of being written to temporary disk files. |
|
|
|
This setting explicitly selects the specific S3A Output Committer algorithm to be used (e.g., directory, partitioned, or magic). By specifying the name, you choose the strategy that manages temporary data, handles task failures, and performs the final atomic commit to the target Amazon S3 path. |
|
|
|
Property enables support for Amazon S3 Access Grants when accessing Amazon S3 data via the S3A/EMRFS filesystem client. |
|
|
|
When Amazon S3 Access Grants are enabled, this property controls whether the Amazon S3 client should fall back to traditional IAM credentials if the Access Grants lookup fails or does not provide sufficient permissions. |
|
|
|
Python path for driver. |
|
|
|
Python path for executor. |
|
|
|
This configuration controls whether Spark utilizes a Python worker daemon process on each executor. When enabled (true, the default), the executor keeps the Python worker alive between tasks to avoid the overhead of repeatedly launching and initializing a new Python interpreter for every task, significantly improving the performance of PySpark applications. |
|
|
|
Enables the use of Apache Arrow to optimize data transfer between the JVM and Python processes in PySpark. |
|
|
|
Configuration property that controls Spark's behavior when an error occurs during the data transfer between the JVM and Python using the Apache Arrow optimization. |
|
|
|
Configuration property that controls whether Spark uses an optimized file committer when writing Parquet files to certain file systems, specifically cloud storage systems like Amazon S3. |
|
|
|
Spark configuration property that specifies the fully qualified class name of the Hadoop OutputCommitter to be used when writing Parquet files. |
|
|
|
This property controls whether the Driver actively cleans up Spark application resources associated with executors that were running on nodes that have been deleted or expired. |
|
|
|
Property enables Spark's logic to automatically blacklist executors that are currently undergoing decommissioning (graceful shutdown) by the cluster manager. This prevents the scheduler from sending new tasks to executors that are about to exit, improving job stability during resource scaling down. |
|
|
|
Maximum time Spark will wait for a task to be successfully migrated off a decommissioning executor before blacklisting the host. |
|
|
|
Tells Spark to be lenient and not fail an entire stage attempt if a fetch failure occurs when reading shuffle data from a decommissioning executor. The fetch failure is considered recoverable, and Spark will re-fetch the data from a different location (potentially requiring re-computation), prioritizing job completion over strict error handling during graceful shutdowns. |
|
|
|
This property is typically used internally or in specific cluster manager setups to define the maximum total duration Spark expects a host's decommissioning process to take. If the actual decommissioning time exceeds this threshold, Spark may take aggressive action, like blacklisting the host or requesting forced termination, to free up the resource. |
|
|
|
When a task fails to fetch shuffle or RDD data from a specific host, setting this to true instructs Spark to unregister all output blocks associated with the failing application on that host. This prevents future tasks from attempting to fetch data from the unreliable host, forcing Spark to re-calculate the necessary blocks elsewhere and increasing job robustness against intermittent network issues. |