Using job retry policies
In Amazon EMR on EKS versions 6.9.0 and later, you can set a retry policy for your job runs. Retry policies cause a job driver pod to be restarted automatically if it fails or is deleted. This makes long-running Spark streaming jobs more resilient to failures.
Setting a retry policy for a job
To configure a retry policy, you provide a RetryPolicyConfiguration field
using the StartJobRun API. An
example retryPolicyConfiguration is shown here:
aws emr-containers start-job-run \ --virtual-cluster-id cluster_id \ --name sample-job-name \ --execution-role-arn execution-role-arn \ --release-label emr-6.9.0-latest \ --job-driver '{ "sparkSubmitJobDriver": { "entryPoint": "local:///usr/lib/spark/examples/src/main/python/pi.py", "entryPointArguments": [ "2" ], "sparkSubmitParameters": "--conf spark.executor.instances=2 --conf spark.executor.memory=2G --conf spark.executor.cores=2 --conf spark.driver.cores=1" } }' \ --retry-policy-configuration '{ "maxAttempts": 5 }' \ --configuration-overrides '{ "monitoringConfiguration": { "cloudWatchMonitoringConfiguration": { "logGroupName": "my_log_group_name", "logStreamNamePrefix": "my_log_stream_prefix" }, "s3MonitoringConfiguration": { "logUri": "s3://amzn-s3-demo-logging-bucket" } } }'
Note
retryPolicyConfiguration is only available from Amazon CLI 1.27.68 version
onwards. To update the Amazon CLI to the latest version, see Installing or updating the latest version of the Amazon CLI
Configure the maxAttempts field with the maximum number of times you want
the job driver pod to be restarted if it fails or is deleted. The execution interval between
two job driver retry attempts is an exponential retry interval of (10 seconds, 20 seconds,
40 seconds ...) which is capped at 6 minutes, as described in the Kubernetes documentation
Note
Every additional job driver execution will be billed as another job run, and will be
subject to Amazon EMR on EKS
pricing
Retry policy configuration values
-
Default retry policy for a job:
StartJobRunincludes a retry policy set to 1 maximum attempt by default. You can configure the retry policy as desired.Note
If
maxAttemptsof theretryPolicyConfigurationis set to 1, it means that no retries will be done to bring up the driver pod on failure. -
Disabling retry policy for a job: To disable a retry policy, set the max attempts value in retryPolicyConfiguration to 1.
"retryPolicyConfiguration": { "maxAttempts": 1 } -
Set maxAttempts for a job within the valid range:
StartJobRuncall will fail if themaxAttemptsvalue is outside the valid range. The validmaxAttemptsrange is from 1 to 2,147,483,647 (32-bit integer), the range supported for Kubernetes'backOffLimitconfiguration setting. For more information, see Pod backoff failure policyin the Kubernetes documentation. If the maxAttemptsvalue is invalid, the following error message is returned:{ "message": "Retry policy configuration's parameter value of maxAttempts is invalid" }
Retrieving a retry policy status for a job
You can view the status of the retry attempts for a job with the ListJobRuns and DescribeJobRun APIs. Once you request a job with an enabled retry
policy configuration, the ListJobRun and DescribeJobRun responses
will contain the status of the retry policy in the RetryPolicyExecution field.
In addition, the DescribeJobRun response will contain the
RetryPolicyConfiguration that was input in the StartJobRun
request for the job.
Sample responses
These fields will not be visible when retry policy is disabled in the job, as described below in Retry policy configuration values.
Monitoring a job with a retry policy
When you enable a retry policy, a CloudWatch event is generated for every job driver that is created. To subscribe to these events, set up a CloudWatch event rule using the following command:
aws events put-rule \ --name cwe-test \ --event-pattern '{"detail-type": ["EMR Job Run New Driver Attempt"]}'
The event will return information on the newDriverPodName,
newDriverCreatedAt timestamp, previousDriverFailureMessage, and
currentAttemptCount of the job drivers. These events will not be created if
the retry policy is disabled.
For more information on how to monitor your job with CloudWatch events, see Monitor jobs with Amazon CloudWatch Events.
Finding logs for drivers and executors
Driver pod names follow the format spark-<job
id>-driver-<random-suffix>. The same random-suffix is added
to the executor pod names that the driver spawns. When you use this
random-suffix, you can find logs for a driver and its associated executors.
The random-suffix is only present if the retry
policy is enabled for the job; otherwise, the random-suffix is
absent.
For more information on how to configure jobs with monitoring configuration for logging, see Run a Spark application.