Tracking processed data using job bookmarks
Amazon Glue tracks data that has already been processed during a previous run of an ETL job by persisting state information from the job run. This persisted state information is called a job bookmark. Job bookmarks help Amazon Glue maintain state information and prevent the reprocessing of old data. With job bookmarks, you can process new data when rerunning on a scheduled interval. A job bookmark is composed of the states for various elements of jobs, such as sources, transformations, and targets. For example, your ETL job might read new partitions in an Amazon S3 file. Amazon Glue tracks which partitions the job has processed successfully to prevent duplicate processing and duplicate data in the job's target data store.
Job bookmarks are implemented for JDBC data sources, the Relationalize transform, and some Amazon Simple Storage Service (Amazon S3) sources. The following table lists the Amazon S3 source formats that Amazon Glue supports for job bookmarks.
Amazon Glue version | Amazon S3 source formats |
---|---|
Version 0.9 | JSON, CSV, Apache Avro, XML |
Version 1.0 and later | JSON, CSV, Apache Avro, XML, Parquet, ORC |
For information about Amazon Glue versions, see Defining job properties for Spark jobs.
The job bookmarks feature has additional functionalities when accessed through Amazon Glue scripts. When browsing your generated script, you may see transformation contexts, which are related to this feature. For more information, see Using job bookmarks.
Using job bookmarks in Amazon Glue
The job bookmark option is passed as a parameter when the job is started. The following table describes the options for setting job bookmarks on the Amazon Glue console.
Job bookmark | Description |
---|---|
Enable | Causes the job to update the state after a run to keep track of previously processed data. If your job has a source with job bookmark support, it will keep track of processed data, and when a job runs, it processes new data since the last checkpoint. |
Disable | Job bookmarks are not used, and the job always processes the entire dataset. You are responsible for managing the output from previous job runs. This is the default. |
Pause |
Process incremental data since the last successful run or the data in the range identified by the following sub-options, without updating the state of last bookmark. You are responsible for managing the output from previous job runs. The two sub-options are:
The job bookmark state is not updated when this option set is specified. The sub-options are optional, however when used both the sub-options needs to be provided. |
For details about the parameters passed to a job on the command line, and specifically for job bookmarks, see Using job parameters in Amazon Glue jobs.
For Amazon S3 input sources, Amazon Glue job bookmarks check the last modified time of the objects to verify which objects need to be reprocessed. If your input source data has been modified since your last job run, the files are reprocessed when you run the job again.
For JDBC sources, the following rules apply:
-
For each table, Amazon Glue uses one or more columns as bookmark keys to determine new and processed data. The bookmark keys combine to form a single compound key.
-
Amazon Glue by default uses the primary key as the bookmark key, provided that it is sequentially increasing or decreasing (with no gaps).
-
You can specify the columns to use as bookmark keys in your Amazon Glue script. For more information about using Job bookmarks in Amazon Glue scripts, see Using job bookmarks.
-
Amazon Glue doesn't support using columns with case-sensitive names as job bookmark keys.
You can rewind your job bookmarks for your Amazon Glue Spark ETL jobs to any previous job run. You can support data backfilling scenarios better by rewinding your job bookmarks to any previous job run, resulting in the subsequent job run reprocessing data only from the bookmarked job run.
If you intend to reprocess all the data using the same job, reset the job bookmark. To reset the job bookmark state, use the Amazon Glue console, the ResetJobBookmark action (Python: reset_job_bookmark) API operation, or the Amazon CLI. For example, enter the following command using the Amazon CLI:
aws glue reset-job-bookmark --job-name
my-job-name
When you rewind or reset a bookmark, Amazon Glue does not clean the target files because there could be multiple targets and targets are not tracked with job bookmarks. Only source files are tracked with job bookmarks. You can create different output targets when rewinding and reprocessing the source files to avoid duplicate data in your output.
Amazon Glue keeps track of job bookmarks by job. If you delete a job, the job bookmark is deleted.
In some cases, you might have enabled Amazon Glue job bookmarks but your ETL job is reprocessing data that was already processed in an earlier run. For information about resolving common causes of this error, see Troubleshooting errors Spark errors.
Operational details of the job bookmarks feature
This section describes more of the operational details of using job bookmarks.
Job bookmarks store the states for a job. Each instance of the state is keyed by a job
name and a version number. When a script invokes job.init
, it retrieves its state
and always gets the latest version. Within a state, there are multiple state elements, which
are specific to each source, transformation, and sink instance in the script. These state
elements are identified by a transformation context that is attached to the corresponding
element (source, transformation, or sink) in the script. The state elements are saved
atomically when job.commit
is invoked from the user script. The script gets the
job name and the control option for the job bookmarks from the arguments.
The state elements in the job bookmark are source, transformation, or sink-specific data. For example, suppose that you want to read incremental data from an Amazon S3 location that is being constantly written to by an upstream job or process. In this case, the script must determine what has been processed so far. The job bookmark implementation for the Amazon S3 source saves information so that when the job runs again, it can filter only the new objects using the saved information and recompute the state for the next run of the job. A timestamp is used to filter the new files.
In addition to the state elements, job bookmarks have a run number, an attempt number, and a version number. The run number tracks the run of the job, and the attempt number records the attempts for a job run. The job run number is a monotonically increasing number that is incremented for every successful run. The attempt number tracks the attempts for each run, and is only incremented when there is a run after a failed attempt. The version number increases monotonically and tracks the updates to a job bookmark.
In the Amazon Glue service database, the bookmark states for all the transformations are stored together as key-value pairs:
{ "job_name" : ..., "run_id": ..., "run_number": .., "attempt_number": ... "states": { "transformation_ctx1" : { bookmark_state1 }, "transformation_ctx2" : { bookmark_state2 } } }
Best practices
The following are best practices for using job bookmarks.
Do not change the data source property with the bookmark enabled. For example, there is a datasource0 pointing to an Amazon S3 input path A, and the job has been reading from a source which has been running for several rounds with the bookmark enabled. If you change the input path of datasource0 to Amazon S3 path B without changing the
transformation_ctx
, the Amazon Glue job will use the old bookmark state stored. That will result in missing or skipping files in the input path B as Amazon Glue would assume that those files had been processed in previous runs.Use a catalog table with bookmarks for better partition management. Bookmarks works both for data sources from the Data Catalog or from options. However, it's difficult to remove/add new partitions with the from options approach. Using a catalog table with crawlers can provide better automation to track the newly added partitions
and give you the flexibility to select particular partitions with a pushdown predicate . Use the Amazon Glue Amazon S3 file lister
for large datasets. A bookmark will list all files under each input partition and do the filering, so if there are too many files under a single partition the bookmark can run into driver OOM. Use the Amazon Glue Amazon S3 file lister to avoid listing all files in memory at once.