Customizing crawler behavior - Amazon Glue
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Customizing crawler behavior

When a crawler runs, it might encounter changes to your data store that result in a schema or partition that is different from a previous crawl. You can use the Amazon Web Services Management Console or the Amazon Glue API to configure how your crawler processes certain types of changes.

Console

When you define a crawler using the Amazon Glue console, you have several options for configuring the behavior of your crawler. For more information about using the Amazon Glue console to add a crawler, see Configuring a crawler.

When a crawler runs against a previously crawled data store, it might discover that a schema has changed or that some objects in the data store have been deleted. The crawler logs changes to a schema. Depending on the source type for the crawler, new tables and partitions might be created regardless of the schema change policy.

To specify what the crawler does when it finds changes in the schema, you can choose one of the following actions on the console:

  • Update the table definition in the Data Catalog – Add new columns, remove missing columns, and modify the definitions of existing columns in the Amazon Glue Data Catalog. Remove any metadata that is not set by the crawler. This is the default setting.

  • Add new columns only – For tables that map to an Amazon S3 data store, add new columns as they are discovered, but don't remove or change the type of existing columns in the Data Catalog. Choose this option when the current columns in the Data Catalog are correct and you don't want the crawler to remove or change the type of the existing columns. If a fundamental Amazon S3 table attribute changes, such as classification, compression type, or CSV delimiter, mark the table as deprecated. Maintain input format and output format as they exist in the Data Catalog. Update SerDe parameters only if the parameter is one that is set by the crawler. For all other data stores, modify existing column definitions.

  • Ignore the change and don't update the table in the Data Catalog – Only new tables and partitions are created.

    This is the default setting for incremental crawls.

A crawler might also discover new or changed partitions. By default, new partitions are added and existing partitions are updated if they have changed. In addition, you can set a crawler configuration option to Update all new and existing partitions with metadata from the table on the Amazon Glue console. When this option is set, partitions inherit metadata properties—such as their classification, input format, output format, SerDe information, and schema—from their parent table. Any changes to these properties in a table are propagated to its partitions. When this configuration option is set on an existing crawler, existing partitions are updated to match the properties of their parent table the next time the crawler runs.

To specify what the crawler does when it finds a deleted object in the data store, choose one of the following actions:

  • Delete tables and partitions from the Data Catalog

  • Ignore the change and don't update the table in the Data Catalog

    This is the default setting for incremental crawls.

  • Mark the table as deprecated in the Data Catalog – This is the default setting.

Amazon CLI
aws glue create-crawler \ --name "your-crawler-name" \ --role "your-iam-role-arn" \ --database-name "your-database-name" \ --targets 'S3Targets=[{Path="s3://your-bucket-name/path-to-data"}]' \ --configuration '{"Version": 1.0, "CrawlerOutput": {"Partitions": {"AddOrUpdateBehavior": "InheritFromTable"}, "Tables": {"AddOrUpdateBehavior": "MergeNewColumns"}}}'
API

When you define a crawler using the Amazon Glue API, you can choose from several fields to configure your crawler. The SchemaChangePolicy in the crawler API determines what the crawler does when it discovers a changed schema or a deleted object. The crawler logs schema changes as it runs.

Sample python code showing the crawler configuration options

import boto3 import json # Initialize a boto3 client for AWS Glue glue_client = boto3.client('glue', region_name='us-east-1') # Replace 'us-east-1' with your desired Amazon region # Define the crawler configuration crawler_configuration = { "Version": 1.0, "CrawlerOutput": { "Partitions": { "AddOrUpdateBehavior": "InheritFromTable" }, "Tables": { "AddOrUpdateBehavior": "MergeNewColumns" } } } configuration_json = json.dumps(crawler_configuration) # Create the crawler with the specified configuration response = glue_client.create_crawler( Name='your-crawler-name', # Replace with your desired crawler name Role='crawler-test-role', # Replace with the ARN of your IAM role for Glue DatabaseName='default', # Replace with your target Glue database name Targets={ 'S3Targets': [ { 'Path': "s3://your-bucket-name/path/", # Replace with your S3 path to the data }, ], # Include other target types like 'JdbcTargets' if needed }, Configuration=configuration_json, # Include other parameters like Schedule, Classifiers, TablePrefix, SchemaChangePolicy, etc., as needed ) print(response)a

When a crawler runs, new tables and partitions are always created regardless of the schema change policy. You can choose one of the following actions in the UpdateBehavior field in the SchemaChangePolicy structure to determine what the crawler does when it finds a changed table schema:

  • UPDATE_IN_DATABASE – Update the table in the Amazon Glue Data Catalog. Add new columns, remove missing columns, and modify the definitions of existing columns. Remove any metadata that is not set by the crawler.

  • LOG – Ignore the changes, and don't update the table in the Data Catalog.

    This is the default setting for incremental crawls.

You can also override the SchemaChangePolicy structure using a JSON object supplied in the crawler API Configuration field. This JSON object can contain a key-value pair to set the policy to not update existing columns and only add new columns. For example, provide the following JSON object as a string:

{ "Version": 1.0, "CrawlerOutput": { "Tables": { "AddOrUpdateBehavior": "MergeNewColumns" } } }

This option corresponds to the Add new columns only option on the Amazon Glue console. It overrides the SchemaChangePolicy structure for tables that result from crawling Amazon S3 data stores only. Choose this option if you want to maintain the metadata as it exists in the Data Catalog (the source of truth). New columns are added as they are encountered, including nested data types. But existing columns are not removed, and their type is not changed. If an Amazon S3 table attribute changes significantly, mark the table as deprecated, and log a warning that an incompatible attribute needs to be resolved. This option is not applicable for incremental crawler.

When a crawler runs against a previously crawled data store, it might discover new or changed partitions. By default, new partitions are added and existing partitions are updated if they have changed. In addition, you can set a crawler configuration option to InheritFromTable (corresponding to the Update all new and existing partitions with metadata from the table option on the Amazon Glue console). When this option is set, partitions inherit metadata properties from their parent table, such as their classification, input format, output format, SerDe information, and schema. Any property changes to the parent table are propagated to its partitions.

When this configuration option is set on an existing crawler, existing partitions are updated to match the properties of their parent table the next time the crawler runs. This behavior is set crawler API Configuration field. For example, provide the following JSON object as a string:

{ "Version": 1.0, "CrawlerOutput": { "Partitions": { "AddOrUpdateBehavior": "InheritFromTable" } } }

The crawler API Configuration field can set multiple configuration options. For example, to configure the crawler output for both partitions and tables, you can provide a string representation of the following JSON object:

{ "Version": 1.0, "CrawlerOutput": { "Partitions": { "AddOrUpdateBehavior": "InheritFromTable" }, "Tables": {"AddOrUpdateBehavior": "MergeNewColumns" } } }

You can choose one of the following actions to determine what the crawler does when it finds a deleted object in the data store. The DeleteBehavior field in the SchemaChangePolicy structure in the crawler API sets the behavior of the crawler when it discovers a deleted object.

  • DELETE_FROM_DATABASE – Delete tables and partitions from the Data Catalog.

  • LOG – Ignore the change. Don't update the Data Catalog. Write a log message instead.

  • DEPRECATE_IN_DATABASE – Mark the table as deprecated in the Data Catalog. This is the default setting.