Using crawlers to populate the Data Catalog - Amazon Glue
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Using crawlers to populate the Data Catalog

You can use an Amazon Glue crawler to populate the Amazon Glue Data Catalog with databases and tables. This is the primary method used by most Amazon Glue users. A crawler can crawl multiple data stores in a single run. Upon completion, the crawler creates or updates one or more tables in your Data Catalog. Extract, transform, and load (ETL) jobs that you define in Amazon Glue use these Data Catalog tables as sources and targets. The ETL job reads from and writes to the data stores that are specified in the source and target Data Catalog tables.

Workflow

The following workflow diagram shows how Amazon Glue crawlers interact with data stores and other elements to populate the Data Catalog.

Workflow showing how Amazon Glue crawler populates the Data Catalog in 5 basic steps.

The following is the general workflow for how a crawler populates the Amazon Glue Data Catalog:

  1. A crawler runs any custom classifiers that you choose to infer the format and schema of your data. You provide the code for custom classifiers, and they run in the order that you specify.

    The first custom classifier to successfully recognize the structure of your data is used to create a schema. Custom classifiers lower in the list are skipped.

  2. If no custom classifier matches your data's schema, built-in classifiers try to recognize your data's schema. An example of a built-in classifier is one that recognizes JSON.

  3. The crawler connects to the data store. Some data stores require connection properties for crawler access.

  4. The inferred schema is created for your data.

  5. The crawler writes metadata to the Data Catalog. A table definition contains metadata about the data in your data store. The table is written to a database, which is a container of tables in the Data Catalog. Attributes of a table include classification, which is a label created by the classifier that inferred the table schema.