Overview of using Amazon Glue - Amazon Glue
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China.

Overview of using Amazon Glue

With Amazon Glue, you store metadata in the Amazon Glue Data Catalog. You use this metadata to orchestrate ETL jobs that transform data sources and load your data warehouse or data lake. The following steps describe the general workflow and some of the choices that you make when working with Amazon Glue.

Note

You can use the following steps, or you can create a workflow that automatically performs steps 1 through 3. For more information, see Performing complex ETL activities using blueprints and workflows in Amazon Glue.

  1. Populate the Amazon Glue Data Catalog with table definitions.

    In the console, for persistent data stores, you can add a crawler to populate the Amazon Glue Data Catalog. You can start the Add crawler wizard from the list of tables or the list of crawlers. You choose one or more data stores for your crawler to access. You can also create a schedule to determine the frequency of running your crawler. For data streams, you can manually create the table definition, and define stream properties.

    Optionally, you can provide a custom classifier that infers the schema of your data. You can create custom classifiers using a grok pattern. However, Amazon Glue provides built-in classifiers that are automatically used by crawlers if a custom classifier does not recognize your data. When you define a crawler, you don't have to select a classifier. For more information about classifiers in Amazon Glue, see Adding classifiers to a crawler in Amazon Glue.

    Crawling some types of data stores requires a connection that provides authentication and location information. If needed, you can create a connection that provides this required information in the Amazon Glue console.

    The crawler reads your data store and creates data definitions and named tables in the Amazon Glue Data Catalog. These tables are organized into a database of your choosing. You can also populate the Data Catalog with manually created tables. With this method, you provide the schema and other metadata to create table definitions in the Data Catalog. Because this method can be a bit tedious and error prone, it's often better to have a crawler create the table definitions.

    For more information about populating the Amazon Glue Data Catalog with table definitions, see Amazon Glue tables.

  2. Define a job that describes the transformation of data from source to target.

    Generally, to create a job, you have to make the following choices:

    • Choose a table from the Amazon Glue Data Catalog to be the source of the job. Your job uses this table definition to access your data source and interpret the format of your data.

    • Choose a table or location from the Amazon Glue Data Catalog to be the target of the job. Your job uses this information to access your data store.

    • Tell Amazon Glue to generate a PySpark script to transform your source to target. Amazon Glue generates the code to call built-in transforms to convert data from its source schema to target schema format. These transforms perform operations such as copy data, rename columns, and filter data to transform data as necessary. You can modify this script in the Amazon Glue console.

    For more information about defining jobs in Amazon Glue, see Authoring jobs in Amazon Glue.

  3. Run your job to transform your data.

    You can run your job on demand, or start it based on a one of these trigger types:

    • A trigger that is based on a cron schedule.

    • A trigger that is event-based; for example, the successful completion of another job can start an Amazon Glue job.

    • A trigger that starts a job on demand.

    For more information about triggers in Amazon Glue, see Starting jobs and crawlers using triggers.

  4. Monitor your scheduled crawlers and triggered jobs.

    Use the Amazon Glue console to view the following:

    • Job run details and errors.

    • Crawler run details and errors.

    • Any notifications about Amazon Glue activities

    For more information about monitoring your crawlers and jobs in Amazon Glue, see Monitoring Amazon Glue.