Overview of blueprints in Amazon Glue - Amazon Glue
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China.

Overview of blueprints in Amazon Glue

Amazon Glue blueprints provide a way to create and share Amazon Glue workflows. When there is a complex ETL process that could be used for similar use cases, rather than creating an Amazon Glue workflow for each use case, you can create a single blueprint.

The blueprint specifies the jobs and crawlers to include in a workflow, and specifies parameters that the workflow user supplies when they run the blueprint to create a workflow. The use of parameters enables a single blueprint to generate workflows for the various similar use cases. For more information about workflows, see Overview of workflows in Amazon Glue.

The following are example use cases for blueprints:

  • You want to partition an existing dataset. The input parameters to the blueprint are Amazon Simple Storage Service (Amazon S3) source and target paths and a list of partition columns.

  • You want to snapshot an Amazon DynamoDB table into a SQL data store like Amazon Redshift. The input parameters to the blueprint are the DynamoDB table name and an Amazon Glue connection, which designates an Amazon Redshift cluster and destination database.

  • You want to convert CSV data in multiple Amazon S3 paths to Parquet. You want the Amazon Glue workflow to include a separate crawler and job for each path. The input parameters are the destination database in the Amazon Glue Data Catalog and a comma-delimited list of Amazon S3 paths. Note that in this case, the number of crawlers and jobs that the workflow creates is variable.

Blueprint components

A blueprint is a ZIP archive that contains the following components:

  • A Python layout generator script

    Contains a function that specifies the workflow layout—the crawlers and jobs to create for the workflow, the job and crawler properties, and the dependencies between the jobs and crawlers. The function accepts blueprint parameters and returns a workflow structure (JSON object) that Amazon Glue uses to generate the workflow. Because you use a Python script to generate the workflow, you can add your own logic that is suitable for your use cases.

  • A configuration file

    Specifies the fully qualified name of the Python function that generates the workflow layout. Also specifies the names, data types, and other properties of all blueprint parameters used by the script.

  • (Optional) ETL scripts and supporting files

    As an advanced use case, you can parameterize the location of the ETL scripts that your jobs use. You can include job script files in the ZIP archive and specify a blueprint parameter for an Amazon S3 location where the scripts are to be copied to. The layout generator script can copy the ETL scripts to the designated location and specify that location as the job script location property. You can also include any libraries or other supporting files, provided that your script handles them.


   Box labeled Blueprint contains two smaller boxes, one labeled Python Script and the other
    labeled Config File.

Blueprint runs

When you create a workflow from a blueprint, Amazon Glue runs the blueprint, which starts an asynchronous process to create the workflow and the jobs, crawlers, and triggers that the workflow encapsulates. Amazon Glue uses the blueprint run to orchestrate the creation of the workflow and its components. You view the status of the creation process by viewing the blueprint run status. The blueprint run also stores the values that you supplied for the blueprint parameters.


   Box labeled Blueprint run contains icons labeled Workflow and Parameter
    Values.

You can view blueprint runs using the Amazon Glue console or Amazon Command Line Interface (Amazon CLI). When viewing or troubleshooting a workflow, you can always return to the blueprint run to view the blueprint parameter values that were used to create the workflow.

Lifecycle of a blueprint

blueprints are developed, tested, registered with Amazon Glue, and run to create workflows. There are typically three personas involved in the blueprint lifecycle.

Persona Tasks
Amazon Glue developer
  • Writes the workflow layout script and creates the configuration file.

  • Tests the blueprint locally using libraries provided by the Amazon Glue service.

  • Creates a ZIP archive of the script, configuration file, and supporting files and publishes the archive to a location in Amazon S3.

  • Adds a bucket policy to the Amazon S3 bucket that grants read permissions on bucket objects to the Amazon Glue administrator's Amazon account.

  • Grants IAM read permissions on the ZIP archive in Amazon S3 to the Amazon Glue administrator.

Amazon Glue administrator
  • Registers the blueprint with Amazon Glue. Amazon Glue makes a copy of the ZIP archive into a reserved Amazon S3 location.

  • Grants IAM permissions on the blueprint to data analysts.

Data analyst
  • Runs the blueprint to create a workflow, and provides blueprint parameter values. Checks the blueprint run status to ensure that the workflow and workflow components were successfully generated.

  • Runs and troubleshoots the workflow. Before running the workflow, can verify the workflow by viewing the workflow design graph on the Amazon Glue console.