Adding jobs in Amazon Glue - Amazon Glue
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Adding jobs in Amazon Glue

An Amazon Glue job encapsulates a script that connects to your source data, processes it, and then writes it out to your data target. Typically, a job runs extract, transform, and load (ETL) scripts. Jobs can also run general-purpose Python scripts (Python shell jobs.) Amazon Glue triggers can start jobs based on a schedule or event, or on demand. You can monitor job runs to understand runtime metrics such as completion status, duration, and start time.

You can use scripts that Amazon Glue generates or you can provide your own. With a source schema and target location or schema, the Amazon Glue code generator can automatically create an Apache Spark API (PySpark) script. You can use this script as a starting point and edit it to meet your goals.

Amazon Glue can write output files in several data formats, including JSON, CSV, ORC (Optimized Row Columnar), Apache Parquet, and Apache Avro. For some data formats, common compression formats can be written.

There are three types of jobs in Amazon Glue: Spark, Streaming ETL, and Python shell.

  • A Spark job is run in an Apache Spark environment managed by Amazon Glue. It processes data in batches.

  • A streaming ETL job is similar to a Spark job, except that it performs ETL on data streams. It uses the Apache Spark Structured Streaming framework. Some Spark job features are not available to streaming ETL jobs.

  • A Python shell job runs Python scripts as a shell and supports a Python version that depends on the Amazon Glue version you are using. You can use these jobs to schedule and run tasks that don't require an Apache Spark environment.

Defining job properties for Spark jobs

When you define your job on the Amazon Glue console, you provide values for properties to control the Amazon Glue runtime environment.

The following list describes the properties of a Spark job. For the properties of a Python shell job, see Defining job properties for Python shell jobs. For properties of a streaming ETL job, see Defining job properties for a streaming ETL job.

The properties are listed in the order in which they appear on the Add job wizard on Amazon Glue console.

Name

Provide a UTF-8 string with a maximum length of 255 characters.

Description

Provide an optional description of up to 2048 characters.

IAM Role

Specify the IAM role that is used for authorization to resources used to run the job and access data stores. For more information about permissions for running jobs in Amazon Glue, see Identity and access management for Amazon Glue.

Type

The type of ETL job. This is set automatically based on the type of data sources you select.

  • Spark runs an Apache Spark ETL script with the job command glueetl.

  • Spark Streaming runs a Apache Spark streaming ETL script with the job command gluestreaming. For more information, see Streaming ETL jobs in Amazon Glue.

  • Python shell run a Python script with the job command pythonshell. For more information, see Python shell jobs in Amazon Glue.

Amazon Glue version

Amazon Glue version determines the versions of Apache Spark and Python that are available to the job, as specified in the following table.

Amazon Glue version Supported Spark and Python versions
4.0
  • Spark 3.3.0

  • Python 3.10

3.0
  • Spark 3.1.1

  • Python 3.7

2.0
  • Spark 2.4.3

  • Python 3.7

1.0
  • Spark 2.4.3

  • Python 2.7

  • Python 3.6

0.9
  • Spark 2.2.1

  • Python 2.7

Worker type

The following worker types are available:

The resources available on Amazon Glue workers are measured in DPUs. A DPU is a relative measure of processing power that consists of 4 vCPUs of compute capacity and 16 GB of memory.

  • G.1X – When you choose this type, you also provide a value for Number of workers. Each worker maps to 1 DPU (4 vCPUs, 16 GB of memory) with 84GB disk (approximately 34GB free). We recommend this worker type for workloads such as data transforms, joins, and queries, to offers a scalable and cost effective way to run most jobs.

  • G.2X – When you choose this type, you also provide a value for Number of workers. Each worker maps to 2 DPU (8 vCPUs, 32 GB of memory) with 128GB disk (approximately 77GB free). We recommend this worker type for workloads such as data transforms, joins, and queries, to offers a scalable and cost effective way to run most jobs.

  • G.4X – When you choose this type, you also provide a value for Number of workers. Each worker maps to 4 DPU (16 vCPUs, 64 GB of memory) with 256GB disk (approximately 235GB free). We recommend this worker type for jobs whose workloads contain your most demanding transforms, aggregations, joins, and queries. This worker type is available only for Amazon Glue version 3.0 or later Spark ETL jobs in the following Amazon Regions: US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Canada (Central), Europe (Frankfurt), Europe (Ireland), and Europe (Stockholm).

  • G.8X – When you choose this type, you also provide a value for Number of workers. Each worker maps to 8 DPU (32 vCPUs, 128 GB of memory) with 512GB disk (approximately 487GB free). We recommend this worker type for jobs whose workloads contain your most demanding transforms, aggregations, joins, and queries. This worker type is available only for Amazon Glue version 3.0 or later Spark ETL jobs, in the same Amazon Regions as supported for the G.4X worker type.

  • G.025X – When you choose this type, you also provide a value for Number of workers. Each worker maps to 0.25 DPU (2 vCPUs, 4 GB of memory) with 84GB disk (approximately 34GB free). We recommend this worker type for low volume streaming jobs. This worker type is only available for Amazon Glue version 3.0 streaming jobs.

You are charged an hourly rate based on the number of DPUs used to run your ETL jobs. For more information, see the Amazon Glue pricing page.

For Amazon Glue version 1.0 or earlier jobs, when you configure a job using the console and specify a Worker type of Standard, the Maximum capacity is set and the Number of workers becomes the value of Maximum capacity - 1. If you use the Amazon Command Line Interface (Amazon CLI) or Amazon SDK, you can specify the Max capacity parameter, or you can specify both Worker type and the Number of workers.

For Amazon Glue version 2.0 or later jobs, you cannot specify a Maximum capacity. Instead, you should specify a Worker type and the Number of workers.

Language

The code in the ETL script defines your job's logic. The script can be coded in Python or Scala. You can choose whether the script that the job runs is generated by Amazon Glue or provided by you. You provide the script name and location in Amazon Simple Storage Service (Amazon S3). Confirm that there isn't a file with the same name as the script directory in the path. To learn more about writing scripts, see Amazon Glue programming guide.

Requested number of workers

For most worker types, you must specify the number of workers that are allocated when the job runs.

Job bookmark

Specify how Amazon Glue processes state information when the job runs. You can have it remember previously processed data, update state information, or ignore state information. For more information, see Tracking processed data using job bookmarks.

Flex execution

When you configure a job using Amazon Studio or the API you may specify a standard or flexible job execution class. Your jobs may have varying degrees of priority and time sensitivity. The standard execution-class is ideal for time-sensitive workloads that require fast job startup and dedicated resources.

The flexible execution class is appropriate for non-urgent jobs such as pre-production jobs, testing, and one-time data loads. Flexible job runs are supported for jobs using Amazon Glue version 3.0 or later and G.1X or G.2X worker types.

Flex job runs are billed based on the number of workers running at any point in time. Number of workers may be added or removed for a running flexible job run. Instead of billing as a simple calculation of Max Capacity * Execution Time, each worker will contribute for the time it ran during the job run. The bill is the sum of (Number of DPUs per worker * time each worker ran).

For more information, see the help panel in Amazon Studio, or Jobs and Job runs.

Number of retries

Specify the number of times, from 0 to 10, that Amazon Glue should automatically restart the job if it fails. Jobs that reach the timeout limit are not restarted.

Job timeout

Sets the maximum execution time in minutes. The default is 2880 minutes (48 hours) for batch jobs. When the job execution time exceeds this limit, the job run state changes to TIMEOUT.

For streaming jobs that run indefinitely, leave the value blank, which is the default value for streaming jobs.

Best practices for job timeouts

Jobs are billed based on execution time. To avoid unexpected charges, configure timeout values appropriate for the expected execution time of your job.

Advanced Properties
Script filename

A unique script name for your job. Cannot be named Untitled job.

Script path

The Amazon S3 location of the script. The path must be in the form s3://bucket/prefix/path/. It must end with a slash (/) and not include any files.

Job metrics

Turn on or turn off the creation of Amazon CloudWatch metrics when this job runs. To see profiling data, you must enable this option. For more information about how to turn on and visualize metrics, see Job monitoring and debugging.

Job observability metrics

Turn on the creation of additional observability CloudWatch metrics when this job runs. For more information, see Monitoring with Amazon Glue Observability metrics.

Continuous logging

Turn on continuous logging to Amazon CloudWatch. If this option is not enabled, logs are available only after the job completes. For more information, see Continuous logging for Amazon Glue jobs.

Spark UI

Turn on the use of Spark UI for monitoring this job. For more information, see Enabling the Apache Spark web UI for Amazon Glue jobs.

Spark UI logs path

The path to write logs when Spark UI is enabled.

Spark UI logging and monitoring configuration

Choose one of the following options:

  • Standard: write logs using the Amazon Glue job run ID as the filename. Turn on Spark UI monitoring in the Amazon Glue console.

  • Legacy: write logs using 'spark-application-{timestamp}' as the filename. Do not turn on Spark UI monitoring.

  • Standard and legacy: write logs to both the standard and legacy locations. Turn on Spark UI monitoring in the Amazon Glue console.

Maximum concurrency

Sets the maximum number of concurrent runs that are allowed for this job. The default is 1. An error is returned when this threshold is reached. The maximum value you can specify is controlled by a service limit. For example, if a previous run of a job is still running when a new instance is started, you might want to return an error to prevent two instances of the same job from running concurrently.

Temporary path

Provide the location of a working directory in Amazon S3 where temporary intermediate results are written when Amazon Glue runs the script. Confirm that there isn't a file with the same name as the temporary directory in the path. This directory is used when Amazon Glue reads and writes to Amazon Redshift and by certain Amazon Glue transforms.

Note

Amazon Glue creates a temporary bucket for jobs if a bucket doesn't already exist in a region. This bucket might permit public access. You can either modify the bucket in Amazon S3 to set the public access block, or delete the bucket later after all jobs in that region have completed.

Delay notification threshold (minutes)

Sets the threshold (in minutes) before a delay notification is sent. You can set this threshold to send notifications when a RUNNING, STARTING, or STOPPING job run takes more than an expected number of minutes.

Security configuration

Choose a security configuration from the list. A security configuration specifies how the data at the Amazon S3 target is encrypted: no encryption, server-side encryption with Amazon KMS-managed keys (SSE-KMS), or Amazon S3-managed encryption keys (SSE-S3).

Server-side encryption

If you select this option, when the ETL job writes to Amazon S3, the data is encrypted at rest using SSE-S3 encryption. Both your Amazon S3 data target and any data that is written to an Amazon S3 temporary directory is encrypted. This option is passed as a job parameter. For more information, see Protecting Data Using Server-Side Encryption with Amazon S3-Managed Encryption Keys (SSE-S3) in the Amazon Simple Storage Service User Guide.

Important

This option is ignored if a security configuration is specified.

Use Glue data catalog as the Hive metastore

Select to use the Amazon Glue Data Catalog as the Hive metastore. The IAM role used for the job must have the glue:CreateDatabase permission. A database called “default” is created in the Data Catalog if it does not exist.

Connections

Choose a VPC configuration to access Amazon S3 data sources located in your virtual private cloud (VPC). You can create and manage Network connection in Amazon Glue. For more information, see Connecting to data.

Libraries
Python library path, Dependent JARs path, and Referenced files path

Specify these options if your script requires them. You can define the comma-separated Amazon S3 paths for these options when you define the job. You can override these paths when you run the job. For more information, see Providing your own custom scripts.

Job parameters

A set of key-value pairs that are passed as named parameters to the script. These are default values that are used when the script is run, but you can override them in triggers or when you run the job. You must prefix the key name with --; for example: --myKey. You pass job parameters as a map when using the Amazon Command Line Interface.

For examples, see Python parameters in Passing and accessing Python parameters in Amazon Glue.

Tags

Tag your job with a Tag key and an optional Tag value. After tag keys are created, they are read-only. Use tags on some resources to help you organize and identify them. For more information, see Amazon tags in Amazon Glue.

Restrictions for jobs that access Lake Formation managed tables

Keep in mind the following notes and restrictions when creating jobs that read from or write to tables managed by Amazon Lake Formation: