Overview Available worker types Worker type specifications table Important considerations Choosing the right worker type Cost Optimization Considerations Best practices

Amazon Glue worker types

Overview

Amazon Glue provides multiple worker types to accommodate different workload requirements, from small streaming jobs to large-scale, memory-intensive data processing tasks. This section provides comprehensive information about all available worker types, their specifications, and usage recommendations.

Worker type categories

Amazon Glue offers two main categories of worker types:

G Worker Types: General-purpose compute workers optimized for standard ETL workloads
R Worker Types: Memory-optimized workers designed for memory-intensive Spark applications

Data Processing Units (DPUs)

The resources available on Amazon Glue workers are measured in DPUs. A DPU is a relative measure of processing power that consists of 4 vCPUs of compute capacity and 16 GB of memory.

Memory-Optimized DPUs (M-DPUs): R type workers use M-DPUs, which provide double the memory allocation for a given size compared to standard DPUs. This means that while a standard DPU provides 16 GB of memory, an M-DPU in R type workers provides 32GB of memory optimized for memory-intensive Spark applications.

Available worker types

G.1X - Standard Worker

DPU: 1 DPU (4 vCPUs, 16 GB memory)
Storage: 94GB disk (approximately 44GB free)
Use Case: Data transforms, joins, and queries - scalable and cost-effective for most jobs

G.2X - Standard Worker

DPU: 2 DPU (8 vCPUs, 32 GB memory)
Storage: 138GB disk (approximately 78GB free)
Use Case: Data transforms, joins, and queries - scalable and cost-effective for most jobs

G.4X - Large Worker

DPU: 4 DPU (16 vCPUs, 64 GB memory)
Storage: 256GB disk (approximately 230GB free)
Use Case: Demanding transforms, aggregations, joins, and queries

G.8X - Extra Large Worker

DPU: 8 DPU (32 vCPUs, 128 GB memory)
Storage: 512GB disk (approximately 485GB free)
Use Case: Most demanding transforms, aggregations, joins, and queries

G.12X - Very Large Worker*

DPU: 12 DPU (48 vCPUs, 192 GB memory)
Storage: 768GB disk (approximately 741GB free)
Use Case: Very large and resource-intensive workloads requiring significant compute capacity

G.16X - Maximum Worker*

DPU: 16 DPU (64 vCPUs, 256 GB memory)
Storage: 1024GB disk (approximately 996GB free)
Use Case: Largest and most resource-intensive workloads requiring maximum compute capacity

R.1X - Memory-Optimized Small*

DPU: 1 M-DPU (4 vCPUs, 32 GB memory)
Use Case: Memory-intensive workloads with frequent out-of-memory errors or high memory-to-CPU ratio requirements

R.2X - Memory-Optimized Medium*

DPU: 2 M-DPU (8 vCPUs, 64 GB memory)
Use Case: Memory-intensive workloads with frequent out-of-memory errors or high memory-to-CPU ratio requirements

R.4X - Memory-Optimized Large*

DPU: 4 M-DPU (16 vCPUs, 128 GB memory)
Use Case: Large memory-intensive workloads with frequent out-of-memory errors or high memory-to-CPU ratio requirements

R.8X - Memory-Optimized Extra Large*

DPU: 8 M-DPU (32 vCPUs, 256 GB memory)
Use Case: Very large memory-intensive workloads with frequent out-of-memory errors or high memory-to-CPU ratio requirements

* You may encounter higher startup latency with these workers. To resolve the issue, try the following:

Wait a few minutes and then submit your job again.
Submit a new job with a reduced number of workers.
Submit a new job using a different worker type or size.

Worker type specifications table

Worker Type Specifications
Worker Type	DPU per Node	vCPU	Memory (GB)	Disk (GB)	Free Disk Space (GB)	Spark Executors per Node
G.1X	1	4	16	94	44	1
G.2X	2	8	32	138	78	1
G.4X	4	16	64	256	230	1
G.8X	8	32	128	512	485	1
G.12X	12	48	192	768	741	1
G.16X	16	64	256	1024	996	1

Note: R worker types have memory-optimized configurations with specifications optimized for memory-intensive workloads.

Important considerations

Startup latency

Important

G.12X and G.16X worker types, as well as all R worker types (R.1X through R.8X), may encounter higher startup latency. To resolve the issue, try the following:

Wait a few minutes and then submit your job again.
Submit a new job with a reduced number of workers.
Submit a new job using a different worker type and size.

Choosing the right worker type

For standard ETL workloads

G.1X or G.2X: Most cost-effective for typical data transforms, joins, and queries
G.4X or G.8X: For more demanding workloads with larger datasets

For large-scale workloads

G.12X: Very large datasets requiring significant compute resources
G.16X: Maximum compute capacity for the most demanding workloads

For memory-intensive workloads

R.1X or R.2X: Small to medium memory-intensive jobs
R.4X or R.8X: Large memory-intensive workloads with frequent OOM errors

Cost Optimization Considerations

Standard G workers: Provide a balance of compute, memory and networking resources, and can be used for a variety of diverse workloads at lower cost
R workers: Specialized for memory-intensive tasks with fast performance for workloads that process large data sets in memory

Best practices

Worker selection guidelines

Start with standard workers (G.1X, G.2X) for most workloads
Use R workers when experiencing frequent out-of-memory errors or workloads with memory-intensive operations like caching, shuffling, and aggregating
Consider G.12X/G.16X for compute-intensive workloads requiring maximum resources
Account for capacity constraints when using new worker types in time-sensitive workflows

Performance optimization

Monitor CloudWatch metrics to understand resource utilization
Use appropriate worker counts based on data size and complexity
Consider data partitioning strategies to optimize worker efficiency

Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Troubleshooting Spark jobs with AI

Streaming ETL jobs