Using the Delta Lake framework in Amazon Glue
Amazon Glue 3.0 and later supports the Linux Foundation Delta Lake framework. Delta Lake is an
open-source data lake storage framework that helps you perform ACID transactions, scale
metadata handling, and unify streaming and batch data processing. This topic covers
available features for using your data in Amazon Glue when you transport or store your data in a
Delta Lake table. To learn more about Delta Lake, see the official Delta Lake documentation
You can use Amazon Glue to perform read and write operations on Delta Lake tables in Amazon S3, or
work with Delta Lake tables using the Amazon Glue Data Catalog. Additional operations such as insert,
update, and Table batch reads and
writesDeltaTable.forPath
. For more information about the Delta Lake Python
library, see Delta Lake's Python documentation.
The following table lists the version of Delta Lake included in each Amazon Glue version.
Amazon Glue version | Supported Delta Lake version |
---|---|
4.0 | 2.1.0 |
3.0 | 1.0.0 |
To learn more about the data lake frameworks that Amazon Glue supports, see Using data lake frameworks with Amazon Glue ETL jobs.
Enabling Delta Lake for Amazon Glue
To enable Delta Lake for Amazon Glue, complete the following tasks:
-
Specify
delta
as a value for the--datalake-formats
job parameter. For more information, see Using job parameters in Amazon Glue jobs. -
Create a key named
--conf
for your Amazon Glue job, and set it to the following value. Alternatively, you can set the following configuration usingSparkConf
in your script. These settings help Apache Spark correctly handle Delta Lake tables.spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog --conf spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore
Lake Formation permission support for Delta tables is enabled by default for Amazon Glue 4.0. No additional configuration is needed for reading/writing to Lake Formation-registered Delta tables. To read a registered Delta table, the Amazon Glue job IAM role must have the SELECT permission. To write to a registered Delta table, the Amazon Glue job IAM role must have the SUPER permission. To learn more about managing Lake Formation permissions, see Granting and revoking permissions on Data Catalog resources
.
Using a different Delta Lake version
To use a version of Delta lake that Amazon Glue doesn't support, specify your own Delta Lake
JAR files using the --extra-jars
job parameter. Do not include
delta
as a value for the --datalake-formats
job
parameter. To use the Delta Lake Python library in this case, you must specify the library JAR files
using the --extra-py-files
job parameter. The Python library comes
packaged in the Delta Lake JAR files.
Example: Write a Delta Lake table to Amazon S3 and register it to the Amazon Glue Data Catalog
The following Amazon Glue ETL script demonstrates how to write a Delta Lake table to Amazon S3 and register the table to the Amazon Glue Data Catalog.
Example: Read a Delta Lake table from Amazon S3 using the Amazon Glue Data Catalog
The following Amazon Glue ETL script reads the Delta Lake table that you created in Example: Write a Delta Lake table to Amazon S3 and register it to the Amazon Glue Data Catalog.
Example: Insert a
DataFrame
into a Delta Lake table in Amazon S3 using the
Amazon Glue Data Catalog
This example inserts data into the Delta Lake table that you created in Example: Write a Delta Lake table to Amazon S3 and register it to the Amazon Glue Data Catalog.
Note
This example requires you to set the --enable-glue-datacatalog
job parameter in order to use the Amazon Glue Data Catalog as an Apache Spark Hive metastore.
To learn more, see Using job parameters in Amazon Glue jobs.
Example: Read a Delta Lake table from Amazon S3 using the Spark API
This example reads a Delta Lake table from Amazon S3 using the Spark API.
Example: Write a Delta Lake table to Amazon S3 using Spark
This example writes a Delta Lake table to Amazon S3 using Spark.
Example: Read and write Delta Lake table with Lake Formation permission control
This example reads and writes a Delta Lake table with Lake Formation permission control.
-
Create a Delta table and register it in Lake Formation
-
To enable Lake Formation permission control, you’ll first need to register the table Amazon S3 path on Lake Formation. For more information, see Registering an Amazon S3 location
. You can register it either from the Lake Formation console or by using the Amazon CLI: aws lakeformation register-resource --resource-arn arn:aws:s3:::<s3-bucket>/<s3-folder> --use-service-linked-role --region <REGION>
Once you register an Amazon S3 location, any Amazon Glue table pointing to the location (or any of its child locations) will return the value for the
IsRegisteredWithLakeFormation
parameter as true in theGetTable
call. -
Create a Delta table that points to the registered Amazon S3 path through Spark:
Note
The following are Python examples.
dataFrame.write \ .format("delta") \ .mode("overwrite") \ .partitionBy("<your_partitionkey_field>") \ .save("s3://<the_s3_path>")
After the data has been written to Amazon S3, use the Amazon Glue crawler to create a new Delta catalog table. For more information, see Introducing native Delta Lake table support with Amazon Glue crawlers
. You can also create the table manually through the Amazon Glue
CreateTable
API.
-
Grant Lake Formation permission to the Amazon Glue job IAM role. You can either grant permissions from the Lake Formation console, or using the Amazon CLI. For more information, see Granting table permissions using the Lake Formation console and the named resource method
Read the Delta table registered in Lake Formation. The code is same as reading a non-registered Delta table. Note that the Amazon Glue job IAM role needs to have the SELECT permission for the read to succeed.
# Example: Read a Delta Lake table from Glue Data Catalog df = glueContext.create_data_frame.from_catalog( database="<your_database_name>", table_name="<your_table_name>", additional_options=additional_options )
-
Write to a Delta table registered in Lake Formation. The code is same as writing to a non-registered Delta table. Note that the Amazon Glue job IAM role needs to have the SUPER permission for the write to succeed.
By default Amazon Glue uses
Append
as saveMode. You can change it by setting the saveMode option inadditional_options
. For information about saveMode support in Delta tables, see Write to a table. glueContext.write_data_frame.from_catalog( frame=dataFrame, database="<your_database_name>", table_name="<your_table_name>", additional_options=additional_options )