Using data lake frameworks with Amazon Glue ETL jobs
Open-source data lake frameworks simplify incremental data processing for files that you store in data lakes built on Amazon S3. Amazon Glue 3.0 and later supports the following open-source data lake frameworks:
-
Apache Hudi
-
Linux Foundation Delta Lake
-
Apache Iceberg
We provide native support for these frameworks so that you can read and write data that you store in Amazon S3 in a transactionally consistent manner. There's no need to install a separate connector or complete extra configuration steps in order to use these frameworks in Amazon Glue ETL jobs.
When you manage datasets through the Amazon Glue Data Catalog, you can use Amazon Glue methods to read and write data lake tables with Spark DataFrames. You can also read and write Amazon S3 data using the Spark DataFrame API.
In this video, you can learn about the basics of how Apache Hudi, Apache Iceberg, and Delta Lake work. You'll see how to insert, update, and delete data in your data lake and how each of these frameworks works.