Using Delta Lake OSS with EMR Serverless
How to use Delta Lake OSS with EMR Serverless applications
-
To build an open source version of Delta Lake
that’s compatible with the version of Spark on your Amazon EMR Serverless application, navigate to the Delta GitHub and follow the instructions. -
Upload the Delta Lake libraries to an Amazon S3 bucket in your Amazon Web Services account.
-
When you submit EMR Serverless jobs in the application configuration, include the Delta Lake JAR files that are now in your bucket.
--conf spark.jars=s3://
DOC-EXAMPLE-BUCKET
/jars/delta-core_2.12-1.1.0.jar -
To ensure that you can read to and write from a Delta table, run a sample PySpark test.
from pyspark import SparkConf, SparkContext from pyspark.sql import HiveContext, SparkSession import uuid conf = SparkConf() sc = SparkContext(conf=conf) sqlContext = HiveContext(sc) url = "s3://
DOC-EXAMPLE-BUCKET
/delta-lake/output/1.0.1/%s/" % str(uuid.uuid4()) ## creates a Delta table and outputs to target S3 bucket session.range(5).write.format("delta").save(url) ## reads a Delta table and outputs to target S3 bucket session.read.format("delta").load(url).show