Use an Iceberg cluster with Spark
Starting with Amazon EMR version 6.5.0, you can use Iceberg with your Spark cluster with no requirement to include bootstrap actions. For Amazon EMR versions 6.4.0 and earlier, you can use a bootstrap action to pre-install all necessary dependencies.
In this tutorial, you use the Amazon CLI to work with Iceberg on an Amazon EMR Spark
cluster. To use the console to create a cluster with Iceberg installed, follow the
steps in Build an Apache Iceberg data lake using Amazon Athena, Amazon EMR, and
Amazon Glue
Create an Iceberg cluster
You can create a cluster with Iceberg installed using the Amazon Web Services Management Console, the
Amazon CLI or the Amazon EMR API. In this tutorial, you use the Amazon CLI to work with
Iceberg on an Amazon EMR cluster. To use the console to create a cluster with
Iceberg installed, follow the steps in Build an Apache Iceberg data lake using Amazon Athena, Amazon EMR, and
Amazon Glue
To use Iceberg on Amazon EMR with the Amazon CLI, first create a cluster with the following steps. For information on specifying the Iceberg classification using the Amazon CLI, see Supply a configuration using the Amazon CLI when you create a cluster or Supply a configuration using the Java SDK when you create a cluster.
-
Create a
configurations.json
file with the following content:[{ "Classification":"iceberg-defaults", "Properties":{"iceberg.enabled":"true"} }]
-
Next, create a cluster with the following configuration. Replace the example Amazon S3 bucket path and the subnet ID with your own.
aws emr create-cluster --release-label emr-6.5.0 \ --applications Name=Spark \ --configurations file://iceberg_configurations.json \ --region us-east-1 \ --name My_Spark_Iceberg_Cluster \ --log-uri s3://
DOC-EXAMPLE-BUCKET/
\ --instance-type m5.xlarge \ --instance-count 2 \ --service-role EMR_DefaultRole_V2 \ --ec2-attributes InstanceProfile=EMR_EC2_DefaultRole,SubnetId=subnet-1234567890abcdef0
Alternatively, you can create an Amazon EMR cluster including the Spark application
and include the file
/usr/share/aws/iceberg/lib/iceberg-spark3-runtime.jar
as a JAR dependency in a Spark job. For more information, see Submitting Applications
To include the jar as a dependency in a Spark job, add the following configuration property to the Spark application:
--conf "spark.jars=/usr/share/aws/iceberg/lib/iceberg-spark3-runtime.jar"
For more information about Spark job dependencies, see Dependency Management
Initialize a Spark session for Iceberg
The following examples demonstrate how to launch the interactive Spark shell, use Spark submit, or use Amazon EMR Notebooks to work with Iceberg on Amazon EMR.
Write to an Iceberg table
The following example shows how to create a DataFrame and write it as an Iceberg dataset. The examples demonstrate working with datasets using the Spark shell while connected to the master node using SSH as the default hadoop user.
Note
To paste code samples into the Spark shell, type :paste
at
the prompt, paste the example, and then press CTRL+D
.
Read from an Iceberg table
Configure Spark properties to use the Amazon Glue Data Catalog as Iceberg tables metastore
To use the Amazon Glue Catalog as the Metastore for Iceberg tables, set the Spark configuration properties as below:
spark-submit \ --conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \ --conf spark.sql.catalog.my_catalog.warehouse=s3://<bucket>/<prefix> \ --conf spark.sql.catalog.my_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog \ --conf spark.sql.catalog.my_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO \ --conf spark.sql.catalog.my_catalog.lock-impl=org.apache.iceberg.aws.dynamodb.DynamoDbLockManager \ --conf spark.sql.catalog.my_catalog.lock.table=myGlueLockTable