

# Using EMR Serverless with Amazon Lake Formation for fine-grained access control
<a name="emr-serverless-lf-enable"></a>

## Overview
<a name="emr-serverless-lf-enable-overview"></a>

With Amazon EMR releases 7.2.0 and higher, leverage Amazon Lake Formation to apply fine-grained access controls on Data Catalog tables that are backed by S3. This capability lets you configure table, row, column, and cell level access controls for read queries within your Amazon EMR Serverless Spark jobs. To configure fine-grained access control for Apache Spark batch jobs and interactive sessions, use EMR Studio. See the following sections to learn more about Lake Formation and how to use it with EMR Serverless.

Using Amazon EMR Serverless with Amazon Lake Formation incurs additional charges. For more information, refer to [Amazon EMR pricing](https://www.amazonaws.cn/emr/pricing/).

## How EMR Serverless works with Amazon Lake Formation
<a name="emr-serverless-lf-enable-how-it-works"></a>

Using EMR Serverless with Lake Formation lets you enforce a layer of permissions on each Spark job to apply Lake Formation permissions control when EMR Serverless executes jobs. EMR Serverless uses [ Spark resource profiles](https://spark.apache.org/docs/latest/api/java/org/apache/spark/resource/ResourceProfile.html) to create two profiles to effectively execute jobs. The user profile executes user-supplied code, while the system profile enforces Lake Formation policies. For more information, refer to [What is Amazon Lake Formation](https://docs.amazonaws.cn/lake-formation/latest/dg/what-is-lake-formation.html) and [Considerations and limitations](https://docs.amazonaws.cn/emr/latest/EMR-Serverless-UserGuide/emr-serverless-lf-enable-considerations.html).

When you use pre-initialized capacity with Lake Formation, we suggest that you have a minimum of two Spark drivers. Each Lake Formation-enabled job utilizes two Spark drivers, one for the user profile and one for the system profile. For the best performance, use double the number of drivers for Lake Formation-enabled jobs compared to if you don't use Lake Formation.

When you run Spark jobs on EMR Serverless, also consider the impact of dynamic allocation on resource management and cluster performance. The configuration `spark.dynamicAllocation.maxExecutors` of the maximum number of executors per resource profile applies to user and system executors. If you configure that number to be equal to the maximum allowed number of executors, your job run might get stuck because of one type of executor that uses all available resources, which prevents the other executor when you run jobs jobs.

So you don't run out of resources, EMR Serverless sets the default maximum number of executors per resource profile to 90% of the `spark.dynamicAllocation.maxExecutors` value. You can override this configuration when you specify `spark.dynamicAllocation.maxExecutorsRatio` with a value between 0 and 1. Additionally, also configure the following properties to optimize resource allocation and overall performance:
+ `spark.dynamicAllocation.cachedExecutorIdleTimeout`
+ `spark.dynamicAllocation.shuffleTracking.timeout`
+ `spark.cleaner.periodicGC.interval`

The following is a high-level overview of how EMR Serverless gets access to data protected by Lake Formation security policies.

![How Amazon EMR accesses data protected by Lake Formation security policies.](http://docs.amazonaws.cn/en_us/emr/latest/EMR-Serverless-UserGuide/images/lf-emr-s-architecture.png)


1. A user submits Spark job to an Amazon Lake Formation-enabled EMR Serverless application. 

1. EMR Serverless sends the job to a user driver and runs the job in the user profile. The user driver runs a lean version of Spark that has no ability to launch tasks, request executors, access S3 or the Glue Catalog. It builds a job plan.

1. EMR Serverless sets up a second driver called the system driver and runs it in the system profile (with a privileged identity). EMR Serverless sets up an encrypted TLS channel between the two drivers for communication. The user driver uses the channel to send the job plans to the system driver. The system driver does not run user-submitted code. It runs full Spark and communicates with S3, and the Data Catalog for data access. It request executors and compiles the Job Plan into a sequence of execution stages. 

1. EMR Serverless then runs the stages on executors with the user driver or system driver. User code in any stage is run exclusively on user profile executors.

1. Stages that read data from Data Catalog tables protected by Amazon Lake Formation or those that apply security filters are delegated to system executors.

## Enabling Lake Formation in Amazon EMR
<a name="emr-serverless-lf-enable-config"></a>

To enable Lake Formation, set `spark.emr-serverless.lakeformation.enabled` to `true` under `spark-defaults` classification for the runtime-configuration parameter when [ creating an EMR Serverless application](https://docs.amazonaws.cn/emr/latest/EMR-Serverless-UserGuide/getting-started.html#gs-application-console).

```
aws emr-serverless create-application \
    --release-label emr-7.13.0 \
    --runtime-configuration '{
     "classification": "spark-defaults", 
     "properties": {
      "spark.emr-serverless.lakeformation.enabled": "true"
      }
    }' \
    --type "SPARK"
```

You can also enable Lake Formation when you create a new application in EMR Studio. Choose **Use Lake Formation for fine-grained access control**, available under **Additional configurations**.

[Inter-worker encryption](https://docs.amazonaws.cn/emr/latest/EMR-Serverless-UserGuide/interworker-encryption.html) is enabled by default when you use Lake Formation with EMR Serverless, so you do not need to explicitly enable inter-worker encryption again.

**Enabling Lake Formation for Spark jobs**

To enable Lake Formation for individual Spark jobs, set `spark.emr-serverless.lakeformation.enabled` to true when using `spark-submit`.

```
--conf spark.emr-serverless.lakeformation.enabled=true
```

## Job runtime role IAM permissions
<a name="emr-serverless-lf-enable-permissions"></a>

Lake Formation permissions control access to Amazon Glue Data Catalog resources, Amazon S3 locations, and the underlying data at those locations. IAM permissions control access to the Lake Formation and Amazon Glue APIs and resources. Although you might have the Lake Formation permission to access a table in the Data Catalog (SELECT), your operation fails if you don’t have the IAM permission on the `glue:Get*` API operation. 

The following is an example policy of how to provide IAM permissions to access a script in S3, uploading logs to S3, Amazon Glue API permissions, and permission to access Lake Formation.

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Sid": "ScriptAccess",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws-cn:s3:::*.amzn-s3-demo-bucket/scripts",
        "arn:aws-cn:s3:::*.amzn-s3-demo-bucket/*"
      ]
    },
    {
      "Sid": "LoggingAccess",
      "Effect": "Allow",
      "Action": [
        "s3:PutObject"
      ],
      "Resource": [
        "arn:aws-cn:s3:::amzn-s3-demo-bucket/logs/*"
      ]
    },
    {
      "Sid": "GlueCatalogAccess",
      "Effect": "Allow",
      "Action": [
        "glue:Get*",
        "glue:Create*",
        "glue:Update*"
      ],
      "Resource": [
        "*"
      ]
    },
    {
      "Sid": "LakeFormationAccess",
      "Effect": "Allow",
      "Action": [
        "lakeformation:GetDataAccess"
      ],
      "Resource": [
        "*"
      ]
    }
  ]
}
```

------

## Setting up Lake Formation permissions for job runtime role
<a name="emr-serverless-lf-enable-set-up-grants-for-role"></a>

First, register the location of your Hive table with Lake Formation. Then create permissions for your job runtime role on your desired table. For more details about Lake Formation, refer to [ What is Amazon Lake Formation?](https://docs.amazonaws.cn/lake-formation/latest/dg/what-is-lake-formation.html) in the *Amazon Lake Formation Developer Guide*.

After you set up the Lake Formation permissions, submit Spark jobs on Amazon EMR Serverless. For more information about Spark jobs, refer to [Spark examples](https://docs.amazonaws.cn/emr/latest/EMR-Serverless-UserGuide/jobs-spark.html#spark-examples).

## Submitting a job run
<a name="emr-serverless-lf-enable-submit-job"></a>

After you finish setting up the Lake Formation grants, you can [ submit Spark jobs on EMR Serverless.](https://docs.amazonaws.cn/emr/latest/EMR-Serverless-UserGuide/jobs-spark.html#spark-examples) The section that follows shows examples of how to configure and submit job run properties.

## Permission requirements
<a name="emr-serverless-lf-enable-otf-permissions"></a>

### Tables not registered in Amazon Lake Formation
<a name="emr-s-lf-otf-permissions"></a>

For tables not registered with Amazon Lake Formation, the job runtime role accesses both the Amazon Glue Data Catalog and the underlying table data in Amazon S3. This requires the job runtime role to have appropriate IAM permissions for both Amazon Glue and Amazon S3 operations. 

### Tables registered in Amazon Lake Formation
<a name="emr-s-lf-otf-permissions-tables-lf-registered"></a>

For tables registered with Amazon Lake Formation, the job runtime role accesses the Amazon Glue Data Catalog metadata, while temporary credentials vended by Lake Formation access the underlying table data in Amazon S3. The Lake Formation permissions required to execute an operation depend on the Amazon Glue Data Catalog and Amazon S3 API calls that the Spark job initiates and can be summarized as follows:
+ **DESCRIBE** permission allows the runtime role to read table or database metadata in the Data Catalog
+ **ALTER** permission allows the runtime role to modify table or database metadata in the Data Catalog
+ **DROP** permission allows the runtime role to delete table or database metadata from the Data Catalog
+ **SELECT** permission allows the runtime role to read table data from Amazon S3
+ **INSERT** permission allows the runtime role to write table data to Amazon S3
+ **DELETE** permission allows the runtime role to delete table data from Amazon S3
**Note**  
Lake Formation evaluates permissions lazily when a Spark job calls Amazon Glue to retrieve table metadata and Amazon S3 to retrieve table data. Jobs that use a runtime role with insufficient permissions will not fail until Spark makes an Amazon Glue or Amazon S3 call that requires the missing permission.

**Note**  
In the following supported table matrix:   
Operations marked as **Supported** exclusively use Lake Formation credentials to access table data for tables registered with Lake Formation. If Lake Formation permissions are insufficient, the operation will not fall back to runtime role credentials. For tables not registered with Lake Formation, the job runtime role credentials access the table data.
Operations marked as **Supported with IAM permissions on Amazon S3 location** do not use Lake Formation credentials to access underlying table data in Amazon S3. To run these operations, the job runtime role must have the necessary Amazon S3 IAM permissions to access the table data, regardless of whether the table is registered with Lake Formation.

------
#### [ Hive ]


| Operation | Amazon Lake Formation permissions | Support status | 
| --- | --- | --- | 
| SELECT | SELECT | Supported | 
| CREATE TABLE | CREATE\_TABLE | Supported | 
| CREATE TABLE LIKE | CREATE\_TABLE | Supported with IAM permissions on Amazon S3 location | 
| CREATE TABLE AS SELECT | CREATE\_TABLE | Supported with IAM permissions on Amazon S3 location | 
| DESCRIBE TABLE | DESCRIBE | Supported | 
| SHOW TBLPROPERTIES | DESCRIBE | Supported | 
| SHOW COLUMNS | DESCRIBE | Supported | 
| SHOW PARTITIONS | DESCRIBE | Supported | 
| SHOW CREATE TABLE | DESCRIBE | Supported | 
| ALTER TABLE tablename | SELECT and ALTER | Supported | 
| ALTER TABLE tablename SET LOCATION | - | Not supported | 
| ALTER TABLE tablenameADD PARTITION | SELECT, INSERT and ALTER | Supported | 
| REPAIR TABLE | SELECT and ALTER | Supported | 
| LOAD DATA |  | Not supported | 
| INSERT | INSERT and ALTER | Supported | 
| INSERT OVERWRITE | SELECT, INSERT, DELETE and ALTER | Supported | 
| DROP TABLE | SELECT, DROP, DELETE and ALTER | Supported | 
| TRUNCATE TABLE | SELECT, INSERT, DELETE and ALTER | Supported | 
| Dataframe Writer V1 | Same as corresponding SQL operation | Supported when appending data to an existing table. Refer to [considerations and limitations](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless-lf-enable-considerations.html) for more information | 
| Dataframe Writer V2 | Same as corresponding SQL operation | Supported when appending data to an existing table. Refer to [considerations and limitations](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless-lf-enable-considerations.html) for more information | 

------
#### [ Iceberg ]


| Operation | Amazon Lake Formation permissions | Support status | 
| --- | --- | --- | 
| SELECT | SELECT | Supported | 
| CREATE TABLE | CREATE\_TABLE | Supported | 
| CREATE TABLE LIKE | CREATE\_TABLE | Supported with IAM permissions on Amazon S3 location | 
| CREATE TABLE AS SELECT | CREATE\_TABLE | Supported with IAM permissions on Amazon S3 location | 
| REPLACE TABLE AS SELECT | SELECT, INSERT and ALTER | Supported | 
| DESCRIBE TABLE | DESCRIBE | Supported with IAM permissions on Amazon S3 location | 
| SHOW TBLPROPERTIES | DESCRIBE | Supported with IAM permissions on Amazon S3 location | 
| SHOW CREATE TABLE | DESCRIBE | Supported with IAM permissions on Amazon S3 location | 
| ALTER TABLE | SELECT, INSERT and ALTER | Supported  | 
| ALTER TABLE SET LOCATION | SELECT, INSERT and ALTER | Supported with IAM permissions on Amazon S3 location | 
| ALTER TABLE WRITE ORDERED BY | SELECT, INSERT and ALTER | Supported with IAM permissions on Amazon S3 location | 
| ALTER TABLE WRITE DISTRIBUTED BY | SELECT, INSERT, and ALTER | Supported with IAM permissions on Amazon S3 location | 
| ALTER TABLE RENAME TABLE | CREATE\_TABLE, and DROP | Supported | 
| INSERT INTO | SELECT, INSERT and ALTER | Supported | 
| INSERT OVERWRITE | SELECT, INSERT and ALTER | Supported | 
| DELETE | SELECT, INSERT and ALTER | Supported | 
| UPDATE | SELECT, INSERT and ALTER | Supported | 
| MERGE INTO | SELECT, INSERT and ALTER | Supported | 
| DROP TABLE | SELECT, DELETE and DROP | Supported | 
| DataFrame Writer V1 | - | Not supported | 
| DataFrame Writer V2 | Same as corresponding SQL operation | Supported when appending data to an existing table. Refer to [considerations and limitations](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless-lf-enable-considerations.html) for more information. | 
| Metadata tables | SELECT | Supported. Certain tables are hidden. Refer to [considerations and limitations](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless-lf-enable-considerations.html) for more information. | 
| Stored procedures | - | Supported for tables that meet the following conditions:[See the AWS documentation website for more details](http://docs.amazonaws.cn/en_us/emr/latest/EMR-Serverless-UserGuide/emr-serverless-lf-enable.html) | 

**Spark configuration for Iceberg:** The following sample shows how to configure Spark with Iceberg. To run Iceberg jobs, provide the following `spark-submit` properties.

```
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
--conf spark.sql.catalog.spark_catalog.warehouse=<{{S3_DATA_LOCATION}}>
--conf spark.sql.catalog.spark_catalog.glue.account-id=<{{ACCOUNT_ID}}>
--conf spark.sql.catalog.spark_catalog.client.region=<{{REGION}}>
--conf spark.sql.catalog.spark_catalog.glue.endpoint=https://glue.<{{REGION}}>.amazonaws.com
```

------
#### [ Hudi ]


| Operation | Amazon Lake Formation permissions | Support status | 
| --- | --- | --- | 
| SELECT | SELECT | Supported | 
| CREATE TABLE | CREATE\_TABLE | Supported with IAM permissions on Amazon S3 location | 
| CREATE TABLE LIKE | CREATE\_TABLE | Supported with IAM permissions on Amazon S3 location | 
| CREATE TABLE AS SELECT | - | Not supported | 
| DESCRIBE TABLE | DESCRIBE | Supported with IAM permissions on Amazon S3 location | 
| SHOW TBLPROPERTIES | DESCRIBE | Supported with IAM permissions on Amazon S3 location | 
| SHOW COLUMNS | DESCRIBE | Supported with IAM permissions on Amazon S3 location | 
| SHOW CREATE TABLE | DESCRIBE | Supported with IAM permissions on Amazon S3 location | 
| ALTER TABLE | SELECT | Supported with IAM permissions on Amazon S3 location | 
| INSERT INTO | SELECT and ALTER | Supported with IAM permissions on Amazon S3 location | 
| INSERT OVERWRITE | SELECT and ALTER | Supported with IAM permissions on Amazon S3 location | 
| DELETE | - | Not supported | 
| UPDATE | - | Not supported | 
| MERGE INTO | - | Not supported | 
| DROP TABLE | SELECT and DROP | Supported with IAM permissions on Amazon S3 location | 
| DataFrame Writer V1 | - | Not supported | 
| DataFrame Writer V2 | Same as corresponding SQL operation | Supported with IAM permissions on Amazon S3 location | 
| Metadata tables | - | Not supported | 
| Table maintenance and utility features | - | Not supported | 

The following samples configure Spark with Hudi, specifying file locations and other properties necessary for use.

**Spark config for Hudi:** This snippet when used in a notebook specifies the path to the Hudi Spark bundle JAR file, which enables Hudi functionality in Spark. It also configures Spark to use the Amazon Glue Data Catalog as the metastore.

```
%%configure -f
{
    "conf": {
        "spark.jars": "/usr/lib/hudi/hudi-spark-bundle.jar",
        "spark.hadoop.hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory",
        "spark.serializer": "org.apache.spark.serializer.JavaSerializer",
        "spark.sql.catalog.spark_catalog": "org.apache.spark.sql.hudi.catalog.HoodieCatalog",
        "spark.sql.extensions": "org.apache.spark.sql.hudi.HoodieSparkSessionExtension"
    }
}
```

**Spark config for Hudi with Amazon Glue:** This snippet when used in a notebook enables Hudi as a supported data-lake format and ensures that Hudi libraries and dependencies are available.

```
%%configure
{
    "--conf": "spark.serializer=org.apache.spark.serializer.JavaSerializer --conf 
spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog --conf 
spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension",
    "--datalake-formats": "hudi",
    "--enable-glue-datacatalog": True,
    "--enable-lakeformation-fine-grained-access": "true"
}
```

------
#### [ Delta Lake ]


| Operation | Amazon Lake Formation permissions | Support status | 
| --- | --- | --- | 
| SELECT | SELECT | Supported | 
| CREATE TABLE | CREATE\_TABLE | Supported | 
| CREATE TABLE LIKE | - | Not supported | 
| CREATE TABLE AS SELECT | CREATE\_TABLE | Supported  | 
| REPLACE TABLE AS SELECT | SELECT, INSERT and ALTER | Supported | 
| DESCRIBE TABLE | DESCRIBE | Supported with IAM permissions on Amazon S3 location | 
| SHOW TBLPROPERTIES | DESCRIBE | Supported with IAM permissions on Amazon S3 location | 
| SHOW COLUMNS | DESCRIBE | Supported with IAM permissions on Amazon S3 location | 
| SHOW CREATE TABLE | DESCRIBE | Supported with IAM permissions on Amazon S3 location | 
| ALTER TABLE | SELECT and INSERT  | Supported  | 
| ALTER TABLE SET LOCATION | SELECT and INSERT  | Supported with IAM permissions on Amazon S3 location | 
| ALTER TABLE tablename CLUSTER BY | SELECT and INSERT | Supported with IAM permissions on Amazon S3 location | 
| ALTER TABLE tablename ADD CONSTRAINT | SELECT and INSERT | Supported with IAM permissions on Amazon S3 location | 
| ALTER TABLE tablename DROP CONSTRAINT | SELECT and INSERT | Supported with IAM permissions on Amazon S3 location | 
| INSERT INTO | SELECT and INSERT | Supported | 
| INSERT OVERWRITE | SELECT and INSERT | Supported | 
| DELETE | SELECT and INSERT | Supported | 
| UPDATE | SELECT and INSERT | Supported | 
| MERGE INTO | SELECT and INSERT | Supported | 
| DROP TABLE | SELECT, DELETE and DROP | Supported | 
| DataFrame Writer V1 | - | Not supported | 
| DataFrame Writer V2 | Same as corresponding SQL operation | Supported  | 
| Table maintenance and utility features | - | Not supported | 

**EMR Serverless with Delta Lake:** To use Delta Lake with Lake Formation on EMR Serverless, run the following command:

```
spark-sql \
  --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension,com.amazonaws.emr.recordserver.connector.spark.sql.RecordServerSQLExtension \
  --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog \
```

------