Working with Amazon Glue Data Catalog views in Amazon EMR (preview) - Amazon EMR
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Working with Amazon Glue Data Catalog views in Amazon EMR (preview)

Note

Amazon Glue Data Catalog views in Amazon EMR are in preview release and is subject to change. The feature is provided as a Preview service as defined in the Amazon Service Terms.

You can create and manage single common views in the Amazon Glue Data Catalog. Single common views are useful because they support multiple SQL query engines, so you can access the same view across different Amazon Web Services services, such as Amazon EMR Amazon Athena, and Amazon Redshift.

By creating a view in the Data Catalog, you can use resource grants and tag-based access controls in Amazon Lake Formation to grant access to a Data Catalog view. Using this method of access control, you don't have to configure additional access to the tables you referenced when creating the view. This method of granting permissions is called definer semantics, and these views are called definer views. For more information about access control in Lake Formation, see Granting and revoking permissions on Data Catalog resources. in the Amazon Lake Formation Developer Guide.

Data Catalog views are useful for the following use cases:

  • Granular access control – create a view that restricts data access based on the permissions the user needs. For example, you can use views in the Data Catalog to prevent employees who don’t work in the HR department from seeing personally identifiable information (PII).

  • Complete view definition – by applying certain filters onto your view in the Data Catalog, you make sure that data records inside a view in the Data Catalog are always complete.

  • Enhanced security – query definition used to create the view must be complete. This benefit means that views in the Data Catalog are less susceptible to SQL commands from malicious players.

  • Simple sharing data – share data with other Amazon Web Services accounts without moving any data. For more information, see Cross-account data sharing in Lake Formation.

Creating a Data Catalog view

Important

During this preview release, Amazon EMR doesn't validate the Spark-SQL that you use when you create the view. To lower risks, we recommend that you limit the users to whom you grant view creation permissions.

To create a Data Catalog view, you must use an IAM role that has the full SELECT permission with Grantable options on all of the tables you want to reference when creating the view. This role is called the definer role. For a full list of permissions and prerequisites required to create a Data Catalog view, see Working with views in the Amazon Lake Formation Developer Guide. You must use the Amazon CLI to configure your IAM role. See Use an IAM role in the Amazon CLI for more information.

Follow these steps to create a Data Catalog view.

Note

To access a Data Catalog view from Apache Spark on Amazon EMR, you must set the dialect to SPARK and the DialectVersion to 3.4.1-amzn-2.

  1. First download the preview model.

    aws s3 cp s3://emr-data-access-control-us-east-1/beta/glue-views/model/service-2.json
  2. Configure the Amazon CLI to use the preview model.

    aws configure add-model --service-model file:///<path-to-preview-model>/service-2.json --service-name glue-views
  3. Create the view.

    aws glue-views create-table --cli-input-json '{ "DatabaseName": "<database>", "TableInput": { "Name": "<view>", "StorageDescriptor": { "Columns": [ { "Name": "<col1>", "Type": "<data-type>" }, ... { "Name": "<colN>", "Type": "<data-type>" } ] }, "ViewDefinition": { "SubObjects": [ "arn:aws:glue:<aws-region;>:<aws-account-id>:table/<database>/<referenced-table1>", ... "arn:aws:glue:<aws-region>:<aws-account-id>:table/<database>/<referenced-tableN>", ], "IsProtected": true, "Representations": [ { "Dialect": "SPARK", "DialectVersion": "3.4.1-amzn-2", "ViewOriginalText": "<Spark-SQL>", "ViewExpandedText": "<Spark-SQL>" } ] } } }'

Enabling access to a Data Catalog view

Important

We recommend that you enable access to Data Catalog views only with EMR clusters in testing environments and not production environments.

To access the Data Catalog view from Apache Spark on Amazon EMR, you must first enable support for Lake Formation and use the script below to enable support for views with Spark on Amazon EMR. For more information about enabling support, see Enable Lake Formation with Amazon EMR and Use custom bootstrap actions.

# Download the script and upload it to Amazon S3 wget https://emr-data-access-control-us-east-1.s3.amazonaws.com/beta/glue-views/ba/enable-mdv.sh /Users/$USER/enable-mdv.sh aws s3 cp /Users/$USER/enable-views.sh s3://<bucket>/<prefix>/enable-views.sh # EMR Security Configuration cat <<EOT > /Users/$USER/lakeformation-protection.json { "AuthorizationConfiguration":{ "IAMConfiguration":{ "EnableApplicationScopedIAMRole":true }, "LakeFormationConfiguration":{ "AuthorizedSessionTagValue":"Amazon EMR" } }, "EncryptionConfiguration": { "EnableInTransitEncryption": true, "InTransitEncryptionConfiguration": { "TLSCertificateConfiguration": { "CertificateProviderType": "PEM", "S3Object": "s3://<BUCKET>/<PREFIX>/certificates.zip" } } } } EOT SECURITY_CONFIG="RuntimeRolesWithAWSLakeFormation" aws emr create-security-configuration \ --name $SECURITY_CONFIG \ --security-configuration file:///Users/$USER/lakeformation-protection.json # EMR Cluster version RELEASE_LABEL="emr-6.15.0"

Then use the following Amazon CLI command that uses the bootstrap action to create an EMR cluster that supports Data Catalog views.

aws emr create-cluster \ ... --release-label $RELEASE_LABEL \ --security-configuration $SECURITY_CONFIG \ --bootstrap-actions \ Name='Enable Views',Path="s3://<bucket>/<prefix>/enable-views.sh"

Querying a Data Catalog view

Important

During this preview release, we recommend that you access views only from trusted sources. In preview, Amazon EMR has a limited amount of validations that protect your EMR cluster.

After creating a Data Catalog view, you can now use an IAM role to query the view. The IAM role must have the SELECT permission on the Data Catalog view. You don't need to grant access to the underlying tables referred in the view. You must use this IAM role as a runtime role. You can access the view from an EMR cluster using a runtime role from Amazon EMR steps, EMR Studio, and SageMaker Studio. For more information about runtime roles, see Runtime roles for Amazon EMR steps.

Once you have everything set up, you can query your view. For example, after attaching the EMR cluster to your Workspace in EMR Studio, you can run the following query to access a view.

SELECT * from <database>.<glue-data-catalog-view> LIMIT 10

Limitations

Consider the following limitations when you use Data Catalog views.

  • You can only create Data Catalog views with Amazon EMR 6.15.0.

  • You can only reference up to 10 tables in the view definition.

  • You can only create PROTECTED Data Catalog views. UNPROTECTED views aren't supported.

  • You can't reference tables in another Amazon Web Services account in Data Catalog views.

  • User-defined functions (UDFs) aren't supported.

  • You can't reference open-table formats such as Apache Hudi or Apache Iceberg in Data Catalog views.

  • You can't reference other views in Data Catalog views.