Transparent encryption in HDFS on Amazon EMR
Transparent encryption is implemented through the use of HDFS encryption
zones, which are HDFS paths that you define. Each encryption zone has its
own key, which is stored in the key server specified using the hdfs-site
configuration classification.
Beginning with Amazon EMR release version 4.8.0, you can use Amazon EMR security configurations to configure data encryption settings for clusters more easily. Security configurations offer settings to enable security for data in-transit and data at-rest in Amazon Elastic Block Store (Amazon EBS) storage volumes and EMRFS data in Amazon S3. For more information, see Encrypt data in transit and at rest in the Amazon EMR Management Guide.
Amazon EMR uses the Hadoop KMS by default; however, you can use another KMS that implements the KeyProvider API operation. Each file in an HDFS encryption zone has its own unique data encryption key, which is encrypted by the encryption zone key. HDFS data is encrypted end-to-end (at-rest and in-transit) when data is written to an encryption zone because encryption and decryption activities only occur in the client.
You cannot move files between encryptions zones or from an encryption zone to unencrypted paths.
The NameNode and HDFS client interact with the Hadoop KMS (or an alternate KMS you configured) through the KeyProvider API operation. The KMS is responsible for storing encryption keys in the backing keystore. Also, Amazon EMR includes the JCE unlimited strength policy, so you can create keys at a desired length.
For more information, see Transparent encryption in HDFS
Note
In Amazon EMR, KMS over HTTPS is not enabled by default with Hadoop KMS. For more
information about how to enable KMS over HTTPS, see the Hadoop KMS
documentation
Configuring HDFS transparent encryption
You can configure transparent encryption in Amazon EMR by creating keys and adding encryption zones. You can do this in several ways:
-
Using the Amazon EMR configuration API operation when you create a cluster
-
Using a Hadoop JAR step with command-runner.jar
-
Logging in to the master node of the Hadoop cluster and using the
hadoop key
andhdfs crypto
command line clients -
Using the REST APIs for Hadoop KMS and HDFS
For more information about the REST APIs, see the respective documentation for Hadoop KMS and HDFS.
To create encryption zones and their keys at cluster creation using the CLI
The hdfs-encryption-zones
classification in the configuration API operation
allows you to specify a key name and an encryption zone when you create a
cluster. Amazon EMR creates this key in Hadoop KMS on your cluster and configures the
encryption zone.
-
Create a cluster with the following command.
aws emr create-cluster --release-label
emr-7.5.0
--instance-type m5.xlarge --instance-count 2 \ --applications Name=App1
Name=App2
--configurations https://s3.amazonaws.com/amzn-s3-demo-bucket/myfolder/myConfig.jsonNote
Linux line continuation characters (\) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).
myConfig.json
:[ { "Classification": "hdfs-encryption-zones", "Properties": { "/myHDFSPath1": "path1_key", "/myHDFSPath2": "path2_key" } } ]
To create encryption zones and their keys manually on the master node
-
Launch your cluster using an Amazon EMR release greater than 4.1.0.
-
Connect to the master node of the cluster with SSH.
-
Create a key within Hadoop KMS.
$ hadoop key create path2_key path2_key has been successfully created with options Options{cipher='AES/CTR/NoPadding', bitLength=256, description='null', attributes=null}. KMSClientProvider[http://ip-x-x-x-x.ec2.internal:16000/kms/v1/] has been updated.
Important
Hadoop KMS requires your key names to be lowercase. If you use a key that has uppercase characters, then your cluster will fail during launch.
-
Create the encryption zone path in HDFS.
$ hadoop fs -mkdir /myHDFSPath2
-
Make the HDFS path an encryption zone using the key that you created.
$ hdfs crypto -createZone -keyName path2_key -path /myHDFSPath2 Added encryption zone /myHDFSPath2
To create encryption zones and their keys manually using the Amazon CLI
-
Add steps to create the KMS keys and encryption zones manually with the following command.
aws emr add-steps --cluster-id j-2AXXXXXXGAPLF --steps Type=CUSTOM_JAR,Name="Create First Hadoop KMS Key",Jar="command-runner.jar",ActionOnFailure=CONTINUE,Args=[/bin/bash,-c,"\"hadoop key create path1_key\""] \ Type=CUSTOM_JAR,Name="Create First Hadoop HDFS Path",Jar="command-runner.jar",ActionOnFailure=CONTINUE,Args=[/bin/bash,-c,"\"hadoop fs -mkdir /myHDFSPath1\""] \ Type=CUSTOM_JAR,Name="Create First Encryption Zone",Jar="command-runner.jar",ActionOnFailure=CONTINUE,Args=[/bin/bash,-c,"\"hdfs crypto -createZone -keyName path1_key -path /myHDFSPath1\""] \ Type=CUSTOM_JAR,Name="Create Second Hadoop KMS Key",Jar="command-runner.jar",ActionOnFailure=CONTINUE,Args=[/bin/bash,-c,"\"hadoop key create path2_key\""] \ Type=CUSTOM_JAR,Name="Create Second Hadoop HDFS Path",Jar="command-runner.jar",ActionOnFailure=CONTINUE,Args=[/bin/bash,-c,"\"hadoop fs -mkdir /myHDFSPath2\""] \ Type=CUSTOM_JAR,Name="Create Second Encryption Zone",Jar="command-runner.jar",ActionOnFailure=CONTINUE,Args=[/bin/bash,-c,"\"hdfs crypto -createZone -keyName path2_key -path /myHDFSPath2\""]
Note
Linux line continuation characters (\) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).
Considerations for HDFS transparent encryption
A best practice is to create an encryption zone for each application where they may write files. Also, you can encrypt all of HDFS by using the hdfs-encryption-zones classification in the configuration API and specify the root path (/) as the encryption zone.
Hadoop key management server
Hadoop
KMS
To configure Hadoop KMS, use the hadoop-kms-site classification to change settings. To configure ACLs, you use the classification kms-acls.
For more information, see the Hadoop KMS
documentation
Note
In Amazon EMR, KMS over HTTPS is not enabled by default with Hadoop KMS. To learn
how to enable KMS over HTTPS, see the Hadoop KMS
documentation
Important
Hadoop KMS requires your key names to be lowercase. If you use a key that has uppercase characters, then your cluster will fail during launch.
Configuring Hadoop KMS in Amazon EMR
Using Amazon EMR release version 4.6.0 or later, the kms-http-port
is
9700 and kms-admin-port
is 9701.
You can configure Hadoop KMS at cluster creation time using the configuration API for Amazon EMR releases. The following are the configuration object classifications available for Hadoop KMS:
Classification | Filename |
---|---|
hadoop-kms-site | kms-site.xml |
hadoop-kms-acls | kms-acls.xml |
hadoop-kms-env | kms-env.sh |
hadoop-kms-log4j | kms-log4j.properties |
To set Hadoop KMS ACLs using the CLI
-
Create a cluster with Hadoop KMS with ACLs using the following command:
aws emr create-cluster --release-label
emr-7.5.0
--instance-type m5.xlarge --instance-count 2 \ --applications Name=App1
Name=App2
--configurations https://s3.amazonaws.com/amzn-s3-demo-bucket/myfolder/myConfig.jsonNote
Linux line continuation characters (\) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).
myConfig.json
:[ { "Classification": "hadoop-kms-acls", "Properties": { "hadoop.kms.blacklist.CREATE": "hdfs,foo,myBannedUser", "hadoop.kms.acl.ROLLOVER": "myAllowedUser" } } ]
To disable Hadoop KMS cache using the CLI
-
Create a cluster with Hadoop KMS
hadoop.kms.cache.enable
set tofalse
, using the following command:aws emr create-cluster --release-label
emr-7.5.0
--instance-type m5.xlarge --instance-count 2 \ --applications Name=App1
Name=App2
--configurations https://s3.amazonaws.com/amzn-s3-demo-bucket/myfolder/myConfig.jsonNote
Linux line continuation characters (\) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).
myConfig.json
:[ { "Classification": "hadoop-kms-site", "Properties": { "hadoop.kms.cache.enable": "false" } } ]
To set environment variables in the kms-env.sh
script using the CLI
-
Change settings in
kms-env.sh
via thehadoop-kms-env
configuration. Create a cluster with Hadoop KMS using the following command:aws emr create-cluster --release-label
emr-7.5.0
--instance-type m5.xlarge --instance-count 2 \ --applications Name=App1
Name=App2
--configurations https://s3.amazonaws.com/amzn-s3-demo-bucket/myfolder/myConfig.jsonNote
Linux line continuation characters (\) are included for readability. They can be removed or used in Linux commands. For Windows, remove them or replace with a caret (^).
myConfig.json
:[ { "Classification": "hadoop-kms-env", "Properties": { }, "Configurations": [ { "Classification": "export", "Properties": { "JAVA_LIBRARY_PATH": "
/path/to/files
", "KMS_SSL_KEYSTORE_FILE": "/non/Default/Path/
.keystore", "KMS_SSL_KEYSTORE_PASS": "myPass
" }, "Configurations": [ ] } ] } ]
For information about configuring Hadoop KMS, see the Hadoop KMS
documentation
HDFS transparent encryption on EMR clusters with multiple master nodes
Apache
Ranger
Apache Ranger KMS stores its root key and Encryption Zone (EZ) keys in your Amazon RDS for an Amazon EMR cluster with multiple primary nodes. To enable transparent encryption in HDFS on an Amazon EMR cluster with multiple primary nodes, you must provide the following configurations.
-
Amazon RDS or your own MySQL server connection URL to store the Ranger KMS root key and EZ key
-
User name and password for MySQL
-
Password for Ranger KMS root key
-
Certificate Authority (CA) PEM file for SSL connection to MySQL server. You can download the certificate bundle for your Amazon Web Services Region from Download certificate bundles for Amazon RDS.
You can provide these configurations by using ranger-kms-dbks-site
classification and ranger-kms-db-ca
classification, as the following
example demonstrates.
[{ "Classification": "ranger-kms-dbks-site", "Properties": { "ranger.ks.jpa.jdbc.url": "
jdbc:log4jdbc:mysql://mysql-host-url.xx-xxx-1.xxx.amazonaws.com:3306/rangerkms
", "ranger.ks.jpa.jdbc.user": "mysql-user-name
", "ranger.ks.jpa.jdbc.password": "mysql-password
", "ranger.db.encrypt.key.password": "password-for-encrypting-a-master-key
" } }, { "Classification": "ranger-kms-db-ca", "Properties": { "ranger.kms.trust.ca.file.s3.url": "<S3-path-of-downloaded-pem-file>" } } ]]
The following are configuration object classifications for Apache Ranger KMS.
Classification | Description |
---|---|
ranger-kms-dbks-site | Change values in dbks-site.xml file of Ranger KMS. |
ranger-kms-site | Change values in ranger-kms-site.xml file of Ranger KMS. |
ranger-kms-env | Change values in the Ranger KMS environment. |
ranger-kms-log4j | Change values in kms-log4j.properties file of Ranger KMS. |
ranger-kms-db-ca | Change values for CA file on S3 for MySQL SSL connection with Ranger KMS. |
Considerations
-
It is highly recommended that you encrypt your Amazon RDS instance to improve security. For more information, see Overview of encrypting Amazon RDS resources
. -
It is highly recommended that you use separate MySQL database for each Amazon EMR cluster with multiple primary nodes for high security bar.
-
To configure transparent encryption in HDFS on an Amazon EMR cluster with multiple primary nodes, you must specify the
hdfs-encryption-zones
classification while creating the cluster. Otherwise, Ranger KMS will not be configured or started. Reconfiguringhdfs-encryption-zones
classification or any of the Hadoop KMS configuration classifications on a running cluster is not supported on Amazon EMR cluster with multiple primary nodes. -
The PEM certificate bundle that you downlaod from Download certificate bundles for Amazon RDS groups multiple certificates into one file. Amazon EMR 7.3.0 and higher supports importing multiple certificates from the PEM file with the configuration
ranger.kms.trust.ca.file.s3.url
.