Configuring Amazon DataSync transfers with HDFS - Amazon DataSync
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Configuring Amazon DataSync transfers with HDFS

To transfer data to or from your Hadoop Distributed File System (HDFS), you must create an Amazon DataSync transfer location. DataSync can use this location as a source or destination for transferring data.

Accessing HDFS clusters

To connect to your HDFS cluster, DataSync uses an agent that you deploy near your HDFS cluster. To learn more about DataSync agents, see Working with Amazon DataSync agents. The DataSync agent acts as an HDFS client and communicates with the NameNodes and DataNodes in your clusters.

When you start a task, DataSync queries the NameNode for locations of files and folders on the cluster. If the HDFS location is configured as a source, then DataSync reads files and folder data from the DataNodes in the cluster and copies the data to the destination. If the HDFS location is configured as a destination, then DataSync writes files and folders from the destination to the DataNodes in the cluster. Before running your DataSync task, verify agent connectivity to the HDFS cluster. For more information, see Testing your agent's connection to your storage.

Authentication

When connecting to an HDFS cluster, DataSync supports simple authentication or Kerberos authentication. To use simple authentication, provide the user name of a user with rights to read and write to the HDFS cluster. To use Kerberos authentication, provide a Kerberos configuration file, a Kerberos key table (keytab) file, and a Kerberos principal name. The credentials of the Kerberos principal must be in the provided keytab file.

Encryption

When using Kerberos authentication, DataSync supports encryption of data as it's transmitted between the DataSync agent and your HDFS cluster. Encrypt your data by using the Quality of Protection (QOP) configuration settings on your HDFS cluster and by specifying the QOP settings when creating your HDFS location. The QOP configuration includes settings for data transfer protection and Remote Procedure Call (RPC) protection.

DataSync supports the following Kerberos encryption types:
  • des-cbc-crc

  • des-cbc-md4

  • des-cbc-md5

  • des3-cbc-sha1

  • arcfour-hmac

  • arcfour-hmac-exp

  • aes128-cts-hmac-sha1-96

  • aes256-cts-hmac-sha1-96

  • aes128-cts-hmac-sha256-128

  • aes256-cts-hmac-sha384-192

  • camellia128-cts-cmac

  • camellia256-cts-cmac

You can also configure HDFS clusters for encryption at rest using Transparent Data Encryption (TDE). When using simple authentication, DataSync reads and writes to TDE-enabled clusters. If you're using DataSync to copy data to a TDE-enabled cluster, first configure the encryption zones on the HDFS cluster. DataSync doesn't create encryption zones.

Creating your HDFS transfer location

Configure a location that you can use a source for your DataSync transfer.

Before you begin: Verify network connectivity between your agent and Hadoop cluster by doing the following:

To create an HDFS location by using the DataSync console
  1. Open the Amazon DataSync console at https://console.amazonaws.cn/datasync/.

  2. In the left navigation pane, expand Data transfer, then choose Locations and Create location.

  3. For Location type, choose Hadoop Distributed File System (HDFS). You can configure this location as a source or destination later.

  4. For Agents, choose one or more agents that you want to use from the list of available agents. The agent connects to your HDFS cluster to securely transfer data between the HDFS cluster and DataSync.

  5. For NameNode, provide the domain name or IP address of the HDFS cluster's primary NameNode.

  6. For Folder, enter a folder on your HDFS cluster that DataSync will use for the data transfer. When the location is used as a source for a task, DataSync copies files in the provided folder. When your location is used as a destination for a task, DataSync writes all files to the provided folder.

  7. To set the Block size or Replication factor, choose Additional settings. The default block size is 128 MiB, and any provided block sizes must be a multiple of 512 bytes. The default replication factor is three DataNodes when transferring data to the HDFS cluster.

  8. In the Security section, choose the Authentication type used on your HDFS cluster.

    • Simple – For User, specify the user name with the following permissions on the HDFS cluster (depending on your use case):

      • If you plan to use this location as a source location, specify a user that only has read permissions.

      • If you plan to use this location as a destination location, specify a user that has read and write permissions.

      Optionally, specify the URI of the Key Management Server (KMS) of the HDFS cluster.

    • Kerberos – Specify the Kerberos Principal with access to your HDFS cluster. Next, provide the KeyTab file that contains the provided Kerberos principal. Then, provide the Kerberos configuration file. Finally, specify the type of encryption in transit protection in the RPC protection and Data transfer protection dropdown lists.

  9. (Optional) Choose Add tag to tag your HDFS location.

    Tags are key-value pairs that help you manage, filter, and search for your locations. We recommend creating at least a name tag for your location.

  10. Choose Create location.

Unsupported HDFS features

The following capabilities of HDFS aren't currently supported by DataSync:

  • Transparent Data Encryption (TDE) when using Kerberos authentication

  • Configuring multiple NameNodes

  • Hadoop HDFS over HTTP (HttpFS)

  • POSIX access control lists (ACLs)

  • HDFS extended attributes (xattrs)