Loading data from Amazon EMR
You can use the COPY command to load data in parallel from an Amazon EMR cluster configured to write text files to the cluster's Hadoop Distributed File System (HDFS) as fixed-width files, character-delimited files, CSV files, or JSON-formatted files.
Process for loading data from Amazon EMR
This section walks you through the process of loading data from an Amazon EMR cluster. The following sections provide the details that you must accomplish each step.
-
Step 1: Configure IAM permissions
The users that create the Amazon EMR cluster and run the Amazon Redshift COPY command must have the necessary permissions.
-
Step 2: Create an Amazon EMR cluster
Configure the cluster to output text files to the Hadoop Distributed File System (HDFS). You will need the Amazon EMR cluster ID and the cluster's main public DNS (the endpoint for the Amazon EC2 instance that hosts the cluster).
-
Step 3: Retrieve the Amazon Redshift cluster public key and cluster node IP addresses
The public key enables the Amazon Redshift cluster nodes to establish SSH connections to the hosts. You will use the IP address for each cluster node to configure the host security groups to permit access from your Amazon Redshift cluster using these IP addresses.
-
Step 4: Add the Amazon Redshift cluster public key to each Amazon EC2 host's authorized keys file
You add the Amazon Redshift cluster public key to the host's authorized keys file so that the host will recognize the Amazon Redshift cluster and accept the SSH connection.
-
Step 5: Configure the hosts to accept all of the Amazon Redshift cluster's IP addresses
Modify the Amazon EMR instance's security groups to add input rules to accept the Amazon Redshift IP addresses.
-
Step 6: Run the COPY command to load the data
From an Amazon Redshift database, run the COPY command to load the data into an Amazon Redshift table.