Prerequisites to use database as a source
Note
Firehose supports database as a source in all Amazon Web Services Regions
Before you begin, complete the following prerequisites.
-
Source Database configurations – You need the following source database configurations before you can use the database as a source for your Firehose stream.
-
Create snapshot watermark table with right permissions – For the initial copy (snapshot) of the data in the tables, Firehose uses an incremental copy approach with watermarks to track the progress. This incremental copy approach helps to resume the copy from where it stopped than recapturing the table in case of any interruptions. Firehose uses watermark table in your database to store the required watermarks. Firehose needs one watermark table per Firehose stream. If the table is not already created before Firehose stream creation, then Firehose creates this table as a part of the stream creation. You need to provide right permissions for Firehose to create this table.
-
Create a database user – Firehose requires a database user account with right permissions to make the initial copy of tables, read CDC events from the transaction logs, access watermark table, and create watermark table if it's not already created. You will use this database user name and password as part of Firehose credentials to connect to your database during stream setup.
-
Enable transaction logs – The transaction logs record all database changes such as INSERT, UPDATE, and DELETE in the order it's committed to the database. Firehose reads the transaction logs and replicates the changes to Apache Iceberg Tables. You must enable the transaction logs if it's not enabled.
-
Add an inbound and outbound rule – To allow private connectivity to databases, you must add an inbound rule and outbound rule for HTTPS traffic and an inbound rule for database ( MySQL or PostgreSQL) traffic in the security group of your database VPC. For the source column, use IPv4 CIDR range of your VPC.
To create watermark table, database user, and to enable transaction logs, follow the steps in Set up database connectivity.
-
-
Enable private connectivity to databases – Firehose supports connecting to databases within VPC using Amazon PrivateLink technology. To enable private connectivity to databases, see Access Amazon RDS across VPCs using Amazon PrivateLink and Network Load Balancer
. Here are some points to note for connecting to databases. -
These steps also apply for databases running on EC2.
-
You must increase the timeout of Lambda function used in this example from default 3 seconds to 5 minutes.
-
Before you run the Lambda function to update the primary instance IP address to the Network Load Balancer, you must create a VPC Endpoint with Amazon service name as
com.amazonaws.us-east-1.elasticloadbalancing
within your database VPC, so the Lambda can communicate with Amazon Elastic Load Balancing service. -
You must allow list Firehose service principal
firehose.amazonaws.com
to create Amazon PrivateLink to your VPC. For more information, see Manage permissions. Do not add the ARN of this service role. Only addfirehose.amazonaws.com
to the allow principals. -
You must allow your endpoint service to accept connection requests automatically, by ensuring that you disable the Acceptance Required option through Amazon VPC . This allows Firehose to create the necessary endpoint connection without any manual intervention. For more information on how to disable connection request, see Accept or reject connection requests .
-
-
Store credentials in Amazon Secrets Manager – Firehose uses Amazon Secrets Manager to retrieve credentials that are used to connect to databases. Add the database user credentials that you created in the previous prerequisite, as secrets in the Amazon Secrets Manager. For more information, refer to Authenticate with Amazon Secrets Manager in Amazon Data Firehose.
-
Create an IAM role with required permissions – Firehose needs an IAM role with specific permissions to access Amazon Secrets Manager, Amazon Glue tables and write data to Amazon S3. The same role is used to grant Amazon Glue access to Amazon S3 buckets. You need this IAM role when you create Apache Iceberg Tables and a Firehose. For more information, see Grant Firehose access to replicate database changes to Apache Iceberg Tables.
-
Create Apache Iceberg Tables – Firehose can automatically create Iceberg Tables if you enable the setting during Firehose stream creation. If you don’t want Firehose to create Iceberg Tables, then you must create Iceberg Tables with the same name and schema as the source database tables. For more information on creating Iceberg tables using Glue, refer to Creating Iceberg Tables.
Note
You must create Apache Iceberg Tables with the following mapping.
-
For MySQL source database name maps to Amazon Glue Database name and source table name maps to Amazon Glue table name.
-
For PostgreSQL, source database name maps to Amazon Glue Database and source schema name and table name maps to Amazon Glue Table name in
<SchemaName>_<TableName>
format. If you create a table by yourself, the source and target schemas should exactly match.
-