Set up the Firehose stream - Amazon Data Firehose
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Firehose supports database as a source in all Amazon Web Services Regions except China Regions, Amazon GovCloud (US) Regions, and Asia Pacific (Malaysia). This feature is in preview and is subject to change. Do not use it for your production workloads.

Set up the Firehose stream

To create a Firehose stream with databases as your source, you need to configure the following.

Configure source and destination

To source data from your database, choose the source for your stream. Firehose supports MySQL and PostgreSQL databases as database sources. Next, choose Apache Iceberg Tables as the destination and provide a Firehose stream name.

Configure database connectivity

For Firehose to connect to a database instance, it needs database endpoint, VPC service endpoint, a port, and a valid database user with the right credentials.

  • Database endpoint – Database endpoint of the primary server of your database cluster. For example, the endpoint would either be self-managed xyz.amazonaws.com or RDS databases mydb.123456789012.us-east-1.rds.amazonaws.com.

  • VPC Service Endpoint name – Firehose supports private connectivity to databases. You must provide VPC endpoint service name. For example, the service endpoint can be com.amazonaws.vpce.us-east-1.vpce-svc-XXXXXXXXXXXXXX.

  • Port – For port, you must configure 3306 for MySQL databases and 5432 for PostgreSQL databases.

  • SSL Mode – You can choose to either enable or disable the SSL mode. If enabled, Firehose uses verify_identity for MySQL and verify-full SSL mode for PostgreSQL. The certificate must be signed by a trusted CA. For more information, see Using SSL/TLS to encrypt a connection to a DB instance or cluster. Note that for RDS PostgreSQL and Aurora PostgreSQL, the force_ssl parameter is set to 1, so you must either specify SSL Mode as enabled in Firehose configuration or you change force_ssl parameter to 0 in database parameter group.

  • Authenticate with Amazon Secrets Manager – Select a secret from Amazon Secrets Manager that contains the credentials for connecting to databases. If you do not have an existing secret, create one in Amazon Secrets Manager. For more information, refer to Authenticate with Amazon Secrets Manager in Amazon Data Firehose.

Configure data capture

If you want Firehose to capture changes from specific databases, tables, and columns, then you can configure them as a part of Firehose stream creation. You can specify required databases, tables, and columns by either providing regular expressions to include or exclude them or by explicitly providing comma separated specific database, table, and column names.

Note

Since in PostgreSQL, the schema within each database contains database objects such as tables and views, the fully qualified name or the regular expression must take schemas into consideration.

For MySQL, the fully qualified name is <Sampledatabase>.<SampleTable> and for PostgreSQL the fully qualified name is <SampleSchema>.<SampleTable>.

Here are some examples of each type.

Databases

Example of sample regular expression (for including databases): .* Example of explicit naming of tables: <SampleDatabase>

Tables

Example of sample regular expression for excluding tables: <SampleDatabase>.* Example of explicit naming of tables for MySQL : <SampleDatabase>.<SampleTable1> Example of explicit naming of tables for PostgreSQL : <SampleSchema>.<SampleTable>

Columns

Example of sample regular expression (for excluding columns): <SampleDatabase>.*.* Example of explicit naming of columns for MySQL : <SampleDatabase>.<SampleTable>.<SampleColumn> Example of explicit naming of columns for PostgreSQL : <SampleSchema>.<SampleTable>.<SampleColumn>

Configure surrogate keys

Firehose requires unique keys for configured tables to take the initial copy of data. If you have tables without a primary key in your databases, then you must provide surrogate key for such tables. If all your tables have primary keys, then you need not configure this section. If Firehose finds missing surrogate keys for tables without primary keys, then its snapshot (initial copy) process fails. In such scenarios, Firehose throws an error to CloudWatch Logs. For surrogate keys, you must explicitly configure keys with fully qualified name as shown in the following example.

For MySQL

SampleDatabase.SampleTable:SampleColumn

For PostgreSQL

SampleSchema.SampleTable:SampleColumn

Provide snapshot watermark table

Firehose uses watermark mechanism during incremental snapshot of tables. You must provide this snapshot watermark table that you created as part of the prerequisite. Input the snapshot watermark table in the format as show in the following example.

For MySQL: DatabaseName.TableName For PostgreSQL: SchemaName.TableName
Note

Don't delete the watermark table nor manually insert or delete records from the watermark table. Also, you mustn't revoke the permission for database user created for Firehose to insert or delete records from the watermark table.

Next Step: Configure destination settings