Firehose supports database as a source in all Amazon Web Services Regions
Configure destination settings
Firehose supports delivery of database changes to Apache Iceberg Tables. Configure the following destination settings to set up the Firehose stream with database as your source.
Connect data catalog
Apache Iceberg requires a data catalog to write to Apache Iceberg Tables. Firehose integrates with Amazon Glue Data Catalog for Apache Iceberg Tables. You can use Amazon Glue Data Catalog in the same account as your Firehose stream or in a cross-account and in the same Region as your Firehose stream (default), or in a different Region.
Enable automatic creation of tables
If you enable this option, Firehose automatically creates required databases, tables, and columns in your target destination with the same name and schema as the source databases. If you enable this option and if Firehose finds some tables with the same name and schema already present, then it will use those existing tables instead and create only missing databases, tables, and columns.
If you do not enable this option, Firehose tries to find required databases, tables, and columns. If Firehose can't find them, it throws an error and delivers data to S3 error bucket.
Note
For Firehose to successfully deliver data to Iceberg Tables, the database, table, and column names along with the schema should completely match. If the names of database objects and schemas do not match, then Firehose throws an error and delivers data to an S3 error bucket.
For MySQL databases, source database maps to Amazon Glue Database and source table maps to Amazon Glue Table.
For PostgreSQL, source database maps to Amazon Glue Database and source table maps
to Amazon Glue Table with a name of SchemaName_TableName
.
Enable schema evolution
If you enable this option, Firehose automatically evolves the schema of Apache Iceberg Tables when source schema changes. As a part of schema evolution, Firehose currently supports new column addition. For example, if a new column is added to a table on the source database side, Firehose automatically takes those changes and adds the new column to the appropriate Apache Iceberg Table.
Specify retry duration
You can use this configuration to specify the duration in seconds for which Firehose should attempt to retry, if it encounters failures in writing to Apache Iceberg Tables in Amazon S3. You can set any value from 0 to 7200 seconds for performing retries. By default, Firehose retries for 300 seconds.
Handle failed delivery or processing
You must configure Firehose to deliver records to an S3 backup bucket in case it fails to process or deliver a stream after expiry of retry duration. For this, configure the S3 backup bucket and S3 backup bucket error output prefix.
Configure buffer hints
Firehose buffers incoming streaming data in memory to a certain size (Buffering size) and for a certain period of time (Buffering interval) before delivering it to Apache Iceberg Tables. You can choose a buffer size of 1–128 MiBs and a buffer interval of 0–900 seconds. Higher buffer hints results in less number of S3 writes, less cost of compaction due to larger data files, and faster query execution but with a higher latency. Lower buffer hint values deliver the data with lower latency.
Configure advanced settings
For advanced settings, you can configure server-side encryption, error logging, permissions, and tags for your Apache Iceberg Tables. For more information, see Configure advanced settings. You must add the IAM role that you created as part of the Grant Firehose access to replicate database changes to Apache Iceberg Tables to use Apache Iceberg Tables as a destination. Firehose will assume the role to access Amazon Glue tables and write to Amazon S3 buckets.
We highly recommend that you enable CloudWatch Logs. If there is any issue with Firehose connecting to databases or taking snapshot of the tables, Firehose throws an error and logs to configured Logs. This is the only mechanism that informs you about the errors.
Firehose stream creation can take several minutes to complete. After you successfully create the Firehose stream, you can start ingesting data into it and can view the data in Apache Iceberg tables.
Note
Configure only one Firehose stream for one database. Having multiple Firehose streams for one database creates multiple connectors to the database, which impacts the database performance.
Once a Firehose streams is created, the initial status of existing tables will be snapshot IN_PROGRESS. Do not change the schema of the source table when the snapshot status is set toIN_PROGRESS. If you change the schema of the table when the snapshot is in progress, then Firehose skips the snapshot of the table. When snapshot process is complete, its status changes to COMPLETE.