Consideration and limitations - Amazon Data Firehose
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Firehose supports database as a source in all Amazon Web Services Regions except China Regions, Amazon GovCloud (US) Regions, and Asia Pacific (Malaysia). This feature is in preview and is subject to change. Do not use it for your production workloads.

Consideration and limitations

Firehose support for Apache Iceberg tables has the following considerations and limitations.

  • Throughput – If you use Direct PUT as the source to deliver data to Apache Iceberg tables, then the maximum throughput per stream is 5 MiB/second in US East (N. Virginia), US West (Oregon), and Europe (Ireland) Regions and 1 MiB/second in Asia Pacific (Tokyo), Canada (Central), and Asia Pacific (Sydney) Regions. If you just want to insert data to Iceberg tables with no updates and deletes and you want higher throughput for your stream, then you can use the Firehose Limits form to request a throughput limit increase.

  • Columns – For column names and values, Firehose takes only the first level of nodes in a multi-level nested JSON. For example, Firehose picks the nodes that are available in the first level including the position field. The column names and the data types of the source data should match with that of target tables for Firehose to successfully deliver. In this case, Firehose expects that you have either struct or map data type column in your Iceberg tables to match the position field. Firehose supports 16 levels of nesting. Following is an example of a nested JSON.

    { "version":"2016-04-01", "deviceId":"<solution_unique_device_id>", "sensorId":"<device_sensor_id>", "timestamp":"2024-01-11T20:42:45.000Z", "value":"<actual_value>", "position":{ "x":143.595901, "y":476.399628, "z":0.24234876 } }

    If the column names or data types do not match, then Firehose throws an error and delivers data to S3 error bucket. If all the column names and data types match in the Apache Iceberg tables, but you have an additional field present in the source record, Firehose skips the new field.

  • One JSON object per record – You can send only one JSON object in one Firehose record. If you aggregate and send multiple JSON objects inside a record, Firehose throws an error and delivers data to S3 error bucket. If you aggregate records with KPL and ingest data into Firehose with Amazon Kinesis Data Streams as source, then Firehose automatically de-aggregates and uses one JSON object per record.

  • Compaction and storage optimization – Every time you write using Firehose, it commits and generates snapshots, small data files and delete files. Having thousands of small data files increases metadata overhead and affects read performance. To get optimal query performance, you might want to consider a solution that periodically takes small data files and rewrites into fewer larger data files. This process is called compaction. Amazon Glue Data Catalog supports automatic compaction of your Apache Iceberg Tables. For more information, see Compaction management in the Amazon Glue User Guide. For additional information, see Automatic compaction of Apache Iceberg Tables.

    Besides compaction of data files, you can also optimize Iceberg tables by reducing storage consumption with VACUUM statement that performs table maintenance on Apache Iceberg tables. Alternatively, you can use Amazon Glue Data Catalog that also supports managed table optimization of Apache Iceberg tables by automatically removing the data files, orphaned files, and expire snapshots that are no longer needed. For more information, see this blog post on Storage optimization of Apache Iceberg Tables.

  • We do not support Amazon MSK Serverless source for Apache Iceberg Tables as a destination.