Deliver data to Apache Iceberg Tables with Amazon Data Firehose - Amazon Data Firehose
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Delivering Amazon Data Firehose streams to Apache Iceberg Tables in Amazon S3 is in preview and is subject to change.

Deliver data to Apache Iceberg Tables with Amazon Data Firehose

Note

Delivering Firehose stream to Apache Iceberg Tables in Amazon S3 is in preview. Do not use this feature for production workloads.

Apache Iceberg is a high-performance open-source table format for performing big data analytics. Apache Iceberg brings the reliability and simplicity of SQL tables to Amazon S3 data lakes, and makes it possible for open-source analytics engines like Spark, Flink, Trino, Hive, and Impala to concurrently work with the same data. For more information about Apache Iceberg, see https://iceberg.apache.org/.

You can use Firehose to directly deliver streaming data to Apache Iceberg Tables in Amazon S3. With this feature, you can route records from a single stream into different Apache Iceberg Tables, and automatically apply insert, update, and delete operations to records in the Apache Iceberg Tables. This feature requires using the Amazon Glue Data Catalog.

Supported Regions and data types

Apache Iceberg Tables is available in US East (N. Virginia), US West (Oregon), Europe (Ireland), Asia Pacific (Tokyo), Canada (Central), and Asia Pacific (Sydney) Amazon Web Services Regions for preview.

Firehose supports all the primitive and complex data types that Apache Iceberg supports. For more information, see Schemas and Data Types. When sending binary data as a string, you must use Firehose supported encoding types - Basic Base64, MIME Base64, URL and filename safe Base64, and Hex. For Timestamp data types, you must always send in microseconds.

Consideration and limitations

Firehose support for Apache Iceberg tables for preview has the following considerations and limitations.

  • Throughput – If you use Direct PUT as the source to deliver data to Apache Iceberg tables, then the maximum throughput per stream is 5 MiB/second in US East (N. Virginia), US West (Oregon), and Europe (Ireland) Regions and 1 MiB/second in Asia Pacific (Tokyo), Canada (Central), and Asia Pacific (Sydney) Regions. If you just want to insert data to Iceberg tables with no updates and deletes and you want to test with higher throughput for your stream, then you can use the Firehose Limits form to request a throughput limit increase.

  • Columns – For column names and values, Firehose takes only the first level of nodes in a multi-level nested JSON. For example, Firehose picks the nodes that are available in the first level including the position field. The column names and the data types of the source data should match with that of target tables for Firehose to successfully deliver. In this case, Firehose expects that you have either struct or map data type column in your Iceberg tables to match the position field. Firehose supports 16 levels of nesting. Following is an example of a nested JSON.

    { "version":"2016-04-01", "deviceId":"<solution_unique_device_id>", "sensorId":"<device_sensor_id>", "timestamp":"2024-01-11T20:42:45.000Z", "value":"<actual_value>", "position":{ "x":143.595901, "y":476.399628, "z":0.24234876 } }

    If the column names or data types do not match, then Firehose throws an error and delivers data to S3 error bucket. If all the column names and data types match in the Apache Iceberg tables, but you have an additional new field present in the source record, Firehose skips the new field.

  • One JSON object per record – You can send only one JSON object in one Firehose record. If you aggregate and send multiple JSON objects inside a record, Firehose throws an error and delivers data to S3 error bucket.

  • Streaming sources – Firehose currently doesn’t support Amazon Managed Streaming for Apache Kafka as a source for Apache Iceberg Tables.

  • Apache Iceberg isolation levels – Firehose does idempotent writes to Apache Iceberg Tables. Currently, we don’t support serializable and snapshot isolation levels.

  • Compaction – Every time you write using Firehose, it generates data files. Having thousands of small data files increases metadata overhead and affects read performance. To get optimal query performance, you might want to consider a solution that periodically takes small data files and rewrites into fewer larger data files. This process is called compaction. Amazon Glue Data Catalog supports automatic compaction of your Apache Iceberg Tables. For more information, see Compaction management in the Amazon Glue User Guide. For additional information, see Automatic compaction of Apache Iceberg Tables. You can also optimize Iceberg tables by reducing storage consumption with VACUUM statement that performs table maintenance on Apache Iceberg tables.