Step 2: Configure source settings - Amazon Data Firehose
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Delivering Amazon Data Firehose streams to Apache Iceberg Tables in Amazon S3 is in preview and is subject to change.

Step 2: Configure source settings

Based on the source you select in step 1, you can configure the source to send information to a Firehose stream from console.

When you choose Amazon MSK to send information to a Firehose stream, you can choose between MSK provisioned and MSK-Serverless clusters. You can then use Firehose to read data easily from a specific Amazon MSK cluster and topic and load it into the specified S3 destination.

In the Source settings section of the page, provide values for the following fields.

Amazon MSK cluster connectivity

Choose either the Private bootstrap brokers (recommended) or Public bootstrap brokers option based on your cluster configuration. Bootstrap brokers is what Apache Kafka client uses as a starting point to connect to the cluster. Public bootstrap brokers are intended for public access from outside of Amazon, while private bootstrap brokers are intended for access from within Amazon. For more information about Amazon MSK, see Amazon Managed Streaming for Apache Kafka.

To connect to a provisioned or serverless Amazon MSK cluster through private bootstrap brokers, the cluster must meet all of the following requirements.

  • The cluster must be active.

  • The cluster must have IAM as one of its access control methods.

  • Multi-VPC private connectivity must be enabled for the IAM access control method.

  • You must add to this cluster a resource-based policy which grants Firehose service principal the permission to invoke the Amazon MSK CreateVpcConnection API operation.

To connect to a provisioned Amazon MSK cluster through public bootstrap brokers, the cluster must meet all of the following requirements.

  • The cluster must be active.

  • The cluster must have IAM as one of its access control methods.

  • The cluster must be public-accessible.

MSK cluster account

You can choose the account where the Amazon MSK cluster resides. This can be one of the following.

  • Current account – Allows you to ingest data from an MSK cluster in the current Amazon account. For this, you must specify the ARN of the Amazon MSK cluster from where your Firehose stream will read data.

  • Cross-account – Allows you to ingest data from an MSK cluster in another Amazon account. For more information, see Cross-account delivery from Amazon MSK.

Topic

Specify the Apache Kafka topic from which you want your Firehose stream to ingest data. You cannot update this topic after Firehose stream creation completes.

Configure the source settings for Amazon Kinesis Data Streams to send information to a Firehose stream as following.

Important

If you use the Kinesis Producer Library (KPL) to write data to a Kinesis data stream, you can use aggregation to combine the records that you write to that Kinesis data stream. If you then use that data stream as a source for your Firehose stream, Amazon Data Firehose de-aggregates the records before it delivers them to the destination. If you configure your Firehose stream to transform the data, Amazon Data Firehose de-aggregates the records before it delivers them to Amazon Lambda. For more information, see Developing Amazon Kinesis Data Streams Producers Using the Kinesis Producer Library and Aggregation.

Under the Source settings, choose an existing stream in the Kinesis data stream list, or enter a data stream ARN in the format arn:aws:kinesis:[Region]:[AccountId]:stream/[StreamName].

If you do not have an existing data stream then choose Create to create a new one from Amazon Kinesis console. After you create a new stream, choose the refresh icon to update the Kinesis stream list. If you have a large number of streams, filter the list using Filter by name.

Note

When you configure a Kinesis data stream as the source of a Firehose stream, the Amazon Data Firehose PutRecord and PutRecordBatch operations are disabled. To add data to your Firehose stream in this case, use the Kinesis Data Streams PutRecord and PutRecords operations.

Amazon Data Firehose starts reading data from the LATEST position of your Kinesis stream. For more information about Kinesis Data Streams positions, see GetShardIterator.

Amazon Data Firehose calls the Kinesis Data Streams GetRecords operation once per second for each shard. However, when full backup is enabled, Firehose calls the Kinesis Data Streams GetRecords operation twice per second for each shard, one for primary delivery destination and another for full backup.

More than one Firehose stream can read from the same Kinesis stream. Other Kinesis applications (consumers) can also read from the same stream. Each call from any Firehose stream or other consumer application counts against the overall throttling limit for the shard. To avoid getting throttled, plan your applications carefully. For more information about Kinesis Data Streams limits, see Amazon Kinesis Streams Limits.

Proceed to the next step to configure record transformation and format conversion.