Amazon Kinesis Data Firehose Data Delivery - Amazon Kinesis Data Firehose
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Amazon Kinesis Data Firehose Data Delivery

After data is sent to your delivery stream, it is automatically delivered to the destination you choose.

Important

If you use the Kinesis Producer Library (KPL) to write data to a Kinesis data stream, you can use aggregation to combine the records that you write to that Kinesis data stream. If you then use that data stream as a source for your Kinesis Data Firehose delivery stream, Kinesis Data Firehose de-aggregates the records before it delivers them to the destination. If you configure your delivery stream to transform the data, Kinesis Data Firehose de-aggregates the records before it delivers them to Amazon Lambda. For more information, see Developing Amazon Kinesis Data Streams Producers Using the Kinesis Producer Library and Aggregation in the Amazon Kinesis Data Streams Developer Guide.

Data Delivery Format

For data delivery to Amazon Simple Storage Service (Amazon S3), Kinesis Data Firehose concatenates multiple incoming records based on the buffering configuration of your delivery stream. It then delivers the records to Amazon S3 as an Amazon S3 object. You might want to add a record separator at the end of each record before you send it to Kinesis Data Firehose. Then you can divide a delivered Amazon S3 object to individual records.

For data delivery to Amazon Redshift, Kinesis Data Firehose first delivers incoming data to your S3 bucket in the format described earlier. Kinesis Data Firehose then issues an Amazon Redshift COPY command to load the data from your S3 bucket to your Amazon Redshift provisioned cluster or Amazon Redshift Serverless workgroup. Ensure that after Kinesis Data Firehose concatenates multiple incoming records to an Amazon S3 object, the Amazon S3 object can be copied to your Amazon Redshift provisioned cluster or Amazon Redshift Serverless workgroup. For more information, see Amazon Redshift COPY Command Data Format Parameters.

For data delivery to OpenSearch Service and OpenSearch Serverless, Kinesis Data Firehose buffers incoming records based on the buffering configuration of your delivery stream. It then generates an OpenSearch Service or OpenSearch Serverless bulk request to index multiple records to your OpenSearch Service cluster or OpenSearch Serverless collection. Make sure that your record is UTF-8 encoded and flattened to a single-line JSON object before you send it to Kinesis Data Firehose. Also, the rest.action.multi.allow_explicit_index option for your OpenSearch Service cluster must be set to true (default) to take bulk requests with an explicit index that is set per record. For more information, see OpenSearch Service Configure Advanced Options in the Amazon OpenSearch Service Developer Guide.

For data delivery to Splunk, Kinesis Data Firehose concatenates the bytes that you send. If you want delimiters in your data, such as a new line character, you must insert them yourself. Make sure that Splunk is configured to parse any such delimiters.

When delivering data to an HTTP endpoint owned by a supported third-party service provider, you can use the integrated Amazon Lambda service to create a function to transform the incoming record(s) to the format that matches the format the service provider's integration is expecting. Contact the third-party service provider whose HTTP endpoint you've chosen for your destination to learn more about their accepted record format.

Data Delivery Frequency

Each Kinesis Data Firehose destination has its own data delivery frequency.

Amazon S3

The frequency of data delivery to Amazon S3 is determined by the Amazon S3 Buffer size and Buffer interval value that you configured for your delivery stream. Kinesis Data Firehose buffers incoming data before it delivers it to Amazon S3. You can configure the values for Amazon S3 Buffer size (1–128 MB) or Buffer interval (60–900 seconds). The condition satisfied first triggers data delivery to Amazon S3. When data delivery to the destination falls behind data writing to the delivery stream, Kinesis Data Firehose raises the buffer size dynamically. It can then catch up and ensure that all data is delivered to the destination.

Amazon Redshift

The frequency of data COPY operations from Amazon S3 to Amazon Redshift is determined by how fast your Amazon Redshift provisioned cluster or Amazon Redshift Serverless workgroup can finish the COPY command. If there is still data to copy, Kinesis Data Firehose issues a new COPY command as soon as the previous COPY command is successfully finished by Amazon Redshift.

Amazon OpenSearch Service

The frequency of data delivery to OpenSearch Service is determined by the OpenSearch Service Buffer size and Buffer interval values that you configured for your delivery stream. Kinesis Data Firehose buffers incoming data before delivering it to OpenSearch Service. You can configure the values for OpenSearch Service Buffer size (1–100 MB) or Buffer interval (60–900 seconds), and the condition satisfied first triggers data delivery to OpenSearch Service.

Amazon OpenSearch Serverless

The frequency of data delivery to OpenSearch Serverless is determined by the OpenSearch Serverless Buffer size and Buffer interval values that you configured for your delivery stream. Kinesis Data Firehose buffers incoming data before delivering it to OpenSearch Serverless. You can configure the values for OpenSearch Serverless Buffer size (1–100 MB) or Buffer interval (60–900 seconds), and the condition satisfied first triggers data delivery to OpenSearch Serverless.

Splunk

Kinesis Data Firehose buffers incoming data before delivering it to Splunk. The buffer size is 5 MB, and the buffer interval is 60 seconds. The condition satisfied first triggers data delivery to Splunk. The buffer size and interval aren't configurable. These numbers are optimal.

HTTP endpoint destination

Kinesis Data Firehose buffers incoming data before delivering it to the specified HTTP endpoint destination. The recommended buffer size for the destination varies from service provider to service provider. For example, the recommended buffer size for Datadog is 4 MiBs and the recommended buffer size for New Relic and Sumo Logic is 1 MiB. Contact the third-party service provider whose endpoint you've chosen as your data destination for more information about their recommended buffer size.

Data Delivery Failure Handling

Each Kinesis Data Firehose destination has its own data delivery failure handling.

Amazon S3

Data delivery to your S3 bucket might fail for various reasons. For example, the bucket might not exist anymore, the IAM role that Kinesis Data Firehose assumes might not have access to the bucket, the network failed, or similar events. Under these conditions, Kinesis Data Firehose keeps retrying for up to 24 hours until the delivery succeeds. The maximum data storage time of Kinesis Data Firehose is 24 hours. If data delivery fails for more than 24 hours, your data is lost.

Amazon Redshift

For an Amazon Redshift destination, you can specify a retry duration (0–7200 seconds) when creating a delivery stream.

Data delivery to your Amazon Redshift provisioned cluster or Amazon Redshift Serverless workgroup might fail for several reasons. For example, you might have an incorrect cluster configuration of your delivery stream, a cluster or workgroup under maintenance, or a network failure. Under these conditions, Kinesis Data Firehose retries for the specified time duration and skips that particular batch of Amazon S3 objects. The skipped objects' information is delivered to your S3 bucket as a manifest file in the errors/ folder, which you can use for manual backfill. For information about how to COPY data manually with manifest files, see Using a Manifest to Specify Data Files.

Amazon OpenSearch Service and OpenSearch Serverless

For the OpenSearch Service and OpenSearch Serverless destination, you can specify a retry duration (0–7200 seconds) when creating a delivery stream.

Data delivery to your OpenSearch Service cluster or OpenSearch Serverless collection might fail for several reasons. For example, you might have an incorrect OpenSearch Service cluster or OpenSearch Serverless collection configuration of your delivery stream, an OpenSearch Service cluster or OpenSearch Serverless collection under maintenance, a network failure, or similar events. Under these conditions, Kinesis Data Firehose retries for the specified time duration and then skips that particular index request. The skipped documents are delivered to your S3 bucket in the AmazonOpenSearchService_failed/ folder, which you can use for manual backfill.

For OpenSearch Service, each document has the following JSON format:

{ "attemptsMade": "(number of index requests attempted)", "arrivalTimestamp": "(the time when the document was received by Firehose)", "errorCode": "(http error code returned by OpenSearch Service)", "errorMessage": "(error message returned by OpenSearch Service)", "attemptEndingTimestamp": "(the time when Firehose stopped attempting index request)", "esDocumentId": "(intended OpenSearch Service document ID)", "esIndexName": "(intended OpenSearch Service index name)", "esTypeName": "(intended OpenSearch Service type name)", "rawData": "(base64-encoded document data)" }

For OpenSearch Serverless, each document has the following JSON format:

{ "attemptsMade": "(number of index requests attempted)", "arrivalTimestamp": "(the time when the document was received by Firehose)", "errorCode": "(http error code returned by OpenSearch Serverless)", "errorMessage": "(error message returned by OpenSearch Serverless)", "attemptEndingTimestamp": "(the time when Firehose stopped attempting index request)", "osDocumentId": "(intended OpenSearch Serverless document ID)", "osIndexName": "(intended OpenSearch Serverless index name)", "rawData": "(base64-encoded document data)" }
Splunk

When Kinesis Data Firehose sends data to Splunk, it waits for an acknowledgment from Splunk. If an error occurs, or the acknowledgment doesn’t arrive within the acknowledgment timeout period, Kinesis Data Firehose starts the retry duration counter. It keeps retrying until the retry duration expires. After that, Kinesis Data Firehose considers it a data delivery failure and backs up the data to your Amazon S3 bucket.

Every time Kinesis Data Firehose sends data to Splunk, whether it's the initial attempt or a retry, it restarts the acknowledgement timeout counter. It then waits for an acknowledgement to arrive from Splunk. Even if the retry duration expires, Kinesis Data Firehose still waits for the acknowledgment until it receives it or the acknowledgement timeout is reached. If the acknowledgment times out, Kinesis Data Firehose checks to determine whether there's time left in the retry counter. If there is time left, it retries again and repeats the logic until it receives an acknowledgment or determines that the retry time has expired.

A failure to receive an acknowledgement isn't the only type of data delivery error that can occur. For information about the other types of data delivery errors, see Splunk Data Delivery Errors. Any data delivery error triggers the retry logic if your retry duration is greater than 0.

The following is an example error record.

{ "attemptsMade": 0, "arrivalTimestamp": 1506035354675, "errorCode": "Splunk.AckTimeout", "errorMessage": "Did not receive an acknowledgement from HEC before the HEC acknowledgement timeout expired. Despite the acknowledgement timeout, it's possible the data was indexed successfully in Splunk. Kinesis Firehose backs up in Amazon S3 data for which the acknowledgement timeout expired.", "attemptEndingTimestamp": 13626284715507, "rawData": "MiAyNTE2MjAyNzIyMDkgZW5pLTA1ZjMyMmQ1IDIxOC45Mi4xODguMjE0IDE3Mi4xNi4xLjE2NyAyNTIzMyAxNDMzIDYgMSA0MCAxNTA2MDM0NzM0IDE1MDYwMzQ3OTQgUkVKRUNUIE9LCg==", "EventId": "49577193928114147339600778471082492393164139877200035842.0" }
HTTP endpoint destination

When Kinesis Data Firehose sends data to an HTTP endpoint destination, it waits for a response from this destination. If an error occurs, or the response doesn’t arrive within the response timeout period, Kinesis Data Firehose starts the retry duration counter. It keeps retrying until the retry duration expires. After that, Kinesis Data Firehose considers it a data delivery failure and backs up the data to your Amazon S3 bucket.

Every time Kinesis Data Firehose sends data to an HTTP endpoint destination, whether it's the initial attempt or a retry, it restarts the response timeout counter. It then waits for a response to arrive from the HTTP endpoint destination. Even if the retry duration expires, Kinesis Data Firehose still waits for the response until it receives it or the response timeout is reached. If the response times out, Kinesis Data Firehose checks to determine whether there's time left in the retry counter. If there is time left, it retries again and repeats the logic until it receives a response or determines that the retry time has expired.

A failure to receive a response isn't the only type of data delivery error that can occur. For information about the other types of data delivery errors, see HTTP Endpoint Data Delivery Errors

The following is an example error record.

{ "attemptsMade":5, "arrivalTimestamp":1594265943615, "errorCode":"HttpEndpoint.DestinationException", "errorMessage":"Received the following response from the endpoint destination. {"requestId": "109777ac-8f9b-4082-8e8d-b4f12b5fc17b", "timestamp": 1594266081268, "errorMessage": "Unauthorized"}", "attemptEndingTimestamp":1594266081318, "rawData":"c2FtcGxlIHJhdyBkYXRh", "subsequenceNumber":0, "dataId":"49607357361271740811418664280693044274821622880012337186.0" }

Amazon S3 Object Name Format

Kinesis Data Firehose adds a UTC time prefix in the format YYYY/MM/dd/HH before writing objects to Amazon S3. This prefix creates a logical hierarchy in the bucket, where each forward slash (/) creates a level in the hierarchy. You can modify this structure by specifying a custom prefix. For information about how to specify a custom prefix, see Custom Prefixes for Amazon S3 Objects.

The Amazon S3 object name follows the pattern DeliveryStreamName-DeliveryStreamVersion-YYYY-MM-dd-HH-MM-SS-RandomString, where DeliveryStreamVersion begins with 1 and increases by 1 for every configuration change of the Kinesis Data Firehose delivery stream. You can change delivery stream configurations (for example, the name of the S3 bucket, buffering hints, compression, and encryption). You can do so by using the Kinesis Data Firehose console or the UpdateDestination API operation.

Index Rotation for the OpenSearch Service Destination

For the OpenSearch Service destination, you can specify a time-based index rotation option from one of the following five options: NoRotation, OneHour, OneDay, OneWeek, or OneMonth.

Depending on the rotation option you choose, Kinesis Data Firehose appends a portion of the UTC arrival timestamp to your specified index name. It rotates the appended timestamp accordingly. The following example shows the resulting index name in OpenSearch Service for each index rotation option, where the specified index name is myindex and the arrival timestamp is 2016-02-25T13:00:00Z.

RotationPeriod IndexName
NoRotation myindex
OneHour myindex-2016-02-25-13
OneDay myindex-2016-02-25
OneWeek myindex-2016-w08
OneMonth myindex-2016-02
Note

With the OneWeek option, Data Firehose auto-create indexes using the format of <YEAR>-w<WEEK NUMBER> (for example, 2020-w33), where the week number is calculated using UTC time and according to the following US conventions:

  • A week starts on Sunday

  • The first week of the year is the first week that contains a Saturday in this year

Delivery Across Amazon Accounts and Across Amazon Regions for HTTP Endpoint Destinations

Kinesis Data Firehose supports data delivery to HTTP endpoint destinations across Amazon accounts. Kinesis Data Firehose delivery stream and the HTTP endpoint that you've chosen as your destination can be in different Amazon accounts.

Kinesis Data Firehose also supports data delivery to HTTP endpoint destinations across Amazon regions. You can deliver data from a delivery stream in one Amazon region to an HTTP endpoint in another Amazon region. You can also delivery data from a delivery stream to an HTTP endpoint destination outside of Amazon regions, for example to your own on-premises server by setting the HTTP endpoint URL to your desired destination. For these scenarios, additional data transfer charges are added to your delivery costs. For more information, see the Data Transfer section in the "On-Demand Pricing" page.

Duplicated Records

Kinesis Data Firehose uses at-least-once semantics for data delivery. In some circumstances, such as when data delivery times out, delivery retries by Kinesis Data Firehose might introduce duplicates if the original data-delivery request eventually goes through. This applies to all destination types that Kinesis Data Firehose supports.

How to Pause and Resume a Kinesis Data Firehose delivery stream

After you setup a delivery stream in Kinesis Data Firehose, data available in the stream source is continuously delivered to the destination. If you encounter situations where your stream destination is temporarily unavailable (for example, during planned maintenance operations), you may want to temporarily pause data delivery, and resume when the destination becomes available again. The following sections show how you can accomplish this:

Important

When you use the approach described below to pause and resume a stream, after you resume the stream, you will see that few records get delivered to the error bucket in Amazon S3 while the rest of the stream continues to get delivered to the destination. This is a known limitation of the approach, and it occurs because a small number of records that could not be previously delivered to the destination after multiple retries are tracked as failed.

Understanding how Kinesis Data Firehose handles delivery failures

When you setup a delivery stream in Kinesis Data Firehose, for many destinations such as OpenSearch, Splunk, and HTTP endpoints, you also setup an S3 bucket where data that fails to be delivered can be backed up. For more information about how Kinesis Data Firehose backs up data in case of failed deliveries, see Data Delivery Failure Handling. For more information about how to grant access to S3 buckets where data that fails to be delivered can be backed up, see Grant Kinesis Data Firehose Access to an Amazon S3 Destination. When Kinesis Data Firehose (a) fails to deliver data to the stream destination, and (b) fails to write data to the backup S3 bucket for failed deliveries, it effectively pauses stream delivery until such time that data can either be delivered to the destination or written to the backup S3 location.

Pausing a Kinesis Data Firehose delivery stream

To pause stream delivery in Kinesis Data Firehose, first remove permissions for Kinesis Data Firehose to write to the S3 backup location for failed deliveries. For example, if you want to pause the delivery stream with an OpenSearch destination, you can do this by updating permissions. For more information, see Grant Kinesis Data Firehose Access to a Public OpenSearch Service Destination.

Remove the "Effect": "Allow" permission for the action s3:PutObject, and explicitly add a statement that applies Effect": "Deny" permission on the action s3:PutObject for the S3 bucket used for backing up failed deliveries. Next, turn off the stream destination (for example, turning off the destination OpenSearch domain), or remove permissions for Kinesis Data Firehose to write to the destination. To update permissions for other destinations, check the section for your destination in Controlling Access with Amazon Kinesis Data Firehose. After you complete these two actions, Kinesis Data Firehose will stop delivering streams, and you can monitor this using CloudWatch metrics for Kinesis Data Firehose.

Important

When you pause stream delivery in Kinesis Data Firehose, you need to ensure that the source of the stream (for example, in Kinesis Data Streams or in Managed Service for Kafka) is configured to retain data until stream delivery is resumed and the data gets delivered to the destination. If the source is DirectPUT, Kinesis Data Firehose will retain data for 24 hours. Data loss could happen if you do not resume the stream and deliver the data before the expiration of data retention period.

Resuming a Kinesis Data Firehose delivery stream

To resume delivery, first revert the change made earlier to the stream destination by turning on the destination and ensuring that Kinesis Data Firehose has permissions to deliver the stream to the destination. Next, revert the changes made earlier to permissions applied to the S3 bucket for backing up failed deliveries. That is, apply "Effect": "Allow" permission for the action s3:PutObject, and remove "Effect": "Deny" permission on the action s3:PutObject for the S3 bucket used for backing up failed deliveries. Finally, monitor using CloudWatch metrics for Kinesis Data Firehose to confirm that the stream is being delivered to the destination. To view and troubleshoot errors, use Amazon CloudWatch Logs monitoring for Kinesis Data Firehose.