Streaming ingestion - Amazon Redshift
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Streaming ingestion

Streaming ingestion provides low-latency, high-speed ingestion of stream data from Amazon Kinesis Data Streams and Amazon Managed Streaming for Apache Kafka into an Amazon Redshift provisioned or Amazon Redshift Serverless materialized view. It lowers the time it takes to access data and it reduces storage cost. You can configure streaming ingestion for your Amazon Redshift cluster or for Amazon Redshift Serverless and create a materialized view, using SQL statements, as described in Creating materialized views in Amazon Redshift. After that, using materialized-view refresh, you can ingest hundreds of megabytes of data per second. This results in fast access to external data that is quickly refreshed.

Data flow

An Amazon Redshift provisioned cluster or an Amazon Redshift Serverless workgroup is the stream consumer. A materialized view is the landing area for data read from the stream, which is processed as it arrives. For instance, JSON values can be consumed and mapped to the materialized view's data columns, using familiar SQL. When the materialized view is refreshed, Redshift consumes data from allocated Kinesis data shards or Kafka partitions until the view reaches parity with the SEQUENCE_NUMBER for the Kinesis stream or last Offset for the Kafka topic. Subsequent materialized view refreshes read data from the last SEQUENCE_NUMBER of the previous refresh until it reaches parity with the stream or topic data.

Streaming ingestion use cases

Use cases for Amazon Redshift streaming ingestion involve working with data that's generated continually (streamed) and must be processed within a short period (latency) of its generation. This is called near real-time analytics. Sources of data can vary, and include IoT devices, system telemetry data, or clickstream data from a busy website or application.

Streaming ingestion considerations

The following are important considerations and best practices for performance and billing as you set up your streaming ingestion environment.

  • Auto refresh usage and activation - Auto refresh queries for a materialized view or views are treated as any other user workload. Auto refresh loads data from the stream as it arrives.

    Auto refresh can be turned on explicitly for a materialized view created for streaming ingestion. To do this, specify AUTO REFRESH in the materialized view definition. Manual refresh is the default. To specify auto refresh for an existing materialized view for streaming ingestion, you can run ALTER MATERIALIZED VIEW to turn it on. For more information, see CREATE MATERIALIZED VIEW or ALTER MATERIALIZED VIEW.

  • Streaming ingestion and Amazon Redshift Serverless - The same setup and configuration instructions that apply to Amazon Redshift streaming ingestion on a provisioned cluster also apply to streaming ingestion on Amazon Redshift Serverless. It's important to size Amazon Redshift Serverless with the necessary level of RPUs to support streaming ingestion with auto refresh and other workloads. For more information, see Billing for Amazon Redshift Serverless.

  • Amazon Redshift nodes in a different availability zone than the Amazon MSK cluster - When you configure streaming ingestion, Amazon Redshift attempts to connect to an Amazon MSK cluster in the same Availability Zone, if rack awareness is enabled for Amazon MSK. If all of your nodes are in different Availability Zones than your Amazon Redshift cluster, you can incur cross Availability Zone data-transfer cost. To avoid this, keep at least one Amazon MSK broker cluster node in the same AZ as your Redshift provisioned cluster or workgroup.

  • Refresh start location - After creating a materialized view, its initial refresh starts from the TRIM_HORIZON of a Kinesis stream, or from offset 0 of an Amazon MSK topic.

  • Data formats - Supported data formats are limited to those that can be converted from VARBYTE. For more information, see VARBYTE type and VARBYTE operators.

  • Appending records to a table - You can run ALTER TABLE APPEND to append rows to a target table from an existing source materialized view. This works only if the materialized view is configured for streaming ingestion. For more information, see ALTER TABLE APPEND.

  • Running TRUNCATE or DELETE - You can remove records from a materialized view that's used for streaming ingestion, using a couple methods:

    • TRUNCATE – This command deletes all of the rows from a materialized view that's configured for streaming ingestion. It doesn't do a table scan. For more information, see TRUNCATE.

    • DELETE – This command deletes all of the rows from a materialized view that's configured for streaming ingestion. For more information, see DELETE.

Streaming ingestion best practices and recommendations

There are cases when you're presented with options in how you configure streaming ingestion. We recommend the following best practices. These are based on our own tests and through helping customers avoid issues leading to data loss.

  • Extracting values from streamed data – If you use the JSON_EXTRACT_PATH_TEXT function in your materialized view definition to shred incoming streaming JSON, it can significantly impact performance and latency. To explain, for each column extracted using JSON_EXTRACT_PATH_TEXT, the incoming JSON is re-parsed. After that, any data-type conversion, filtering, and business logic occurs. This means, for example, that if you extract 10 columns from your JSON data, each JSON record is parsed 10 times, which includes type conversions and additional logic. This results in higher ingestion latency. An alternative approach we recommend is to use the JSON_PARSE function to convert JSON records to Redshift's SUPER data type. After the streamed data lands in the materialized view, use PartiQL to extract individual strings from SUPER's representation of the JSON data. For more information, see Querying semistructured data.

    It's also important to note that JSON_EXTRACT_PATH_TEXT has a 64KB data-size maximum. Thus, if any JSON record is larger than 64KB, processing it with JSON_EXTRACT_PATH_TEXT results in an error.

  • Mapping an Amazon Kinesis Data Streams stream or Amazon MSK topic to an Amazon Redshift streaming-ingestion materialized view – We don't recommend creating multiple streaming-ingestion materialized views to ingest data from a single Amazon Kinesis Data Streams stream or Amazon MSK topic. This is because each materialized view creates a consumer for each shard in the Kinesis Data Streams stream or partition in the Kafka topic. This can result in throttling or exceeding the throughput of the stream or topic. It also can result in higher cost, since you're ingesting the same data multiple times. We recommend that you create one streaming materialized view for each stream or topic.

    If your use case requires that you land the data from one KDS stream or MSK topic into multiple materialized views, consult the Amazon Big Data blog, specifically Best practices to implement near-real-time analytics using Amazon Redshift Streaming Ingestion with Amazon MSK, before you do so.

Using streaming ingestion compared with staging data in Amazon S3

There are several options for streaming data to Amazon Redshift or to Amazon Redshift Serverless. Two well-known options are streaming ingestion, as described in this topic, or setting up a delivery stream to Amazon S3 with Firehose. The following list describes each method:

  1. Streaming ingestion from Kinesis Data Streams or Amazon Managed Streaming for Apache Kafka to Amazon Redshift or Amazon Redshift Serverless involves configuring a materialized view to receive the data.

  2. Delivering data into Amazon Redshift using Kinesis Data Streams and streaming through Firehose involves connecting the source stream to Amazon Data Firehose and waiting for Firehose to stage the data in Amazon S3. This process makes use of various-sized batches at varying-length buffer intervals. After streaming to Amazon S3, Firehose initiates a COPY command to load the data.

With streaming ingestion, you bypass several steps that are required for the second process:

  • You don't have to send data to an Amazon Data Firehose delivery stream, because with streaming ingestion, data can be sent directly from Kinesis Data Streams to a materialized view in a Redshift database.

  • You don't have to land streamed data in Amazon S3, because streaming ingestion data goes directly to the Redshift materialized view.

  • You don't have to write and run COPY commands because the data in the materialized view is refreshed directly from the stream. Loading data from Amazon S3 to Redshift isn't part of the process.

Note that streaming ingestion is limited to streams from Amazon Kinesis Data Streams and topics from Amazon MSK. For streaming from Kinesis Data Streams to targets other than Amazon Redshift, it's likely that you need a Firehose delivery stream. For more information, see Sending Data to an Amazon Data Firehose Delivery Stream.

Limitations

Feature or behavior Description
Kafka topic length limit

It isn't possible to use a Kafka topic with a name longer than 128 characters (not including quotation marks). For more information, see Names and identifiers.

Incremental refreshes and JOINs on a materialized view

The materialized view must be incrementally maintainable. Full recompute is not possible for Kinesis or Amazon MSK because they don't preserve stream or topic history past 24 hours or 7 days, by default. You can set longer data retention periods in Kinesis or Amazon MSK. However, this can result in more maintenance and cost. Additionally, JOINs are not currently supported on materialized views created on a Kinesis stream, or on an Amazon MSK topic. After creating a materialized view on your stream or topic, you can create another materialized view in order to join your streaming materialized view to other materialized views, tables, or views.

For more information, see REFRESH MATERIALIZED VIEW.

Record parsing

Amazon Redshift streaming ingestion doesn't support parsing records that have been aggregated by the Kinesis Producer Library (KPL Key Concepts - Aggregation). The aggregated records are ingested, but are stored as binary protocol buffer data. (See Protocol buffers for more information.) Depending on how you push data to Kinesis, you may need to turn off this feature.

Decompression

VARBYTE does not currently support any decompression methods. Because of this, records containing compressed data can't be queried in Redshift. Decompress your data before pushing it into the Kinesis stream or Amazon MSK topic.

Maximum record size

The maximum size of any record field Amazon Redshift can ingest from Kinesis or Amazon MSK is slightly less than 1MB. The following points detail the behavior:

  • Maximum VARBYTE length - The VARBYTE type supports data to a maximum length of 1,024,000 bytes. Because Kinesis limits payloads to 1MB, after Base64 encoding, all Kinesis data can be ingested by Amazon Redshift.

  • Message limits - Default Amazon MSK configuration limits messages to 1MB. Additionally, if a message includes headers, the amount of data is limited to 1,048,470 bytes. With default settings, there are no problems with ingestion. However, you can change the maximum message size for Kafka, and therefore Amazon MSK, to a larger value. In this case, it might be possible for the key/value field of a Kafka record, or the header, to exceed the size limit. These records can cause an error and are not ingested.

Error records

In each case where a record can't be ingested to Redshift because the size of the data exceeds the maximum size, that record is skipped. Materialized view refresh still succeeds, in this case, and a segment of each error record is written to the SYS_STREAM_SCAN_ERRORS system table. Errors that result from business logic, such as an error in a calculation or an error resulting from a type conversion, are not skipped. Test the logic carefully, before you add logic to your materialized view definition, to avoid these.

Amazon MSK Multi-VPC private connectivity

Amazon MSK multi-VPC private connectivity isn't currently supported for Redshift streaming ingestion. Alternatively, you can use VPC peering to connect VPCs or Amazon Transit Gateway to connect VPCs and on-premises networks through a central hub. Either of these can enable Redshift to communicate with an Amazon MSK cluster or with Amazon MSK Serverless in another VPC.