Streaming ingestion
Streaming ingestion provides
low-latency, high-speed ingestion of stream data from Amazon Kinesis Data Streams
Data flow
An Amazon Redshift provisioned cluster or an Amazon Redshift Serverless workgroup is the stream consumer. A materialized view is the landing
area for data read from the stream, which is processed as it arrives. For instance, JSON values can
be consumed and mapped to the materialized view's data columns, using familiar SQL. When the materialized view is
refreshed, Redshift consumes data from allocated Kinesis data shards or Kafka partitions until the view reaches parity with
the SEQUENCE_NUMBER
for the Kinesis stream or last Offset
for the Kafka topic. Subsequent materialized
view refreshes read data from the last SEQUENCE_NUMBER
of the
previous refresh until it reaches parity with the stream or topic data.
Streaming ingestion use cases
Use cases for Amazon Redshift streaming ingestion involve working with data that's generated continually (streamed) and must be processed within a short period (latency) of its generation. This is called near real-time analytics. Sources of data can vary, and include IoT devices, system telemetry data, or clickstream data from a busy website or application.
Streaming ingestion considerations
The following are important considerations and best practices for performance and billing as you set up your streaming ingestion environment.
Auto refresh usage and activation - Auto refresh queries for a materialized view or views are treated as any other user workload. Auto refresh loads data from the stream as it arrives.
Auto refresh can be turned on explicitly for a materialized view created for streaming ingestion. To do this, specify
AUTO REFRESH
in the materialized view definition. Manual refresh is the default. To specify auto refresh for an existing materialized view for streaming ingestion, you can runALTER MATERIALIZED VIEW
to turn it on. For more information, see CREATE MATERIALIZED VIEW or ALTER MATERIALIZED VIEW.-
Streaming ingestion and Amazon Redshift Serverless - The same setup and configuration instructions that apply to Amazon Redshift streaming ingestion on a provisioned cluster also apply to streaming ingestion on Amazon Redshift Serverless. It's important to size Amazon Redshift Serverless with the necessary level of RPUs to support streaming ingestion with auto refresh and other workloads. For more information, see Billing for Amazon Redshift Serverless.
-
Amazon Redshift nodes in a different availability zone than the Amazon MSK cluster - When you configure streaming ingestion, Amazon Redshift attempts to connect to an Amazon MSK cluster in the same Availability Zone, if rack awareness is enabled for Amazon MSK. If all of your nodes are in different Availability Zones than your Amazon Redshift cluster, you can incur cross Availability Zone data-transfer cost. To avoid this, keep at least one Amazon MSK broker cluster node in the same AZ as your Redshift provisioned cluster or workgroup.
-
Refresh start location - After creating a materialized view, its initial refresh starts from the
TRIM_HORIZON
of a Kinesis stream, or from offset 0 of an Amazon MSK topic. -
Data formats - Supported data formats are limited to those that can be converted from
VARBYTE
. For more information, see VARBYTE type and VARBYTE operators. Appending records to a table - You can run
ALTER TABLE APPEND
to append rows to a target table from an existing source materialized view. This works only if the materialized view is configured for streaming ingestion. For more information, see ALTER TABLE APPEND.Running TRUNCATE or DELETE - You can remove records from a materialized view that's used for streaming ingestion, using a couple methods:
TRUNCATE
– This command deletes all of the rows from a materialized view that's configured for streaming ingestion. It doesn't do a table scan. For more information, see TRUNCATE.DELETE
– This command deletes all of the rows from a materialized view that's configured for streaming ingestion. For more information, see DELETE.
Streaming ingestion best practices and recommendations
There are cases when you're presented with options in how you configure streaming ingestion. We recommend the following best practices. These are based on our own tests and through helping customers avoid issues leading to data loss.
Extracting values from streamed data – If you use the JSON_EXTRACT_PATH_TEXT function in your materialized view definition to shred incoming streaming JSON, it can significantly impact performance and latency. To explain, for each column extracted using JSON_EXTRACT_PATH_TEXT, the incoming JSON is re-parsed. After that, any data-type conversion, filtering, and business logic occurs. This means, for example, that if you extract 10 columns from your JSON data, each JSON record is parsed 10 times, which includes type conversions and additional logic. This results in higher ingestion latency. An alternative approach we recommend is to use the JSON_PARSE function to convert JSON records to Redshift's SUPER data type. After the streamed data lands in the materialized view, use PartiQL to extract individual strings from SUPER's representation of the JSON data. For more information, see Querying semistructured data.
It's also important to note that JSON_EXTRACT_PATH_TEXT has a 64KB data-size maximum. Thus, if any JSON record is larger than 64KB, processing it with JSON_EXTRACT_PATH_TEXT results in an error.
Mapping an Amazon Kinesis Data Streams stream or Amazon MSK topic to an Amazon Redshift streaming-ingestion materialized view – We don't recommend creating multiple streaming-ingestion materialized views to ingest data from a single Amazon Kinesis Data Streams stream or Amazon MSK topic. This is because each materialized view creates a consumer for each shard in the Kinesis Data Streams stream or partition in the Kafka topic. This can result in throttling or exceeding the throughput of the stream or topic. It also can result in higher cost, since you're ingesting the same data multiple times. We recommend that you create one streaming materialized view for each stream or topic.
If your use case requires that you land the data from one KDS stream or MSK topic into multiple materialized views, consult the Amazon Big Data blog
, specifically Best practices to implement near-real-time analytics using Amazon Redshift Streaming Ingestion with Amazon MSK , before you do so.
Using streaming ingestion compared with staging data in Amazon S3
There are several options for streaming data to Amazon Redshift or to Amazon Redshift Serverless. Two well-known options are streaming ingestion, as described in this topic, or setting up a delivery stream to Amazon S3 with Firehose. The following list describes each method:
Streaming ingestion from Kinesis Data Streams or Amazon Managed Streaming for Apache Kafka to Amazon Redshift or Amazon Redshift Serverless involves configuring a materialized view to receive the data.
Delivering data into Amazon Redshift using Kinesis Data Streams and streaming through Firehose involves connecting the source stream to Amazon Data Firehose and waiting for Firehose to stage the data in Amazon S3. This process makes use of various-sized batches at varying-length buffer intervals. After streaming to Amazon S3, Firehose initiates a COPY command to load the data.
With streaming ingestion, you bypass several steps that are required for the second process:
You don't have to send data to an Amazon Data Firehose delivery stream, because with streaming ingestion, data can be sent directly from Kinesis Data Streams to a materialized view in a Redshift database.
You don't have to land streamed data in Amazon S3, because streaming ingestion data goes directly to the Redshift materialized view.
You don't have to write and run COPY commands because the data in the materialized view is refreshed directly from the stream. Loading data from Amazon S3 to Redshift isn't part of the process.
Note that streaming ingestion is limited to streams from Amazon Kinesis Data Streams and topics from Amazon MSK. For streaming from Kinesis Data Streams to targets other than Amazon Redshift, it's likely that you need a Firehose delivery stream. For more information, see Sending Data to an Amazon Data Firehose Delivery Stream.
Limitations
Feature or behavior | Description |
---|---|
Kafka topic length limit | It isn't possible to use a Kafka topic with a name longer than 128 characters (not including quotation marks). For more information, see Names and identifiers. |
Incremental refreshes and JOINs on a materialized view | The materialized view must be incrementally maintainable. Full recompute is not possible for Kinesis or Amazon MSK because they don't preserve stream or topic history past 24 hours or 7 days, by default. You can set longer data retention periods in Kinesis or Amazon MSK. However, this can result in more maintenance and cost. Additionally, JOINs are not currently supported on materialized views created on a Kinesis stream, or on an Amazon MSK topic. After creating a materialized view on your stream or topic, you can create another materialized view in order to join your streaming materialized view to other materialized views, tables, or views. For more information, see REFRESH MATERIALIZED VIEW. |
Record parsing | Amazon Redshift streaming ingestion doesn't support parsing records that have been aggregated by the Kinesis
Producer Library (KPL Key Concepts - Aggregation). The aggregated
records are ingested, but are stored as binary protocol buffer
data. (See Protocol buffers |
Decompression |
|
Maximum record size | The maximum size of any record field Amazon Redshift can ingest from Kinesis or Amazon MSK is slightly less than 1MB. The following points detail the behavior:
|
Error records | In each case where a record can't be ingested to Redshift because the size of the data exceeds the maximum size, that record is skipped. Materialized view refresh still succeeds, in this case, and a segment of each error record is written to the SYS_STREAM_SCAN_ERRORS system table. Errors that result from business logic, such as an error in a calculation or an error resulting from a type conversion, are not skipped. Test the logic carefully, before you add logic to your materialized view definition, to avoid these. |
Amazon MSK Multi-VPC private connectivity | Amazon MSK multi-VPC private connectivity isn't currently supported for Redshift streaming ingestion. Alternatively, you can use VPC peering to connect VPCs or Amazon Transit Gateway to connect VPCs and on-premises networks through a central hub. Either of these can enable Redshift to communicate with an Amazon MSK cluster or with Amazon MSK Serverless in another VPC. |