WLM query monitoring rules
In Amazon Redshift workload management (WLM), query monitoring rules define metrics-based performance boundaries for WLM queues and specify what action to take when a query goes beyond those boundaries. For example, for a queue dedicated to short running queries, you might create a rule that cancels queries that run for more than 60 seconds. To track poorly designed queries, you might have another rule that logs queries that contain nested loops.
You define query monitoring rules as part of your workload management (WLM) configuration. You can define up to 25 rules for each queue, with a limit of 25 rules for all queues. Each rule includes up to three conditions, or predicates, and one action. A predicate consists of a metric, a comparison condition (=, <, or > ), and a value. If all of the predicates for any rule are met, that rule's action is triggered. Possible rule actions are log, hop, and abort, as discussed following.
The rules in a given queue apply only to queries running in that queue. A rule is independent of other rules.
WLM evaluates metrics every 10 seconds. If more than one rule is triggered during the same period, WLM chooses the rule with the most severe action. If the action for two rules has the same severity, WLM runs the rules in alphabetical order, based on the rule name. If the action is hop or abort, the action is logged and the query is evicted from the queue. If the action is log, the query continues to run in the queue. WLM initiates only one log action per query per rule. If the queue contains other rules, those rules remain in effect. If the action is hop and the query is routed to another queue, the rules for the new queue apply. For more information about query monitoring and tracking actions taken on specific queries, see the collection of samples at Short query acceleration.
When all of a rule's predicates are met, WLM writes a row to the STL_WLM_RULE_ACTION system table. In addition, Amazon Redshift records query metrics for currently running queries to STV_QUERY_METRICS. Metrics for completed queries are stored in STL_QUERY_METRICS.
Defining a query monitoring rule
You create query monitoring rules as part of your WLM configuration, which you define as part of your cluster's parameter group definition.
You can create rules using the Amazon Web Services Management Console or programmatically using JSON.
Note
If you choose to create rules programmatically, we strongly recommend using the console to generate the JSON that you include in the parameter group definition. For more information, see Creating or modifying a query monitoring rule using the console and Configuring Parameter Values Using the Amazon CLI in the Amazon Redshift Management Guide.
To define a query monitoring rule, you specify the following elements:
-
A rule name – Rule names must be unique within the WLM configuration. Rule names can be up to 32 alphanumeric characters or underscores, and can't contain spaces or quotation marks. You can have up to 25 rules per queue, and the total limit for all queues is 25 rules.
-
One or more predicates – You can have up to three predicates per rule. If all the predicates for any rule are met, the associated action is triggered. A predicate is defined by a metric name, an operator ( =, <, or > ), and a value. An example is
query_cpu_time > 100000
. For a list of metrics and examples of values for different metrics, see Query monitoring metrics for Amazon Redshift provisioned following in this section. -
An action – If more than one rule is triggered, WLM chooses the rule with the most severe action. Possible actions, in ascending order of severity, are:
-
Log – Record information about the query in the STL_WLM_RULE_ACTION system table. Use the Log action when you want to only write a log record. WLM creates at most one log per query, per rule. Following a log action, other rules remain in force and WLM continues to monitor the query.
-
Hop (only available with manual WLM) – Log the action and hop the query to the next matching queue. If there isn't another matching queue, the query is canceled. QMR hops only CREATE TABLE AS (CTAS) statements and read-only queries, such as SELECT statements. For more information, see WLM query queue hopping.
-
Abort – Log the action and cancel the query. QMR doesn't stop COPY statements and maintenance operations, such as ANALYZE and VACUUM.
-
Change priority (only available with automatic WLM) – Change the priority of a query.
-
To limit the runtime of queries, we recommend creating a query monitoring rule
instead of using WLM timeout. For example, you can set max_execution_time
to 50,000 milliseconds as shown in the following JSON snippet.
"max_execution_time": 50000
But we recommend instead that you define an equivalent query monitoring rule.
The following example demonstrates a query monitoring rule that
sets query_execution_time
to 50 seconds:
"rules": [ { "rule_name": "rule_query_execution", "predicate": [ { "metric_name": "query_execution_time", "operator": ">", "value": 50 } ], "action": "abort" } ]
For steps to create or modify a query monitoring rule, see Creating or modifying a query monitoring rule using the console and Properties in the wlm_json_configuration Parameter in the Amazon Redshift Management Guide.
You can find more information about query monitoring rules in the following topics:
Query monitoring metrics for Amazon Redshift provisioned
The following table describes the metrics used in query monitoring rules. (These metrics are distinct from the metrics stored in the STV_QUERY_METRICS and STL_QUERY_METRICS system tables.)
For a given metric, the performance threshold is tracked either at the query level or the segment level. For more information about segments and steps, see Query planning and execution workflow.
Note
The WLM timeout parameter is distinct from query monitoring rules.
Metric | Name | Description |
---|---|---|
Query CPU time |
query_cpu_time
|
CPU time used by the query, in seconds. CPU
time is distinct from Query execution time .
Valid values are 0–999,999. |
Blocks read |
query_blocks_read
|
Number of 1 MB data blocks read by the query. Valid values are 0–1,048,575. |
Scan row count |
scan_row_count
|
The number of rows in a scan step. The row count is the total number of rows emitted before filtering rows marked for deletion (ghost rows) and before applying user-defined query filters. Valid values are 0–999,999,999,999,999. |
Query execution time |
query_execution_time
|
Elapsed execution time for a query, in seconds. Execution
time doesn't include time spent waiting in a queue. Valid values are 0–86,399. |
Query queue time |
query_queue_time
|
Time spent waiting in a queue, in seconds. Valid values are 0–86,399. |
CPU usage |
query_cpu_usage_percent
|
Percent of CPU capacity used by the query. Valid values are 0–6,399. |
Memory to disk |
query_temp_blocks_to_disk
|
Temporary disk space used to write intermediate results,
in 1 MB blocks. Valid values are 0–319,815,679. |
CPU skew |
cpu_skew
|
The ratio of maximum CPU usage for any slice to average
CPU usage for all slices. This metric is defined at the segment
level. Valid values are 0–99. |
I/O skew |
io_skew
|
The ratio of maximum blocks read (I/O) for any slice to
average blocks read for all slices. This metric is defined at the segment
level. Valid values are 0–99. |
Rows joined |
join_row_count
|
The number of rows processed in a join step. Valid values are 0–999,999,999,999,999. |
Nested loop join row count |
nested_loop_join_row_count
|
The number or rows in a nested loop join. Valid values are 0–999,999,999,999,999. |
Return row count |
return_row_count
|
The number of rows returned by the query. Valid values are 0–999,999,999,999,999. |
Segment execution time |
segment_execution_time
|
Elapsed execution time for a single segment, in seconds.
To avoid or reduce sampling errors, include segment_execution_time
> 10 in your rules.Valid values are 0–86,388. |
Spectrum scan row count |
spectrum_scan_row_count
|
The number of rows of data in Amazon S3 scanned by an
Amazon Redshift Spectrum query. Valid values are 0–999,999,999,999,999. |
Spectrum scan size |
spectrum_scan_size_mb
|
The size of data in Amazon S3, in MB, scanned by an Amazon Redshift
Spectrum query. Valid values are 0–999,999,999,999,999. |
Query priority |
query_priority
|
The priority of the query.
Valid values are |
Note
The hop action is not supported with the
query_queue_time
predicate. That is, rules defined to hop when aquery_queue_time
predicate is met are ignored.-
Short segment execution times can result in sampling errors with some metrics, such as
io_skew
andquery_cpu_usage_percent
. To avoid or reduce sampling errors, include segment execution time in your rules. A good starting point issegment_execution_time > 10
.
The SVL_QUERY_METRICS view shows the metrics for completed queries. The SVL_QUERY_METRICS_SUMMARY view shows the maximum values of metrics for completed queries. Use the values in these views as an aid to determine threshold values for defining query monitoring rules.
Query monitoring metrics for Amazon Redshift Serverless
The following table describes the metrics used in query monitoring rules for Amazon Redshift Serverless.
Metric | Name | Description |
---|---|---|
Query CPU time | max_query_cpu_time |
CPU time used by the query, in seconds. CPU time is distinct from Query execution time .
Valid values are 0–999,999. |
Blocks read |
max_query_blocks_read
|
Number of 1 MB data blocks read by the query. Valid values are 0–1,048,575. |
Scan row count |
max_scan_row_count
|
The number of rows in a scan step. The row count is the total number of rows emitted before filtering rows marked for deletion (ghost rows) and before applying user-defined query filters. Valid values are 0–999,999,999,999,999. |
Query execution time | max_query_execution_time |
Elapsed execution time for a query, in seconds. Execution time doesn't include time spent waiting in a queue. If a query exceeds the set execution time, Amazon Redshift Serverless stops the query. Valid values are 0–86,399. |
Query queue time | max_query_queue_time |
Time spent waiting in a queue, in seconds. Valid values are 0–86,399. |
CPU usage |
max_query_cpu_usage_percent
|
Percent of CPU capacity used by the query. Valid values are 0–6,399. |
Memory to disk |
max_query_temp_blocks_to_disk
|
Temporary disk space used to write intermediate results,
in 1 MB blocks. Valid values are 0–319,815,679. |
Rows joined |
max_join_row_count
|
The number of rows processed in a join step. Valid values are 0–999,999,999,999,999. |
Nested loop join row count |
max_nested_loop_join_row_count
|
The number or rows in a nested loop join. Valid values are 0–999,999,999,999,999. |
Note
The hop action is not supported with the
max_query_queue_time
predicate. That is, rules defined to hop when amax_query_queue_time
predicate is met are ignored.-
Short segment execution times can result in sampling errors with some metrics, such as
max_io_skew
andmax_query_cpu_usage_percent
.
Query monitoring rules templates
When you add a rule using the Amazon Redshift console, you can choose to create a rule from a predefined template. Amazon Redshift creates a new rule with a set of predicates and populates the predicates with default values. The default action is log. You can modify the predicates and action to meet your use case.
The following table lists available templates.
Template Name | Predicates | Description |
---|---|---|
Nested loop join |
nested_loop_join_row_count > 100
|
A nested loop join might indicate an incomplete join predicate, which often results in a very large return set (a Cartesian product). Use a low row count to find a potentially runaway query early. |
Query returns a high number of rows |
return_row_count > 1000000
|
If you dedicate a queue to simple, short running queries, you might include a rule that finds queries returning a high row count. The template uses a default of 1 million rows. For some systems, you might consider one million rows to be high, or in a larger system, a billion or more rows might be high. |
Join with a high number of rows |
join_row_count > 1000000000
|
A join step that involves an unusually high number of rows might indicate a need for more restrictive filters. The template uses a default of 1 billion rows. For an ad hoc (one-time) queue that's intended for quick, simple queries, you might use a lower number. |
High disk usage when writing intermediate results |
query_temp_blocks_to_disk > 100000
|
When currently executing queries use more than the available system RAM, the query execution engine writes intermediate results to disk (spilled memory). Typically, this condition is the result of a rogue query, which usually is also the query that uses the most disk space. The acceptable threshold for disk usage varies based on the cluster node type and number of nodes. The template uses a default of 100,000 blocks, or 100 GB. For a small cluster, you might use a lower number. |
Long running query with high I/O skew | segment_execution_time > 120 and
io_skew > 1.30
|
I/O skew occurs when one node slice has a much higher I/O rate than the other slices. As a starting point, a skew of 1.30 (1.3 times average) is considered high. High I/O skew is not always a problem, but when combined with a long running query time, it might indicate a problem with the distribution style or sort key. |
System tables and views for query monitoring rules
When all of a rule's predicates are met, WLM writes a row to the STL_WLM_RULE_ACTION system table. This row contains details for the query that triggered the rule and the resulting action.
In addition, Amazon Redshift records query metrics the following system tables and views.
-
The STV_QUERY_METRICS table displays the metrics for currently running queries.
-
The STL_QUERY_METRICS table records the metrics for completed queries.
-
The SVL_QUERY_METRICS view shows the metrics for completed queries.
-
The SVL_QUERY_METRICS_SUMMARY view shows the maximum values of metrics for completed queries.