WLM query monitoring rules - Amazon Redshift
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

WLM query monitoring rules

In Amazon Redshift workload management (WLM), query monitoring rules define metrics-based performance boundaries for WLM queues and specify what action to take when a query goes beyond those boundaries. For example, for a queue dedicated to short running queries, you might create a rule that cancels queries that run for more than 60 seconds. To track poorly designed queries, you might have another rule that logs queries that contain nested loops.

You define query monitoring rules as part of your workload management (WLM) configuration. You can define up to 25 rules for each queue, with a limit of 25 rules for all queues. Each rule includes up to three conditions, or predicates, and one action. A predicate consists of a metric, a comparison condition (=, <, or > ), and a value. If all of the predicates for any rule are met, that rule's action is triggered. Possible rule actions are log, hop, and abort, as discussed following.

The rules in a given queue apply only to queries running in that queue. A rule is independent of other rules.

WLM evaluates metrics every 10 seconds. If more than one rule is triggered during the same period, WLM initiates the most severe action—abort, then hop, then log. If the action is hop or abort, the action is logged and the query is evicted from the queue. If the action is log, the query continues to run in the queue. WLM initiates only one log action per query per rule. If the queue contains other rules, those rules remain in effect. If the action is hop and the query is routed to another queue, the rules for the new queue apply. For more information about query monitoring and tracking actions taken on specific queries, see the collection of samples at Working with short query acceleration.

When all of a rule's predicates are met, WLM writes a row to the STL_WLM_RULE_ACTION system table. In addition, Amazon Redshift records query metrics for currently running queries to STV_QUERY_METRICS. Metrics for completed queries are stored in STL_QUERY_METRICS.

Defining a query monitoring rule

You create query monitoring rules as part of your WLM configuration, which you define as part of your cluster's parameter group definition.

You can create rules using the Amazon Web Services Management Console or programmatically using JSON.

Note

If you choose to create rules programmatically, we strongly recommend using the console to generate the JSON that you include in the parameter group definition. For more information, see Creating or modifying a query monitoring rule using the console and Configuring Parameter Values Using the Amazon CLI in the Amazon Redshift Management Guide.

To define a query monitoring rule, you specify the following elements:

  • A rule name – Rule names must be unique within the WLM configuration. Rule names can be up to 32 alphanumeric characters or underscores, and can't contain spaces or quotation marks. You can have up to 25 rules per queue, and the total limit for all queues is 25 rules.

  • One or more predicates – You can have up to three predicates per rule. If all the predicates for any rule are met, the associated action is triggered. A predicate is defined by a metric name, an operator ( =, <, or > ), and a value. An example is query_cpu_time > 100000. For a list of metrics and examples of values for different metrics, see Query monitoring metrics for Amazon Redshift provisioned following in this section.

  • An action – If more than one rule is triggered, WLM chooses the rule with the most severe action. Possible actions, in ascending order of severity, are:

    • Log – Record information about the query in the STL_WLM_RULE_ACTION system table. Use the Log action when you want to only write a log record. WLM creates at most one log per query, per rule. Following a log action, other rules remain in force and WLM continues to monitor the query.

    • Hop (only available with manual WLM) – Log the action and hop the query to the next matching queue. If there isn't another matching queue, the query is canceled. QMR hops only CREATE TABLE AS (CTAS) statements and read-only queries, such as SELECT statements. For more information, see WLM query queue hopping.

    • Abort – Log the action and cancel the query. QMR doesn't stop COPY statements and maintenance operations, such as ANALYZE and VACUUM.

    • Change priority (only available with automatic WLM) – Change the priority of a query.

To limit the runtime of queries, we recommend creating a query monitoring rule instead of using WLM timeout. For example, you can set max_execution_time to 50,000 milliseconds as shown in the following JSON snippet.

"max_execution_time": 50000

But we recommend instead that you define an equivalent query monitoring rule that sets query_execution_time to 50 seconds as shown in the following JSON snippet.

"rules": [ { "rule_name": "rule_query_execution", "predicate": [ { "metric_name": "query_execution_time", "operator": ">", "value": 50 } ], "action": "abort" } ]

For steps to create or modify a query monitoring rule, see Creating or modifying a query monitoring rule using the console and Properties in the wlm_json_configuration Parameter in the Amazon Redshift Management Guide.

You can find more information about query monitoring rules in the following topics:

Query monitoring metrics for Amazon Redshift provisioned

The following table describes the metrics used in query monitoring rules. (These metrics are distinct from the metrics stored in the STV_QUERY_METRICS and STL_QUERY_METRICS system tables.)

For a given metric, the performance threshold is tracked either at the query level or the segment level. For more information about segments and steps, see Query planning and execution workflow.

Note

The WLM timeout parameter is distinct from query monitoring rules.

Metric Name Description
Query CPU time query_cpu_time CPU time used by the query, in seconds. CPU time is distinct from Query execution time.

Valid values are 0–999,999.

Blocks read query_blocks_read Number of 1 MB data blocks read by the query.

Valid values are 0–1,048,575.

Scan row count scan_row_count

The number of rows in a scan step. The row count is the total number of rows emitted before filtering rows marked for deletion (ghost rows) and before applying user-defined query filters.

Valid values are 0–999,999,999,999,999.

Query execution time query_execution_time Elapsed execution time for a query, in seconds. Execution time doesn't include time spent waiting in a queue.

Valid values are 0–86,399.

Query queue time query_queue_time Time spent waiting in a queue, in seconds.

Valid values are 0–86,399.

CPU usage query_cpu_usage_percent Percent of CPU capacity used by the query.

Valid values are 0–6,399.

Memory to disk query_temp_blocks_to_disk Temporary disk space used to write intermediate results, in 1 MB blocks.

Valid values are 0–319,815,679.

CPU skew cpu_skew The ratio of maximum CPU usage for any slice to average CPU usage for all slices. This metric is defined at the segment level.

Valid values are 0–99.

I/O skew io_skew The ratio of maximum blocks read (I/O) for any slice to average blocks read for all slices. This metric is defined at the segment level.

Valid values are 0–99.

Rows joined join_row_count The number of rows processed in a join step.

Valid values are 0–999,999,999,999,999.

Nested loop join row count nested_loop_join_row_count The number or rows in a nested loop join.

Valid values are 0–999,999,999,999,999.

Return row count return_row_count The number of rows returned by the query.

Valid values are 0–999,999,999,999,999.

Segment execution time segment_execution_time Elapsed execution time for a single segment, in seconds. To avoid or reduce sampling errors, include segment_execution_time > 10 in your rules.

Valid values are 0–86,388.

Spectrum scan row count spectrum_scan_row_count The number of rows of data in Amazon S3 scanned by an Amazon Redshift Spectrum query.

Valid values are 0–999,999,999,999,999.

Spectrum scan size spectrum_scan_size_mb The size of data in Amazon S3, in MB, scanned by an Amazon Redshift Spectrum query.

Valid values are 0–999,999,999,999,999.

Query priority query_priority The priority of the query.

Valid values are HIGHEST, HIGH, NORMAL, LOW, and LOWEST. When comparing query_priority using greater than (>) and less than (<) operators, HIGHEST is greater than HIGH, HIGH is greater than NORMAL, and so on.

Note
  • The hop action is not supported with the query_queue_time predicate. That is, rules defined to hop when a query_queue_time predicate is met are ignored.

  • Short segment execution times can result in sampling errors with some metrics, such as io_skew and query_cpu_usage_percent. To avoid or reduce sampling errors, include segment execution time in your rules. A good starting point is segment_execution_time > 10.

The SVL_QUERY_METRICS view shows the metrics for completed queries. The SVL_QUERY_METRICS_SUMMARY view shows the maximum values of metrics for completed queries. Use the values in these views as an aid to determine threshold values for defining query monitoring rules.

Query monitoring metrics for Amazon Redshift Serverless

The following table describes the metrics used in query monitoring rules for Amazon Redshift Serverless.

Metric Name Description
Query CPU time max_query_cpu_time CPU time used by the query, in seconds. CPU time is distinct from Query execution time.

Valid values are 0–999,999.

Blocks read max_query_blocks_read Number of 1 MB data blocks read by the query.

Valid values are 0–1,048,575.

Scan row count max_scan_row_count

The number of rows in a scan step. The row count is the total number of rows emitted before filtering rows marked for deletion (ghost rows) and before applying user-defined query filters.

Valid values are 0–999,999,999,999,999.

Query execution time max_query_execution_time

Elapsed execution time for a query, in seconds. Execution time doesn't include time spent waiting in a queue. If a query exceeds the set execution time, Amazon Redshift Serverless stops the query.

Valid values are 0–86,399.

Query queue time max_query_queue_time Time spent waiting in a queue, in seconds.

Valid values are 0–86,399.

CPU usage max_query_cpu_usage_percent Percent of CPU capacity used by the query.

Valid values are 0–6,399.

Memory to disk max_query_temp_blocks_to_disk Temporary disk space used to write intermediate results, in 1 MB blocks.

Valid values are 0–319,815,679.

Rows joined max_join_row_count The number of rows processed in a join step.

Valid values are 0–999,999,999,999,999.

Nested loop join row count max_nested_loop_join_row_count The number or rows in a nested loop join.

Valid values are 0–999,999,999,999,999.

Note
  • The hop action is not supported with the max_query_queue_time predicate. That is, rules defined to hop when a max_query_queue_time predicate is met are ignored.

  • Short segment execution times can result in sampling errors with some metrics, such as max_io_skew and max_query_cpu_usage_percent.

Query monitoring rules templates

When you add a rule using the Amazon Redshift console, you can choose to create a rule from a predefined template. Amazon Redshift creates a new rule with a set of predicates and populates the predicates with default values. The default action is log. You can modify the predicates and action to meet your use case.

The following table lists available templates.

Template Name Predicates Description
Nested loop join nested_loop_join_row_count > 100 A nested loop join might indicate an incomplete join predicate, which often results in a very large return set (a Cartesian product). Use a low row count to find a potentially runaway query early.
Query returns a high number of rows return_row_count > 1000000 If you dedicate a queue to simple, short running queries, you might include a rule that finds queries returning a high row count. The template uses a default of 1 million rows. For some systems, you might consider one million rows to be high, or in a larger system, a billion or more rows might be high.
Join with a high number of rows join_row_count > 1000000000 A join step that involves an unusually high number of rows might indicate a need for more restrictive filters. The template uses a default of 1 billion rows. For an ad hoc (one-time) queue that's intended for quick, simple queries, you might use a lower number.
High disk usage when writing intermediate results query_temp_blocks_to_disk > 100000 When currently executing queries use more than the available system RAM, the query execution engine writes intermediate results to disk (spilled memory). Typically, this condition is the result of a rogue query, which usually is also the query that uses the most disk space. The acceptable threshold for disk usage varies based on the cluster node type and number of nodes. The template uses a default of 100,000 blocks, or 100 GB. For a small cluster, you might use a lower number.
Long running query with high I/O skew segment_execution_time > 120 and io_skew > 1.30 I/O skew occurs when one node slice has a much higher I/O rate than the other slices. As a starting point, a skew of 1.30 (1.3 times average) is considered high. High I/O skew is not always a problem, but when combined with a long running query time, it might indicate a problem with the distribution style or sort key.

System tables and views for query monitoring rules

When all of a rule's predicates are met, WLM writes a row to the STL_WLM_RULE_ACTION system table. This row contains details for the query that triggered the rule and the resulting action.

In addition, Amazon Redshift records query metrics the following system tables and views.