Troubleshooting workload issues for Aurora MySQL databases
Database workload can be viewed as reads and writes. With an understanding of "normal" database workload, you can tune queries and the database server to meet demand as it changes. There are a number of different reasons why performance can change, so the first step is to understand what has changed.
-
Has there been a major or minor version upgrade?
A major version upgrade includes changes to the engine code, particularly in the optimizer, that can change the query execution plan. When upgrading database versions, especially major versions, it's very important that you analyze the database workload and tune accordingly. Tuning can involve optimizing and rewriting queries, or adding and updating parameter settings, depending on the results of testing. Understanding what is causing the impact will allow you to start focusing on that specific area.
For more information, see What is new in MySQL 8.0
and Server and status variables and options added, deprecated, or removed in MySQL 8.0 in the MySQL documentation, and Comparison of Aurora MySQL version 2 and Aurora MySQL version 3. -
Has there been an increase in data being processed (row counts)?
-
Are there more queries running concurrently?
-
Are there schema or database changes?
-
Have there been code defects or fixes?
Contents
Instance host metrics
Monitor instance host metrics such as CPU, memory, and network activity to help understand whether there has been a workload change. There are two main concepts for understanding workload changes:
-
Utilization – The usage of a device, such as CPU or disk. It can be time-based or capacity-based.
-
Time-based – The amount of time that a resource is busy over a particular observation period.
-
Capacity-based – The amount of throughput that a system or component can deliver, as a percentage of its capacity.
-
-
Saturation – The degree to which more work is required of a resource than it can process. When capacity-based usage reaches 100%, the extra work can't be processed and must be queued.
CPU usage
You can use the following tools to identify CPU usage and saturation:
-
CloudWatch provides the
CPUUtilization
metric. If this reaches 100%, then the instance is saturated. However, CloudWatch metrics are averaged over 1 minute, and lack granularity.For more information on CloudWatch metrics, see Instance-level metrics for Amazon Aurora.
-
Enhanced Monitoring provides metrics returned by the operating system
top
command. It shows load averages and the following CPU states, with 1-second granularity:-
Idle (%)
= Idle time -
IRQ (%)
= Software interrupts -
Nice (%)
= Nice time for processes with a nicedpriority. -
Steal (%)
= Time spent serving other tenants (virtualization related) -
System (%)
= System time -
User (%)
= User time -
Wait (%)
= I/O wait
For more information on Enhanced Monitoring metrics, see OS metrics for Aurora.
-
Memory usage
If the system is under memory pressure, and resource consumption is reaching saturation, you should be observing a high degree of page scanning, paging, swapping, and out-of-memory errors.
You can use the following tools to identify memory usage and saturation:
CloudWatch provides the FreeableMemory
metric, that shows how much
memory can be reclaimed by flushing some of the OS caches and the current free
memory.
For more information on CloudWatch metrics, see Instance-level metrics for Amazon Aurora.
Enhanced Monitoring provides the following metrics that can help you identify memory usage issues:
-
Buffers (KB)
– The amount of memory used for buffering I/O requests before writing to the storage device, in kilobytes. -
Cached (KB)
– The amount of memory used for caching file system–based I/O. -
Free (KB)
– The amount of unassigned memory, in kilobytes. -
Swap
– Cached, Free, and Total.
For example, if you see that your DB instance uses Swap
memory,
then the total amount of memory for your workload is larger than your instance
currently has available. We recommend increasing the size of your DB instance or
tuning your workload to use less memory.
For more information on Enhanced Monitoring metrics, see OS metrics for Aurora.
For more detailed information on using the Performance Schema and
sys
schema to determine which connections and components are
using memory, see Troubleshooting memory usage issues for
Aurora MySQL databases.
Network throughput
CloudWatch provides the following metrics for total network throughput, all averaged over 1 minute:
-
NetworkReceiveThroughput
– The amount of network throughput received from clients by each instance in the Aurora DB cluster. -
NetworkTransmitThroughput
– The amount of network throughput sent to clients by each instance in the Aurora DB cluster. -
NetworkThroughput
– The amount of network throughput both received from and transmitted to clients by each instance in the Aurora DB cluster. -
StorageNetworkReceiveThroughput
– The amount of network throughput received from the Aurora storage subsystem by each instance in the DB cluster. -
StorageNetworkTransmitThroughput
– The amount of network throughput sent to the Aurora storage subsystem by each instance in the Aurora DB cluster. -
StorageNetworkThroughput
– The amount of network throughput received from and sent to the Aurora storage subsystem by each instance in the Aurora DB cluster.
For more information on CloudWatch metrics, see Instance-level metrics for Amazon Aurora.
Enhanced Monitoring provides the network
received (RX) and transmitted (TX)
graphs, with up to 1-second granularity.
For more information on Enhanced Monitoring metrics, see OS metrics for Aurora.
Database metrics
Examine the following CloudWatch metrics for workload changes:
-
BlockedTransactions
– The average number of transactions in the database that are blocked per second. -
BufferCacheHitRatio
– The percentage of requests that are served by the buffer cache. -
CommitThroughput
– The average number of commit operations per second. -
DatabaseConnections
– The number of client network connections to the database instance. -
Deadlocks
– The average number of deadlocks in the database per second. -
DMLThroughput
– The average number of inserts, updates, and deletes per second. -
ResultSetCacheHitRatio
– The percentage of requests that are served by the query cache. -
RollbackSegmentHistoryListLength
– The undo logs that record committed transactions with delete-marked records. -
RowLockTime
– The total time spent acquiring row locks for InnoDB tables. -
SelectThroughput
– The average number of select queries per second.
For more information on CloudWatch metrics, see Instance-level metrics for Amazon Aurora.
Consider the following questions when examining the workload:
-
Were there recent changes in DB instance class, for example reducing the instance size from 8xlarge to 4xlarge, or changing from db.r5 to db.r6?
-
Can you create a clone and reproduce the issue, or is it happening only on that one instance?
-
Is there server resource exhaustion, high CPU or memory exhaustion? If yes, this could mean that additional hardware is required.
-
Are one or more queries taking longer?
-
Are the changes caused by an upgrade, especially a major version upgrade? If yes, then compare the pre- and post-upgrade metrics.
-
Are there changes in the number of reader DB instances?
-
Have you enabled general, audit, or binary logging? For more information, see Logging for Aurora MySQL databases.
-
Did you enable, disable, or change your use of binary log (binlog) replication?
-
Are there any long-running transactions holding large numbers of row locks? Examine the InnoDB history list length (HLL) for indications of long-running transactions.
For more information, see The InnoDB history list length increased significantly and the blog post Why is my SELECT query running slowly on my Amazon Aurora MySQL DB cluster?
. -
If a large HLL is caused by a write transaction, it means that
UNDO
logs are accumulating (not being cleaned regularly). In a large write transaction, this accumulation can grow quickly. In MySQL,UNDO
is stored in the SYSTEM tablespace. The SYSTEM
tablespace is not shrinkable. TheUNDO
log might cause theSYSTEM
tablespace to grow to several GB, or even TB. After the purge, release the allocated space by taking a logical backup (dump) of the data, then import the dump to a new DB instance. -
If a large HLL is caused by a read transaction (long-running query), it can mean that the query is using a large amount of temporary space. Release the temporary space by rebooting. Examine Performance Insights DB metrics for any changes in the
Temp
section, such ascreated_tmp_tables
. For more information, see Monitoring DB load with Performance Insights on Amazon Aurora.
-
-
Can you split long-running transactions into smaller ones that modify fewer rows?
-
Are there any changes in blocked transactions or increases in deadlocks? Examine Performance Insights DB metrics for any changes in status variables in the
Locks
section, such asinnodb_row_lock_time
,innodb_row_lock_waits
, andinnodb_dead_locks
. Use 1-minute or 5-minute intervals. -
Are there increased wait events? Examine Performance Insights wait events and wait types using 1-minute or 5-minute intervals. Analyze the top wait events and see whether they are correlated to workload changes or database contention. For example,
buf_pool mutex
indicates buffer pool contention. For more information, see Tuning Aurora MySQL with wait events.
Troubleshooting memory usage issues for Aurora MySQL databases
While CloudWatch, Enhanced Monitoring, and Performance Insights provide a good overview of memory usage at the operating system level, such as how much memory the database process is using, they don't allow you to break down what connections or components within the engine might be causing this memory usage.
To troubleshoot this, you can use the Performance Schema and sys
schema. In Aurora MySQL version 3, memory instrumentation is enabled by default when
the Performance Schema is enabled. In Aurora MySQL version 2, only memory
instrumentation for Performance Schema memory usage is enabled by default. For
information on tables available in the Performance Schema to track memory usage and
enabling Performance Schema memory instrumentation, see Memory summary tables
While detailed information is available in the Performance Schema to track current
memory usage, the MySQL sys schema
In the sys
schema, the following views are available to track memory
usage by connection, component, and query.
View | Description |
---|---|
Provides information on engine memory usage by host. This can be useful for identifying which application servers or client hosts are consuming memory. |
|
Provides information on engine memory usage by thread ID. The
thread ID in MySQL can be a client connection or a background
thread. You can map thread IDs to MySQL connection IDs by using
the sys.processlist |
|
Provides information on engine memory usage by user. This can be useful for identifying which user accounts or clients are consuming memory. |
|
Provides information on engine memory usage by engine
component. This can be useful for identifying memory usage
globally by engine buffers or components. For example, you might
see the |
|
Provides an overview of total tracked memory usage in the database engine. |
In Aurora MySQL version 3.05 and higher, you can also track maximum memory usage by
statement digest in the Performance Schema statement summary tablesMAX_TOTAL_MEMORY
column can help you identify maximum
memory used by query digest since the statistics were last reset, or since the
database instance was restarted. This can be useful in identifying specific queries
that might be consuming a lot of memory.
Note
The Performance Schema and sys
schema show you the current memory
usage on the server, and the high-water marks for memory consumed per connection
and engine component. Because the Performance Schema is maintained in memory,
information is reset when the DB instance restarts. To maintain a history over
time, we recommend that you configure retrieval and storage of this data outside
of the Performance Schema.
Example 1: Continuous high memory usage
Looking globally at FreeableMemory
in CloudWatch, we can see that
memory usage greatly increased at 2024-03-26 02:59 UTC.
This doesn't tell us the whole picture. To determine which component is using
the most memory, you can log into the database and look at
sys.memory_global_by_current_bytes
. This table contains a list
of memory events that MySQL tracks, along with information on memory allocation
per event. Each memory tracking event starts with memory/%
,
followed by other information on which engine component/feature the event is
associated with.
For example, memory/performance_schema/%
is for memory events
related to the Performance Schema, memory/innodb/%
is for InnoDB,
and so on. For more information on event naming conventions, see Performance Schema instrument naming conventions
From the following query, we can find the likely culprit based on
current_alloc
, but we can also see many
memory/performance_schema/%
events.
mysql> SELECT * FROM sys.memory_global_by_current_bytes LIMIT 10; +-----------------------------------------------------------------------------+---------------+---------------+-------------------+------------+------------+----------------+ | event_name | current_count | current_alloc | current_avg_alloc | high_count | high_alloc | high_avg_alloc | +-----------------------------------------------------------------------------+---------------+---------------+-------------------+------------+------------+----------------+ | memory/sql/Prepared_statement::main_mem_root | 512817 | 4.91 GiB | 10.04 KiB | 512823 | 4.91 GiB | 10.04 KiB | | memory/performance_schema/prepared_statements_instances | 252 | 488.25 MiB | 1.94 MiB | 252 | 488.25 MiB | 1.94 MiB | | memory/innodb/hash0hash | 4 | 79.07 MiB | 19.77 MiB | 4 | 79.07 MiB | 19.77 MiB | | memory/performance_schema/events_errors_summary_by_thread_by_error | 1028 | 52.27 MiB | 52.06 KiB | 1028 | 52.27 MiB | 52.06 KiB | | memory/performance_schema/events_statements_summary_by_thread_by_event_name | 4 | 47.25 MiB | 11.81 MiB | 4 | 47.25 MiB | 11.81 MiB | | memory/performance_schema/events_statements_summary_by_digest | 1 | 40.28 MiB | 40.28 MiB | 1 | 40.28 MiB | 40.28 MiB | | memory/performance_schema/memory_summary_by_thread_by_event_name | 4 | 31.64 MiB | 7.91 MiB | 4 | 31.64 MiB | 7.91 MiB | | memory/innodb/memory | 15227 | 27.44 MiB | 1.85 KiB | 20619 | 33.33 MiB | 1.66 KiB | | memory/sql/String::value | 74411 | 21.85 MiB | 307 bytes | 76867 | 25.54 MiB | 348 bytes | | memory/sql/TABLE | 8381 | 21.03 MiB | 2.57 KiB | 8381 | 21.03 MiB | 2.57 KiB | +-----------------------------------------------------------------------------+---------------+---------------+-------------------+------------+------------+----------------+ 10 rows in set (0.02 sec)
We mentioned previously that the Performance Schema is stored in memory, which
means that it's also tracked in the performance_schema
memory
instrumentation.
Note
If you find that the Performance Schema is using a lot of memory, and want
to limit its memory usage, you can tune database parameters based on your
requirements. For more information, see The Performance Schema memory-allocation model
For readability, you can rerun the same query but exclude Performance Schema events. The output shows the following:
-
The main memory consumer is
memory/sql/Prepared_statement::main_mem_root
. -
The
current_alloc
column tells us that MySQL has 4.91 GiB currently allocated to this event. -
The
high_alloc column
tells us that 4.91 GiB is the high-water mark ofcurrent_alloc
since the stats were last reset or since the server restarted. This means thatmemory/sql/Prepared_statement::main_mem_root
is at its highest value.
mysql> SELECT * FROM sys.memory_global_by_current_bytes WHERE event_name NOT LIKE 'memory/performance_schema/%' LIMIT 10; +-----------------------------------------------+---------------+---------------+-------------------+------------+------------+----------------+ | event_name | current_count | current_alloc | current_avg_alloc | high_count | high_alloc | high_avg_alloc | +-----------------------------------------------+---------------+---------------+-------------------+------------+------------+----------------+ | memory/sql/Prepared_statement::main_mem_root | 512817 | 4.91 GiB | 10.04 KiB | 512823 | 4.91 GiB | 10.04 KiB | | memory/innodb/hash0hash | 4 | 79.07 MiB | 19.77 MiB | 4 | 79.07 MiB | 19.77 MiB | | memory/innodb/memory | 17096 | 31.68 MiB | 1.90 KiB | 22498 | 37.60 MiB | 1.71 KiB | | memory/sql/String::value | 122277 | 27.94 MiB | 239 bytes | 124699 | 29.47 MiB | 247 bytes | | memory/sql/TABLE | 9927 | 24.67 MiB | 2.55 KiB | 9929 | 24.68 MiB | 2.55 KiB | | memory/innodb/lock0lock | 8888 | 19.71 MiB | 2.27 KiB | 8888 | 19.71 MiB | 2.27 KiB | | memory/sql/Prepared_statement::infrastructure | 257623 | 16.24 MiB | 66 bytes | 257631 | 16.24 MiB | 66 bytes | | memory/mysys/KEY_CACHE | 3 | 16.00 MiB | 5.33 MiB | 3 | 16.00 MiB | 5.33 MiB | | memory/innodb/sync0arr | 3 | 7.03 MiB | 2.34 MiB | 3 | 7.03 MiB | 2.34 MiB | | memory/sql/THD::main_mem_root | 815 | 6.56 MiB | 8.24 KiB | 849 | 7.19 MiB | 8.67 KiB | +-----------------------------------------------+---------------+---------------+-------------------+------------+------------+----------------+ 10 rows in set (0.06 sec)
From the name of the event, we can tell that this memory is being used for
prepared statements. If you want to see which connections are using this memory,
you can check memory_by_thread_by_current_bytes
In the following example, each connection has approximately 7 MiB allocated,
with a high-water mark of approximately 6.29 MiB
(current_max_alloc
). This makes sense, because the example is using
sysbench
with 80 tables and 800 connections with prepared
statements. If you want to reduce memory usage in this scenario, you can
optimize your application's usage of prepared statements to reduce memory
consumption.
mysql> SELECT * FROM sys.memory_by_thread_by_current_bytes; +-----------+-------------------------------------------+--------------------+-------------------+-------------------+-------------------+-----------------+ | thread_id | user | current_count_used | current_allocated | current_avg_alloc | current_max_alloc | total_allocated | +-----------+-------------------------------------------+--------------------+-------------------+-------------------+-------------------+-----------------+ | 46 | rdsadmin@localhost | 405 | 8.47 MiB | 21.42 KiB | 8.00 MiB | 155.86 MiB | | 61 | reinvent@10.0.4.4 | 1749 | 6.72 MiB | 3.93 KiB | 6.29 MiB | 14.24 MiB | | 101 | reinvent@10.0.4.4 | 1845 | 6.71 MiB | 3.72 KiB | 6.29 MiB | 14.50 MiB | | 55 | reinvent@10.0.4.4 | 1674 | 6.68 MiB | 4.09 KiB | 6.29 MiB | 14.13 MiB | | 57 | reinvent@10.0.4.4 | 1416 | 6.66 MiB | 4.82 KiB | 6.29 MiB | 13.52 MiB | | 112 | reinvent@10.0.4.4 | 1759 | 6.66 MiB | 3.88 KiB | 6.29 MiB | 14.17 MiB | | 66 | reinvent@10.0.4.4 | 1428 | 6.64 MiB | 4.76 KiB | 6.29 MiB | 13.47 MiB | | 75 | reinvent@10.0.4.4 | 1389 | 6.62 MiB | 4.88 KiB | 6.29 MiB | 13.40 MiB | | 116 | reinvent@10.0.4.4 | 1333 | 6.61 MiB | 5.08 KiB | 6.29 MiB | 13.21 MiB | | 90 | reinvent@10.0.4.4 | 1448 | 6.59 MiB | 4.66 KiB | 6.29 MiB | 13.58 MiB | | 98 | reinvent@10.0.4.4 | 1440 | 6.57 MiB | 4.67 KiB | 6.29 MiB | 13.52 MiB | | 94 | reinvent@10.0.4.4 | 1433 | 6.57 MiB | 4.69 KiB | 6.29 MiB | 13.49 MiB | | 62 | reinvent@10.0.4.4 | 1323 | 6.55 MiB | 5.07 KiB | 6.29 MiB | 13.48 MiB | | 87 | reinvent@10.0.4.4 | 1323 | 6.55 MiB | 5.07 KiB | 6.29 MiB | 13.25 MiB | | 99 | reinvent@10.0.4.4 | 1346 | 6.54 MiB | 4.98 KiB | 6.29 MiB | 13.24 MiB | | 105 | reinvent@10.0.4.4 | 1347 | 6.54 MiB | 4.97 KiB | 6.29 MiB | 13.34 MiB | | 73 | reinvent@10.0.4.4 | 1335 | 6.54 MiB | 5.02 KiB | 6.29 MiB | 13.23 MiB | | 54 | reinvent@10.0.4.4 | 1510 | 6.53 MiB | 4.43 KiB | 6.29 MiB | 13.49 MiB | . . . . . . | 812 | reinvent@10.0.4.4 | 1259 | 6.38 MiB | 5.19 KiB | 6.29 MiB | 13.05 MiB | | 214 | reinvent@10.0.4.4 | 1279 | 6.38 MiB | 5.10 KiB | 6.29 MiB | 12.90 MiB | | 325 | reinvent@10.0.4.4 | 1254 | 6.38 MiB | 5.21 KiB | 6.29 MiB | 12.99 MiB | | 705 | reinvent@10.0.4.4 | 1273 | 6.37 MiB | 5.13 KiB | 6.29 MiB | 13.03 MiB | | 530 | reinvent@10.0.4.4 | 1268 | 6.37 MiB | 5.15 KiB | 6.29 MiB | 12.92 MiB | | 307 | reinvent@10.0.4.4 | 1263 | 6.37 MiB | 5.17 KiB | 6.29 MiB | 12.87 MiB | | 738 | reinvent@10.0.4.4 | 1260 | 6.37 MiB | 5.18 KiB | 6.29 MiB | 13.00 MiB | | 819 | reinvent@10.0.4.4 | 1252 | 6.37 MiB | 5.21 KiB | 6.29 MiB | 13.01 MiB | | 31 | innodb/srv_purge_thread | 17810 | 3.14 MiB | 184 bytes | 2.40 MiB | 205.69 MiB | | 38 | rdsadmin@localhost | 599 | 1.76 MiB | 3.01 KiB | 1.00 MiB | 25.58 MiB | | 1 | sql/main | 3756 | 1.32 MiB | 367 bytes | 355.78 KiB | 6.19 MiB | | 854 | rdsadmin@localhost | 46 | 1.08 MiB | 23.98 KiB | 1.00 MiB | 5.10 MiB | | 30 | innodb/clone_gtid_thread | 1596 | 573.14 KiB | 367 bytes | 254.91 KiB | 970.69 KiB | | 40 | rdsadmin@localhost | 235 | 245.19 KiB | 1.04 KiB | 128.88 KiB | 808.64 KiB | | 853 | rdsadmin@localhost | 96 | 94.63 KiB | 1009 bytes | 29.73 KiB | 422.45 KiB | | 36 | rdsadmin@localhost | 33 | 36.29 KiB | 1.10 KiB | 16.08 KiB | 74.15 MiB | | 33 | sql/event_scheduler | 3 | 16.27 KiB | 5.42 KiB | 16.04 KiB | 16.27 KiB | | 35 | sql/compress_gtid_table | 8 | 14.20 KiB | 1.77 KiB | 8.05 KiB | 18.62 KiB | | 25 | innodb/fts_optimize_thread | 12 | 1.86 KiB | 158 bytes | 648 bytes | 1.98 KiB | | 23 | innodb/srv_master_thread | 11 | 1.23 KiB | 114 bytes | 361 bytes | 24.40 KiB | | 24 | innodb/dict_stats_thread | 11 | 1.23 KiB | 114 bytes | 361 bytes | 1.35 KiB | | 5 | innodb/io_read_thread | 1 | 144 bytes | 144 bytes | 144 bytes | 144 bytes | | 6 | innodb/io_read_thread | 1 | 144 bytes | 144 bytes | 144 bytes | 144 bytes | | 2 | sql/aws_oscar_log_level_monitor | 0 | 0 bytes | 0 bytes | 0 bytes | 0 bytes | | 4 | innodb/io_ibuf_thread | 0 | 0 bytes | 0 bytes | 0 bytes | 0 bytes | | 7 | innodb/io_write_thread | 0 | 0 bytes | 0 bytes | 0 bytes | 0 bytes | | 8 | innodb/io_write_thread | 0 | 0 bytes | 0 bytes | 0 bytes | 0 bytes | | 9 | innodb/io_write_thread | 0 | 0 bytes | 0 bytes | 0 bytes | 0 bytes | | 10 | innodb/io_write_thread | 0 | 0 bytes | 0 bytes | 0 bytes | 0 bytes | | 11 | innodb/srv_lra_thread | 0 | 0 bytes | 0 bytes | 0 bytes | 0 bytes | | 12 | innodb/srv_akp_thread | 0 | 0 bytes | 0 bytes | 0 bytes | 0 bytes | | 18 | innodb/srv_lock_timeout_thread | 0 | 0 bytes | 0 bytes | 0 bytes | 248 bytes | | 19 | innodb/srv_error_monitor_thread | 0 | 0 bytes | 0 bytes | 0 bytes | 0 bytes | | 20 | innodb/srv_monitor_thread | 0 | 0 bytes | 0 bytes | 0 bytes | 0 bytes | | 21 | innodb/buf_resize_thread | 0 | 0 bytes | 0 bytes | 0 bytes | 0 bytes | | 22 | innodb/btr_search_sys_toggle_thread | 0 | 0 bytes | 0 bytes | 0 bytes | 0 bytes | | 32 | innodb/dict_persist_metadata_table_thread | 0 | 0 bytes | 0 bytes | 0 bytes | 0 bytes | | 34 | sql/signal_handler | 0 | 0 bytes | 0 bytes | 0 bytes | 0 bytes | +-----------+-------------------------------------------+--------------------+-------------------+-------------------+-------------------+-----------------+ 831 rows in set (2.48 sec)
As mentioned earlier, the thread ID (thd_id
) value here can refer
to server background threads or database connections. If you want to map thread
ID values to database connection IDs, you can use the
performance_schema.threads
table or the
sys.processlist
view, where conn_id
is the
connection ID.
mysql> SELECT thd_id,conn_id,user,db,command,state,time,last_wait FROM sys.processlist WHERE user='reinvent@10.0.4.4'; +--------+---------+-------------------+----------+---------+----------------+------+-------------------------------------------------+ | thd_id | conn_id | user | db | command | state | time | last_wait | +--------+---------+-------------------+----------+---------+----------------+------+-------------------------------------------------+ | 590 | 562 | reinvent@10.0.4.4 | sysbench | Execute | closing tables | 0 | wait/io/redo_log_flush | | 578 | 550 | reinvent@10.0.4.4 | sysbench | Sleep | NULL | 0 | idle | | 579 | 551 | reinvent@10.0.4.4 | sysbench | Execute | closing tables | 0 | wait/io/redo_log_flush | | 580 | 552 | reinvent@10.0.4.4 | sysbench | Execute | updating | 0 | wait/io/table/sql/handler | | 581 | 553 | reinvent@10.0.4.4 | sysbench | Execute | updating | 0 | wait/io/table/sql/handler | | 582 | 554 | reinvent@10.0.4.4 | sysbench | Sleep | NULL | 0 | idle | | 583 | 555 | reinvent@10.0.4.4 | sysbench | Sleep | NULL | 0 | idle | | 584 | 556 | reinvent@10.0.4.4 | sysbench | Execute | updating | 0 | wait/io/table/sql/handler | | 585 | 557 | reinvent@10.0.4.4 | sysbench | Execute | closing tables | 0 | wait/io/redo_log_flush | | 586 | 558 | reinvent@10.0.4.4 | sysbench | Execute | updating | 0 | wait/io/table/sql/handler | | 587 | 559 | reinvent@10.0.4.4 | sysbench | Execute | closing tables | 0 | wait/io/redo_log_flush | . . . . . . | 323 | 295 | reinvent@10.0.4.4 | sysbench | Sleep | NULL | 0 | idle | | 324 | 296 | reinvent@10.0.4.4 | sysbench | Execute | updating | 0 | wait/io/table/sql/handler | | 325 | 297 | reinvent@10.0.4.4 | sysbench | Execute | closing tables | 0 | wait/io/redo_log_flush | | 326 | 298 | reinvent@10.0.4.4 | sysbench | Execute | updating | 0 | wait/io/table/sql/handler | | 438 | 410 | reinvent@10.0.4.4 | sysbench | Execute | System lock | 0 | wait/lock/table/sql/handler | | 280 | 252 | reinvent@10.0.4.4 | sysbench | Sleep | starting | 0 | wait/io/socket/sql/client_connection | | 98 | 70 | reinvent@10.0.4.4 | sysbench | Query | freeing items | 0 | NULL | +--------+---------+-------------------+----------+---------+----------------+------+-------------------------------------------------+ 804 rows in set (5.51 sec)
Now we stop the sysbench
workload, which closes the connections
and released the memory. Checking the events again, we can confirm that memory
is released, but high_alloc
still tells us what the high-water mark
is. The high_alloc
column can be very useful in identifying short
spikes in memory usage, where you might not be able to immediately identify
usage from current_alloc
, which shows only currently allocated
memory.
mysql> SELECT * FROM sys.memory_global_by_current_bytes WHERE event_name='memory/sql/Prepared_statement::main_mem_root' LIMIT 10; +----------------------------------------------+---------------+---------------+-------------------+------------+------------+----------------+ | event_name | current_count | current_alloc | current_avg_alloc | high_count | high_alloc | high_avg_alloc | +----------------------------------------------+---------------+---------------+-------------------+------------+------------+----------------+ | memory/sql/Prepared_statement::main_mem_root | 17 | 253.80 KiB | 14.93 KiB | 512823 | 4.91 GiB | 10.04 KiB | +----------------------------------------------+---------------+---------------+-------------------+------------+------------+----------------+ 1 row in set (0.00 sec)
If you want to reset high_alloc
, you can truncate the
performance_schema
memory summary tables, but this resets all
memory instrumentation. For more information, see Performance Schema general table characteristics
In the following example, we can see that high_alloc
is reset
after truncation.
mysql> TRUNCATE `performance_schema`.`memory_summary_global_by_event_name`; Query OK, 0 rows affected (0.00 sec) mysql> SELECT * FROM sys.memory_global_by_current_bytes WHERE event_name='memory/sql/Prepared_statement::main_mem_root' LIMIT 10; +----------------------------------------------+---------------+---------------+-------------------+------------+------------+----------------+ | event_name | current_count | current_alloc | current_avg_alloc | high_count | high_alloc | high_avg_alloc | +----------------------------------------------+---------------+---------------+-------------------+------------+------------+----------------+ | memory/sql/Prepared_statement::main_mem_root | 17 | 253.80 KiB | 14.93 KiB | 17 | 253.80 KiB | 14.93 KiB | +----------------------------------------------+---------------+---------------+-------------------+------------+------------+----------------+ 1 row in set (0.00 sec)
Example 2: Transient memory spikes
Another common occurrence is short spikes in memory usage on a database
server. These can be periodic drops in freeable memory that are difficult to
troubleshoot using current_alloc
in
sys.memory_global_by_current_bytes
, because the memory has
already been freed.
Note
If Performance Schema statistics have been reset, or the database instance
has been restarted, this information won't be available in sys
or performance_schema
. To retain this information, we recommend
that you configure external metrics collection.
The following graph of the os.memory.free
metric in Enhanced Monitoring shows
brief 7-second spikes in memory usage. Enhanced Monitoring allows you to monitor at intervals
as short as 1 second, which is perfect for catching transient spikes like
these.
To help diagnose the cause of the memory usage here, we can use a combination
of high_alloc
in the sys
memory summary views and
Performance Schema statement summary tables
As expected, because memory usage isn't currently high, we can't see any major
offenders in the sys
schema view under
current_alloc.
mysql> SELECT * FROM sys.memory_global_by_current_bytes LIMIT 10; +-----------------------------------------------------------------------------+---------------+---------------+-------------------+------------+------------+----------------+ | event_name | current_count | current_alloc | current_avg_alloc | high_count | high_alloc | high_avg_alloc | +-----------------------------------------------------------------------------+---------------+---------------+-------------------+------------+------------+----------------+ | memory/innodb/hash0hash | 4 | 79.07 MiB | 19.77 MiB | 4 | 79.07 MiB | 19.77 MiB | | memory/innodb/os0event | 439372 | 60.34 MiB | 144 bytes | 439372 | 60.34 MiB | 144 bytes | | memory/performance_schema/events_statements_summary_by_digest | 1 | 40.28 MiB | 40.28 MiB | 1 | 40.28 MiB | 40.28 MiB | | memory/mysys/KEY_CACHE | 3 | 16.00 MiB | 5.33 MiB | 3 | 16.00 MiB | 5.33 MiB | | memory/performance_schema/events_statements_history_long | 1 | 14.34 MiB | 14.34 MiB | 1 | 14.34 MiB | 14.34 MiB | | memory/performance_schema/events_errors_summary_by_thread_by_error | 257 | 13.07 MiB | 52.06 KiB | 257 | 13.07 MiB | 52.06 KiB | | memory/performance_schema/events_statements_summary_by_thread_by_event_name | 1 | 11.81 MiB | 11.81 MiB | 1 | 11.81 MiB | 11.81 MiB | | memory/performance_schema/events_statements_summary_by_digest.digest_text | 1 | 9.77 MiB | 9.77 MiB | 1 | 9.77 MiB | 9.77 MiB | | memory/performance_schema/events_statements_history_long.digest_text | 1 | 9.77 MiB | 9.77 MiB | 1 | 9.77 MiB | 9.77 MiB | | memory/performance_schema/events_statements_history_long.sql_text | 1 | 9.77 MiB | 9.77 MiB | 1 | 9.77 MiB | 9.77 MiB | +-----------------------------------------------------------------------------+---------------+---------------+-------------------+------------+------------+----------------+ 10 rows in set (0.01 sec)
Expanding the view to order by high_alloc
, we can now see that
the memory/temptable/physical_ram
component is a very good
candidate here. At its highest, it consumed 515.00 MiB.
As its name suggests, memory/temptable/physical_ram
instruments
memory usage for the TEMP
storage engine in MySQL, which was
introduced in MySQL 8.0. For more information on how MySQL uses temporary
tables, see Internal temporary table use in MySQL
Note
We're using the sys.x$memory_global_by_current_bytes
view in
this example.
mysql> SELECT event_name, format_bytes(current_alloc) AS "currently allocated", sys.format_bytes(high_alloc) AS "high-water mark" FROM sys.x$memory_global_by_current_bytes ORDER BY high_alloc DESC LIMIT 10; +-----------------------------------------------------------------------------+---------------------+-----------------+ | event_name | currently allocated | high-water mark | +-----------------------------------------------------------------------------+---------------------+-----------------+ | memory/temptable/physical_ram | 4.00 MiB | 515.00 MiB | | memory/innodb/hash0hash | 79.07 MiB | 79.07 MiB | | memory/innodb/os0event | 63.95 MiB | 63.95 MiB | | memory/performance_schema/events_statements_summary_by_digest | 40.28 MiB | 40.28 MiB | | memory/mysys/KEY_CACHE | 16.00 MiB | 16.00 MiB | | memory/performance_schema/events_statements_history_long | 14.34 MiB | 14.34 MiB | | memory/performance_schema/events_errors_summary_by_thread_by_error | 13.07 MiB | 13.07 MiB | | memory/performance_schema/events_statements_summary_by_thread_by_event_name | 11.81 MiB | 11.81 MiB | | memory/performance_schema/events_statements_summary_by_digest.digest_text | 9.77 MiB | 9.77 MiB | | memory/performance_schema/events_statements_history_long.sql_text | 9.77 MiB | 9.77 MiB | +-----------------------------------------------------------------------------+---------------------+-----------------+ 10 rows in set (0.00 sec)
In Example 1: Continuous high memory usage, we checked the current memory usage for each connection to determine which connection is responsible for using the memory in question. In this example, the memory is already freed, so checking the memory usage for current connections isn't useful.
To dig deeper and find the offending statements, users, and hosts, we use the
Performance Schema. The Performance Schema contains multiple statement summary
tables that are sliced by different dimensions such as event name, statement
digest, host, thread, and user. Each view will allow you dig deeper into where
certain statements are being run and what they are doing. This section is
focused on MAX_TOTAL_MEMORY
, but you can find more information on
all of the columns available in the Performance Schema statement summary tables
mysql> SHOW TABLES IN performance_schema LIKE 'events_statements_summary_%'; +------------------------------------------------------------+ | Tables_in_performance_schema (events_statements_summary_%) | +------------------------------------------------------------+ | events_statements_summary_by_account_by_event_name | | events_statements_summary_by_digest | | events_statements_summary_by_host_by_event_name | | events_statements_summary_by_program | | events_statements_summary_by_thread_by_event_name | | events_statements_summary_by_user_by_event_name | | events_statements_summary_global_by_event_name | +------------------------------------------------------------+ 7 rows in set (0.00 sec)
First we check events_statements_summary_by_digest
to see
MAX_TOTAL_MEMORY
.
From this we can see the following:
-
The query with digest
20676ce4a690592ff05debcffcbc26faeb76f22005e7628364d7a498769d0c4a
seems to be a good candidate for this memory usage. TheMAX_TOTAL_MEMORY
is 537450710, which matches the high-water mark we saw for thememory/temptable/physical_ram
event insys.x$memory_global_by_current_bytes
. -
It has been run four times (
COUNT_STAR
), first at 2024-03-26 04:08:34.943256, and last at 2024-03-26 04:43:06.998310.
mysql> SELECT SCHEMA_NAME,DIGEST,COUNT_STAR,MAX_TOTAL_MEMORY,FIRST_SEEN,LAST_SEEN FROM performance_schema.events_statements_summary_by_digest ORDER BY MAX_TOTAL_MEMORY DESC LIMIT 5; +-------------+------------------------------------------------------------------+------------+------------------+----------------------------+----------------------------+ | SCHEMA_NAME | DIGEST | COUNT_STAR | MAX_TOTAL_MEMORY | FIRST_SEEN | LAST_SEEN | +-------------+------------------------------------------------------------------+------------+------------------+----------------------------+----------------------------+ | sysbench | 20676ce4a690592ff05debcffcbc26faeb76f22005e7628364d7a498769d0c4a | 4 | 537450710 | 2024-03-26 04:08:34.943256 | 2024-03-26 04:43:06.998310 | | NULL | f158282ea0313fefd0a4778f6e9b92fc7d1e839af59ebd8c5eea35e12732c45d | 4 | 3636413 | 2024-03-26 04:29:32.712348 | 2024-03-26 04:36:26.269329 | | NULL | 0046bc5f642c586b8a9afd6ce1ab70612dc5b1fd2408fa8677f370c1b0ca3213 | 2 | 3459965 | 2024-03-26 04:31:37.674008 | 2024-03-26 04:32:09.410718 | | NULL | 8924f01bba3c55324701716c7b50071a60b9ceaf17108c71fd064c20c4ab14db | 1 | 3290981 | 2024-03-26 04:31:49.751506 | 2024-03-26 04:31:49.751506 | | NULL | 90142bbcb50a744fcec03a1aa336b2169761597ea06d85c7f6ab03b5a4e1d841 | 1 | 3131729 | 2024-03-26 04:15:09.719557 | 2024-03-26 04:15:09.719557 | +-------------+------------------------------------------------------------------+------------+------------------+----------------------------+----------------------------+ 5 rows in set (0.00 sec)
Now that we know the offending digest, we can get more details such as the query text, the user who ran it, and where it was run. Based on the digest text returned, we can see that this is a common table expression (CTE) that creates four temporary tables and performs four table scans, which is very inefficient.
mysql> SELECT SCHEMA_NAME,DIGEST_TEXT,QUERY_SAMPLE_TEXT,MAX_TOTAL_MEMORY,SUM_ROWS_SENT,SUM_ROWS_EXAMINED,SUM_CREATED_TMP_TABLES,SUM_NO_INDEX_USED FROM performance_schema.events_statements_summary_by_digest WHERE DIGEST='20676ce4a690592ff05debcffcbc26faeb76f22005e7628364d7a498769d0c4a'\G; *************************** 1. row *************************** SCHEMA_NAME: sysbench DIGEST_TEXT: WITH RECURSIVE `cte` ( `n` ) AS ( SELECT ? FROM `sbtest1` UNION ALL SELECT `id` + ? FROM `sbtest1` ) SELECT * FROM `cte` QUERY_SAMPLE_TEXT: WITH RECURSIVE cte (n) AS ( SELECT 1 from sbtest1 UNION ALL SELECT id + 1 FROM sbtest1) SELECT * FROM cte MAX_TOTAL_MEMORY: 537450710 SUM_ROWS_SENT: 80000000 SUM_ROWS_EXAMINED: 80000000 SUM_CREATED_TMP_TABLES: 4 SUM_NO_INDEX_USED: 4 1 row in set (0.01 sec)
For more information on the events_statements_summary_by_digest
table and other Performance Schema statement summary tables, see Statement summary tables
You can also run an EXPLAIN
Note
EXPLAIN ANALYZE
can provide more information than
EXPLAIN
, but it also runs the query, so be careful.
-- EXPLAIN mysql> EXPLAIN WITH RECURSIVE cte (n) AS (SELECT 1 FROM sbtest1 UNION ALL SELECT id + 1 FROM sbtest1) SELECT * FROM cte; +----+-------------+------------+------------+-------+---------------+------+---------+------+----------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+------------+------------+-------+---------------+------+---------+------+----------+----------+-------------+ | 1 | PRIMARY | <derived2> | NULL | ALL | NULL | NULL | NULL | NULL | 19221520 | 100.00 | NULL | | 2 | DERIVED | sbtest1 | NULL | index | NULL | k_1 | 4 | NULL | 9610760 | 100.00 | Using index | | 3 | UNION | sbtest1 | NULL | index | NULL | k_1 | 4 | NULL | 9610760 | 100.00 | Using index | +----+-------------+------------+------------+-------+---------------+------+---------+------+----------+----------+-------------+ 3 rows in set, 1 warning (0.00 sec) -- EXPLAIN format=tree mysql> EXPLAIN format=tree WITH RECURSIVE cte (n) AS (SELECT 1 FROM sbtest1 UNION ALL SELECT id + 1 FROM sbtest1) SELECT * FROM cte\G; *************************** 1. row *************************** EXPLAIN: -> Table scan on cte (cost=4.11e+6..4.35e+6 rows=19.2e+6) -> Materialize union CTE cte (cost=4.11e+6..4.11e+6 rows=19.2e+6) -> Index scan on sbtest1 using k_1 (cost=1.09e+6 rows=9.61e+6) -> Index scan on sbtest1 using k_1 (cost=1.09e+6 rows=9.61e+6) 1 row in set (0.00 sec) -- EXPLAIN ANALYZE mysql> EXPLAIN ANALYZE WITH RECURSIVE cte (n) AS (SELECT 1 from sbtest1 UNION ALL SELECT id + 1 FROM sbtest1) SELECT * FROM cte\G; *************************** 1. row *************************** EXPLAIN: -> Table scan on cte (cost=4.11e+6..4.35e+6 rows=19.2e+6) (actual time=6666..9201 rows=20e+6 loops=1) -> Materialize union CTE cte (cost=4.11e+6..4.11e+6 rows=19.2e+6) (actual time=6666..6666 rows=20e+6 loops=1) -> Covering index scan on sbtest1 using k_1 (cost=1.09e+6 rows=9.61e+6) (actual time=0.0365..2006 rows=10e+6 loops=1) -> Covering index scan on sbtest1 using k_1 (cost=1.09e+6 rows=9.61e+6) (actual time=0.0311..2494 rows=10e+6 loops=1) 1 row in set (10.53 sec)
But who ran it? We can see in the Performance Schema that the
destructive_operator
user had MAX_TOTAL_MEMORY
of
537450710, which again matches the previous results.
Note
The Performance Schema is stored in memory, so should not be relied upon as the sole source for auditing. If you need to maintain a history of statements run, and from which users, we recommend that enable audit logging. If you also need to maintain information on memory usage, we recommend that you configure monitoring to export and store these values.
mysql> SELECT USER,EVENT_NAME,COUNT_STAR,MAX_TOTAL_MEMORY FROM performance_schema.events_statements_summary_by_user_by_event_name ORDER BY MAX_CONTROLLED_MEMORY DESC LIMIT 5; +----------------------+---------------------------+------------+------------------+ | USER | EVENT_NAME | COUNT_STAR | MAX_TOTAL_MEMORY | +----------------------+---------------------------+------------+------------------+ | destructive_operator | statement/sql/select | 4 | 537450710 | | rdsadmin | statement/sql/select | 4172 | 3290981 | | rdsadmin | statement/sql/show_tables | 2 | 3615821 | | rdsadmin | statement/sql/show_fields | 2 | 3459965 | | rdsadmin | statement/sql/show_status | 75 | 1914976 | +----------------------+---------------------------+------------+------------------+ 5 rows in set (0.00 sec) mysql> SELECT HOST,EVENT_NAME,COUNT_STAR,MAX_TOTAL_MEMORY FROM performance_schema.events_statements_summary_by_host_by_event_name WHERE HOST != 'localhost' AND COUNT_STAR>0 ORDER BY MAX_CONTROLLED_MEMORY DESC LIMIT 5; +------------+----------------------+------------+------------------+ | HOST | EVENT_NAME | COUNT_STAR | MAX_TOTAL_MEMORY | +------------+----------------------+------------+------------------+ | 10.0.8.231 | statement/sql/select | 4 | 537450710 | +------------+----------------------+------------+------------------+ 1 row in set (0.00 sec)
Troubleshooting out-of-memory issues for Aurora MySQL databases
The Aurora MySQL aurora_oom_response
instance-level parameter can enable the DB instance to monitor the system memory and
estimate the memory consumed by various statements and connections. If the system runs low on memory, it can perform a list of
actions to attempt to release that memory. It does so in an attempt to avoid a database restart due to out-of-memory (OOM)
issues. The instance-level parameter takes a string of comma-separated actions that a DB instance performs when its memory is
low. The aurora_oom_response
parameter is supported for Aurora MySQL versions 2 and 3.
The following values, and combinations of them, can be used for the aurora_oom_response
parameter. An empty string
means that no action is taken, and effectively turns off the feature, leaving the database prone to OOM restarts.
-
decline
– Declines new queries when the DB instance is low on memory. kill_connect
– Closes database connections that are consuming a large amount of memory, and ends current transactions and Data Definition Language (DDL) statements. This response isn't supported for Aurora MySQL version 2.For more information, see KILL statement
in the MySQL documentation. -
kill_query
– Ends queries in descending order of memory consumption until the instance memory surfaces above the low threshold. DDL statements aren't ended.For more information, see KILL statement
in the MySQL documentation. -
print
– Only prints the queries that are consuming a large amount of memory. -
tune
– Tunes the internal table caches to release some memory back to the system. Aurora MySQL decreases the memory used for caches such astable_open_cache
andtable_definition_cache
in low-memory conditions. Eventually, Aurora MySQL sets their memory usage back to normal when the system is no longer low on memory.For more information, see table_open_cache
and table_definition_cache in the MySQL documentation. tune_buffer_pool
– Decreases the size of the buffer pool to release some memory and make it available for the database server to process connections. This response is supported for Aurora MySQL version 3.06 and higher.You must pair
tune_buffer_pool
with eitherkill_query
orkill_connect
in theaurora_oom_response
parameter value. If not, buffer pool resizing won't happen, even when you includetune_buffer_pool
in the parameter value.
In Aurora MySQL versions lower than 3.06, for DB instance classes with memory less than or equal to 4 GiB, when the instance is under memory pressure, the default
actions include print
, tune
, decline
, and kill_query
. For DB instance classes with memory greater than 4 GiB,
the parameter value is empty by default (disabled).
In Aurora MySQL version 3.06 and higher, for DB instance classes with memory less
than or equal to 4 GiB, Aurora MySQL also closes the top memory-consuming connections
(kill_connect
). For DB instance classes with memory greater than 4
GiB, the default parameter value is print
.
If you frequently run into out-of-memory issues, memory usage can be monitored using
memory summary tablesperformance_schema
is enabled.