Production monitoring - Amazon DynamoDB
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Production monitoring

You should establish a baseline for normal DAX performance in your environment, by measuring performance at various times and under different load conditions. As you monitor DAX, you should consider storing historical monitoring data. This stored data gives you a baseline from which to compare current performance data, identify normal performance patterns and performance anomalies, and devise methods to address issues.

To establish a baseline, you should, at minimum, monitor the following items both during load testing and in production.

  • CPU utilization and throttled requests, so that you can determine whether you might need to use a larger node type in your cluster. The CPU utilization of your cluster is available through the CPUUtilization CloudWatch metric. The average stat on this metric provides an average CPU utilization view across all the nodes in your cluster. For cluster scaling decisions, we recommend that you use the maximum stat which is the maximum utilization across all the nodes.

    Note

    Amazon has improved the CPUUtilization metric's granularity. You might observe changes to the metric starting from 2024-05-17 to 2024-06-22.

  • Operation latency (as measure on the client side) should remain consistent within your application's latency requirements.

  • Error rates should remain low, as seen from the ErrorRequestCount, FaultRequestCount, and FailedRequestCount CloudWatch metrics.

  • Network bytes consumption, so that you can determine whether you will need to use more nodes or a larger node type in your cluster. NetworkBytesIn and NetworkBytesOut metrics are available in CloudWatch, and you should compare them against your instances available baseline bandwidth as documented here.

    Note

    The Amazon EC2 documented available baseline bandwidth is in Gigabits per second (Gbps), where as the NetworkBytesIn and NetworkBytesOut metrics are in Gigabytes per minute (GBpm). To convert Gbps to GBpm and measure utilization, please multiply the baseline bandwidth by 7.5.

  • Cache memory utilization and evicted size, so that you can determine whether the cluster's node type has sufficient memory to hold your working set, and if not, switch to a larger node type.

    Note

    In case of a large number of cache misses and writes, cache memory utilization can increase up to 100% and may cause availability downtime.

  • Client connections, so that you can monitor for any unexplained spikes in connections to the cluster.