Monitor replication
You can use https://console.amazonaws.cn/cloudwatch/ReplicationLatency
, MessageLag
, and ReplicatorThroughput
at a topic and aggregate level for each Amazon MSK Replicator. Metrics are visible under ReplicatorName in the “Amazon/Kafka” namespace. You can also see ReplicatorFailure
, AuthError
and ThrottleTime
metrics to check for issues.
The MSK console displays a subset of CloudWatch metrics for each MSK Replicator. From the console Replicator list, select the name of a Replicator and select the Monitoring tab.
MSK Replicator metrics
The following metrics describes performance or connection metrics for the MSK Replicator.
AuthError metrics do not cover topic-level auth errors. To monitor your MSK Replicator’s topic-level auth errors, monitor Replicator’s ReplicationLatency metrics and the source cluster’s topic-level metrics, MessagesInPerSec. If a topic’s ReplicationLatency dropped to 0 but the topic still has data being produced to it, it indicates that the Replicator has an Auth issue with the topic. Check that the Replicator’s service execution IAM role has sufficient permission to access the topic.
Metric type | Metric | Description | Dimensions | Unit | Raw Metric Granularity | Raw Metric Aggregation Stat |
---|---|---|---|---|---|---|
Performance | ReplicationLatency | Time it takes records to replicate from the source to target cluster; duration between record produce time at source and replicated to target. If ReplicationLatency increases, check if clusters have enough partitions to support replication. High replication latency can occur when the partition count is too low for high throughput. | ReplicatorName | Milliseconds | Partition | Maximum |
ReplicatorName, Topic | Milliseconds | Partition | Maximum | |||
Performance | MessageLag | Monitors the sync between the MSK Replicator and the source cluster. MessageLag indicates the lag between the messages produced to the source cluster and messages consumed by the replicator. It is not the lag between the source and target cluster. Even if the source cluster is unavailable/interrupted, the replicator will finish writing the message it has consumed to the target cluster. After an outage, MessageLag shows an increase indicating the number of messages the replicator is behind the source cluster and this can be monitored until the number of messages is 0, showing that the replicator has caught up with the source cluster. | ReplicatorName | Count | Partition | Sum |
ReplicatorName, Topic | Count | Partition | Sum | |||
Performance | ReplicatorBytesInPerSec | Average number of bytes processed by the replicator per second. Data processed by MSK Replicator consists of all the data that MSK Replicator receives which includes the data replicated to target cluster and the data filtered by MSK Replicator (only if your Replicator is configured with Identical topic name configuration) to prevent the data being copied back to the same topic it originated from. If your Replicator is configured with "Prefixed" topic name configuration, both ReplicatorBytesInPerSec and ReplicatorThroughput metrics will have the same value as no data will be filtered by MSK Replicator. |
ReplicatorName | BytesPerSecond | ReplicatorName | Sum |
Performance | ReplicatorThroughput | Average number of bytes replicated per second. If ReplicatorThroughput drops for a topic, check KafkaClusterPingSuccessCount and AuthError metrics to ensure the Replicator can communicate with clusters, then check cluster metrics to ensure the cluster is not down. | ReplicatorName | BytesPerSecond | Partition | Sum |
ReplicatorName, Topic | BytesPerSecond | Partition | Sum | |||
Debug | AuthError | The number of connections with failed authentication per second. If this metric is above 0, you can check if the service execution role policy for the replicator is valid and make sure there aren't deny permissions set for the cluster permissions. Based on clusterAlias dimension, you can identify if the source or target cluster is experiencing auth errors. | ReplicatorName, ClusterAlias | Count | Worker | Sum |
Debug | ThrottleTime | The average time in ms a request was throttled by brokers on the cluster. Set throttling to avoid having the MSK Replicator overwhelm the cluster. If this metric is 0, replicationLatency is not high, and replicatorThroughput is as expected, then throttling is working as expected. If this metric is above 0, you can adjust throttling accordingly. | ReplicatorName, ClusterAlias | Milliseconds | Worker | Maximum |
Debug | ReplicatorFailure | Number of failures that the replicator is experiencing. | ReplicatorName | Count | Sum | |
Debug | KafkaClusterPingSuccessCount |
Indicates the health of the replicator connection to the kafka cluster. If this value is 1, the connection is healthy. If the value is 0 or no datapoint, the connection is unhealthy. If the value is 0, you can check network or IAM permission settings for the Kafka cluster. Based on ClusterAlias dimension, you can identify whether this metric is for source or target cluster. |
ReplicatorName, ClusterAlias | Count | Sum |