Troubleshoot your Amazon MSK cluster
The following information can help you troubleshoot problems that you might have with your
Amazon MSK cluster. You can also post your issue to Amazon Web Services re:Post
Topics
- Volume replacement causes disk saturation due to replication overload
- Consumer group stuck in PreparingRebalance state
- Error delivering broker logs to Amazon CloudWatch Logs
- No default security group
- Cluster appears stuck in the CREATING state
- Cluster state goes from CREATING to FAILED
- Cluster state is ACTIVE but producers cannot send data or consumers cannot receive data
- Amazon CLI doesn't recognize Amazon MSK
- Partitions go offline or replicas are out of sync
- Disk space is running low
- Memory running low
- Producer gets NotLeaderForPartitionException
- Under-replicated partitions (URP) greater than zero
- Cluster has topics called __amazon_msk_canary and __amazon_msk_canary_state
- Partition replication fails
- Unable to access cluster that has public access turned on
- Unable to access cluster from within Amazon: Networking issues
- Failed authentication: Too many connects
- MSK Serverless: Cluster creation fails
Volume replacement causes disk saturation due to replication overload
During unplanned volume hardware failure, Amazon MSK may replace the volume with a new instance. Kafka repopulates the new volume by replicating partitions from other brokers in the cluster. Once partitions are replicated and caught up, they are eligible for leadership and in-sync replica (ISR) membership.
Problem
In a broker recovering from volume replacement, some partitions of varying sizes may come back online before others. This can be problematic as those partitions can be serving traffic from the same broker that is still catching up (replicating) other partitions. This replication traffic can sometimes saturate the underlying volume throughput limits, which is 250 MiB per second in the default case. When this saturation occurs, any partitions that are already caught up will be impacted, resulting in latency across the cluster for any brokers sharing ISR with those caught up partitions (not just leader partitions due to remote acks acks=all
). This problem is more common with larger clusters that have larger numbers of partitions that vary in size.
Recommendation
To improve replication I/O posture, ensure that best practice thread settings are in place.
To reduce the likelihood of underlying volume saturation, enable provisioned storage with a higher throughput. A min throughput value of 500 MiB/s is recommended for high throughput replication cases, but the actual value needed will vary with throughput and use case. Provision storage throughput for brokers in a Amazon MSK cluster.
To minimize replication pressure, lower
num.replica.fetchers
to the default value of2
.
Consumer group stuck in PreparingRebalance
state
If one or more of your consumer groups is stuck in a perpetual rebalancing state, the cause
might be
Apache Kafka issue KAFKA-9752
To resolve this issue, we recommend that you upgrade your cluster to Amazon MSK bug-fix version 2.4.1.1, which contains a fix for this issue. For information about updating an existing cluster to Amazon MSK bug-fix version 2.4.1.1, see Update the Apache Kafka version.
The workarounds for solving this issue without upgrading the cluster to Amazon MSK bug-fix version 2.4.1.1 are to either set the Kafka clients to use Static membership protocol , or to Identify and reboot the coordinating broker node of the stuck consumer group.
Implementing static membership protocol
To implement Static Membership Protocol in your clients, do the following:
Set the
group.instance.id
property of your Kafka Consumersconfiguration to a static string that identifies the consumer in the group. Ensure that other instances of the configuration are updated to use the static string.
Deploy the changes to your Kafka Consumers.
Using Static Membership Protocol is more effective if the session timeout in the client configuration is set to a duration that allows the consumer to recover without prematurely triggering a consumer group rebalance. For example, if your consumer application can tolerate 5 minutes of unavailability, a reasonable value for the session timeout would be 4 minutes instead of the default value of 10 seconds.
Note
Using Static Membership Protocol only reduces the probability of encountering this issue. You may still encounter this issue even when using Static Membership Protocol.
Rebooting the coordinating broker node
To reboot the coordinating broker node, do the following:
Identify the group coordinator using the
kafka-consumer-groups.sh
command.Restart the group coordinator of the stuck consumer group using the RebootBroker API action.
Error delivering broker logs to Amazon CloudWatch Logs
When you try to set up your cluster to send broker logs to Amazon CloudWatch Logs, you might get one of two exceptions.
If you get an InvalidInput.LengthOfCloudWatchResourcePolicyLimitExceeded
exception, try again but use log groups that start with /aws/vendedlogs/
.
For more information, see Enabling Logging from Certain Amazon Web Services
If you get an
InvalidInput.NumberOfCloudWatchResourcePoliciesLimitExceeded
exception,
choose an existing Amazon CloudWatch Logs policy in your account, and append the following JSON to
it.
{"Sid":"AWSLogDeliveryWrite","Effect":"Allow","Principal":{"Service":"delivery.logs.amazonaws.com"},"Action":["logs:CreateLogStream","logs:PutLogEvents"],"Resource":["*"]}
If you try to append the JSON above to an existing policy but get an error that says you've reached the maximum length for the policy you picked, try to append the JSON to another one of your Amazon CloudWatch Logs policies. After you append the JSON to an existing policy, try once again to set up broker-log delivery to Amazon CloudWatch Logs.
No default security group
If you try to create a cluster and get an error indicating that there's no default
security group, it might be because you are using a VPC that was shared with you. Ask
your administrator to grant you permission to describe the security groups on this VPC
and try again. For an example of a policy that allows this action, see Amazon EC2: Allows Managing EC2 Security Groups Associated With a Specific VPC,
Programmatically and in the Console
Cluster appears stuck in the CREATING state
Sometimes cluster creation can take up to 30 minutes. Wait for 30 minutes and check the state of the cluster again.
Cluster state goes from CREATING to FAILED
Try creating the cluster again.
Cluster state is ACTIVE but producers cannot send data or consumers cannot receive data
-
If the cluster creation succeeds (the cluster state is
ACTIVE
), but you can't send or receive data, ensure that your producer and consumer applications have access to the cluster. For more information, see the guidance in Step 3: Create a client machine.
-
If your producers and consumers have access to the cluster but still experience problems producing and consuming data, the cause might be KAFKA-7697
, which affects Apache Kafka version 2.1.0 and can lead to a deadlock in one or more brokers. Consider migrating to Apache Kafka 2.2.1, which is not affected by this bug. For information about how to migrate, see Migrate to an Amazon MSK Cluster.
Amazon CLI doesn't recognize Amazon MSK
If you have the Amazon CLI installed, but it doesn't recognize the Amazon MSK commands, upgrade your Amazon CLI to the latest version. For detailed instructions on how to upgrade the Amazon CLI, see Installing the Amazon Command Line Interface. For information about how to use the Amazon CLI to run Amazon MSK commands, see Amazon MSK: How it works.
Partitions go offline or replicas are out of sync
These can be symptoms of low disk space. See Disk space is running low.
Disk space is running low
See the following best practices for managing disk space: Monitor disk space and Adjust data retention parameters.
Memory running low
If you see the MemoryUsed
metric running high or MemoryFree
running low, that doesn't mean there's a problem. Apache Kafka is designed to use as much memory as possible, and it manages it optimally.
Producer gets NotLeaderForPartitionException
This is often a transient error. Set the producer's retries
configuration parameter to a value that's higher than its current value.
Under-replicated partitions (URP) greater than zero
The UnderReplicatedPartitions
metric is an important one to monitor. In a
healthy MSK cluster, this metric has the value 0. If it's greater than zero, that might
be for one of the following reasons.
-
If
UnderReplicatedPartitions
is spiky, the issue might be that the cluster isn't provisioned at the right size to handle incoming and outgoing traffic. See Best practices. -
If
UnderReplicatedPartitions
is consistently greater than 0 including during low-traffic periods, the issue might be that you've set restrictive ACLs that don't grant topic access to brokers. To replicate partitions, brokers must be authorized to both READ and DESCRIBE topics. DESCRIBE is granted by default with the READ authorization. For information about setting ACLs, see Authorization and ACLsin the Apache Kafka documentation.
Cluster has topics called __amazon_msk_canary and __amazon_msk_canary_state
You might see that your MSK cluster has a topic with the name __amazon_msk_canary
and another with the name __amazon_msk_canary_state
. These are internal topics that Amazon MSK creates and uses for cluster health and diagnostic metrics.
These topics are negligible in size and can't be deleted.
Partition replication fails
Ensure that you haven't set ACLs on CLUSTER_ACTIONS.
Unable to access cluster that has public access turned on
If your cluster has public access turned on, but you still cannot access it from the internet, follow these steps:
Ensure that the cluster's security group's inbound rules allow your IP address and the cluster's port. For a list of cluster port numbers, see Port information. Also ensure that the security group's outbound rules allow outbound communications. For more information about security groups and their inbound and outbound rules, see Security groups for your VPC
in the Amazon VPC User Guide. Make sure that your IP address and the cluster's port are allowed in the inbound rules of the cluster's VPC network ACL. Unlike security groups, network ACLs are stateless. This means that you must configure both inbound and outbound rules. In the outbound rules, allow all traffic (port range: 0-65535) to your IP address. For more information, see Add and delete rules
in the Amazon VPC User Guide. -
Make sure that you are using the public-access bootstrap-brokers string to access the cluster. An MSK cluster that has public access turned on has two different bootstrap-brokers strings, one for public access, and one for access from within Amazon. For more information, see Get the bootstrap brokers using the Amazon Web Services Management Console.
Unable to access cluster from within Amazon: Networking issues
If you have an Apache Kafka application that is unable to communicate successfully with an MSK cluster, start by performing the following connectivity test.
Use any of the methods described in Get the bootstrap brokers for an Amazon MSK cluster to get the addresses of the bootstrap brokers.
-
In the following command replace
bootstrap-broker
with one of the broker addresses that you obtained in the previous step. Replaceport-number
with 9094 if the cluster is set up to use TLS authentication. If the cluster doesn't use TLS authentication, replaceport-number
with 9092. Run the command from the client machine.telnet
bootstrap-broker
port-number
Where port-number is:
9094 if the cluster is set up to use TLS authentication.
9092 If the cluster doesn't use TLS authentication.
A different port-number is required if public access is enabled.
Run the command from the client machine.
-
Repeat the previous command for all the bootstrap brokers.
If the client machine is able to access the brokers, this means there
are no connectivity issues. In this case, run the following command to check whether
your Apache Kafka client is set up correctly. To get
bootstrap-brokers
, use any of the methods described in
Get the bootstrap brokers for an
Amazon MSK cluster. Replace
topic
with the name of your topic.
<path-to-your-kafka-installation>
/bin/kafka-console-producer.sh --broker-listbootstrap-brokers
--producer.config client.properties --topictopic
If the previous command succeeds, this means that your client is set up correctly. If you're still unable to produce and consume from an application, debug the problem at the application level.
If the client machine is unable to access the brokers, see the following subsections for guidance that is based on your client-machine setup.
Amazon EC2 client and MSK cluster in the same VPC
If the client machine is in the same VPC as the MSK cluster, make sure the cluster's security group has an inbound rule that accepts traffic from the client machine's security group. For information about setting up these rules, see Security Group Rules. For an example of how to access a cluster from an Amazon EC2 instance that's in the same VPC as the cluster, see Get started using Amazon MSK.
Amazon EC2 client and MSK cluster in different VPCs
If the client machine and the cluster are in two different VPCs, ensure the following:
-
The two VPCs are peered.
-
The status of the peering connection is active.
-
The route tables of the two VPCs are set up correctly.
For information about VPC peering, see Working with VPC Peering Connections
On-premises client
In the case of an on-premises client that is set up to connect to the MSK cluster using Amazon VPN, ensure the following:
-
The VPN connection status is
UP
. For information about how to check the VPN connection status, see How do I check the current status of my VPN tunnel?. -
The route table of the cluster's VPC contains the route for an on-premises CIDR whose target has the format
Virtual private gateway(vgw-xxxxxxxx)
. -
The MSK cluster's security group allows traffic on port 2181, port 9092 (if your cluster accepts plaintext traffic), and port 9094 (if your cluster accepts TLS-encrypted traffic).
For more Amazon VPN troubleshooting guidance, see Troubleshooting Client VPN
Amazon Direct Connect
If the client uses Amazon Direct Connect, see Troubleshooting Amazon Direct Connect
If the previous troubleshooting guidance doesn't resolve the issue, ensure that no
firewall is blocking network traffic. For further debugging, use tools like
tcpdump
and Wireshark
to analyze traffic and to make sure
that it is reaching the MSK cluster.
Failed authentication: Too many connects
The Failed authentication ... Too many connects
error indicates that a broker is protecting itself because one or more IAM clients are trying to connect to it at an aggressive rate. To help brokers accept a higher rate of new IAM connections, you can increase the reconnect.backoff.ms
To learn more about the rate limits for new connections per broker, see the Amazon MSK quota page.
MSK Serverless: Cluster creation fails
If you try to create an MSK Serverless cluster and the workflow fails, you may not
have permission to create a VPC endpoint. Verify that your administrator has granted you
permission to create a VPC endpoint by allowing the ec2:CreateVpcEndpoint
action.
For a complete list of permissions required to perform all Amazon MSK actions, see Amazon managed policy: AmazonMSKFullAccess.