Troubleshooting your Application Signals installation
This section contains troubleshooting tips for CloudWatch Application Signals.
Topics
- Application doesn't start after Application Signals is enabled
- Python application doesn't start after Application Signals is enabled
- No application data in Application Signals dashboard
- Service metrics or dependency metrics have Unknown values
- Handling a ConfigurationConflict when managing the Amazon CloudWatch Observability EKS add-on
- I want to filter out unnecessary metrics and traces
- What does InternalOperation mean?
- Can I disable FluentBit?
- Can I filter container logs before exporting to the CloudWatch Logs?
Application doesn't start after Application Signals is enabled
If your application on an Amazon EKS cluster doesn't start after you enable Application Signals on the cluster, check for the following:
Check if the application has been instrumented by another monitoring solution. Application Signals might not support co-existing with other instrumentation solutions.
Confirm that your the application meets the compatibility requirements to use Application Signals. For more information, see Application Signals supported systems .
If your application failed to pull the Application Signals artifacts such as the Amazon Distro for OpenTelemetery Java or Python agent and CloudWatch agent images, it could be a network issue.
To mitigate the issue, remove the annotation instrumentation.opentelemetry.io/inject-java: "true"
or instrumentation.opentelemetry.io/inject-python: "true"
from your application deployment manifest, and re-deploy your application. Then check if the application is working.
Python application doesn't start after Application Signals is enabled
It is a known issue in OpenTelemetry auto-instrumentation that a missing PYTHONPATH
environment variable can sometimes cause the application to fail to start
. To resolve this, ensure that you set the PYTHONPATH
environment variable to the location of your application’s working directory.
For more information about this issue, see
Python autoinstrumentation setting of PYTHONPATH is not compliant with Python's module resolution behavior, breaking Django applications
For Django applications, there are additional required configurations, which are outlined in the
OpenTelemetry Python documentation
Use the
--noreload
flag to prevent automatic reloading.Set the
DJANGO_SETTINGS_MODULE
environment variable to the location of your Django application’ssettings.py
file. This ensures that OpenTelemetry can correctly access and integrate with your Django settings.
No application data in Application Signals dashboard
If metrics or traces are missing in the Application Signals dashboards, the following might be causes. Investigate these causes only if you have waited 15 minutes for Application Signals to collect and display data since your last update.
Make sure that your library and framework you are using is supported by the ADOT Java agent. For more information, see Libraries / Frameworks
. Make sure that the CloudWatch agent is running. First check the status of the CloudWatch agent pods and make sure they are all in
Running
status.kubectl -n amazon-cloudwatch get pods.
Add the following to the CloudWatch agent configuration file to enable debugging logs, and then restart the agent.
"agent": { "region": "${REGION}", "debug": true },
Then check for errors in the CloudWatch agent pods.
Check for configuration issues with the CloudWatch agent. Confirm that the following is still in the CloudWatch agent configuration file and the agent has been restarted since it was added.
"agent": { "region": "${REGION}", "debug": true },
Then check the OpenTelemetry debugging logs for error messages such as
ERROR io.opentelemetry.exporter.internal.grpc.OkHttpGrpcExporter - Failed to export ...
. These messages might indicate the problem.If that doesn't solve the issue, dump and check the environment variables with names that start with
OTEL_
by describing the pod with thekubectl describe pod
command.To enable the OpenTelemetry Python debug logging, set the environment variable
OTEL_PYTHON_LOG_LEVEL
todebug
and redeploy the application.Check for wrong or insufficient permissions for exporting data from the CloudWatch agent. If you see
Access Denied
messages in the CloudWatch agent logs, this might be the issue. It is possible that the permissions applied when you installed the CloudWatch agent were later changed or revoked.Check for an Amazon Distro for OpenTelemetry (ADOT) issue when generating telemetry data.
Make sure that the instrumentation annotations
instrumentation.opentelemetry.io/inject-java
andsidecar.opentelemetry.io/inject-java
are applied to the application deployment and the value istrue
. Without these, the application pods will not be instrumented even if the ADOT addon is installed correctly.Next, check if the
init
container is applied on the application and theReady
state isTrue
. If theinit
container is not ready, see the status for the reason.If the issue persists, enable debug logging on the OpenTelemetry Java SDK by setting the environment variable
OTEL_JAVAAGENT_DEBUG
to true and redeploying the application. Then look for messages that start withERROR io.telemetry
.The metric/span exporter might be dropping data. To find out, check the application log for messages that include
Failed to export...
The CloudWatch agent might be getting throttled when sending metrics or spans to Application Signals. Check for messages indicating throttling in the CloudWatch agent logs.
Make sure that you've enabled the service discovery setup. You need to do this only once in your Region.
To confirm this, in the CloudWatch console choose Application Signals, Services. If Step 1 is not marked Complete, choose Start discovering your services. Data should start flowing in within five minutes.
Service metrics or dependency metrics have Unknown values
If you see UnknownService, UnknownOperation, UnknownRemoteService, or UnknownRemoteOperation for a dependency name or operation in the Application Signals dashboards, check whether the occurrence of data points for the unknown remote service and unknown remote operation are coinciding with their deployments.
UnknownService means that the name of an instrumented application is unknown. If the
OTEL_SERVICE_NAME
environment variable is undefined andservice.name
isn't specified inOTEL_RESOURCE_ATTRIBUTES
, the service name is set toUnknownService
. To fix this, specify the service name inOTEL_SERVICE_NAME
orOTEL_RESOURCE_ATTRIBUTES
.UnknownOperation means that the name of an invoked operation is unknown. This occurs when Application Signals is unable to discover an operation name which invokes the remote call, or when the extracted operation name contains high cardinality values.
UnknownRemoteService means that the name of the destination service is unknown. This occurs when the system is unable to extract the destination service name that the remote call accesses.
One solution is to create a custom span around the function that sends out the request, and add the attribute
aws.remote.service
with the designated value. Another option is to configure the CloudWatch agent to customize the metric value ofRemoteService
. For more information about customizations in the CloudWatch agent, see Enable CloudWatch Application Signals.UnknownRemoteOperation means that the name of the destination operation is unknown. This occurs when the system is unable to extract the destination operation name that the remote call accesses.
One solution is to create a custom span around the function that sends out the request, and add the attribute
aws.remote.operation
with the designated value. Another option is to configure the CloudWatch agent to customize the metric value ofRemoteOperation
. For more information about customizations in the CloudWatch agent, see Enable CloudWatch Application Signals.
Handling a ConfigurationConflict when managing the Amazon CloudWatch Observability EKS add-on
When you install or update the Amazon CloudWatch Observability EKS add-on, if you notice a failure caused by
a Health Issue
of type ConfigurationConflict
with a description that
starts with
Conflicts found when trying to apply. Will not continue due to resolve conflicts mode
,
it is likely because you already have the CloudWatch agent and its associated components such as the
ServiceAccount, the ClusterRole and the ClusterRoleBinding installed on the cluster. When the
add-on tries to install the CloudWatch agent and its associated components, if it
detects any change in the contents, it by default fails the installation or update to avoid
overwriting the state of the resources on the cluster.
If you are trying to onboard to the Amazon CloudWatch Observability EKS add-on and you see this failure, we recommend deleting an existing CloudWatch agent setup that you had previously installed on the cluster and then installing the EKS add-on. Be sure to back up any customizations you might have made to the original CloudWatch agent setup such as a custom agent configuration, and provide these to the Amazon CloudWatch Observability EKS add-on when you next install or update it. If you had previously installed the CloudWatch agent for onboarding to Container Insights, see Deleting the CloudWatch agent and Fluent Bit for Container Insights for more information.
Alternatively, the add-on supports a conflict resolution configuration option
that has the capability to specify OVERWRITE
. You can use this option to proceed
with installing or updating the add-on by overwriting the conflicts on the cluster.
If you are using the Amazon EKS console, you'll find the Conflict resolution method when you
choose the Optional configuration settings when you create
or update the add-on. If you are using the Amazon CLI, you can supply the --resolve-conflicts OVERWRITE
to your command to create or update the add-on.
I want to filter out unnecessary metrics and traces
If Application Signals is collecting traces and metrics that you don't want, see Manage high-cardinality operations for information about configuring the CloudWatch agent with custom rules to reduce cardinality.
For information about customizing trace sampling rules, see Configure sampling rules in the X-Ray documentation.
What does InternalOperation
mean?
An InternalOperation
is an operation that is triggered by the application internally rather than by an external invocation.
Seeing InternalOperation
is expected, healthy behavior.
Some typical examples where you would see InternalOperation
include the following:
Preloading on start– Your application performs an operation named
loadDatafromDB
which reads metadata from a database during the warm up phase. Instead of observingloadDatafromDB
as a service operation, you'll see it categorized as anInternalOperation
.Async execution in the background– Your application subscribes to an event queue, and processes streaming data accordingly whenever there’s an update. Each triggered operation will be under
InternalOperation
as a service operation.Retrieving host information from a service registry– Your application talks to a service registry for service discovery. All interactions with the discovery system are classified as an
InternalOperation
.
Can I disable FluentBit?
You can disable FluentBit by configuring the Amazon CloudWatch Observability EKS add-on. For more information, see (Optional) Additional configuration.
Can I filter container logs before exporting to the CloudWatch Logs?
No, filtering container logs is not yet supported.