Troubleshooting your Application Signals installation - Amazon CloudWatch
Services or capabilities described in Amazon Web Services documentation might vary by Region. To see the differences applicable to the China Regions, see Getting Started with Amazon Web Services in China (PDF).

Troubleshooting your Application Signals installation

This section contains troubleshooting tips for CloudWatch Application Signals.

Application doesn't start after Application Signals is enabled

If your application on an Amazon EKS cluster doesn't start after you enable Application Signals on the cluster, check for the following:

  • Check if the application has been instrumented by another monitoring solution. Application Signals might not support co-existing with other instrumentation solutions.

  • Confirm that your the application meets the compatibility requirements to use Application Signals. For more information, see Application Signals supported systems .

  • If your application failed to pull the Application Signals artifacts such as the Amazon Distro for OpenTelemetery Java or Python agent and CloudWatch agent images, it could be a network issue.

To mitigate the issue, remove the annotation instrumentation.opentelemetry.io/inject-java: "true" or instrumentation.opentelemetry.io/inject-python: "true" from your application deployment manifest, and re-deploy your application. Then check if the application is working.

Python application doesn't start after Application Signals is enabled

It is a known issue in OpenTelemetry auto-instrumentation that a missing PYTHONPATH environment variable can sometimes cause the application to fail to start . To resolve this, ensure that you set the PYTHONPATH environment variable to the location of your application’s working directory. For more information about this issue, see Python autoinstrumentation setting of PYTHONPATH is not compliant with Python's module resolution behavior, breaking Django applications.

For Django applications, there are additional required configurations, which are outlined in the OpenTelemetry Python documentation.

  • Use the --noreload flag to prevent automatic reloading.

  • Set the DJANGO_SETTINGS_MODULE environment variable to the location of your Django application’s settings.py file. This ensures that OpenTelemetry can correctly access and integrate with your Django settings.

No application data in Application Signals dashboard

If metrics or traces are missing in the Application Signals dashboards, the following might be causes. Investigate these causes only if you have waited 15 minutes for Application Signals to collect and display data since your last update.

  • Make sure that your library and framework you are using is supported by the ADOT Java agent. For more information, see Libraries / Frameworks.

  • Make sure that the CloudWatch agent is running. First check the status of the CloudWatch agent pods and make sure they are all in Running status.

    kubectl -n amazon-cloudwatch get pods.

    Add the following to the CloudWatch agent configuration file to enable debugging logs, and then restart the agent.

    "agent": { "region": "${REGION}", "debug": true },

    Then check for errors in the CloudWatch agent pods.

  • Check for configuration issues with the CloudWatch agent. Confirm that the following is still in the CloudWatch agent configuration file and the agent has been restarted since it was added.

    "agent": { "region": "${REGION}", "debug": true },

    Then check the OpenTelemetry debugging logs for error messages such as ERROR io.opentelemetry.exporter.internal.grpc.OkHttpGrpcExporter - Failed to export .... These messages might indicate the problem.

    If that doesn't solve the issue, dump and check the environment variables with names that start with OTEL_ by describing the pod with the kubectl describe pod command.

  • To enable the OpenTelemetry Python debug logging, set the environment variable OTEL_PYTHON_LOG_LEVEL to debug and redeploy the application.

  • Check for wrong or insufficient permissions for exporting data from the CloudWatch agent. If you see Access Denied messages in the CloudWatch agent logs, this might be the issue. It is possible that the permissions applied when you installed the CloudWatch agent were later changed or revoked.

  • Check for an Amazon Distro for OpenTelemetry (ADOT) issue when generating telemetry data.

    Make sure that the instrumentation annotations instrumentation.opentelemetry.io/inject-java and sidecar.opentelemetry.io/inject-java are applied to the application deployment and the value is true. Without these, the application pods will not be instrumented even if the ADOT addon is installed correctly.

    Next, check if the init container is applied on the application and the Ready state is True. If the init container is not ready, see the status for the reason.

    If the issue persists, enable debug logging on the OpenTelemetry Java SDK by setting the environment variable OTEL_JAVAAGENT_DEBUG to true and redeploying the application. Then look for messages that start with ERROR io.telemetry.

  • The metric/span exporter might be dropping data. To find out, check the application log for messages that include Failed to export...

  • The CloudWatch agent might be getting throttled when sending metrics or spans to Application Signals. Check for messages indicating throttling in the CloudWatch agent logs.

  • Make sure that you've enabled the service discovery setup. You need to do this only once in your Region.

    To confirm this, in the CloudWatch console choose Application Signals, Services. If Step 1 is not marked Complete, choose Start discovering your services. Data should start flowing in within five minutes.

Service metrics or dependency metrics have Unknown values

If you see UnknownService, UnknownOperation, UnknownRemoteService, or UnknownRemoteOperation for a dependency name or operation in the Application Signals dashboards, check whether the occurrence of data points for the unknown remote service and unknown remote operation are coinciding with their deployments.

  • UnknownService means that the name of an instrumented application is unknown. If the OTEL_SERVICE_NAME environment variable is undefined and service.name isn't specified in OTEL_RESOURCE_ATTRIBUTES, the service name is set to UnknownService. To fix this, specify the service name in OTEL_SERVICE_NAME or OTEL_RESOURCE_ATTRIBUTES.

  • UnknownOperation means that the name of an invoked operation is unknown. This occurs when Application Signals is unable to discover an operation name which invokes the remote call, or when the extracted operation name contains high cardinality values.

  • UnknownRemoteService means that the name of the destination service is unknown. This occurs when the system is unable to extract the destination service name that the remote call accesses.

    One solution is to create a custom span around the function that sends out the request, and add the attribute aws.remote.service with the designated value. Another option is to configure the CloudWatch agent to customize the metric value of RemoteService. For more information about customizations in the CloudWatch agent, see Enable CloudWatch Application Signals.

  • UnknownRemoteOperation means that the name of the destination operation is unknown. This occurs when the system is unable to extract the destination operation name that the remote call accesses.

    One solution is to create a custom span around the function that sends out the request, and add the attribute aws.remote.operation with the designated value. Another option is to configure the CloudWatch agent to customize the metric value of RemoteOperation. For more information about customizations in the CloudWatch agent, see Enable CloudWatch Application Signals.

Handling a ConfigurationConflict when managing the Amazon CloudWatch Observability EKS add-on

When you install or update the Amazon CloudWatch Observability EKS add-on, if you notice a failure caused by a Health Issue of type ConfigurationConflict with a description that starts with Conflicts found when trying to apply. Will not continue due to resolve conflicts mode, it is likely because you already have the CloudWatch agent and its associated components such as the ServiceAccount, the ClusterRole and the ClusterRoleBinding installed on the cluster. When the add-on tries to install the CloudWatch agent and its associated components, if it detects any change in the contents, it by default fails the installation or update to avoid overwriting the state of the resources on the cluster.

If you are trying to onboard to the Amazon CloudWatch Observability EKS add-on and you see this failure, we recommend deleting an existing CloudWatch agent setup that you had previously installed on the cluster and then installing the EKS add-on. Be sure to back up any customizations you might have made to the original CloudWatch agent setup such as a custom agent configuration, and provide these to the Amazon CloudWatch Observability EKS add-on when you next install or update it. If you had previously installed the CloudWatch agent for onboarding to Container Insights, see Deleting the CloudWatch agent and Fluent Bit for Container Insights for more information.

Alternatively, the add-on supports a conflict resolution configuration option that has the capability to specify OVERWRITE. You can use this option to proceed with installing or updating the add-on by overwriting the conflicts on the cluster. If you are using the Amazon EKS console, you'll find the Conflict resolution method when you choose the Optional configuration settings when you create or update the add-on. If you are using the Amazon CLI, you can supply the --resolve-conflicts OVERWRITE to your command to create or update the add-on.

I want to filter out unnecessary metrics and traces

If Application Signals is collecting traces and metrics that you don't want, see Manage high-cardinality operations for information about configuring the CloudWatch agent with custom rules to reduce cardinality.

For information about customizing trace sampling rules, see Configure sampling rules in the X-Ray documentation.

What does InternalOperation mean?

An InternalOperation is an operation that is triggered by the application internally rather than by an external invocation. Seeing InternalOperation is expected, healthy behavior.

Some typical examples where you would see InternalOperation include the following:

  • Preloading on start– Your application performs an operation named loadDatafromDB which reads metadata from a database during the warm up phase. Instead of observing loadDatafromDB as a service operation, you'll see it categorized as an InternalOperation.

  • Async execution in the background– Your application subscribes to an event queue, and processes streaming data accordingly whenever there’s an update. Each triggered operation will be under InternalOperation as a service operation.

  • Retrieving host information from a service registry– Your application talks to a service registry for service discovery. All interactions with the discovery system are classified as an InternalOperation.

Can I disable FluentBit?

You can disable FluentBit by configuring the Amazon CloudWatch Observability EKS add-on. For more information, see (Optional) Additional configuration.

Can I filter container logs before exporting to the CloudWatch Logs?

No, filtering container logs is not yet supported.