

# Prometheus metrics troubleshooting
<a name="ContainerInsights-Prometheus-troubleshooting"></a>

This section provides help for troubleshooting your Prometheus metrics setup. 

**Topics**
+ [Prometheus metrics troubleshooting on Amazon ECS](ContainerInsights-Prometheus-troubleshooting-ECS.md)
+ [Prometheus metrics troubleshooting on Amazon EKS and Kubernetes clusters](ContainerInsights-Prometheus-troubleshooting-EKS.md)

# Prometheus metrics troubleshooting on Amazon ECS
<a name="ContainerInsights-Prometheus-troubleshooting-ECS"></a>

This section provides help for troubleshooting your Prometheus metrics setup on Amazon ECS clusters. 

## I don't see Prometheus metrics sent to CloudWatch Logs
<a name="ContainerInsights-Prometheus-troubleshooting-ECS-nometrics"></a>

The Prometheus metrics should be ingested as log events in the log group **/aws/ecs/containerinsights/cluster-name/Prometheus**. If the log group is not created or the Prometheus metrics are not sent to the log group, you will need to first check whether the Prometheus targets have been successfully discovered by the CloudWatch agent. And next check the security group and permission settings of the CloudWatch agent. The following steps guide you to do the debugging.

**Step 1: Enable the CloudWatch agent debugging mode**

First, change the CloudWatch agent to debug mode by adding the following bold lines to your Amazon CloudFormation template file, `cwagent-ecs-prometheus-metric-for-bridge-host.yaml` or `cwagent-ecs-prometheus-metric-for-awsvpc.yaml`. Then save the file.

```
cwagentconfig.json: |
    {
      "agent": {
        "debug": true
      },
      "logs": {
        "metrics_collected": {
```

Create a new Amazon CloudFormation changeset against the existing stack. Set other parameters in the changeset to the same values as in your existing Amazon CloudFormation stack. The following example is for a CloudWatch agent installed in an Amazon ECS cluster using the EC2 launch type and the bridge network mode.

```
ECS_NETWORK_MODE=bridge
 CREATE_IAM_ROLES=True
ECS_TASK_ROLE_NAME=your_selected_ecs_task_role_name
ECS_EXECUTION_ROLE_NAME=your_selected_ecs_execution_role_name
NEW_CHANGESET_NAME=your_selected_ecs_execution_role_name

aws cloudformation create-change-set --stack-name CWAgent-Prometheus-ECS-${ECS_CLUSTER_NAME}-EC2-${ECS_NETWORK_MODE} \
    --template-body file://cwagent-ecs-prometheus-metric-for-bridge-host.yaml \
    --parameters ParameterKey=ECSClusterName,ParameterValue=$ECS_CLUSTER_NAME \
                 ParameterKey=CreateIAMRoles,ParameterValue=$CREATE_IAM_ROLES \
                 ParameterKey=ECSNetworkMode,ParameterValue=$ECS_NETWORK_MODE \
                 ParameterKey=TaskRoleName,ParameterValue=$ECS_TASK_ROLE_NAME \
                 ParameterKey=ExecutionRoleName,ParameterValue=$ECS_EXECUTION_ROLE_NAME \
    --capabilities CAPABILITY_NAMED_IAM \
    --region $AWS_REGION \
    --change-set-name $NEW_CHANGESET_NAME
```

Go to the Amazon CloudFormation console to review the new changeset, `$NEW_CHANGESET_NAME`. There should be one change applied to the **CWAgentConfigSSMParameter** resource. Execute the changeset and restart the CloudWatch agent task by entering the following commands.

```
aws ecs update-service --cluster $ECS_CLUSTER_NAME \
--desired-count 0 \
--service your_service_name_here \
--region $AWS_REGION
```

Wait about 10 seconds and then enter the following command.

```
aws ecs update-service --cluster $ECS_CLUSTER_NAME \
--desired-count 1 \
--service your_service_name_here \
--region $AWS_REGION
```

**Step 2: Check the ECS service discovery logs**

The ECS task definition of the CloudWatch agent enables the logs by default in the section below. The logs are sent to CloudWatch Logs in the log group **/ecs/ecs-cwagent-prometheus**.

```
LogConfiguration:
  LogDriver: awslogs
    Options:
      awslogs-create-group: 'True'
      awslogs-group: "/ecs/ecs-cwagent-prometheus"
      awslogs-region: !Ref AWS::Region
      awslogs-stream-prefix: !Sub 'ecs-${ECSLaunchType}-awsvpc'
```

Filter the logs by the string `ECS_SD_Stats` to get the metrics related to the ECS service discovery, as shown in the following example.

```
2020-09-1T01:53:14Z D! ECS_SD_Stats: AWSCLI_DescribeContainerInstances: 1
2020-09-1T01:53:14Z D! ECS_SD_Stats: AWSCLI_DescribeInstancesRequest: 1
2020-09-1T01:53:14Z D! ECS_SD_Stats: AWSCLI_DescribeTaskDefinition: 2
2020-09-1T01:53:14Z D! ECS_SD_Stats: AWSCLI_DescribeTasks: 1
2020-09-1T01:53:14Z D! ECS_SD_Stats: AWSCLI_ListTasks: 1
2020-09-1T01:53:14Z D! ECS_SD_Stats: Exporter_DiscoveredTargetCount: 1
2020-09-1T01:53:14Z D! ECS_SD_Stats: LRUCache_Get_EC2MetaData: 1
2020-09-1T01:53:14Z D! ECS_SD_Stats: LRUCache_Get_TaskDefinition: 2
2020-09-1T01:53:14Z D! ECS_SD_Stats: LRUCache_Size_ContainerInstance: 1
2020-09-1T01:53:14Z D! ECS_SD_Stats: LRUCache_Size_TaskDefinition: 2
2020-09-1T01:53:14Z D! ECS_SD_Stats: Latency: 43.399783ms
```

The meaning of each metric for a particular ECS service discovery cycle is as follows:
+ **AWSCLI\$1DescribeContainerInstances** – the number of `ECS::DescribeContainerInstances` API calls made.
+ **AWSCLI\$1DescribeInstancesRequest** – the number of `ECS::DescribeInstancesRequest` API calls made.
+ **AWSCLI\$1DescribeTaskDefinition** – the number of `ECS::DescribeTaskDefinition` API calls made.
+ **AWSCLI\$1DescribeTasks** – the number of `ECS::DescribeTasks` API calls made.
+ **AWSCLI\$1ListTasks** – the number of `ECS::ListTasks` API calls made.
+ **ExporterDiscoveredTargetCount** – the number of Prometheus targets that were discovered and successfully exported into the target result file within the container.
+ **LRUCache\$1Get\$1EC2MetaData** – the number of times that container instances metadata was retrieved from the cache.
+ **LRUCache\$1Get\$1TaskDefinition** – the number of times that ECS task definition metadata was retrieved from the cache.
+ **LRUCache\$1Size\$1ContainerInstance** – the number of unique container instance's metadata cached in memory.
+ **LRUCache\$1Size\$1TaskDefinition** – the number of unique ECS task definitions cached in memory.
+ **Latency** – how long the service discovery cycle takes.

Check the value of `ExporterDiscoveredTargetCount` to see whether the discovered Prometheus targets match your expectations. If not, the possible reasons are as follows:
+ The configuration of ECS service discovery might not match your application's setting. For the docker label-based service discovery, your target containers may not have the necessary docker label configured in the CloudWatch agent to auto discover them. For the ECS task definition ARN regular expression-based service discovery, the regex setting in the CloudWatch agent may not match your application’s task definition. 
+ The CloudWatch agent’s ECS task role might not have permission to retrieve the metadata of ECS tasks. Check that the CloudWatch agent has been granted the following read-only permissions:
  + `ec2:DescribeInstances`
  + `ecs:ListTasks`
  + `ecs:DescribeContainerInstances`
  + `ecs:DescribeTasks`
  + `ecs:DescribeTaskDefinition`

**Step 3: Check the network connection and the ECS task role policy**

If there are still no log events sent to the target CloudWatch Logs log group even though the value of `Exporter_DiscoveredTargetCount` indicates that there are discovered Prometheus targets, this could be caused by one of the following:
+ The CloudWatch agent might not be able to connect to the Prometheus target ports. Check the security group setting behind the CloudWatch agent. The private IP should alow the CloudWatch agent to connect to the Prometheus exporter ports. 
+ The CloudWatch agent’s ECS task role might not have the **CloudWatchAgentServerPolicy** managed policy. The CloudWatch agent’s ECS task role needs to have this policy to be able to send the Prometheus metrics as log events. If you used the sample Amazon CloudFormation template to create the IAM roles automatically, both the ECS task role and the ECS execution role are granted with the least privilege to perform the Prometheus monitoring. 

# Prometheus metrics troubleshooting on Amazon EKS and Kubernetes clusters
<a name="ContainerInsights-Prometheus-troubleshooting-EKS"></a>

This section provides help for troubleshooting your Prometheus metrics setup on Amazon EKS and Kubernetes clusters. 

## General troubleshooting steps on Amazon EKS
<a name="ContainerInsights-Prometheus-troubleshooting-general"></a>

To confirm that the CloudWatch agent is running, enter the following command.

```
kubectl get pod -n amazon-cloudwatch
```

The output should include a row with `cwagent-prometheus-id` in the `NAME` column and `Running` in the `STATUS column.`

To display details about the running pod, enter the following command. Replace *pod-name* with the complete name of your pod that has a name that starts with `cw-agent-prometheus`.

```
kubectl describe pod pod-name -n amazon-cloudwatch
```

If you have CloudWatch Container Insights installed, you can use CloudWatch Logs Insights to query the logs from the CloudWatch agent collecting the Prometheus metrics.

**To query the application logs**

1. Open the CloudWatch console at [https://console.amazonaws.cn/cloudwatch/](https://console.amazonaws.cn/cloudwatch/).

1. In the navigation pane, choose **CloudWatch Logs Insights**.

1. Select the log group for the application logs, **/aws/containerinsights/*cluster-name*/application**

1. Replace the search query expression with the following query, and choose **Run query**

   ```
   fields ispresent(kubernetes.pod_name) as haskubernetes_pod_name, stream, kubernetes.pod_name, log | 
   filter haskubernetes_pod_name and kubernetes.pod_name like /cwagent-prometheus
   ```

You can also confirm that Prometheus metrics and metadata are being ingested as CloudWatch Logs events.

**To confirm that Prometheus data is being ingested**

1. Open the CloudWatch console at [https://console.amazonaws.cn/cloudwatch/](https://console.amazonaws.cn/cloudwatch/).

1. In the navigation pane, choose **CloudWatch Logs Insights**.

1. Select the **/aws/containerinsights/*cluster-name*/prometheus**

1. Replace the search query expression with the following query, and choose **Run query**

   ```
   fields @timestamp, @message | sort @timestamp desc | limit 20
   ```

## Logging dropped Prometheus metrics
<a name="ContainerInsights-Prometheus-troubleshooting-droppedmetrics"></a>

This release does not collect Prometheus metrics of the histogram type. You can use the CloudWatch agent to check whether any Prometheus metrics are being dropped because they are histogram metrics. You can also log a list of the first 500 Prometheus metrics that are dropped and not sent to CloudWatch because they are histogram metrics.

To see whether any metrics are being dropped, enter the following command:

```
kubectl logs -l "app=cwagent-prometheus" -n amazon-cloudwatch --tail=-1
```

If any metrics are being dropped, you will see the following lines in the `/opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log` file.

```
I! Drop Prometheus metrics with unsupported types. Only Gauge, Counter and Summary are supported.
I! Please enable CWAgent debug mode to view the first 500 dropped metrics
```

If you see those lines and want to know what metrics are being dropped, use the following steps.

**To log a list of dropped Prometheus metrics**

1. Change the CloudWatch agent to debug mode by adding the following bold lines to your `prometheus-eks.yaml` or `prometheus-k8s.yaml`file, and save the file.

   ```
   {
         "agent": {
           "debug": true
         },
   ```

   This section of the file should then look like this:

   ```
   cwagentconfig.json: |
       {
         "agent": {
           "debug": true
         },
         "logs": {
           "metrics_collected": {
   ```

1. Reinstall the CloudWatch agent to enable debug mode by entering the following commands:

   ```
   kubectl delete deployment cwagent-prometheus -n amazon-cloudwatch
   kubectl apply -f prometheus.yaml
   ```

   The dropped metrics are logged in the CloudWatch agent pod.

1. To retrieve the logs from the CloudWatch agent pod, enter the following command:

   ```
   kubectl logs -l "app=cwagent-prometheus" -n amazon-cloudwatch --tail=-1
   ```

   Or, if you have Container Insights Fluentd logging installed, the logs are also saved in the CloudWatch Logs log group **/aws/containerinsights/*cluster\$1name*/application**.

   To query these logs, you can follow the steps for querying the application logs in [General troubleshooting steps on Amazon EKS](#ContainerInsights-Prometheus-troubleshooting-general).

## Where are the Prometheus metrics ingested as CloudWatch Logs log events?
<a name="ContainerInsights-Prometheus-troubleshooting-metrics_ingested"></a>

The CloudWatch agent creates a log stream for each Prometheus scrape job configuration. For example, in the `prometheus-eks.yaml` and `prometheus-k8s.yaml` files, the line `job_name: 'kubernetes-pod-appmesh-envoy'` scrapes App Mesh metrics. The Prometheus target is defined as `kubernetes-pod-appmesh-envoy`. So all App Mesh Prometheus metrics are ingested as CloudWatch Logs events in the log stream **kubernetes-pod-appmesh-envoy** under the log group named **/aws/containerinsights/cluster-name/Prometheus**.

## I don't see Amazon EKS or Kubernetes Prometheus metrics in CloudWatch metrics
<a name="ContainerInsights-Prometheus-troubleshooting-no-metrics"></a>

First, make sure that the Prometheus metrics are ingested as log events in the log group **/aws/containerinsights/cluster-name/Prometheus**. Use the information in [Where are the Prometheus metrics ingested as CloudWatch Logs log events?](#ContainerInsights-Prometheus-troubleshooting-metrics_ingested) to help you check the target log stream. If the log stream is not created or there are no new log events in the log stream, check the following:
+ Check that the Prometheus metrics exporter endpoints are set up correctly
+ Check that the Prometheus scraping configurations in the `config map: cwagent-prometheus` section of the CloudWatch agent YAML file is correct. The configuration should be the same as it would be in a Prometheus configuration file. For more information, see [<scrape\$1config>](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#scrape_config) in the Prometheus documentation.

If the Prometheus metrics are ingested as log events correctly, check that the embedded metric format settings are added into the log events to generate the CloudWatch metrics.

```
"CloudWatchMetrics":[
   {
      "Metrics":[
         {
            "Name":"envoy_http_downstream_cx_destroy_remote_active_rq"
         }
      ],
      "Dimensions":[
         [
            "ClusterName",
            "Namespace"
         ]
      ],
      "Namespace":"ContainerInsights/Prometheus"
   }
],
```

For more information about embedded metric format, see [Specification: Embedded metric format](CloudWatch_Embedded_Metric_Format_Specification.md).

If there is no embedded metric format in the log events, check that the `metric_declaration` section is configured correctly in the `config map: prometheus-cwagentconfig` section of the CloudWatch agent installation YAML file. For more information, see [Tutorial for adding a new Prometheus scrape target: Prometheus API Server metrics](ContainerInsights-Prometheus-Setup-configure.md#ContainerInsights-Prometheus-Setup-new-exporters).