Amazon OpenSearch Service 的跟踪分析
您可以使用 OpenSearch 可观察性插件中包含的跟踪分析,从而分析来自分布式应用程序的跟踪数据。Trace Analytics 需要 OpenSearch 或 Elasticsearch 7.9 或更高版本。
在分布式应用程序中,单个操作(如用户单击按钮)可触发一系列扩展事件。例如,应用程序前端可能会调用后端服务,后端服务调用另一个服务,该服务可查询数据库,该服务处理查询并返回结果。然后,第一个后端服务向前端发送确认,这将更新 UI。
您可以使用跟踪分析来帮助您可视化此事件流并识别性能问题。

先决条件
跟踪分析要求您将工具
将工具插入添加到应用程序后,OpenTelemetry Collector
最后,Data Prepper
有关演示端到端数据流的 Docker Compose 文件,请参阅 OpenSearch 文档
OpenTelemetry Collector 示例配置
要将 OpenTelemetry Collector 与 Data Prepper 一起使用,请尝试以下示例配置:
receivers: jaeger: protocols: grpc: otlp: protocols: grpc: zipkin: exporters: otlp/data-prepper: endpoint:
data-prepper-host
:21890 insecure: true service: pipelines: traces: receivers: [jaeger, otlp, zipkin] exporters: [otlp/data-prepper]
Data Prepper 示例配置
若要将跟踪数据发送到 OpenSearch Service 域,请尝试以下示例配置文件。
data-prepper-config.yaml
ssl: true keyStoreFilePath: "/usr/share/data-prepper/keystore.jks" # required if ssl is true keyStorePassword: "password" # optional, defaults to empty string privateKeyPassword: "other_password" # optional, defaults to empty string serverPort: 4900 # port for administrative endpoints, default is 4900
pipelines.yaml
entry-pipeline: # Workers is the number of application threads. # Try setting this value to the number of CPU cores on the machine. # We recommend the same number of workers for all pipelines. workers: 4 delay: "100" # milliseconds source: otel_trace_source: ssl: true sslKeyCertChainFile: "config/demo-data-prepper.crt" sslKeyFile: "config/demo-data-prepper.key" buffer: bounded_blocking: # Buffer size is the number of export requests to hold in memory. # We recommend the same value for all pipelines. # Batch size is the maximum number of requests each worker thread processes within the delay. # Keep buffer size >= number of workers * batch size. buffer_size: 1024 batch_size: 256 sink: - pipeline: name: "raw-pipeline" - pipeline: name: "service-map-pipeline" raw-pipeline: workers: 4 # We recommend the default delay for the raw pipeline. delay: "3000" source: pipeline: name: "entry-pipeline" prepper: - otel_trace_raw_prepper: buffer: bounded_blocking: buffer_size: 1024 batch_size: 256 sink: - opensearch: hosts: ["https://
domain-endpoint
"] # # Basic authentication # username: "ta-user" # password: "ta-password" # IAM signing aws_sigv4: true aws_region: "us-east-1
" trace_analytics_raw: true service-map-pipeline: workers: 4 delay: "100" source: pipeline: name: "entry-pipeline" prepper: - service_map_stateful: buffer: bounded_blocking: buffer_size: 1024 batch_size: 256 sink: - opensearch: hosts: ["https://domain-endpoint
"] # # Basic authentication # username: "ta-user" # password: "ta-password" # IAM signing aws_sigv4: true aws_region: "us-east-1
" trace_analytics_service_map: true
-
对于 IAM 签名,请使用 Amazon CLI 运行
aws configure
以设置您的凭证。 -
如果您使用访问权限的精细控制与内部用户数据库,使用基本身份验证行。
如果您的域使用精细访问控制,则必须将 Data Prepper 用户或角色映射到 all_access role。
如果域不使用精细访问控制,则 Data Prepper 用户或角色必须具有对多个索引和模板的写入权限,以及访问索引状态管理(ISM)策略和检索群集设置的权限。以下示例策略显示了所需的最低权限。
{ "Version": "2012-10-17", "Statement": [{ "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::
123456789012
:user/data-prepper-sink-user
" }, "Action": "es:ESHttp*", "Resource": [ "arn:aws:es:us-east-1
:123456789012
:domain/domain-name
/otel-v1*", "arn:aws:es:us-east-1
:123456789012
:domain/domain-name
/_template/otel-v1*", "arn:aws:es:us-east-1
:123456789012
:domain/domain-name
/_plugins/_ism/policies/raw-span-policy", "arn:aws:es:us-east-1
:123456789012
:domain/domain-name
/_alias/otel-v1*" ] }, { "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::123456789012
:user/data-prepper-sink-user
" }, "Action": "es:ESHttpGet", "Resource": "arn:aws:us-east-1
:123456789012
:domain/domain-name
/_cluster/settings" } ] }
Data Prepper 使用端口 21890 接收数据,它必须能够连接到 OpenTelemetry Collector 和 OpenSearch 集群。要进行性能调整,请调整配置文件中的工作线程计数和缓冲区设置,以及计算机的 Java 虚拟机 (JVM) 堆大小。
Data Prepper 的完整文档请查看 OpenSearch 文档
探索跟踪数据
控制面板视图按 HTTP 方法和路径将跟踪组合在一起,以便您可以查看与特定操作相关的平均延迟、错误率和趋势。对于更集中的视图,请尝试按跟踪组名称进行筛选。

要向下钻取组成跟踪组的迹线,请选择右侧列中的迹线数。然后选择一个单独的跟踪获取详细的摘要。
服务视图列出了应用程序中的所有服务,以及显示各种服务之间如何相互连接的交互式地图。与控制面板(有助于按操作识别问题)不同,服务图可帮助您按服务识别问题。尝试按错误率或延迟进行排序,了解应用程序的潜在问题区域。
