Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Migrating Metrics From Performance Analyzer to OpenTelemetry Framework #585

Open
ansjcy opened this issue Oct 17, 2023 · 5 comments
Open
Labels
enhancement New feature or request

Comments

@ansjcy
Copy link
Member

ansjcy commented Oct 17, 2023

Introduction

This RFC proposes migrating metrics from the OpenSearch Performance Analyzer (a plugin designed to gather system and application-level metrics) to OpenTelemetry framework, in light of the recent integration of OpenTelemetry as a trace/metrics collector within OpenSearch, and eventually deprecate the Performance Analyzer plugin.

Background

OpenSearch Performance Analyzer has been a valuable plugin within OpenSearch, offering insights into system and application level performance. With the advancement in observability frameworks and the community's move towards standardization, OpenSearch has integrated OpenTelemetry as a metrics collector, we are now presented with an opportunity to streamline our metrics collection workflow and framework and improve the maintainability and performance of the metrics collection workflow.

Motivation

  1. Unified Metrics Collection: The integration of OpenTelemetry provides us a comprehensive metrics collection framework that can potentially replace the functionality of Performance Analyzer. Consolidating our metrics collection tools will simplify the architecture and reduce the complexity of our system.
  2. Reduce Maintenance Overhead: Maintaining two metrics collection tools is resource-intensive. PA is using a metric collection framrowk built by ourselves to collect metrics and it's not industry standard. By focusing our efforts on a single framework (OpenTelemetry), we can ensure that we provide the best possible support and updates.
  3. Community Adoption: OpenTelemetry has gained significant traction in the community, leading to more integrations, tools, and extensions that our users can benefit from.
  4. Performance: OpenTelemetry is a widely-adopted project with optimizations and improvements being made continuously. Leveraging its capabilities can potentially offer better performance and resource utilization compared to maintaining our custom solution (PA/RCA).

Proposal

  • Deprecation Notice: We can begin by adding a deprecation notice on the Performance Analyzer's README and documentation. Inform users about the planned deprecation and the timeline for discontinuing support.
  • Migration Plan: Come up with a detailed migration plan which covers:
    • What are the different types of metrics we collect in Performance Analyzer
    • For each of the category, how to get the exact same metrics previously gathered by Performance Analyzer using OpenTelemetry.
    • For the downstream components that consume PA metrics, how to maintain the consistency.
    • Run the new metrics system as shadow mode for some time (?).
  • Deprecation: After we are confident of the new metrics collection workflow, , officially deprecate the Performance Analyzer.
    • Stopping active development and support.
    • Archiving the repository or clearly marking it as deprecated.
  • Removal: In a subsequent major release of OpenSearch, completely remove the Performance Analyzer from the codebase and documentation.

Appendix

Categories of PA (the plugin) Collectors

  • Host level metrics: collected by directly reading the host/node level metrics.
  • Service level metrics: collected directly from OpenSearch application, it uses the OpenSearchResource object with is created when the PA plugin is loaded and contains the OpenSearch related data like threadPool, environment, indicesService etc.
    • Metrics with reflect: involve using java reflection to get metrics from a library
  • JVM level metrics: collected from JVM directly by using GarbageCollectorMXBean etc.
  • Service level metrics with API: collected by calling an API.
  • PA internal metrics: Collects internal metrics from PA/RCA framework, not related to OpenSearch Core.
Collector Name Type Details: How are metrics collected migrate to ..? Feasible or not?
OSMetricsCollector Host level metrics Several customized data generator are created to gather CPU, Disk, Scheduling related metrics by reading the "/proc//task//*" files in a blocking way for all threads on the node. The metrics are then gathered by the OSMetricsCollector and forward to the Json file in shared Memory. Other agent outside of OpenSearch process /OPTL collector Feasible
DisksCollector Host level metrics Customized data generator are created to gather Disk related metrics by reading the "/proc/diskstats" files in a blocking way. The metrics are then gathered by the DiskCollector and forward to the Json file in shared Memory. Other agent outside of OpenSearch process/OPTL collector Feasible
NetworkInterfaceCollector Host level metrics Customized data generator are created to gather Network related metrics by reading the "/proc/net/snmp, /prov/net/snmp6, /proc/net/dev" files in a blocking way. The metrics are then gathered by the NetworkInterfaceCollector and forward to the Json file in shared Memory. Other agent outside of OpenSearch process/OPTL collector Feasible
HeapMetricsCollector JVM level metrics Utilize the GarbageCollectorMXBean and MemoryMXBean in java.lang.management library to get metrics related to JVM Core Feasible
GCInfoCollector JVM level metrics get GC related info from GarbageCollectorMXBeans Core Feasible
CircuitBreakerCollector Service level metrics from circuitBreakerService passed from OpenSearch Core Feasible.
NodeDetailsCollector Service level metrics from clusterService passed from OpenSearch Core Feasible
ClusterManagerServiceMetrics Service level metrics get the pending tasks stats from clusterService.clusterManagerService Core Feasible
ShardStateCollector Service level metrics get shard state metrics for each shard in each index using the routingTable data within the clusterService passed from OpenSearch Core Feasible, but need to check the CPU level metrics comming from threads.
ElectionTermCollector Service level metrics Get election term metric from clusterService passed from OpenSearch Core Feasible
ThreadPoolMetricsCollector Service level metrics (with reflection) Metrics are get from calling the stats() function on the threadpool object passed from OpenSearch. we use Java reflection to get the capacity of the threadpool Core Feasible. Migrating to core means we can directly send threadpool level metrics without using reflection.
CacheConfigMetricsCollector Service level metrics (with reflection) from indicesService passed from OpenSearch, use Java reflection to ensure backward compatibility. The indicesService is provided by DI and the binding is defined here Core Feasible.
NodeStatsAllShardsMetricsCollector Service level metrics (with reflection) from indicesService passed from OpenSearch, get the increment of the high level stats for all shards by calculating the diff between the previous shard stats Core Feasible
NodeStatsFixedShardsMetricsCollector Service level metrics (with reflection) Similar to NodeStatsAllShardsMetricsCollector, from indicesService passed from OpenSearch, get more detailed metrics for some specified shards passed by the user with shardsPerCollection Core Feasible
ClusterManagerServiceEventMetrics Service level metrics (with reflection) get cluster manager task event data from the clusterManagerService Object passed from OpenSearch Core Feasible
ClusterManagerThrottlingMetricsCollector Service level metrics (with reflection) get throttling metrics from the reflect of org.opensearch.action.support.clustermanager.ClusterManagerThrottlingRetryListener, from the clusterService passed from OpenSearch Core Feasible
ClusterApplierServiceStatsCollector Service level metrics (with reflection) "ClusterApplierServiceStats is ES is a tracker for total time taken to apply cluster state and thenumber of times it has failed". This collector uses the ClusterApplierService from opensearch. Core Feasible
AdmissionControlMetricsCollector Service level metrics (with reflection) Use the admissionController from com.sonian.opensearch.http.jetty.throttling.JettyAdmissionControlService in OpenSearch. Get AdmissionControl related metrics. Core Feasible
ShardIndexingPressureMetricsCollector Service level metrics (with reflection) Get Index pressure related metrics, from clusterService passed from OpenSearch. Using classes like org.opensearch.index.ShardIndexingPressureStore, org.opensearch.index.IndexingPressure, org.opensearch.index.ShardIndexingPressure classes from clusterService Core Feasible
FaultDetectionMetricsCollector PA internal metrics PA internal queue fault metrics? Get the FaultDetectionHandlerMetricsQueue from org.opensearch.performanceanalyzer.handler.ClusterFaultDetectionStatsHandler and emit metrics based on each entry. Deprecate Feasible
StatsCollector PA internal metrics PA internal metrics stats collector deprecate Feasible
@ansjcy ansjcy added enhancement New feature or request untriaged labels Oct 17, 2023
@ansjcy ansjcy changed the title [RFC] Migrating metrics from Performance Analyzer to OpenTelemetry Framework [RFC] Migrating Metrics From Performance Analyzer to OpenTelemetry Framework Oct 17, 2023
@Gaganjuneja
Copy link
Collaborator

@ansjcy, thanks for putting this up. Utilizing the OpenSearch telemetry framework for emitting these metrics does seem promising. The PA plugin generators are already well-written, making them easily reusable. Since these metrics are ideally part of a plugin rather than being merged directly into the core, migrating them to the OpenSearch telemetry framework within the PA plugin sounds like a sensible approach.

thoughts here @reta @backslasht @msfroh @khushbr @Bukhtawar

@reta
Copy link
Contributor

reta commented Apr 12, 2024

Agree with @Gaganjuneja , the OpenSearch already collects tons of metrics but exposes them through REST APIs, using the newly developed metric providers, we certainly could unify the approach. Thanks @ansjcy !

@backslasht
Copy link

+1, I like the idea of migrating the Performance Analyzer plugin metrics into the OpenTelemetry format.

But, would like to understand bit more on deprecation of "Performance Analyzer" plugin part.

  1. are you suggesting to move the logic into a new plugin which will emit these metrics in OTel format and once that is done deprecate "Performance Analyzer" plugin OR
  2. are you suggesting to move the metrics collection into core?

@Gaganjuneja
Copy link
Collaborator

Thank you, @reta and @backslasht, for your prompt responses. My suggestion is to retain these metrics within the "Performance Analyzer" plugin for the time being, given its extensive collection of operating system metrics. To facilitate this, we can pass the MetricsRegistry from the core to the Performance Analyzer plugin and initiate the migration of metrics to utilize an OpenTelemetry-based metrics registry for publishing purposes. Eventually, we can deliberate on the feasibility of integrating this plugin entirely into the core, taking into consideration the implications of backporting as well.

@dblock
Copy link
Member

dblock commented Jun 17, 2024

Catch All Triage - 1 2 3 4 5

@dblock dblock removed the untriaged label Jun 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants