Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract data models from RCA into separate Repo #307

Closed
khushbr opened this issue Mar 25, 2023 · 10 comments
Closed

Extract data models from RCA into separate Repo #307

khushbr opened this issue Mar 25, 2023 · 10 comments
Assignees
Labels
discuss enhancement New feature or request v2.8.0 Issues and PRs related to version v2.8.0

Comments

@khushbr
Copy link
Collaborator

khushbr commented Mar 25, 2023

Background Context
Performance Analyzer (Writer) plugin captures the system behavior at a fine grain(5s) level in the form of metrics and stats, which is periodically flushed to a shared memory location. The RCA Agent(Reader) component scans this directory for updates, parses the raw data into sqlite entries and populates the on-disk sqlite metricDB.

MetricsProcessor

Problem Statement & Proposed Solution
In the past, a conscious design choice to segregate the OpenSearch independent code into a separate process(RCA) was made. This allowed to deploy changes to RCA without restarting the critical OpenSearch process. However, the code split wasn't done cleanly. The shared data models of the various metrics are currently part of the RCA codebase. This proposal is to address this circular dependency between PA and RCA, extracting out the common data models from RCA codebase to a shared library which can in-turn be consumed by the 2 components.

Feedback Requested:
Currently, within OpenSearch ecosystem, we have no precedent for an OpenSearch plugin - external Agent communication model. (Note: We have a shared library common-utils for between plugins reusable functionality and data models.) We are looking for feedback on how to better organize the code for such use-cases.

@khushbr khushbr added enhancement New feature or request discuss labels Mar 25, 2023
@getsaurabh02
Copy link
Member

Thanks for highlighting this @khushbr. Wondering if this refactoring is primarily to address the circular dependency or has other benefits as well? The shared library common-utils is currently overloaded and we have ongoing discussions for it to be deprecated. Adding more dependency could impact such decisions.
Also, how does it align with our long term goals moving some of these collection/aggregation logic to the core itself? Do we have any active plans on it yet?

@khushbr
Copy link
Collaborator Author

khushbr commented Mar 27, 2023

Hey @getsaurabh02, Thank you for your comment.

We will not be using common-utils, but rather have a separate library (Github Repository) with PA-RCA data models, named performance-analyzer-commons.

The aim of this exercise is to remove the code level dependency PA currently has on RCA repository, the benefits will further extend to making the github build, testing workflows cleaner and avoid the RCA packaging inside PA we currently do for the Maven releases.

Also, how does it align with our long term goals moving some of these collection/aggregation logic to the core itself? Do we have any active plans on it yet?

We discussed this and came to the conclusion that async path and system stats metric collection will continue to be part of the current setup (PA-RCA), while the sync path metrics, a limited set (Latency, CPU, JVM) will be part of the upcoming Tracing Framework.

@eirsep
Copy link
Member

eirsep commented Mar 28, 2023

Is there any way we can have the shared models moved to core and avoid creating a new package?

@lezzago
Copy link
Member

lezzago commented Mar 29, 2023

For choosing this new package instead of common-utils makes sense as we want these data models to only be accessed by PA and RCA, but not other plugins. The idea of the shared data models in common-utils currently is to be shared to any other plugin to call Alerting or Notifications, which is very different.

Is there any way we can have the shared models moved to core and avoid creating a new package?

If currently if we do not want PA/RCA to be first class citizens in OpenSearch core, we should not move these data models into core.

I think the only thing to be careful of creating a new package is the extra overhead to manage a new package for each OpenSearch release. If we can make this package very version agnostic as possible, that would be ideal. It could potentially have a separate versioning scheme from OpenSearch and we should make sure it has no dependency on OpenSearch core.

@Tjofil
Copy link
Contributor

Tjofil commented Mar 29, 2023

Leaving some additional context and findings regarding the logical organization of what is currently PA and PA-RCA repo and the suggestions for improving during the separation process.

Parts of our building, bwc, integ testing and local deployment workflows, scripts and practices are outdated (compared to other OS plugin repos), way too complicated and time consuming and should be reworked alongside the process of the repo separation.

  1. Build process for local deployment and testing, the same one from Github workflows which is triggered on every push/commit is very time consuming and even includes some redundancies.
    For example PA-RCA build is published to local maven repo for PA build to consume. After that, both builds are used in the image building process. Testing below shows redundancy for PA-RCA part, as the same PA-RCA build that we transfer and use separately is already found within the PA build because of its previous consumption.
    image
    (Comparison done using meld version 3.22)

  2. The above mentioned build and deployment process is done basically from scratch, it includes two-stage image creation (from centos, if anyone could provide more context about the origin of this build process I would appreciate), entrypoint script, other docker-compose scripts and licensing data, all of it can be found here. The image is built from latest Opensearch min distro + PA plugin installation + redundant PA-RCA config files from the previous point, alongside other access and JVM parameterizing. The process doesn't install any other plugins that are installed by default with standard OS image. This proved to be impractical and error-prone, causing weird bugs not to be detectable before release, i.e. opensearchproject/opensearch:2.6.0 image manifests bugs that are not happening when image is built in the previously explained way (centos base image + opensearch-min:2.6 + PA:2.6 + PA-RCA:2.6), one of the bugs being #305.

  3. Every change inside PA-RCA repo requires repetition of the whole build process (which is not ideal as RCA is logically built a top off PA, more precisely PA's so-called Observability component and shouldn't require whole PA's rebuild). Addition of the gradle run task that is present in other plugins like Alerting would also be nice and simplify quick testing for the developers.
    Improvements of last two points should all be possible once we get rid of the cyclic dependency in this split.

The other aspect I want to talk about is the way for these two to be separated. I'm going to provide some additional context and present the architecture more in detail. Please raise concerns if any part seems incorrect.

Performance Analyzer is a component that is able function on it's own. Conceptually, it is composed of two larger parts: Reader and Writer. Writer is installed as a standard OS plugin and links its fate to the Opensearch JVM, this is necessary for easier collection of OS and OS JVM specific metrics. Collected metrics are then written to temporary timestamped disk files in the form of events, each file representing five second period starting with marked timestamp. All of this is happening as a part of OS' JVM. Reader on the other hand is part of the agent process running independently and without explicit synchronization, reading, aggregating and persisting metrics from the timestamped files. Metrics are persisted inside rotated SQLite databases and (again) separated into 5-second timestamped aggregations. This all represents core PA functionalities which originally existed on their own, mainly to be used inside PerfTop (apart from PA batch API which is its own story) which are available to standard PA metrics API and require only PA to be enabled (through PA enable request).

Performance Analyzer - Root Cause analysis as a component and a Framework, on the other hand, came later (I'm not sure about this, correct me if I'm wrong) and it's idea was to build on top of the PA's agent and use the PA's 5-second aggregations as Leaf nodes for real-time RCA. It is a concept on its own and consists of large infrastructure that enables generalized communication and RCA component connection between different OS nodes using gRPC. Note that there are no explicitly named sub-components of PA-RCA called Reader or Writer, RCA Leaf nodes are implemented as scheduled jobs that consume PA's Observability component (query the SQLite base storing the 5-second metric aggregations). All the RCA nodes (nodes producing RCA summaries), persist their findings inside another SQLite db, and these are the results you get from the performance-analyzer-rca API. All the RCA analysis are implemented using the PA-RCA framework. More can be found here.

Here is the extended picture of the explained architecture:
image

In order for PA-RCA's analysis to start, separate enabling request is required, but it has to be preceded with the PA's enabling request in order to consume its components and work, while vice-versa is not true. It is good to note that PA's results (original metric sampling and aggregating) are not linked to the PA-RCA part. While being one of its most attractive upgrades, PA-RCA analysis is not the only way to use the PA's results.

The way the code is currently split (if we ignore the cyclic dependency part) is that the code that runs inside the plugin is in the PA repo and the part of the code that's running inside the PA Agent is in the PA-RCA repo. This may cause some confusion when looking for certain parts of the system, i.e. you'll find PA's Reader inside the RCA repo. This may be done for sake of build and packaging convenience although I'm not sure. Is it more ideal to keep it separated conceptually (if it's even possible) or by plugin - agent division? I'm also not sure about this and additional discussion and comments are well appreciated.

@khushbr
Copy link
Collaborator Author

khushbr commented Mar 29, 2023

@Tjofil Thank you for the detailed Write-up. I agree with all of it except the last point, the PA Writer and Reader components were intentionally split up and are arranged such by design; the idea was to minimize the footprint on the OpenSearch process, in which PA plugin is loaded. The writer flushes the raw metric data to disk, as soon as it is available, to ensure the data persistence in the in-memory buffers is minimal. All the post-processing on the data is done by the separate process of PA-Agent, via the MetricProcessor, MetricEmitters subcomponents.

@Tjofil
Copy link
Contributor

Tjofil commented Mar 30, 2023

@khushbr You're completely right. The way the code is split in terms of JVM it runs in (OS's or PA agent's JVM) is completely sound and is not questionable. I was more talking about how repos' names may not be intuitive when looking for something or when building/installing the plugin and features you want. While they are running inside the same process, PA Reader and PA-RCA are not that closely coupled, in fact they communicate through SQLite db (of course referencing common metric types) and they have their own thread pools, but are both located in the same repo called PA-RCA so I wondered if there is a way to keep the JVM separation as it is, but having the PA and PA-RCA split more intuitively or at least label the repos differently alongside this separation process.

@khushbr
Copy link
Collaborator Author

khushbr commented Mar 31, 2023

Linking the older issue on PA side: opensearch-project/performance-analyzer#50

@khushbr khushbr added the v2.8.0 Issues and PRs related to version v2.8.0 label Apr 6, 2023
@khushbr
Copy link
Collaborator Author

khushbr commented Apr 7, 2023

@khushbr
Copy link
Collaborator Author

khushbr commented May 31, 2023

@khushbr khushbr closed this as completed May 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss enhancement New feature or request v2.8.0 Issues and PRs related to version v2.8.0
Projects
None yet
Development

No branches or pull requests

6 participants