Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Opensearch 2.6.0 ShardBulkDocs Metrics is always empty #405

Open
pmarjou opened this issue Mar 23, 2023 · 9 comments
Open

[BUG] Opensearch 2.6.0 ShardBulkDocs Metrics is always empty #405

pmarjou opened this issue Mar 23, 2023 · 9 comments
Assignees
Labels
bug Something isn't working v2.8.0

Comments

@pmarjou
Copy link

pmarjou commented Mar 23, 2023

What is the bug?
ShardBulkDocs Metrics is always empty when calling performance Analyzer API. Other metrics are working.
I re-open the problem is still in 2.6.0 #77

How can one reproduce the bug?
Steps to reproduce the behavior:

docker compose enclosed
docker-compose.txt

Follow step to activate performance analyzer : https://opensearch.org/docs/latest/opensearch/install/docker/#optional-set-up-performance-analyzer

Create test index and POST documents in this index PUT test (see enclosed)
postdata.txt

while pushing documents, call from your browser : http://localhost:9600/_plugins/_performanceanalyzer/metrics?metrics=ShardBulkDocs,ShardEvents&agg=sum,sum
you 'll get only ShardEvents numbers and no data on ShardBulkDocs

{"CkXsbxtMTDisNx2ahHqrIw": {"timestamp": 1635425175000, "data": {"fields":[{"name":"ShardBulkDocs","type":"DOUBLE"},{"name":"ShardEvents","type":"DOUBLE"}],"records":[[0.0,27.0]]}}}

What is the expected behavior?
On previous Open Distro 1.13.1 this call was working

What is your host/environment?

Opensearch 2.6.0 Docker

@pmarjou pmarjou added bug Something isn't working untriaged labels Mar 23, 2023
@pmarjou
Copy link
Author

pmarjou commented Mar 23, 2023

in addition, I've checked as mentioned in @systemosyn comment #77 (comment)
I only have start events on /dev/shm/performanceanalyzer temp files :
image

@Tjofil
Copy link
Contributor

Tjofil commented Mar 29, 2023

Hey @pmarjou thank you for submitting your findings.

The erroneous response is definitely related to the missing finish events you pointed out, because without them, there's no trigger for the following line and nothing gets persisted inside the SQLite db for later querying.

I reproduced the error using the opensearchproject/opensearch:2.6.0 image and went back to my locally deployed cluster to debug. But there, finish events were present and the request was working as expected.

I built the PA and PA-RCA from the 2.6 branch and installed them with our standard build process, where we start from clean image and opensearch minimal build and install PA and PA-RCA and again, everything was working well. I first suspected that there was a build problem with the official image related to PA, PA-RCA but found nothing suspicious.

Finally the only obvious remaining difference between these two setups are the plugins installed by default on the standard opensearch image that we pull.

By uninstalling them in batches i found a sort-of convoluted root cause: presence of Security plugin.

[opensearch@f440328382f9 ~]$ ls plugins/
opensearch-alerting                   opensearch-neural-search
opensearch-anomaly-detection          opensearch-notifications
opensearch-asynchronous-search        opensearch-notifications-core
opensearch-cross-cluster-replication  opensearch-observability
opensearch-geospatial                 opensearch-performance-analyzer
opensearch-index-management           opensearch-reports-scheduler
opensearch-job-scheduler              opensearch-security
opensearch-knn                        opensearch-security-analytics
opensearch-ml                         opensearch-sql
[opensearch@f440328382f9 ~]$ cd /dev/shm/performanceanalyzer/
[opensearch@f440328382f9 performanceanalyzer]$ cat * | grep shardbulk
^threads/-1/shardbulk/0/start
^threads/-1/shardbulk/1/start
^threads/-1/shardbulk/2/start
^threads/-1/shardbulk/3/start
[opensearch@0f18089877e6 performanceanalyzer]$ ls /usr/share/opensearch/plugins/
opensearch-alerting                   opensearch-neural-search
opensearch-anomaly-detection          opensearch-notifications
opensearch-asynchronous-search        opensearch-notifications-core
opensearch-cross-cluster-replication  opensearch-observability
opensearch-geospatial                 opensearch-performance-analyzer
opensearch-index-management           opensearch-reports-scheduler
opensearch-job-scheduler              opensearch-security-analytics
opensearch-knn                        opensearch-sql
opensearch-ml
[opensearch@0f18089877e6 performanceanalyzer]$ cat * | grep shardbulk
^threads/-1/shardbulk/1/start
^threads/-1/shardbulk/1/finish
^threads/-1/shardbulk/2/start
^threads/-1/shardbulk/2/finish

Above are the recorded shardbulk events of the opensearch setups with and without Security plugin installed, respectively. All expected events are present in the latter. This behavior was consistent across multiple tests.

debug.log
debug2.log

And these are the logs with DEBUG option enabled of the respective setups, with and without Security plugin. Nothing obvious pops out at first glance so I'll have to go more into details. Comments and suggestions are welcome.

Context:

ShardBulk start and finish events are delivered to PA plugin through TransportChannel, though in a different way, thus the explanation why first works and the latter does not. As usually, channels are reachable through TransportRequestHandler which are supplied by TransportInterceptor's and they are registered inside OS core during initialization of NetworkModule.

Without Security plugin installed, interceptor chain does not include interceptors from Security and channels from PerformanceAnalyzer are successfully reached and finish events are omitted. With Security interceptors registered, my assumption is that the chain somehow gets broken and handlers from PerformanceAnalyzer are never reached. These are my assumptions based on some findings and may not be true. Feedback appreciated.

@praveensameneni
Copy link
Member

Thank you @Tjofil for looking into the issue, please keep the thread updated

@khushbr
Copy link
Collaborator

khushbr commented Apr 7, 2023

@peternied , @scrawfor99 Can you help here ? The Security Plugin is somehow interfering with the Performance Analyzer Listener for shardbulk close events.

@davidlago
Copy link

When was the last version where this was working successfully? it seems like this has happened before and are now reopening it as still an issue. Did it get fixed back when it first got reported and now is failing again? or was it never fully fixed?

@Tjofil
Copy link
Contributor

Tjofil commented Apr 7, 2023

@davidlago It was actually never fixed fully, there were other bugs, like 283, from Reader side which caused the same effect as this one. We fixed them and tested it, unfortunately without Security plugin, and didn't catch this one.

@peternied
Copy link
Member

peternied commented Apr 7, 2023

@Tjofil I've added a diagram of how access is managed with the Security Plugin. We might need more context to know for sure. When PA is attempting to write metric information, the context of the request does not have permission to invoke the transport action to save the metric data. There are three ways this can be resolved

  • The context of the request has no bearing on transport action for writing ShardBulkDocs, PA should use ThreadContext.stashContext() to allow access
  • The context of the request is required to have permissions to save shard bulk metrics, Cluster administrators need to update settings on for the user making the request to do so.
  • The security plugin adds a special case to ignore these action. [BUG] Security plugin interfering with Performance Analyzer metric collection security#2658
sequenceDiagram
    participant Client
    participant OpenSearch
    participant SecurityPlugin
    participant Cluster as Plugin
    
    Client->>OpenSearch: Request
    OpenSearch->>SecurityPlugin: Request with no Auth info
    SecurityPlugin->>SecurityPlugin: Add Auth information to request context
    OpenSearch->>Cluster: Client Request
    Cluster->>SecurityPlugin: Execute transport layer action
    SecurityPlugin->>SecurityPlugin: Check if action is allowed
    alt Allowed
        SecurityPlugin->>OpenSearch: Continue request
        OpenSearch-->>Cluster: Transport layer action result
    else Denied
        SecurityPlugin-->>OpenSearch: Return 403 Forbidden
        OpenSearch-->>Client: 403 Forbidden
    end
    alt Plugin run outside user context
    Cluster->>Cluster: Stash context
    Cluster->>SecurityPlugin: Execute transport layer action outside user context
    Cluster-->>SecurityPlugin: Check if action is allowed
    SecurityPlugin->>OpenSearch: Continue request
    OpenSearch-->>Cluster: Transport layer action result
    Cluster->>Cluster: Restore user context
    end
    Cluster-->>SecurityPlugin: Result
    SecurityPlugin-->>OpenSearch: Result
    OpenSearch-->>Client: Result
Loading

@pmarjou22
Copy link

When was the last version where this was working successfully? it seems like this has happened before and are now reopening it as still an issue. Did it get fixed back when it first got reported and now is failing again? or was it never fully fixed?

Last time it was working successfully was on "Open Distro 1.13.1" (see #77 ) it stopped working when moving on Opensearch distribution

@khushbr khushbr added the v2.8.0 label Apr 18, 2023
@acidul
Copy link

acidul commented Nov 15, 2023

Hello all,
any update about this issue ?
Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working v2.8.0
Projects
None yet
Development

No branches or pull requests

8 participants