Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUESTION] Need some more context on cached_response_duration_seconds #443

Open
shivagupta-byte opened this issue Jun 21, 2024 · 7 comments

Comments

@shivagupta-byte
Copy link

shivagupta-byte commented Jun 21, 2024

Which feature your question relates to?
Can I get some more context what does cached_response_duration_seconds metric is calculating.
From what I understand, it is calculating the time, from the time it received request to the time it sent the response, incase of a cache it.
And if it is correct, the cached_response_duration_seconds I'm getting is way more that I would expect
image

Average time is sometimes going over 3s.

@mga-chka
Copy link
Collaborator

cached_response_duration_seconds is only the time it took to fetch the query response from the cache.
To get the end to end processing time you'd need the time it took:

  • to start processing the query
  • to fetch the response from the cache (or from clickhouse then put in the cache if it was not present in the cache)
  • to answer the query back

you should assess the performance of your redis to see if it's the bottleneck

@shivagupta-byte
Copy link
Author

image
all the calls in redis are taking less than 50 us. redis does not seem to be the bottleneck

@mga-chka
Copy link
Collaborator

are you storing large results in redis (lmore than 1 MBytes )?
In some edge cases, the large results are first put into the disk of chproxy before being sent to the client, which might explain why the redis call are fast but the cached_response_duration_seconds is big. In this edge cases, it's the time to fetch multiple times from redis a small part of the result then put it on disk.

@shivagupta-byte
Copy link
Author

yes some results are big, more than 1MB.
I checked the code and saw that we fetch in chunks of 2MB, and put the data in chunks of 2MB as well. Any particular reason for this design choice?
Can we increase it to let's say 10MB. it might solve the issue

@mga-chka
Copy link
Collaborator

Here is the rational of this choice:

If the remaining TTL in Redis is too small, we took the decision to first put the data into a file before streaming it to the client. It's to prevent the case where the client is too slow (because it's doing some business logic while consuming the stream or because its network connection is bad) and the object in Redis expire before the end of the stream, which would lead to an error from CH proxy. With this design choice, the client is guaranteed to have an answer from chproxy because if the transfer into the file fails, the retry mechanism will be triggered and the SQL query will be done in live from clickhouse then put into cache and given to the client.
In order to transfer the data from Redis to the file (or from Redis/[the file] to the client), we decided to do it in a streaming way (by bulks of 2MB max) to avoid getting out of memory errors if the cached results are too big to fit in the golang runtime. The bigger the size of the bulk the bigger the risk of OOM.

If you want, you can add a PR to allow the modification of the hardcoded 2MB bulk size using a configuration variable (that would be at 2MB by default).

@shivagupta-byte
Copy link
Author

sure, let me raise that. this might fix the slowness we are facing

@mga-chka
Copy link
Collaborator

ok, if it helps here is an example of a previous PR that was aiming to adding a new configuration variable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants