Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace ConcurrentBag<T> with a simpler data structure #3433

Closed
jamescrosswell opened this issue Jun 18, 2024 · 17 comments
Closed

Replace ConcurrentBag<T> with a simpler data structure #3433

jamescrosswell opened this issue Jun 18, 2024 · 17 comments

Comments

@jamescrosswell
Copy link
Collaborator

We appear to be using ConcurrentBag<T> in ways it possibly wasn't designed for and this may be the cause of a memory leak we discovered. We should investigate using a simpler data structure that doesn't use ThreadLocals or WorkStealingPools, if possible.

Originally posted by @jamescrosswell in #3432 (comment)

@cliedeman
Copy link

cliedeman commented Jul 9, 2024

We are still experiencing high memory usage related to this

image

@jamescrosswell is there anything we can do in the short term?

@bitsandfoxes
Copy link
Contributor

We should start looking into something similar to #829
If the SDK knows a transaction is going to get sampled then all the potential childspans can do a noop. That way users regain some control over the memory consumption via the sample rate.

@jamescrosswell
Copy link
Collaborator Author

@jamescrosswell is there anything we can do in the short term?

@cliedeman as far as I know, the original memory leak this ticket relates to was fixed in:

The purpose of this ticket was to explore a more elegant solution.

We know of a memory leak in Sentry.Profiling (which is still not GA) but otherwise it should all be airtight.

If you're seeing memory growing monotonically and you're not using Sentry.Profiling then we'd definitely like to dig deeper. The first step for us is always reproducing the problem so if you can share a small sample project that recreates the issue then we can look into it.

@cliedeman
Copy link

cliedeman commented Jul 10, 2024

Hi @jamescrosswell

Profiling is not enabled.

I dont believe its growling monotonically. We haven't found a pattern - I think this could be possibly be because of an increase in exceptions but these exceptions are filtered sdk side. We had one isue where 3 OOM errors follow about an hour in between.

image.

Thanks
Ciaran

@jamescrosswell
Copy link
Collaborator Author

I dont believe its growling monotonically. We haven't found a pattern - I think this could be possibly be because of an increase in exceptions but these exceptions are filtered sdk side. We had one isue where 3 OOM errors follow about an hour in between.

Ah, in that case maybe related to @bitsandfoxes 's comment. In the dump you've shown above, the largest consumer of memory appears to be a concurrent bag that is storing trace information... so TransactionTracer and SpanTracer instances. You'd have to be creating an absolute tonne of these to get an OOM exception though.

The other possibility is the MemoryCache shown in your dump.

You'd really need to capture a dump just before an OOM exception was thrown to be sure which was the culprit though. It may be that in normal circumstances, Sentry's concurrentbag is allocating relatively more memory than other stuff in the process but in the circumstances leading up to your OOM exception it's something else entirely.

I put together an experiment a while back with some code that you could potentially include in your application to capture the dump at the right time (just before the OOM).

@cliedeman
Copy link

These are the azure crash monitor dumps taken at crash time

@jamescrosswell
Copy link
Collaborator Author

jamescrosswell commented Jul 11, 2024

These are the azure crash monitor dumps taken at crash time

And was that an OOM crash? So you've got less than 400MB of memory available to the process?

@cliedeman
Copy link

Unsure. The server definitely has more memory. There is no increase in server error rates or increase in request rates. I'll scrutinize the dumps later to try and see why there are so many spans because that doesn't make sense to me. I am also willing to share the dumps.

Also the server is 32bit and the OOM could be caused by a collection hitting the max size not just heap size

@cliedeman
Copy link

Let me rephrase my thinking.

Our application is crashing with OOM. There are no obvious suspects. Sentry is the highest memory user according the dumps.

There are several issues related to memory on this repo. I would like to atleast eliminate sentry from the causes.

I could turn off sentry and observe that the issues stop but Sentry is really useful.

@bitsandfoxes
Copy link
Contributor

We will prioritize getting on top of the SDK's memory consumption.
@jamescrosswell has a PR for some optimization experiment #3434
@cliedeman would you be able/willing to run an alpha build and see if that does anything to reduce this issue?

@cliedeman
Copy link

Hi @bitsandfoxes

Yes happy to do so.

@jamescrosswell
Copy link
Collaborator Author

Hi @bitsandfoxes

Yes happy to do so.

@cliedeman apologies for the delay. I had to travel at short notice and my Windows development machine is a mini PC (not a notebook) so I only had my Mac with me... which means I've only just been able to get that branch tidied up for testing in the last couple of days since returning.

In any event, if you are still able to help test this, version 4.9.0-sync.collection.2 of Sentry is available on NuGet. It's basically a fork of the main repo that uses a SynchronizedCollection<T> instead of a ConcurrentBag<T> to store all of the spans related to a trace (all of the changes in #3434 basically).

It'd be great to know whether this has any impact on the memory utilisation in your scenario.

@Julian-Robinson
Copy link

Julian-Robinson commented Aug 9, 2024

Hi @jamescrosswell

Thanks for your efforts in troubleshooting this. We've also got a memory leak in our app and our leading suspect is Sentry - we haven't been able to pinpoint the precise cause but we found this issue and ran your experimental release this morning so wanted to report back with our findings.

Frontend SPA with

  • @sentry/react: ^8.14.0

Backend API running ASP.NET 8 with

  • Sentry.AspNetCore v4.2.1
  • Sentry.OpenTelemetry v4.2.1
  • Sentry.Serilog v4.2.1

Configuration:

  "Sentry": {
    "MinimumBreadcrumbLevel": "Debug",
    "MinimumEventLevel": "Error",
    "Dsn": "...",
    "EnableTracing": true,
    "MaxRequestBodySize": "None",
    "IncludeActivityData": true,
    "AttachStackTrace": true,
    "DiagnosticLevel": "Error"
  },

Our experience has been

  • We have been running the configuration above fine for several weeks.
    • Memory was stable.
    • We were not applying any sampling, all traces were being captured.
  • We fixed some sentry configuration in our frontend - one of these changes was applying tracesSampleRate: 0.1
    • From this point we started noticing a steady increase in memory in our backend containers.
  • We got a few memory dumps from the containers and ran a finalizequeue to troubleshoot.
    • The results showed a large number of sentry related objects, and the suspicious ConcurrentBag you're replacing here.
7f52855043d8 10,351   414,040 System.Threading.ThreadLocal<System.Collections.Concurrent.ConcurrentBag<Sentry.Breadcrumb>+WorkStealingQueue>
7f5285504720 10,351   414,040 System.Threading.ThreadLocal<System.Collections.Concurrent.ConcurrentBag<Sentry.ISpan>+WorkStealingQueue>
7f528523c4b0 11,564   462,560 System.Threading.ThreadLocal<System.Collections.Concurrent.ConcurrentBag<Sentry.SentryPackage>+WorkStealingQueue>
7f528523c7f8 11,564   462,560 System.Threading.ThreadLocal<System.Collections.Concurrent.ConcurrentBag<System.String>+WorkStealingQueue>
7f527d9a3278 25,354 1,825,488 System.Reflection.Emit.DynamicResolver
  • We upgraded the backend to sentry 4.9.0
    • The memory leaks continued, however the pattern did change slightly - becoming more erratic.
  • We ran this experimental 4.9.0-sync.collection.2 version.
    • We only ran this for an hour or so, but it didn't seem to have any impact, memory continued to climb at the same rate.

Screenshot below shows our min/max/average memory utilisation over the last two days. Red lines indicate deployments where the version as changed. x = experimental version.

Image

We then disabled Sentry entirely to verify whether it is the cause and (early days) so far memory is remaining much flatter. Unfortunately I didnt think to grab a memory dump from the experimental containers until now - sorry.

Image

Our plan from here is to incrementally re-enable Sentry features to see when things start creeping up again, but the leading contender for us is something related to the sampling rate and unsampled traces.

  • Re-enable Sentry Serilog sink for issue reports.
  • Re-enable tracing with 100% sample rate.
  • Reintroduce the reduced sample rate.

Hope that's useful - We can probably re-run the experimental version again if that would be worthwhile, but if there's anything else that would be useful let me know.

@jamescrosswell
Copy link
Collaborator Author

jamescrosswell commented Aug 12, 2024

Thanks for providing so much info @Julian-Robinson !

We ran this experimental 4.9.0-sync.collection.2 version.
We only ran this for an hour or so, but it didn't seem to have any impact, memory continued to climb at the same rate.

This, in particular, really helps - thank you.

In the image you provided showing memory utilisation, it doesn't look like the the memory is growing monotonically. It looks like it's getting cleaned up regularly by garbage collection and like average memory utilisation could fluctuate between 20-60% indefinitely. It might be good to get memory utilisation down lower than that. Sentry's tracing functionality, in particular, would ideally use less memory. However we're not aware of any memory "leaks" in the Tracing functionality...

Essentially I'm trying to identify whether in your environment you have a memory leak (memory keeps growing until you eventually get an OOM exception) or if it's more a case of Sentry consuming more memory than you would expect/like when tracing is enabled.

@Julian-Robinson
Copy link

Hey @jamescrosswell

We were seeing OOM errors with all versions. 4.9.0 definitely looked like it released memory a bit more readily than earlier versions but the general trend across the board was that memory usage continued to grow over time until the container crashed.. My graphs probably needed a bit more context though..

We're running several instances on ECS and the charts show the min/avg/max memory across all containers - not just a single instance. Our cluster spins up new containers when memory gets above 60%. Any time the blue line spikes down to 0 is an indication of a new container spinning up.
The large drops in max memory are due to a container crashing and the max value dropping to the next highest maximum. In some cases the drop isn't huge because the next container has a similar amount of used memory - typically if they were started around the same time.
To muddy the graphs even more there's a few deployments in there which rolls all containers and starts everything fresh. The min/max/avg values all climb up together after a deploy.

In other news. Over the weekend we had the Sentry Serilog sink renabled - without any tracing - and the memory profile has been rock solid. Hoping to try and test it out with 100% profiling today.

@jamescrosswell
Copy link
Collaborator Author

Aha, thanks @Julian-Robinson - that makes much more sense. I see now why min/avg/max all track together initially then (and indeed it's all in the upward direction when that happens).

Hoping to try and test it out with 100% profiling today.

Just so you're aware, Profiling in the .NET SDK is still under development and there is a known memory leak if you enable profiling:

With profiling disabled, I haven't been able to reproduce any memory leaks however 😦

@jamescrosswell
Copy link
Collaborator Author

Since replacing ConcurrentBag<T> in the SDK doesn't appear to have any impact on memory issues, we're closing this particular issue.

We've created this new issue to continue investigation into memory issues when tracing is enabled:

If you are experiencing memory problems with the Sentry SDK and are able to help us reliably reproduce these, please add a comment to issue #3550.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants