Replace `ConcurrentBag<T>` with a simpler data structure #3433

jamescrosswell · 2024-06-18T23:34:38Z

We appear to be using ConcurrentBag<T> in ways it possibly wasn't designed for and this may be the cause of a memory leak we discovered. We should investigate using a simpler data structure that doesn't use ThreadLocals or WorkStealingPools, if possible.

Originally posted by @jamescrosswell in #3432 (comment)

The text was updated successfully, but these errors were encountered:

cliedeman · 2024-07-09T14:06:37Z

We are still experiencing high memory usage related to this

@jamescrosswell is there anything we can do in the short term?

bitsandfoxes · 2024-07-09T14:36:20Z

We should start looking into something similar to #829
If the SDK knows a transaction is going to get sampled then all the potential childspans can do a noop. That way users regain some control over the memory consumption via the sample rate.

jamescrosswell · 2024-07-10T01:49:35Z

@jamescrosswell is there anything we can do in the short term?

@cliedeman as far as I know, the original memory leak this ticket relates to was fixed in:

Fix memory leak when tracing is enabled #3432

The purpose of this ticket was to explore a more elegant solution.

We know of a memory leak in Sentry.Profiling (which is still not GA) but otherwise it should all be airtight.

If you're seeing memory growing monotonically and you're not using Sentry.Profiling then we'd definitely like to dig deeper. The first step for us is always reproducing the problem so if you can share a small sample project that recreates the issue then we can look into it.

cliedeman · 2024-07-10T07:37:34Z

Hi @jamescrosswell

Profiling is not enabled.

I dont believe its growling monotonically. We haven't found a pattern - I think this could be possibly be because of an increase in exceptions but these exceptions are filtered sdk side. We had one isue where 3 OOM errors follow about an hour in between.

.

Thanks
Ciaran

jamescrosswell · 2024-07-11T02:47:05Z

I dont believe its growling monotonically. We haven't found a pattern - I think this could be possibly be because of an increase in exceptions but these exceptions are filtered sdk side. We had one isue where 3 OOM errors follow about an hour in between.

Ah, in that case maybe related to @bitsandfoxes 's comment. In the dump you've shown above, the largest consumer of memory appears to be a concurrent bag that is storing trace information... so TransactionTracer and SpanTracer instances. You'd have to be creating an absolute tonne of these to get an OOM exception though.

The other possibility is the MemoryCache shown in your dump.

You'd really need to capture a dump just before an OOM exception was thrown to be sure which was the culprit though. It may be that in normal circumstances, Sentry's concurrentbag is allocating relatively more memory than other stuff in the process but in the circumstances leading up to your OOM exception it's something else entirely.

I put together an experiment a while back with some code that you could potentially include in your application to capture the dump at the right time (just before the OOM).

cliedeman · 2024-07-11T03:44:10Z

These are the azure crash monitor dumps taken at crash time

jamescrosswell · 2024-07-11T04:06:26Z

These are the azure crash monitor dumps taken at crash time

And was that an OOM crash? So you've got less than 400MB of memory available to the process?

cliedeman · 2024-07-11T04:18:57Z

Unsure. The server definitely has more memory. There is no increase in server error rates or increase in request rates. I'll scrutinize the dumps later to try and see why there are so many spans because that doesn't make sense to me. I am also willing to share the dumps.

Also the server is 32bit and the OOM could be caused by a collection hitting the max size not just heap size

cliedeman · 2024-07-11T05:07:11Z

Let me rephrase my thinking.

Our application is crashing with OOM. There are no obvious suspects. Sentry is the highest memory user according the dumps.

There are several issues related to memory on this repo. I would like to atleast eliminate sentry from the causes.

I could turn off sentry and observe that the issues stop but Sentry is really useful.

bitsandfoxes · 2024-07-11T11:09:13Z

We will prioritize getting on top of the SDK's memory consumption.
@jamescrosswell has a PR for some optimization experiment #3434
@cliedeman would you be able/willing to run an alpha build and see if that does anything to reduce this issue?

cliedeman · 2024-07-11T11:49:59Z

Hi @bitsandfoxes

Yes happy to do so.

jamescrosswell · 2024-07-30T21:00:39Z

Hi @bitsandfoxes

Yes happy to do so.

@cliedeman apologies for the delay. I had to travel at short notice and my Windows development machine is a mini PC (not a notebook) so I only had my Mac with me... which means I've only just been able to get that branch tidied up for testing in the last couple of days since returning.

In any event, if you are still able to help test this, version 4.9.0-sync.collection.2 of Sentry is available on NuGet. It's basically a fork of the main repo that uses a SynchronizedCollection<T> instead of a ConcurrentBag<T> to store all of the spans related to a trace (all of the changes in #3434 basically).

It'd be great to know whether this has any impact on the memory utilisation in your scenario.

Julian-Robinson · 2024-08-09T00:54:13Z

Hi @jamescrosswell

Thanks for your efforts in troubleshooting this. We've also got a memory leak in our app and our leading suspect is Sentry - we haven't been able to pinpoint the precise cause but we found this issue and ran your experimental release this morning so wanted to report back with our findings.

Frontend SPA with

@sentry/react: ^8.14.0

Backend API running ASP.NET 8 with

Sentry.AspNetCore v4.2.1
Sentry.OpenTelemetry v4.2.1
Sentry.Serilog v4.2.1

Configuration:

  "Sentry": {
    "MinimumBreadcrumbLevel": "Debug",
    "MinimumEventLevel": "Error",
    "Dsn": "...",
    "EnableTracing": true,
    "MaxRequestBodySize": "None",
    "IncludeActivityData": true,
    "AttachStackTrace": true,
    "DiagnosticLevel": "Error"
  },

Our experience has been

We have been running the configuration above fine for several weeks.
- Memory was stable.
- We were not applying any sampling, all traces were being captured.
We fixed some sentry configuration in our frontend - one of these changes was applying tracesSampleRate: 0.1
- From this point we started noticing a steady increase in memory in our backend containers.
We got a few memory dumps from the containers and ran a finalizequeue to troubleshoot.
- The results showed a large number of sentry related objects, and the suspicious ConcurrentBag you're replacing here.

7f52855043d8 10,351   414,040 System.Threading.ThreadLocal<System.Collections.Concurrent.ConcurrentBag<Sentry.Breadcrumb>+WorkStealingQueue>
7f5285504720 10,351   414,040 System.Threading.ThreadLocal<System.Collections.Concurrent.ConcurrentBag<Sentry.ISpan>+WorkStealingQueue>
7f528523c4b0 11,564   462,560 System.Threading.ThreadLocal<System.Collections.Concurrent.ConcurrentBag<Sentry.SentryPackage>+WorkStealingQueue>
7f528523c7f8 11,564   462,560 System.Threading.ThreadLocal<System.Collections.Concurrent.ConcurrentBag<System.String>+WorkStealingQueue>
7f527d9a3278 25,354 1,825,488 System.Reflection.Emit.DynamicResolver

We upgraded the backend to sentry 4.9.0
- The memory leaks continued, however the pattern did change slightly - becoming more erratic.
We ran this experimental 4.9.0-sync.collection.2 version.
- We only ran this for an hour or so, but it didn't seem to have any impact, memory continued to climb at the same rate.

Screenshot below shows our min/max/average memory utilisation over the last two days. Red lines indicate deployments where the version as changed. x = experimental version.

We then disabled Sentry entirely to verify whether it is the cause and (early days) so far memory is remaining much flatter. Unfortunately I didnt think to grab a memory dump from the experimental containers until now - sorry.

Our plan from here is to incrementally re-enable Sentry features to see when things start creeping up again, but the leading contender for us is something related to the sampling rate and unsampled traces.

Re-enable Sentry Serilog sink for issue reports.
Re-enable tracing with 100% sample rate.
Reintroduce the reduced sample rate.

Hope that's useful - We can probably re-run the experimental version again if that would be worthwhile, but if there's anything else that would be useful let me know.

jamescrosswell · 2024-08-12T06:07:34Z

Thanks for providing so much info @Julian-Robinson !

We ran this experimental 4.9.0-sync.collection.2 version.
We only ran this for an hour or so, but it didn't seem to have any impact, memory continued to climb at the same rate.

This, in particular, really helps - thank you.

In the image you provided showing memory utilisation, it doesn't look like the the memory is growing monotonically. It looks like it's getting cleaned up regularly by garbage collection and like average memory utilisation could fluctuate between 20-60% indefinitely. It might be good to get memory utilisation down lower than that. Sentry's tracing functionality, in particular, would ideally use less memory. However we're not aware of any memory "leaks" in the Tracing functionality...

Essentially I'm trying to identify whether in your environment you have a memory leak (memory keeps growing until you eventually get an OOM exception) or if it's more a case of Sentry consuming more memory than you would expect/like when tracing is enabled.

Julian-Robinson · 2024-08-12T22:44:31Z

Hey @jamescrosswell

We were seeing OOM errors with all versions. 4.9.0 definitely looked like it released memory a bit more readily than earlier versions but the general trend across the board was that memory usage continued to grow over time until the container crashed.. My graphs probably needed a bit more context though..

We're running several instances on ECS and the charts show the min/avg/max memory across all containers - not just a single instance. Our cluster spins up new containers when memory gets above 60%. Any time the blue line spikes down to 0 is an indication of a new container spinning up.
The large drops in max memory are due to a container crashing and the max value dropping to the next highest maximum. In some cases the drop isn't huge because the next container has a similar amount of used memory - typically if they were started around the same time.
To muddy the graphs even more there's a few deployments in there which rolls all containers and starts everything fresh. The min/max/avg values all climb up together after a deploy.

In other news. Over the weekend we had the Sentry Serilog sink renabled - without any tracing - and the memory profile has been rock solid. Hoping to try and test it out with 100% profiling today.

jamescrosswell · 2024-08-12T23:39:49Z

Aha, thanks @Julian-Robinson - that makes much more sense. I see now why min/avg/max all track together initially then (and indeed it's all in the upward direction when that happens).

Hoping to try and test it out with 100% profiling today.

Just so you're aware, Profiling in the .NET SDK is still under development and there is a known memory leak if you enable profiling:

Increased CPU and memory overhead while profiling #3199 (comment)

With profiling disabled, I haven't been able to reproduce any memory leaks however 😦

jamescrosswell · 2024-08-22T10:01:23Z

Since replacing ConcurrentBag<T> in the SDK doesn't appear to have any impact on memory issues, we're closing this particular issue.

We've created this new issue to continue investigation into memory issues when tracing is enabled:

Address memory consumption when Tracing is enabled **Cannot reproduce*** #3550

If you are experiencing memory problems with the Sentry SDK and are able to help us reliably reproduce these, please add a comment to issue #3550.

jamescrosswell self-assigned this Jun 18, 2024

jamescrosswell mentioned this issue Jun 19, 2024

Replaced ConcurrentBag<T> with SynchronizedCollection<T> #3434

Closed

getsantry bot added the Waiting for: Product Owner label Jul 9, 2024

getsantry bot removed the Waiting for: Product Owner label Jul 9, 2024

getsantry bot added the Waiting for: Product Owner label Jul 10, 2024

getsantry bot removed the Waiting for: Product Owner label Jul 11, 2024

getsantry bot added the Waiting for: Product Owner label Jul 11, 2024

getsantry bot removed the Waiting for: Product Owner label Jul 11, 2024

getsantry bot added the Waiting for: Product Owner label Jul 11, 2024

getsantry bot removed the Waiting for: Product Owner label Jul 11, 2024

bitsandfoxes added Product: Tracing Optimization labels Jul 11, 2024

getsantry bot added the Waiting for: Product Owner label Jul 11, 2024

bitsandfoxes removed the Waiting for: Product Owner label Jul 11, 2024

getsantry bot added the Waiting for: Product Owner label Aug 9, 2024

getsantry bot removed the Waiting for: Product Owner label Aug 12, 2024

getsantry bot added the Waiting for: Product Owner label Aug 12, 2024

getsantry bot removed the Waiting for: Product Owner label Aug 12, 2024

jamescrosswell mentioned this issue Aug 22, 2024

Address memory consumption when Tracing is enabled **Cannot reproduce*** #3550

Open

jamescrosswell closed this as completed Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace `ConcurrentBag<T>` with a simpler data structure #3433

Replace `ConcurrentBag<T>` with a simpler data structure #3433

jamescrosswell commented Jun 18, 2024

cliedeman commented Jul 9, 2024 •

edited

Loading

bitsandfoxes commented Jul 9, 2024

jamescrosswell commented Jul 10, 2024

cliedeman commented Jul 10, 2024 •

edited

Loading

jamescrosswell commented Jul 11, 2024

cliedeman commented Jul 11, 2024

jamescrosswell commented Jul 11, 2024 •

edited

Loading

cliedeman commented Jul 11, 2024

cliedeman commented Jul 11, 2024

bitsandfoxes commented Jul 11, 2024

cliedeman commented Jul 11, 2024

jamescrosswell commented Jul 30, 2024

Julian-Robinson commented Aug 9, 2024 •

edited

Loading

jamescrosswell commented Aug 12, 2024 •

edited

Loading

Julian-Robinson commented Aug 12, 2024

jamescrosswell commented Aug 12, 2024

jamescrosswell commented Aug 22, 2024

Replace ConcurrentBag<T> with a simpler data structure #3433

Replace ConcurrentBag<T> with a simpler data structure #3433

Comments

jamescrosswell commented Jun 18, 2024

cliedeman commented Jul 9, 2024 • edited Loading

bitsandfoxes commented Jul 9, 2024

jamescrosswell commented Jul 10, 2024

cliedeman commented Jul 10, 2024 • edited Loading

jamescrosswell commented Jul 11, 2024

cliedeman commented Jul 11, 2024

jamescrosswell commented Jul 11, 2024 • edited Loading

cliedeman commented Jul 11, 2024

cliedeman commented Jul 11, 2024

bitsandfoxes commented Jul 11, 2024

cliedeman commented Jul 11, 2024

jamescrosswell commented Jul 30, 2024

Julian-Robinson commented Aug 9, 2024 • edited Loading

jamescrosswell commented Aug 12, 2024 • edited Loading

Julian-Robinson commented Aug 12, 2024

jamescrosswell commented Aug 12, 2024

jamescrosswell commented Aug 22, 2024

Replace `ConcurrentBag<T>` with a simpler data structure #3433

Replace `ConcurrentBag<T>` with a simpler data structure #3433

cliedeman commented Jul 9, 2024 •

edited

Loading

cliedeman commented Jul 10, 2024 •

edited

Loading

jamescrosswell commented Jul 11, 2024 •

edited

Loading

Julian-Robinson commented Aug 9, 2024 •

edited

Loading

jamescrosswell commented Aug 12, 2024 •

edited

Loading