Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(tracing): avoid assigning span's local root to self, so that the python GC doesn't kick in #10809

Merged
merged 17 commits into from
Sep 26, 2024

Conversation

erikayasuda
Copy link
Contributor

@erikayasuda erikayasuda commented Sep 25, 2024

<<Description copied from the original PR>>

We (Modal Labs) noticed that enabling tracing in parts of our code caused the Python GC to kick in a lot and add large delays to our event loop. A simple repro (see test) showed us that there was a self reference cycle in span when a top-level span's local root was assigned to the span itself, preventing Python from cleaning up the objects by reference counting.

The change itself is pretty straightforward - we use None to represent when a span's local root is itself. The span still exposes _local_root with the same behavior via a property, and internally stores a _local_root_value that holds a non-self-referential span. This also means that the tracer no longer needs to explicitly set the local root for top-level spans. This should be fairly non-risky as it doesn't change any code behavior.

Checklist

  • PR author has checked that all the criteria below are met
  • The PR description includes an overview of the change
  • The PR description articulates the motivation for the change
  • The change includes tests OR the PR description describes a testing strategy
  • The PR description notes risks associated with the change, if any
  • Newly-added code is easy to change
  • The change follows the library release note guidelines
  • The change includes or references documentation updates if necessary
  • Backport labels are set (if applicable)

Reviewer Checklist

  • Reviewer has checked that all the criteria below are met
  • Title is accurate
  • All changes are related to the pull request's stated goal
  • Avoids breaking API changes
  • Testing strategy adequately addresses listed risks
  • Newly-added code is easy to change
  • Release note makes sense to a user of the library
  • If necessary, author has acknowledged and discussed the performance implications of this PR as reported in the benchmarks PR comment
  • Backport labels are set in a manner that is consistent with the release branch maintenance policy

Copy link
Contributor

github-actions bot commented Sep 25, 2024

CODEOWNERS have been resolved as:

releasenotes/notes/fix-span-unnecessary-gc.yaml                         @DataDog/apm-python
ddtrace/_trace/span.py                                                  @DataDog/apm-sdk-api-python
ddtrace/_trace/tracer.py                                                @DataDog/apm-sdk-api-python
ddtrace/propagation/http.py                                             @DataDog/apm-sdk-api-python
tests/tracer/test_tracer.py                                             @DataDog/apm-sdk-api-python

@datadog-dd-trace-py-rkomorn
Copy link

datadog-dd-trace-py-rkomorn bot commented Sep 25, 2024

Datadog Report

Branch report: modal-labs/local_root_gc_issue
Commit report: c5d94a7
Test service: dd-trace-py

✅ 0 Failed, 1286 Passed, 0 Skipped, 37m 29.69s Total duration (3.77s time saved)

@pr-commenter
Copy link

pr-commenter bot commented Sep 25, 2024

Benchmarks

Benchmark execution time: 2024-09-25 21:47:32

Comparing candidate commit c5d94a7 in PR branch modal-labs/local_root_gc_issue with baseline commit 057bb26 in branch main.

Found 8 performance improvements and 2 performance regressions! Performance is the same for 361 metrics, 53 unstable metrics.

scenario:iast_aspects-aspect_iast_do_lower

  • 🟩 max_rss_usage [-2.463MB; -2.082MB] or [-8.412%; -7.113%]

scenario:iast_aspects-aspect_iast_do_operator_add_params

  • 🟥 max_rss_usage [+2.581MB; +2.878MB] or [+9.704%; +10.820%]

scenario:iast_aspects-aspect_iast_do_replace

  • 🟥 max_rss_usage [+2.560MB; +2.926MB] or [+9.632%; +11.008%]

scenario:iast_aspects-aspect_iast_do_upper

  • 🟩 max_rss_usage [-2.407MB; -2.082MB] or [-8.219%; -7.107%]

scenario:otelspan-add-metrics

  • 🟩 max_rss_usage [-9.186MB; -8.960MB] or [-26.196%; -25.551%]

scenario:otelspan-add-tags

  • 🟩 max_rss_usage [-9.143MB; -8.891MB] or [-26.098%; -25.379%]

scenario:span-add-metrics

  • 🟩 max_rss_usage [-19.217MB; -18.986MB] or [-38.406%; -37.942%]

scenario:span-add-tags

  • 🟩 max_rss_usage [-5.704MB; -5.477MB] or [-18.982%; -18.226%]

scenario:span-start

  • 🟩 execution_time [-2.508ms; -2.077ms] or [-12.530%; -10.374%]

scenario:span-start-traceid128

  • 🟩 execution_time [-2.514ms; -2.082ms] or [-12.256%; -10.149%]

Copy link
Member

@brettlangdon brettlangdon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@erikayasuda and I also discussed another direction we could take.

Since each span holds a reference to it's parent, any time we need to access the "local root" we can just traverse this linked list back to the root span (span whose _parent is None) to get the local root as well.

In extreme cases a "tall" "local" trace will only be 20-30 spans deep, so the overhead of traversing this should not be too expensive (citation/benchmarks needed).

With this approach, def _local_root() -> "Span": could be a simple traversal, and then we can avoid ever needing to "set" local root, and passing along span._local_root = parent._local_root, etc

This is not a required change, but an observation.

ddtrace/_trace/span.py Outdated Show resolved Hide resolved
tests/tracer/test_tracer.py Show resolved Hide resolved
ddtrace/propagation/http.py Show resolved Hide resolved
@erikayasuda erikayasuda marked this pull request as ready for review September 25, 2024 20:18
Copy link
Contributor

@taegyunkim taegyunkim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From profiling, we need to propagate local_root.span_id and local_root.span_type and this PR will change the behavior by returning early.

  1. if span._local_root is not None:
  2. if span._local_root == span and span.span_type == SpanTypes.WEB:

I hope profiling tests catch those, if not I'm happy to contribute tests.

Oh wait, I think you've written the code in a way to handle these, nvm.

Copy link
Contributor

@taegyunkim taegyunkim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These also need to be updated too, right?

  1. return span._local_root._get_ctx_item(data_key)
  2. root = span._local_root
  3. ddtrace.tracer.sample(dbspan._local_root)

@erikayasuda
Copy link
Contributor Author

From profiling, we need to propagate local_root.span_id and local_root.span_type and this PR will change the behavior by returning early.

@taegyunkim All tests were passing in a prior commit (no functionality was changed, just some mypy nits). Is the concern here that _local_root will return None instead of the actual Span? Given the property getter/setter they added, the behavior of calling _local_root on a Span should still be the same, we just have a new slot to prevent the cyclic reference. If the issue is elsewhere, we'll definitely take up your offer to contribute to add the necessary profiling tests 🙏

@taegyunkim
Copy link
Contributor

@erikayasuda Re-read your code and our tests. We do have tests checking local_root.span_id and local_root.span_type are propagated and they all pass, so should be ok.

local_root_span_id=local_root_span_id,
trace_type=span_type,

Thanks for confirming!

Copy link
Contributor

@taegyunkim taegyunkim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work

@mabdinur
Copy link
Contributor

@erikayasuda and I also discussed another direction we could take.

Since each span holds a reference to it's parent, any time we need to access the "local root" we can just traverse this linked list back to the root span (span whose _parent is None) to get the local root as well.

In extreme cases a "tall" "local" trace will only be 20-30 spans deep, so the overhead of traversing this should not be too expensive (citation/benchmarks needed).

With this approach, def _local_root() -> "Span": could be a simple traversal, and then we can avoid ever needing to "set" local root, and passing along span._local_root = parent._local_root, etc

This is not a required change, but an observation.

This is a bit off topic but with partial flushing the root span can be finished, encoded, and sent to the agent while a child span is still active. Since spans are only encoded once, any modifications made to a finished Span may not be reflected in the payload sent to the Datadog Agent.

We can end up in a scenario where ddtrace.tracer.current_root_span() returns a span that has ended, the span is updated (ex: tags are set) but those changes are never reflected in the Datadog UI. In most cases child spans are finished before parent spans but parent/root spans finishing before a child span is a common scenario in async executions.

Should we solve this issue in this PR? Should we only keep a reference to local root spans that have not been encoded and sent to the Datadog Agent?

Copy link
Contributor

@mabdinur mabdinur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The memory usage benchmark scenarios look very promising. Can we also include screen captures of the results from the flaskrealworld app in the microbenchmarking platform?

@erikayasuda
Copy link
Contributor Author

erikayasuda commented Sep 25, 2024

@mabdinur

Should we only keep a reference to local root spans that have not been encoded and sent to the Datadog Agent?

Just to confirm, this is already the current behavior right? Or would we be introducing this new behavior of keeping references to a finished root span? If it's the latter, then I'd prefer to address it in this PR. Otherwise, I can own a separate PR to address that issue.

@mabdinur
Copy link
Contributor

@mabdinur

Should we only keep a reference to local root spans that have not been encoded and sent to the Datadog Agent?

Just to confirm, this is already the current behavior right? Or would we be introducing this new behavior of keeping references to a finished root span? If it's the latter, then I'd prefer to address it in this PR. Otherwise, I can own a separate PR to address that issue.

It's a separate issue. I can address it. I would like to propose making Span properties immutable after Span.finish() is called. Trying to modify a finished span should raise a warning.

@brettlangdon
Copy link
Member

@mabdinur

Should we solve this issue in this PR? Should we only keep a reference to local root spans that have not been encoded and sent to the Datadog Agent?

This change does not change any existing behavior or problems. This scenario exists today, and will continue to exist (no worse than today).

Especially since child spans will always hold reference to their parents, as long as child spans are still active (not GC'd at least) then all parents will not get GC'd as well.

Changing that finished spans get "locked" is new unrelated behavior change.

@brettlangdon brettlangdon merged commit e160b36 into main Sep 26, 2024
638 of 640 checks passed
@brettlangdon brettlangdon deleted the modal-labs/local_root_gc_issue branch September 26, 2024 18:51
github-actions bot pushed a commit that referenced this pull request Sep 26, 2024
…python GC doesn't kick in (#10809)

_<<Description copied from the [original
PR](https://github.com/DataDog/dd-trace-py/pull/10624)>>_

We (Modal Labs) noticed that enabling tracing in parts of our code
caused the Python GC to kick in a lot and add large delays to our event
loop. A simple repro (see test) showed us that there was a self
reference cycle in span when a top-level span's local root was assigned
to the span itself, preventing Python from cleaning up the objects by
reference counting.

The change itself is pretty straightforward - we use None to represent
when a span's local root is itself. The span still exposes _local_root
with the same behavior via a property, and internally stores a
_local_root_value that holds a non-self-referential span. This also
means that the tracer no longer needs to explicitly set the local root
for top-level spans. This should be fairly non-risky as it doesn't
change any code behavior.

## Checklist
- [x] PR author has checked that all the criteria below are met
- The PR description includes an overview of the change
- The PR description articulates the motivation for the change
- The change includes tests OR the PR description describes a testing
strategy
- The PR description notes risks associated with the change, if any
- Newly-added code is easy to change
- The change follows the [library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
- The change includes or references documentation updates if necessary
- Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))

## Reviewer Checklist
- [x] Reviewer has checked that all the criteria below are met
- Title is accurate
- All changes are related to the pull request's stated goal
- Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes
- Testing strategy adequately addresses listed risks
- Newly-added code is easy to change
- Release note makes sense to a user of the library
- If necessary, author has acknowledged and discussed the performance
implications of this PR as reported in the benchmarks PR comment
- Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)

---------

Co-authored-by: Daniel Shaar <[email protected]>
Co-authored-by: Daniel Shaar <[email protected]>
(cherry picked from commit e160b36)
github-actions bot pushed a commit that referenced this pull request Sep 26, 2024
…python GC doesn't kick in (#10809)

_<<Description copied from the [original
PR](https://github.com/DataDog/dd-trace-py/pull/10624)>>_

We (Modal Labs) noticed that enabling tracing in parts of our code
caused the Python GC to kick in a lot and add large delays to our event
loop. A simple repro (see test) showed us that there was a self
reference cycle in span when a top-level span's local root was assigned
to the span itself, preventing Python from cleaning up the objects by
reference counting.

The change itself is pretty straightforward - we use None to represent
when a span's local root is itself. The span still exposes _local_root
with the same behavior via a property, and internally stores a
_local_root_value that holds a non-self-referential span. This also
means that the tracer no longer needs to explicitly set the local root
for top-level spans. This should be fairly non-risky as it doesn't
change any code behavior.

## Checklist
- [x] PR author has checked that all the criteria below are met
- The PR description includes an overview of the change
- The PR description articulates the motivation for the change
- The change includes tests OR the PR description describes a testing
strategy
- The PR description notes risks associated with the change, if any
- Newly-added code is easy to change
- The change follows the [library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
- The change includes or references documentation updates if necessary
- Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))

## Reviewer Checklist
- [x] Reviewer has checked that all the criteria below are met
- Title is accurate
- All changes are related to the pull request's stated goal
- Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes
- Testing strategy adequately addresses listed risks
- Newly-added code is easy to change
- Release note makes sense to a user of the library
- If necessary, author has acknowledged and discussed the performance
implications of this PR as reported in the benchmarks PR comment
- Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)

---------

Co-authored-by: Daniel Shaar <[email protected]>
Co-authored-by: Daniel Shaar <[email protected]>
(cherry picked from commit e160b36)
github-actions bot pushed a commit that referenced this pull request Sep 26, 2024
…python GC doesn't kick in (#10809)

_<<Description copied from the [original
PR](https://github.com/DataDog/dd-trace-py/pull/10624)>>_

We (Modal Labs) noticed that enabling tracing in parts of our code
caused the Python GC to kick in a lot and add large delays to our event
loop. A simple repro (see test) showed us that there was a self
reference cycle in span when a top-level span's local root was assigned
to the span itself, preventing Python from cleaning up the objects by
reference counting.

The change itself is pretty straightforward - we use None to represent
when a span's local root is itself. The span still exposes _local_root
with the same behavior via a property, and internally stores a
_local_root_value that holds a non-self-referential span. This also
means that the tracer no longer needs to explicitly set the local root
for top-level spans. This should be fairly non-risky as it doesn't
change any code behavior.

## Checklist
- [x] PR author has checked that all the criteria below are met
- The PR description includes an overview of the change
- The PR description articulates the motivation for the change
- The change includes tests OR the PR description describes a testing
strategy
- The PR description notes risks associated with the change, if any
- Newly-added code is easy to change
- The change follows the [library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
- The change includes or references documentation updates if necessary
- Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))

## Reviewer Checklist
- [x] Reviewer has checked that all the criteria below are met
- Title is accurate
- All changes are related to the pull request's stated goal
- Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes
- Testing strategy adequately addresses listed risks
- Newly-added code is easy to change
- Release note makes sense to a user of the library
- If necessary, author has acknowledged and discussed the performance
implications of this PR as reported in the benchmarks PR comment
- Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)

---------

Co-authored-by: Daniel Shaar <[email protected]>
Co-authored-by: Daniel Shaar <[email protected]>
(cherry picked from commit e160b36)
erikayasuda added a commit that referenced this pull request Sep 26, 2024
…python GC doesn't kick in [backport 2.14] (#10838)

Backport e160b36 from #10809 to 2.14.

_<<Description copied from the [original
PR](https://github.com/DataDog/dd-trace-py/pull/10624)>>_

We (Modal Labs) noticed that enabling tracing in parts of our code
caused the Python GC to kick in a lot and add large delays to our event
loop. A simple repro (see test) showed us that there was a self
reference cycle in span when a top-level span's local root was assigned
to the span itself, preventing Python from cleaning up the objects by
reference counting.

The change itself is pretty straightforward - we use None to represent
when a span's local root is itself. The span still exposes _local_root
with the same behavior via a property, and internally stores a
_local_root_value that holds a non-self-referential span. This also
means that the tracer no longer needs to explicitly set the local root
for top-level spans. This should be fairly non-risky as it doesn't
change any code behavior.

## Checklist
- [x] PR author has checked that all the criteria below are met
- The PR description includes an overview of the change
- The PR description articulates the motivation for the change
- The change includes tests OR the PR description describes a testing
strategy
- The PR description notes risks associated with the change, if any
- Newly-added code is easy to change
- The change follows the [library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
- The change includes or references documentation updates if necessary
- Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))

## Reviewer Checklist
- [x] Reviewer has checked that all the criteria below are met 
- Title is accurate
- All changes are related to the pull request's stated goal
- Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes
- Testing strategy adequately addresses listed risks
- Newly-added code is easy to change
- Release note makes sense to a user of the library
- If necessary, author has acknowledged and discussed the performance
implications of this PR as reported in the benchmarks PR comment
- Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)

Co-authored-by: erikayasuda <[email protected]>
erikayasuda added a commit that referenced this pull request Sep 26, 2024
…python GC doesn't kick in [backport 2.13] (#10837)

Backport e160b36 from #10809 to 2.13.

_<<Description copied from the [original
PR](https://github.com/DataDog/dd-trace-py/pull/10624)>>_

We (Modal Labs) noticed that enabling tracing in parts of our code
caused the Python GC to kick in a lot and add large delays to our event
loop. A simple repro (see test) showed us that there was a self
reference cycle in span when a top-level span's local root was assigned
to the span itself, preventing Python from cleaning up the objects by
reference counting.

The change itself is pretty straightforward - we use None to represent
when a span's local root is itself. The span still exposes _local_root
with the same behavior via a property, and internally stores a
_local_root_value that holds a non-self-referential span. This also
means that the tracer no longer needs to explicitly set the local root
for top-level spans. This should be fairly non-risky as it doesn't
change any code behavior.

## Checklist
- [x] PR author has checked that all the criteria below are met
- The PR description includes an overview of the change
- The PR description articulates the motivation for the change
- The change includes tests OR the PR description describes a testing
strategy
- The PR description notes risks associated with the change, if any
- Newly-added code is easy to change
- The change follows the [library release note
guidelines](https://ddtrace.readthedocs.io/en/stable/releasenotes.html)
- The change includes or references documentation updates if necessary
- Backport labels are set (if
[applicable](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting))

## Reviewer Checklist
- [x] Reviewer has checked that all the criteria below are met 
- Title is accurate
- All changes are related to the pull request's stated goal
- Avoids breaking
[API](https://ddtrace.readthedocs.io/en/stable/versioning.html#interfaces)
changes
- Testing strategy adequately addresses listed risks
- Newly-added code is easy to change
- Release note makes sense to a user of the library
- If necessary, author has acknowledged and discussed the performance
implications of this PR as reported in the benchmarks PR comment
- Backport labels are set in a manner that is consistent with the
[release branch maintenance
policy](https://ddtrace.readthedocs.io/en/latest/contributing.html#backporting)

Co-authored-by: erikayasuda <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants