-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
trace: fixes to help for heap allocation debugging #4548
Conversation
tr_err(&mem_tr, \ | ||
"failed to alloc 0x%x bytes zone 0x%x caps 0x%x flags 0x%x", \ | ||
bytes, zone, caps, flags); \ | ||
} \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I was meaning from my comment on the previous PR it would be good to dump the heap blocks on any allocation error. I see we are already doing this but in debug mode only ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I was meaning from my comment on the previous PR it would be good to dump the heap blocks on any allocation error. I see we are already doing this but in debug mode only ?
That's true. One more bad thing is that our trace are always filtered out by the trace filtering, that's why I tried to set the filtering to N by default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do something like this (it sames duplicating the print).
#define _DEBUG_TRACE_PTR(ptr, bytes, zone, caps, flags) \
do { \
if (!(ptr)) { \
tr_err(&mem_tr, \
"failed to alloc 0x%x bytes zone 0x%x caps 0x%x flags 0x%x", \
bytes, zone, caps, flags); \
} \
} while (0)
#if DEBUG
#define DEBUG_TRACE_PTR(ptr, bytes, zone, caps, flags) \
_DEBUG_TRACE_PTR(ptr, bytes, zone, caps, flags) \
alloc_trace_heap(zone, caps, bytes);
#else
#define DEBUG_TRACE_PTR(ptr, bytes, zone, caps, flags) \
_DEBUG_TRACE_PTR(ptr, bytes, zone, caps, flags)
#endif
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it doesnt look like this feedback was addressed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, definitely good to deduplicate, but can I propose a slight modification:
#define _DEBUG_TRACE_PTR(ptr, bytes, zone, caps, flags) \
if (!(ptr)) \
tr_err(&mem_tr, \
"failed to alloc 0x%x bytes zone 0x%x caps 0x%x flags 0x%x", \
bytes, zone, caps, flags)
#if DEBUG
#define DEBUG_TRACE_PTR(ptr, bytes, zone, caps, flags) do { \
_DEBUG_TRACE_PTR(ptr, bytes, zone, caps, flags); \
alloc_trace_heap(zone, caps, bytes); \
while (0)
#else
#define DEBUG_TRACE_PTR(ptr, bytes, zone, caps, flags) do { \
_DEBUG_TRACE_PTR(ptr, bytes, zone, caps, flags); \
while (0)
#endif
to always have do { ... } while (0)
externally
@@ -35,22 +35,22 @@ config TRACEM | |||
config TRACE_FILTERING | |||
bool "Trace filtering" | |||
depends on TRACE | |||
default y | |||
default n |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is hiding a very big change in a PR with a totally unrelated name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean the PR subject, let me rename it. Anyway, the change is mentioned in the commit subject.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the commit is fine, I was only referring to the PR name.
This relatively complex feature took a lot of effort, so I'm pretty sure it was developed for a reason. This change could kill performance on all production systems with just one character change. So even a scary PR name may not be enough, it should probably be discussed in some larger forum / mailing list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the commit is fine, I was only referring to the PR name.
This relatively complex feature took a lot of effort, so I'm pretty sure it was developed for a reason. This change could kill performance on all production systems with just one character change. So even a scary PR name may not be enough, it should probably be discussed in some larger forum / mailing list.
No, in real production systems, usually the DMA trace itself is disabled thoroughly, I don't think the feature is useful at all, please correct me if you think so.
Anyway, it filters out useful logs by mistake from time to time for me, it is legit to disable it by default to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The filtering is a useful developer debug feature, i.e. helps developers improve the trace SNR when debugging issues when there is high frequency trace output. However for CI usage it can hide some important details in the logs that would be useful for triage. I'm good to try this change for CI and see if it helps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I just change to unselect the whole TRACE_FILTERING feature, frankly speaking I don't know how to use the _TRACE_FILTERING_VERBOSITY at all, I haven't seen the tr_dbg() logs for years. @slawblauciak asked me about this and what I shared to him is just use tr_info() only.
@marc-hb can you share how should CI and customer use this _TRACE_FILTERING_VERBOSITY today? If nobody is using it, no harm to disable it by default, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@keyonjie I also disagree with your approach. Why wasn't this change a different PR? we could have fixed the 'filtering by mistake' instead of disabling the capability entirely.
@lgirdwood we need to keep the rule that there are TWO approvers on such critical PRs. This ignores feedback, it's not a good process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marc-hb can you share how should CI and customer use this _TRACE_FILTERING_VERBOSITY today?
By following the official documentation:
https://thesofproject.github.io/latest/developer_guides/debugability/logger/index.html#trace-filtering
Filtering by log levels and component is really the most basic feature of any logging framework.
This commit apparently caused regression #4779 , according to some people reverting it seems to make the problem go away. Of course the code should in theory work with any Kconfig combination, I am not saying changing Kconfig is the root cause. But we need the tests to pass.
Because both, unrelated settings were changed in the same commit ("just in case?"), validation could not immediately find which one setting caused the regression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These KConfig changes have been reverted in #4785 because of regression #4779. @keyonjie I now understand that CONFIG_TRACE_FILTERING_ADAPTIVE
is the only default you wanted to change, correct? Please re-submit with only the CONFIG_TRACE_FILTERING_ADAPTIVE
default change (it's all very brittle, so only one change at a time) and after first thoroughly trying to reproduce issue #4779 locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it's very easy to turn off CONFIG_TRACE_FILTERING_ADAPTIVE at run-time, no recompilation needed: #5154
src/lib/alloc.c
Outdated
@@ -647,14 +647,22 @@ static void alloc_trace_heap(enum mem_zone zone, uint32_t caps, size_t bytes) | |||
|
|||
#define DEBUG_TRACE_PTR(ptr, bytes, zone, caps, flags) \ | |||
do { \ | |||
if (!ptr) { \ | |||
tr_err(&mem_tr, "failed to alloc 0x%x bytes zone 0x%x caps 0x%x flags 0x%x", \ | |||
alloc_trace_heap(zone, caps, bytes); \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only need to dump this when the allocation fails otherwise it will flood the trace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my logic here is:
a. always trace errors when the allocation failed.
b. trace the updated heap status for the successful allocation, only when CONFIG_DEBUG_HEAP is selected ('no' by default) and the DEBUG_TRACE_PTR() is called. It will somewhat 'flood' the trace, but that's what our intention of 'CONFIG_DEBUG_HEAP'?
@@ -35,22 +35,22 @@ config TRACEM | |||
config TRACE_FILTERING | |||
bool "Trace filtering" | |||
depends on TRACE | |||
default y | |||
default n |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The filtering is a useful developer debug feature, i.e. helps developers improve the trace SNR when debugging issues when there is high frequency trace output. However for CI usage it can hide some important details in the logs that would be useful for triage. I'm good to try this change for CI and see if it helps
SOFCI TEST |
1 similar comment
SOFCI TEST |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still some duplication.
tr_err(&mem_tr, \ | ||
"failed to alloc 0x%x bytes zone 0x%x caps 0x%x flags 0x%x", \ | ||
bytes, zone, caps, flags); \ | ||
} \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do something like this (it sames duplicating the print).
#define _DEBUG_TRACE_PTR(ptr, bytes, zone, caps, flags) \
do { \
if (!(ptr)) { \
tr_err(&mem_tr, \
"failed to alloc 0x%x bytes zone 0x%x caps 0x%x flags 0x%x", \
bytes, zone, caps, flags); \
} \
} while (0)
#if DEBUG
#define DEBUG_TRACE_PTR(ptr, bytes, zone, caps, flags) \
_DEBUG_TRACE_PTR(ptr, bytes, zone, caps, flags) \
alloc_trace_heap(zone, caps, bytes);
#else
#define DEBUG_TRACE_PTR(ptr, bytes, zone, caps, flags) \
_DEBUG_TRACE_PTR(ptr, bytes, zone, caps, flags)
#endif
@XiaoyunWu6666 @keyonjie seeing the CML stream offset kernel bug again. here, could be unrelated but trace filtering is OFF for CI so maybe more stream DMA trace traffic at this point. @plbossart @ranj063 fyi.
|
It's not surprising, the previous occurrence of this STREAM_OFFSET timeout was due to a change in the way var_args were handled - and we never root-caused the problem. Here this PR changes the memory management. It seems that all attempts to change the trace come back to the same flakiness with a memory-handling side effect impacting the DMA. I vote for no merge, if a feature adds another bug it's not good. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NAK if we end-up with STREAM_OFFSET issues again.
Agree with @plbossart , not to merge it until the CI reported issue is fixed. Let me finish the heap refinement first and then the dma lli one then this one. |
@plbossart can we add some debug in the driver to print the stream name/pcm name so that this can be trace back a bit better, I am suspecting the trace DMA here, we do run it on "one-shot" mode but it could be a race between it's state machine and the driver (and extra trace data makes it more frequent). |
@ujfalusi fyi - if you have 30mins and its easy todo - see above, we could do with more detail in the STREAM_OFFSET kernel error message. |
@plbossart, @lgirdwood, can you take a look at thesofproject/linux#3121 ? Is it going to help with narrowing down the scope? |
Thank you @ujfalusi !
On the FW side we see an IPC timeout, so It's possible we have an IP programming flow around DMA stop that needs some better alignment between FW and drivers... |
9d97c37
to
782156d
Compare
No matter if CONFIG_DEBUG_HEAP selected or not, we need to print out error message when allocation failed. Furthermore, dump the whold heap usage in case under heap debugging or allocation failure. Signed-off-by: Keyon Jie <[email protected]>
The trace filtering will suppress useful logs and make debugging difficult, disable it by default. Signed-off-by: Keyon Jie <[email protected]>
782156d
to
47606c2
Compare
This small series perform fixes to help for heap allocation debugging:
No matter if CONFIG_DEBUG_HEAP selected or not, we need to print out
error message when allocation failed.
trace: Kconfig: disable filtering by default