Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trace: fixes to help for heap allocation debugging #4548

Merged
merged 2 commits into from
Sep 16, 2021

Conversation

keyonjie
Copy link
Contributor

@keyonjie keyonjie commented Jul 26, 2021

This small series perform fixes to help for heap allocation debugging:

  1. No matter if CONFIG_DEBUG_HEAP selected or not, we need to print out
    error message when allocation failed.

  2. trace: Kconfig: disable filtering by default

tr_err(&mem_tr, \
"failed to alloc 0x%x bytes zone 0x%x caps 0x%x flags 0x%x", \
bytes, zone, caps, flags); \
} \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was meaning from my comment on the previous PR it would be good to dump the heap blocks on any allocation error. I see we are already doing this but in debug mode only ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I was meaning from my comment on the previous PR it would be good to dump the heap blocks on any allocation error. I see we are already doing this but in debug mode only ?

That's true. One more bad thing is that our trace are always filtered out by the trace filtering, that's why I tried to set the filtering to N by default

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do something like this (it sames duplicating the print).

#define _DEBUG_TRACE_PTR(ptr, bytes, zone, caps, flags) \
	do { \
		if (!(ptr)) { \
			tr_err(&mem_tr, \
			       "failed to alloc 0x%x bytes zone 0x%x caps 0x%x flags 0x%x", \
			       bytes, zone, caps, flags); \
		} \
	} while (0)

#if DEBUG
#define DEBUG_TRACE_PTR(ptr, bytes, zone, caps, flags) \
                       _DEBUG_TRACE_PTR(ptr, bytes, zone, caps, flags) \
			alloc_trace_heap(zone, caps, bytes);
#else
#define DEBUG_TRACE_PTR(ptr, bytes, zone, caps, flags) \
                       _DEBUG_TRACE_PTR(ptr, bytes, zone, caps, flags)
#endif

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesnt look like this feedback was addressed?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, definitely good to deduplicate, but can I propose a slight modification:

#define _DEBUG_TRACE_PTR(ptr, bytes, zone, caps, flags) \
	if (!(ptr)) \
		tr_err(&mem_tr, \
		       "failed to alloc 0x%x bytes zone 0x%x caps 0x%x flags 0x%x", \
		       bytes, zone, caps, flags)
#if DEBUG
#define DEBUG_TRACE_PTR(ptr, bytes, zone, caps, flags) do { \
	_DEBUG_TRACE_PTR(ptr, bytes, zone, caps, flags); \
	alloc_trace_heap(zone, caps, bytes); \
while (0)
#else
#define DEBUG_TRACE_PTR(ptr, bytes, zone, caps, flags) do { \
	_DEBUG_TRACE_PTR(ptr, bytes, zone, caps, flags); \
while (0)
#endif

to always have do { ... } while (0) externally

@@ -35,22 +35,22 @@ config TRACEM
config TRACE_FILTERING
bool "Trace filtering"
depends on TRACE
default y
default n
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is hiding a very big change in a PR with a totally unrelated name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean the PR subject, let me rename it. Anyway, the change is mentioned in the commit subject.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the commit is fine, I was only referring to the PR name.

This relatively complex feature took a lot of effort, so I'm pretty sure it was developed for a reason. This change could kill performance on all production systems with just one character change. So even a scary PR name may not be enough, it should probably be discussed in some larger forum / mailing list.

Copy link
Contributor Author

@keyonjie keyonjie Jul 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the commit is fine, I was only referring to the PR name.

This relatively complex feature took a lot of effort, so I'm pretty sure it was developed for a reason. This change could kill performance on all production systems with just one character change. So even a scary PR name may not be enough, it should probably be discussed in some larger forum / mailing list.

No, in real production systems, usually the DMA trace itself is disabled thoroughly, I don't think the feature is useful at all, please correct me if you think so.

Anyway, it filters out useful logs by mistake from time to time for me, it is legit to disable it by default to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filtering is a useful developer debug feature, i.e. helps developers improve the trace SNR when debugging issues when there is high frequency trace output. However for CI usage it can hide some important details in the logs that would be useful for triage. I'm good to try this change for CI and see if it helps

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I just change to unselect the whole TRACE_FILTERING feature, frankly speaking I don't know how to use the _TRACE_FILTERING_VERBOSITY at all, I haven't seen the tr_dbg() logs for years. @slawblauciak asked me about this and what I shared to him is just use tr_info() only.
@marc-hb can you share how should CI and customer use this _TRACE_FILTERING_VERBOSITY today? If nobody is using it, no harm to disable it by default, no?

Copy link
Member

@plbossart plbossart Sep 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@keyonjie I also disagree with your approach. Why wasn't this change a different PR? we could have fixed the 'filtering by mistake' instead of disabling the capability entirely.

@lgirdwood we need to keep the rule that there are TWO approvers on such critical PRs. This ignores feedback, it's not a good process.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marc-hb can you share how should CI and customer use this _TRACE_FILTERING_VERBOSITY today?

By following the official documentation:
https://thesofproject.github.io/latest/developer_guides/debugability/logger/index.html#trace-filtering

Filtering by log levels and component is really the most basic feature of any logging framework.

This commit apparently caused regression #4779 , according to some people reverting it seems to make the problem go away. Of course the code should in theory work with any Kconfig combination, I am not saying changing Kconfig is the root cause. But we need the tests to pass.

Because both, unrelated settings were changed in the same commit ("just in case?"), validation could not immediately find which one setting caused the regression.

Cc: @ranj063 , @1994lwz

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These KConfig changes have been reverted in #4785 because of regression #4779. @keyonjie I now understand that CONFIG_TRACE_FILTERING_ADAPTIVE is the only default you wanted to change, correct? Please re-submit with only the CONFIG_TRACE_FILTERING_ADAPTIVE default change (it's all very brittle, so only one change at a time) and after first thoroughly trying to reproduce issue #4779 locally.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it's very easy to turn off CONFIG_TRACE_FILTERING_ADAPTIVE at run-time, no recompilation needed: #5154

@keyonjie keyonjie changed the title alloc: trace error when allocation failed with no condition trace: fixes to help for heap allocation debugging Jul 28, 2021
src/lib/alloc.c Outdated
@@ -647,14 +647,22 @@ static void alloc_trace_heap(enum mem_zone zone, uint32_t caps, size_t bytes)

#define DEBUG_TRACE_PTR(ptr, bytes, zone, caps, flags) \
do { \
if (!ptr) { \
tr_err(&mem_tr, "failed to alloc 0x%x bytes zone 0x%x caps 0x%x flags 0x%x", \
alloc_trace_heap(zone, caps, bytes); \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only need to dump this when the allocation fails otherwise it will flood the trace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my logic here is:
a. always trace errors when the allocation failed.
b. trace the updated heap status for the successful allocation, only when CONFIG_DEBUG_HEAP is selected ('no' by default) and the DEBUG_TRACE_PTR() is called. It will somewhat 'flood' the trace, but that's what our intention of 'CONFIG_DEBUG_HEAP'?

@@ -35,22 +35,22 @@ config TRACEM
config TRACE_FILTERING
bool "Trace filtering"
depends on TRACE
default y
default n
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filtering is a useful developer debug feature, i.e. helps developers improve the trace SNR when debugging issues when there is high frequency trace output. However for CI usage it can hide some important details in the logs that would be useful for triage. I'm good to try this change for CI and see if it helps

@marc-hb marc-hb requested review from marc-hb and removed request for marc-hb July 28, 2021 14:14
@lgirdwood
Copy link
Member

SOFCI TEST

1 similar comment
@keqiaozhang
Copy link
Collaborator

SOFCI TEST

Copy link
Member

@lgirdwood lgirdwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still some duplication.

tr_err(&mem_tr, \
"failed to alloc 0x%x bytes zone 0x%x caps 0x%x flags 0x%x", \
bytes, zone, caps, flags); \
} \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do something like this (it sames duplicating the print).

#define _DEBUG_TRACE_PTR(ptr, bytes, zone, caps, flags) \
	do { \
		if (!(ptr)) { \
			tr_err(&mem_tr, \
			       "failed to alloc 0x%x bytes zone 0x%x caps 0x%x flags 0x%x", \
			       bytes, zone, caps, flags); \
		} \
	} while (0)

#if DEBUG
#define DEBUG_TRACE_PTR(ptr, bytes, zone, caps, flags) \
                       _DEBUG_TRACE_PTR(ptr, bytes, zone, caps, flags) \
			alloc_trace_heap(zone, caps, bytes);
#else
#define DEBUG_TRACE_PTR(ptr, bytes, zone, caps, flags) \
                       _DEBUG_TRACE_PTR(ptr, bytes, zone, caps, flags)
#endif

@lgirdwood
Copy link
Member

@XiaoyunWu6666 @keyonjie seeing the CML stream offset kernel bug again. here, could be unrelated but trace filtering is OFF for CI so maybe more stream DMA trace traffic at this point. @plbossart @ranj063 fyi.

 1322.593425] kernel: rt700 sdw:1:025d:0700:00: Slave impl defined interrupt
[ 1322.593436] kernel: rt700 sdw:1:025d:0700:00: rt700_interrupt_callback control_port_stat=4
[ 1322.852497] kernel: rt700 sdw:1:025d:0700:00: [rt700_sdw_read] b921 0000 => 80000000
[ 1322.853314] kernel: rt700 sdw:1:025d:0700:00: [rt700_sdw_read] 7520 85a0 9c20 aca0 => 00000800
[ 1322.854265] kernel: rt700 sdw:1:025d:0700:00: [rt700_sdw_read] 7520 85a0 9c20 aca0 => 00000000
[ 1322.854282] kernel: rt700 sdw:1:025d:0700:00: in rt700_jack_detect_handler, jack_type=0x3
[ 1322.854290] kernel: rt700 sdw:1:025d:0700:00: in rt700_jack_detect_handler, btn_type=0x0
[ 1323.077134] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG
[ 1323.077189] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG
[ 1324.645700] kernel: ax88179_178a 2-4:1.0 enx0050b6ba239a: ax88179 - Link status is: 1
[ 1326.344753] kernel: soundwire sdw-master-0: clock stop prep/de-prep done slave:15
[ 1326.345141] kernel: soundwire_intel soundwire_intel.link.1: intel_link_power_down: powering down all links
[ 1327.386182] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: pcm: trigger stream 10 dir 1 cmd 0
[ 1327.386190] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x60050000: GLB_STREAM_MSG: TRIG_STOP
[ 1327.386347] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x60050000: GLB_STREAM_MSG: TRIG_STOP
[ 1327.386909] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: FW Poll Status: reg[0xa0]=0x240002 timedout
[ 1327.386912] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: hda_dsp_stream_trigger: cmd 0: timeout on STREAM_SD_OFFSET read
[ 1327.386918] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x60030000: GLB_STREAM_MSG: PCM_FREE
[ 1327.387072] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x60030000: GLB_STREAM_MSG: PCM_FREE
[ 1327.387098] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: pcm: free stream 10 dir 1
[ 1327.387183] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: pcm: close stream 10 dir 1
[ 1327.392509] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: pcm: open stream 11 dir 1
[ 1327.392512] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: period min 192 max 16384 bytes
[ 1327.392514] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: period count 2 max 16
[ 1327.392515] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: buffer max 65536 bytes
[ 1327.392580] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: pcm: hw params stream 11 dir 1
[ 1327.392582] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: generating page table for 00000000d7f36897 size 0xfa00 pages 16
[ 1327.392593] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: FW Poll Status: reg[0xa0]=0x240000 successful
[ 1327.392611] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: FW Poll Status: reg[0xa0]=0x40000 successful
[ 1327.392613] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: hda_dsp_stream_setup_bdl: period_bytes:0x3e80
[ 1327.392614] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: hda_dsp_stream_setup_bdl: periods:4
[ 1327.392623] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: stream_tag 2
[ 1327.392633] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x60010000: GLB_STREAM_MSG: PCM_PARAMS
[ 1327.393057] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x60010000: GLB_STREAM_MSG: PCM_PARAMS
[ 1327.393062] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: pcm: stream dir 1, posn mailbox offset is 790756
[ 1327.393135] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: pcm: trigger stream 11 dir 1 cmd 1
[ 1327.393648] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: FW Poll Status: reg[0xa0]=0x2024001e successful
[ 1327.393652] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x60040000: GLB_STREAM_MSG: TRIG_START
[ 1327.393850] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x60040000: GLB_STREAM_MSG: TRIG_START
[ 1328.077106] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG
[ 1328.077199] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG
[ 1328.577085] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG
[ 1328.577184] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG
[ 1329.588202] kernel: PM: suspend entry (deep)
[ 1329.588642] kernel: Filesystems sync: 0.000 seconds
[ 1329.650060] kernel: Freezing user space processes ... (elapsed 0.009 seconds) done.
[ 1329.659439] kernel: OOM killer disabled.
[ 1329.659440] kernel: Freezing remaining freezable tasks ... (elapsed 0.000 seconds) done.
[ 1329.660362] kernel: printk: Suspending console(s) (use no_console_suspend to debug)

@plbossart
Copy link
Member

plbossart commented Aug 24, 2021

It's not surprising, the previous occurrence of this STREAM_OFFSET timeout was due to a change in the way var_args were handled - and we never root-caused the problem. Here this PR changes the memory management. It seems that all attempts to change the trace come back to the same flakiness with a memory-handling side effect impacting the DMA.

I vote for no merge, if a feature adds another bug it's not good.

Copy link
Member

@plbossart plbossart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NAK if we end-up with STREAM_OFFSET issues again.

@keyonjie
Copy link
Contributor Author

Agree with @plbossart , not to merge it until the CI reported issue is fixed.

Let me finish the heap refinement first and then the dma lli one then this one.

@lgirdwood
Copy link
Member

@plbossart can we add some debug in the driver to print the stream name/pcm name so that this can be trace back a bit better, I am suspecting the trace DMA here, we do run it on "one-shot" mode but it could be a race between it's state machine and the driver (and extra trace data makes it more frequent).

@lgirdwood
Copy link
Member

@ujfalusi fyi - if you have 30mins and its easy todo - see above, we could do with more detail in the STREAM_OFFSET kernel error message.

@ujfalusi
Copy link
Contributor

@ujfalusi fyi - if you have 30mins and its easy todo - see above, we could do with more detail in the STREAM_OFFSET kernel error message.

@plbossart, @lgirdwood, can you take a look at thesofproject/linux#3121 ? Is it going to help with narrowing down the scope?

@lgirdwood
Copy link
Member

Thank you @ujfalusi !
@plbossart @ranj063 @keyonjie fwiw I think the STREAM_OFFSET error is also happening on the FW side. i.e. from #4656

[  213.706809] kernel: sof-audio-pci-intel-apl 0000:00:0e.0: ipc tx: 0x60040000: GLB_STREAM_MSG: TRIG_START
[  213.707784] kernel: sof-audio-pci-intel-apl 0000:00:0e.0: ipc tx succeeded: 0x60040000: GLB_STREAM_MSG: TRIG_START
[  216.780966] kernel: sof-audio-pci-intel-apl 0000:00:0e.0: pcm: trigger stream 0 dir 1 cmd 0
[  216.780998] kernel: sof-audio-pci-intel-apl 0000:00:0e.0: ipc tx: 0x60050000: GLB_STREAM_MSG: TRIG_STOP
[  216.782252] kernel: sof-audio-pci-intel-apl 0000:00:0e.0: ipc tx succeeded: 0x60050000: GLB_STREAM_MSG: TRIG_STOP
[  216.782286] kernel: sof-audio-pci-intel-apl 0000:00:0e.0: FW Poll Status: reg[0xa0]=0x240000 successful
[  216.782305] kernel: sof-audio-pci-intel-apl 0000:00:0e.0: ipc tx: 0x60030000: GLB_STREAM_MSG: PCM_FREE
[  217.285068] kernel: sof-audio-pci-intel-apl 0000:00:0e.0: error: ipc timed out for 0x60030000 size 12
[  217.285225] kernel: sof-audio-pci-intel-apl 0000:00:0e.0: preventing DSP entering D3 state to preserve context
[  217.285231] kernel: sof-audio-pci-intel-apl 0000:00:0e.0: ------------[ IPC dump start ]------------
[  217.285243] kernel: sof-audio-pci-intel-apl 0000:00:0e.0: error: hda irq intsts 0x00000000 intlctl 0xc0000001 rirb 00
[  217.285251] kernel: sof-audio-pci-intel-apl 0000:00:0e.0: error: dsp irq ppsts 0x00000000 adspis 0x00000000
[  217.285260] kernel: sof-audio-pci-intel-apl 0000:00:0e.0: error: host status 0x00000000 dsp status 0x00000000 mask 0x00000003
[  217.285268] kernel: sof-audio-pci-intel-apl 0000:00:0e.0: ------------[ IPC dump end ]------------
[  217.285274] kernel: sof-audio-pci-intel-apl 0000:00:0e.0: ------------[ DSP dump start ]------------
[  217.285282] kernel: sof-audio-pci-intel-apl 0000:00:0e.0: status: fw entered - code 00000005
[  217.285330] kernel: sof-audio-pci-intel-apl 0000:00:0e.0: error: unexpected fault 0x00000000 trace 0x00004000
[  217.285338] kernel: sof-audio-pci-intel-apl 0000:00:0e.0: ------------[ DSP dump end ]------------
[  217.285348] kernel: sof-audio-pci-intel-apl 0000:00:0e.0: ASoC: error at soc_component_trigger on 0000:00:0e.0: -110
[  217.285359] kernel:  Port0: ASoC: trigger FE cmd: 0 failed: -110
[  217.285435] kernel: sof-audio-pci-intel-apl 0000:00:0e.0: pcm: free stream 0 dir 1

On the FW side we see an IPC timeout, so It's possible we have an IP programming flow around DMA stop that needs some better alignment between FW and drivers...

No matter if CONFIG_DEBUG_HEAP selected or not, we need to print out
error message when allocation failed. Furthermore, dump the whold heap
usage in case under heap debugging or allocation failure.

Signed-off-by: Keyon Jie <[email protected]>
The trace filtering will suppress useful logs and make debugging
difficult, disable it by default.

Signed-off-by: Keyon Jie <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants