Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] timeout on STREAM_SD_OFFSET read when suspend-resume on CML Helios #4558

Closed
XiaoyunWu6666 opened this issue Jul 27, 2021 · 57 comments
Closed
Assignees
Labels
bug Something isn't working as expected CML Applies to Comet Lake platform I2S Applies to I2S bus for codec connection P3 Low-impact bugs or features
Milestone

Comments

@XiaoyunWu6666
Copy link
Contributor

XiaoyunWu6666 commented Jul 27, 2021

Describe the bug
timeout on STREAM_SD_OFFSET read when suspend-resume on CML
inner daily /5515?model=CML_HEL_RT5682&testcase=check-suspend-resume-with-capture-5
To Reproduce

TPLG=/lib/firmware/intel/sof-tplg/sof-cml-rt1011-rt5682.tplg  \
 sof-test/test-case/check-suspend-resume-with-audio.sh -S 1 -w 3 -l 300 -m capture

-S 1 -w 3 save some time while preserving an acceptable ssh experience. For faster reproduction on a local system try -S 1 -w 1?

-m capture is critical: -m playback never reproduced even after 100 iterations.

Environment
Start Time: 2021-07-26 21:24:47 UTC
End Time: 2021-07-27 04:58:29 UTC
Kernel Branch: topic/sof-dev
Kernel Commit: 2113dc7f
SOF Branch: main
SOF Commit: 57ee04f
Topology:sof-cml-rt1011-rt5682.tplg
Platform:CML_HEL_RT5682

Screenshots or console output
[console]

2021-07-26 23:04:00 UTC Sub-Test: [REMOTE_INFO] ===== Round(2/5) =====
2021-07-26 23:04:00 UTC Sub-Test: [REMOTE_COMMAND] Run the command: rtcwake -m mem -s 5
rtcwake: assuming RTC uses UTC ...
rtcwake: wakeup from "mem" using /dev/rtc0 at Mon Jul 26 23:04:07 2021
2021-07-26 23:04:10 UTC Sub-Test: [REMOTE_COMMAND] sleep for 5
arecord: suspend:1691: suspend: prepare error: Connection timed out
2021-07-26 23:04:15 UTC Sub-Test: [REMOTE_INFO] Check for the kernel log status
declare -- cmd="journalctl_cmd --since=@1627340635"
2021-07-26 23:04:15 UTC [ERROR] Caught kernel log error
===========================>>
[ 5378.823476] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: hda_dsp_stream_trigger: cmd 5: timeout on STREAM_SD_OFFSET read
[ 5382.615214] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: cl_copy_fw: timeout HDA_DSP_SRAM_REG_ROM_STATUS read
[ 5382.616259] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: hda_dsp_stream_trigger: cmd 0: timeout on STREAM_SD_OFFSET read
[ 5382.616266] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: DMA trigger stop failed
[ 5382.616272] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ------------[ DSP dump start ]------------
[ 5382.616305] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: extended rom status:  0x5000001 0x0 0x0 0x0 0x0 0x0 0x1811102 0x0
[ 5382.616311] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ------------[ DSP dump end ]------------
[ 5382.616315] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: load fw failed ret: -110
[ 5382.616373] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: failed to start DSP
[ 5382.616378] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: failed to boot DSP firmware after resume -110
[ 5382.616382] kernel: PM: dpm_run_callback(): pci_pm_resume+0x0/0x80 returns -110
[ 5382.616513] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: PM: failed to resume async: error -110
[ 5383.132839] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: ipc timed out for 0x30100000 size 48
[ 5383.132909] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ------------[ IPC dump start ]------------
[ 5383.132967] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: hda irq intsts 0x00000000 intlctl 0x00001388 rirb 00
[ 5383.133019] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: dsp irq ppsts 0x00000000 adspis 0x00000003
[ 5383.133077] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: host status 0x00000000 dsp status 0x00040000 mask 0x00000000
[ 5383.133131] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ------------[ IPC dump end ]------------
[ 5383.133175] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: load pipeline ipc failure
[ 5383.133216] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: failed to load widget PIPELINE.11.DMIC1.IN
[ 5383.133264] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: failed widget list set up for pcm 7 dir 1
[ 5383.133313] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: set pcm hw_params after resume
[ 5383.133356] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ASoC: error at snd_soc_pcm_component_prepare on 0000:00:1f.3: -110
[ 5383.133496] kernel:  DMIC16kHz: ASoC: soc_pcm_prepare() failed (-110)
[ 5383.133535] kernel:  DMIC16kHz: ASoC: dpcm_fe_dai_prepare() failed (-110)
<<===========================
@slawblauciak slawblauciak transferred this issue from thesofproject/sof Jul 27, 2021
@mengdonglin mengdonglin changed the title [BUG] timeout on STREAM_SD_OFFSET read when suspend-resume on CML [BUG] timeout on STREAM_SD_OFFSET read when suspend-resume on CML Helios Jul 27, 2021
@plbossart
Copy link
Member

@slawblauciak what makes you think it's a kernel issue? There's nothing that changed recently.

@lgirdwood Can you please look into this?

@plbossart
Copy link
Member

@XiaoyunWu6666 The same problem affects other CML devices, it's not just CML_HEL_RT5682

http://sof-ci.sh.intel.com/#/result/planresultdetail/5515?model=CML_SKU0983_SDW&testcase=check-suspend-resume-with-capture-5

http://sof-ci.sh.intel.com/#/result/planresultdetail/5515?model=CML_SKU0955_HDA&testcase=check-suspend-resume-with-capture-5

It's an across-the-board problem for all CML devices using I2S/HDAudio or SoundWire links. That's very very suspicious.

@plbossart
Copy link
Member

Reproduced in 10 iterations on CML SKU 09C6 (same as 0983)

2021-07-27 15:30:35 UTC Sub-Test: [INFO] ===== Round(10/50) =====
2021-07-27 15:30:35 UTC Sub-Test: [COMMAND] Run the command: rtcwake -m mem -s 5
rtcwake: assuming RTC uses UTC ...
rtcwake: wakeup from "mem" using /dev/rtc0 at Tue Jul 27 15:30:40 2021
2021-07-27 15:31:01 UTC Sub-Test: [COMMAND] sleep for 5
arecord: suspend:1716: suspend: prepare error: Connection timed out
2021-07-27 15:31:04 UTC Sub-Test: [INFO] Check for the kernel log status
declare -- cmd="journalctl_cmd --since=@1627399830"
2021-07-27 15:31:04 UTC [ERROR] Caught kernel log error
===========================>>
[  516.464381] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: hda_dsp_stream_trigger: cmd 5: timeout on STREAM_SD_OFFSET read
[  525.418163] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: cl_copy_fw: timeout HDA_DSP_SRAM_REG_ROM_STATUS read
[  525.418556] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: hda_dsp_stream_trigger: cmd 0: timeout on STREAM_SD_OFFSET read
[  525.418565] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: DMA trigger stop failed
[  525.418571] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ------------[ DSP dump start ]------------
[  525.418598] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: extended rom status:  0x5000001 0x0 0x0 0x0 0x0 0x0 0x1811102 0x0
[  525.418604] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ------------[ DSP dump end ]------------
[  525.418609] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: load fw failed ret: -110
[  525.418661] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: failed to start DSP
[  525.418667] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: failed to boot DSP firmware after resume -110
[  525.418673] kernel: PM: dpm_run_callback(): pci_pm_resume+0x0/0x80 returns -110
[  525.418703] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: PM: failed to resume async: error -110
[  525.419950] kernel: soundwire_intel soundwire_intel.link.0: Failed to power up link: -11
[  530.922894] kernel: rt711 sdw:0:025d:0711:00: Initialization not complete, timed out
[  530.922909] kernel: PM: dpm_run_callback(): acpi_subsys_resume+0x0/0x70 returns -110

kernel SHA1: 85bfa34a88d6
SOF SHA1: 9a7a5ce (HEAD -> main, origin/main) pipeline: trigger START from the task context

@plbossart
Copy link
Member

plbossart commented Jul 27, 2021

0def9056300a4b23d909ba0b36256087af91d2fe is the first bad commit
commit 0def9056300a4b23d909ba0b36256087af91d2fe
Author: Marc Herbert <[email protected]>
Date:   Fri Jul 16 08:00:02 2021 +0000

    trace: move CONFIG_TRACEM implementation up a couple levels
    
    From deep down trace.c:va_tracelog() up to the _log_message() level.
    
    Also rename va_tracelog() to the more specific dma_tracelog()
    
    Preparation to support the DMA trace in Zephyr.
    
    The only functional change in this commit is that DMA messages copied to
    the shared memory are not de-duplicated any more (a.k.a "adaptive rate
    limiting" or CONFIG_TRACE_FILTERING_ADAPTIVE). These are generally
    supposed to be high level hence rare enough; otherwise there is probably
    a "bigger problem".
    
    Signed-off-by: Marc Herbert <[email protected]>

 src/include/sof/trace/trace.h  | 28 +++++++++++++---
 src/trace/trace.c              | 72 ++++++++++++++++++++----------------------
 test/cmocka/src/common_mocks.c | 10 +++++-
 3 files changed, 67 insertions(+), 43 deletions(-)


git bisect log
git bisect start
# bad: [9a7a5ce17dcb339e3fd667c8f12dd50db13afe31] pipeline: trigger START from the task context
git bisect bad 9a7a5ce17dcb339e3fd667c8f12dd50db13afe31
# good: [9648fb65a5b8b86707e0a160efa992f3f3c97f59] alloc: fix heap_trace() block start index
git bisect good 9648fb65a5b8b86707e0a160efa992f3f3c97f59
# good: [b1c9e858969bf022b0b3f908ca5539548405934e] zephyr_ll.c: declare missing ll_tr trace context
git bisect good b1c9e858969bf022b0b3f908ca5539548405934e
# bad: [b96077cde05cb512ab0cfc1fa8ccf933e5f25b25] zephyr: intel: remove zephyr_ll from mandatory files
git bisect bad b96077cde05cb512ab0cfc1fa8ccf933e5f25b25
# good: [e1aa806d5c6679d93da9d6a6f024d0e504ee6353] trace: add 'atomic' argument to mtrace_dict_entry()
git bisect good e1aa806d5c6679d93da9d6a6f024d0e504ee6353
# bad: [57ee04f2d9312968a8e11fd916195bbacd667180] trace: add _log_nodict() and enable DMA trace for Zephyr
git bisect bad 57ee04f2d9312968a8e11fd916195bbacd667180
# bad: [0def9056300a4b23d909ba0b36256087af91d2fe] trace: move CONFIG_TRACEM implementation up a couple levels
git bisect bad 0def9056300a4b23d909ba0b36256087af91d2fe
# first bad commit: [0def9056300a4b23d909ba0b36256087af91d2fe] trace: move CONFIG_TRACEM implementation up a couple levels

@plbossart
Copy link
Member

@XiaoyunWu6666 @marc-hb @fredoh9 @bardliao I would appreciate it if you can try to reproduce the issue on a CML device. My results show that there is a firmware regression since I don't see any issue with the latest kernel/older firmware.

More details for the record - @lgirdwood FYI
test 5515 on July 27 - BAD
Kernel Commit: 2113dc7f
SOF Commit: 57ee04f

test 5464 July 26 GOOD
Kernel Commit: d54b7803
SOF Commit: 9648fb6

so basically I bisected from yesterday's tested SOF commit
good: [9648fb6] alloc: fix heap_trace() block start index
until today's HEAD
bad: [9a7a5ce] pipeline: trigger START from the task context

@lgirdwood
Copy link
Member

@plbossart thanks for bisect , can you confirm if the trace patch or the pipeline patch is bad ?

@plbossart
Copy link
Member

I don't understand the question @lgirdwood. My tests show what the first bad commit was, I don't know if the last patches on pipelines are correct or not. It could also be that my 'good' tests are 'green failures', and the issue happens with an earlier commit.

@lgirdwood
Copy link
Member

Sorry, was not clear. I mean is the bisect clear cut or vague - I agree though this is a FW issue.

0def9056300a4b23d909ba0b36256087af91d2fe is the first bad commit
commit 0def9056300a4b23d909ba0b36256087af91d2fe
Author: Marc Herbert <[email protected]>
Date:   Fri Jul 16 08:00:02 2021 +0000

OR

so basically I bisected from yesterday's tested SOF commit
good: [9648fb65a5b8b86707e0a160efa992f3f3c97f59] alloc: fix heap_trace() block start index
until today's HEAD
bad: [9a7a5ce17dcb339e3fd667c8f12dd50db13afe31] pipeline: trigger START from the task context

I guess what you are saying it could be either above two commits (and needs further investigation) ?

@plbossart
Copy link
Member

no, I am saying: the problem started with 0def905 or earlier. I don't have a 100% confidence in the 'good' results, but I trust the tests when they tell me the DSP boot failed.

@lgirdwood lgirdwood transferred this issue from thesofproject/linux Jul 27, 2021
@lgirdwood
Copy link
Member

@XiaoyunWu6666 are you able to do some further investigation.

  1. locally revert 0def905 and try 20 iterations
  2. locally revert 9a7a5ce and try 20 iterations.

OR redo the bisect with 20 iterations (based on @plbossart being able to reproduce with 10). Thanks !

@marc-hb @lyakh fyi

@plbossart
Copy link
Member

plbossart commented Jul 27, 2021

Note that sometimes 10-20 iterations is not enough. I just ran a check which didn't fail in 50, and my last took 36 with the SOF main branch

2021-07-27 21:54:09 UTC Sub-Test: [INFO] ===== Round(36/50) =====
2021-07-27 21:54:09 UTC Sub-Test: [COMMAND] Run the command: rtcwake -m mem -s 5
rtcwake: assuming RTC uses UTC ...
rtcwake: wakeup from "mem" using /dev/rtc0 at Tue Jul 27 21:54:14 2021
2021-07-27 21:54:33 UTC Sub-Test: [COMMAND] sleep for 5
arecord: suspend:1716: suspend: prepare error: Connection timed out
2021-07-27 21:54:38 UTC Sub-Test: [INFO] Check for the kernel log status
declare -- cmd="journalctl_cmd --since=@1627422844"
2021-07-27 21:54:38 UTC [ERROR] Caught kernel log error
===========================>>
[ 1053.206429] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: hda_dsp_stream_trigger: cmd 5: timeout on STREAM_SD_OFFSET read
[ 1062.162980] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: cl_copy_fw: timeout HDA_DSP_SRAM_REG_ROM_STATUS read
[ 1062.163662] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: hda_dsp_stream_trigger: cmd 0: timeout on STREAM_SD_OFFSET read
[ 1062.163671] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: DMA trigger stop failed
[ 1062.163678] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ------------[ DSP dump start ]------------
[ 1062.163705] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: extended rom status:  0x5000001 0x0 0x0 0x0 0x0 0x0 0x1811102 0x0
[ 1062.163712] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ------------[ DSP dump end ]------------
[ 1062.163717] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: load fw failed ret: -110
[ 1062.163773] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: failed to start DSP
[ 1062.163780] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: failed to boot DSP firmware after resume -110

@plbossart
Copy link
Member

plbossart commented Jul 27, 2021

With these two commands, I was able to run 100 iterations without issues.

git revert 57ee04f2d9312968a8e11fd916195bbacd667180
git revert 0def9056300a4b23d909ba0b36256087af91d2fe

test command:

TPLG=/lib/firmware/intel/sof-tplg/sof-cml-rt711-rt1308-mono-rt715.tplg ~/sof-test/test-case/check-suspend-resume-with-audio.sh -l 100 -m capture

@plbossart
Copy link
Member

And with these 3 commands I get an error in 11 iterations

git revert 57ee04f2d9312968a8e11fd916195bbacd667180
git revert 0def9056300a4b23d909ba0b36256087af91d2fe
git cherry-pick 0def9056300a4b23d909ba0b36256087af91d2fe
2021-07-27 23:03:15 UTC Sub-Test: [INFO] ===== Round(11/100) =====
2021-07-27 23:03:15 UTC Sub-Test: [COMMAND] Run the command: rtcwake -m mem -s 5
rtcwake: assuming RTC uses UTC ...
rtcwake: wakeup from "mem" using /dev/rtc0 at Tue Jul 27 23:03:21 2021
2021-07-27 23:03:40 UTC Sub-Test: [COMMAND] sleep for 5
arecord: suspend:1716: suspend: prepare error: Connection timed out
2021-07-27 23:03:45 UTC Sub-Test: [INFO] Check for the kernel log status
declare -- cmd="journalctl_cmd --since=@1627426990"
2021-07-27 23:03:45 UTC [ERROR] Caught kernel log error
===========================>>
[  152.853194] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: hda_dsp_stream_trigger: cmd 5: timeout on STREAM_SD_OFFSET read
[  161.852981] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: cl_copy_fw: timeout HDA_DSP_SRAM_REG_ROM_STATUS read
[  161.853391] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: hda_dsp_stream_trigger: cmd 0: timeout on STREAM_SD_OFFSET read
[  161.853399] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: DMA trigger stop failed
[  161.853405] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ------------[ DSP dump start ]------------
[  161.853432] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: extended rom status:  0x5000001 0x0 0x0 0x0 0x0 0x0 0x1811102 0x0
[  161.853439] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ------------[ DSP dump end ]------------
[  161.853443] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: load fw failed ret: -110
[  161.853498] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: failed to start DSP
[  161.853503] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: failed to boot DSP firmware after resume -110
[  161.853509] kernel: PM: dpm_run_callback(): pci_pm_resume+0x0/0x80 returns -110
[  161.853539] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: PM: failed to resume async: error -110
[  161.854777] kernel: soundwire_intel soundwire_intel.link.0: Failed to power up link: -11
[  167.167712] kernel: rt711 sdw:0:025d:0711:00: Initialization not complete, timed out
[  167.167724] kernel: PM: dpm_run_callback(): acpi_subsys_resume+0x0/0x70 returns -110

I would say this confirms the bisect and rules out all following commits after 0def905

@marc-hb
Copy link
Collaborator

marc-hb commented Jul 28, 2021

I can reproduce relatively quickly to on jf-cml-hel-rt5682-01 with TPLG=/lib/firmware/intel/sof-tplg/sof-cml-rt1011-rt5682.tplg sof-test/test-case/check-suspend-resume-with-audio.sh -S 1 -w 3 -l 50 -m capture... even after reverting both 57ee04f and 0def905 !

EDIT: I just reproduced at commit 98e8016, before the commits above.

Note the DSP is permanently wedged after a failure, sof-test/tools/kmod/sof_remove and _insert cannot recover it. A Linux reboot is necessary.

More precisely, after failure, sof-test/kmod/tools/sof_remove.sh unloads the modules but prints this:

Jul 27 23:39:58 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x40010000: GLB_PM_MSG: CTX_SAVE
Jul 27 23:39:59 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: ipc timed out for 0x40010000 size 76
Jul 27 23:39:59 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: -110 failed to prepare DSP for device removal
Jul 27 23:39:59 jf-cml-hel-rt5682-01 acpid[387]: input device has been disconnected, fd 13
Jul 27 23:39:59 jf-cml-hel-rt5682-01 acpid[387]: input device has been disconnected, fd 12
Jul 27 23:39:59 jf-cml-hel-rt5682-01 acpid[387]: input device has been disconnected, fd 11
Jul 27 23:39:59 jf-cml-hel-rt5682-01 acpid[387]: input device has been disconnected, fd 10
Jul 27 23:39:59 jf-cml-hel-rt5682-01 systemd[529]: Stopped target Sound Card.
Jul 27 23:39:59 jf-cml-hel-rt5682-01 systemd[1]: Stopped target Sound Card.
Jul 27 23:39:59 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: FW Poll Status: reg[0x4]=0x1d003c timedout
Jul 27 23:39:59 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: hda_dsp_core_reset_enter: timeout on HDA_DSP_REG_ADSPCS read
Jul 27 23:39:59 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: dsp core reset failed: core_mask 1

Then sof-test/kmod/tools/sof_insert.sh fails like this:

Jul 27 23:15:12 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: hda_dsp_stream_setup_bdl: periods:1
Jul 27 23:15:12 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: Attempting iteration 0 of Core En/ROM load...
Jul 27 23:15:12 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: DSP core(s) enabled? 0 : core_mask f
Jul 27 23:15:12 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: FW Poll Status: reg[0x4]=0xf0f0e0e successful
Jul 27 23:15:12 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: FW Poll Status: reg[0x4]=0xf0f0e00 successful
Jul 27 23:15:12 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: unstall/run core: core_mask = f
Jul 27 23:15:12 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: DSP core(s) enabled? 1 : core_mask f
Jul 27 23:15:12 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: FW Poll Status: reg[0x4]=0xf0f0000 successful
Jul 27 23:15:12 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: unstall/run core: core_mask = 1
Jul 27 23:15:12 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: DSP core(s) enabled? 1 : core_mask 1
Jul 27 23:15:13 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: FW Poll Status: reg[0xd4]=0x0 timedout
Jul 27 23:15:13 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: FW Poll Status: reg[0x4]=0xf0f0f0f successful
Jul 27 23:15:13 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: FW Poll Status: reg[0x4]=0xf0f successful
Jul 27 23:15:13 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: DSP core(s) enabled? 0 : core_mask f
Jul 27 23:15:13 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: Attempting iteration 1 of Core En/ROM load...
Jul 27 23:15:13 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: DSP core(s) enabled? 0 : core_mask f
Jul 27 23:15:13 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: FW Poll Status: reg[0x4]=0xf0f0f0f successful
Jul 27 23:15:13 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: FW Poll Status: reg[0x4]=0xf0f0f00 successful
Jul 27 23:15:13 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: unstall/run core: core_mask = f
Jul 27 23:15:13 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: DSP core(s) enabled? 1 : core_mask f
Jul 27 23:15:13 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: FW Poll Status: reg[0x4]=0xf0f0000 successful
Jul 27 23:15:13 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: unstall/run core: core_mask = 1
Jul 27 23:15:13 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: DSP core(s) enabled? 1 : core_mask 1
Jul 27 23:15:13 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: FW Poll Status: reg[0xd4]=0x80000000 successful
Jul 27 23:15:13 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: FW Poll Status: reg[0x4]=0xf0f0e0e successful
Jul 27 23:15:13 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: FW Poll Status: reg[0x4]=0x1010e0e successful
Jul 27 23:15:13 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: DSP core(s) enabled? 0 : core_mask e
Jul 27 23:15:13 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: FW Poll Status: reg[0x80000]=0x5000001 successful
  --- 3 seconds exactly ---
Jul 27 23:15:16 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: FW Poll Status: reg[0x80000]=0x5000001 timedout
Jul 27 23:15:16 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: cl_copy_fw: timeout HDA_DSP_SRAM_REG_ROM_STATUS read
Jul 27 23:15:16 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: FW Poll Status: reg[0x160]=0x140000 successful
Jul 27 23:15:16 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ------------[ DSP dump start ]------------
Jul 27 23:15:16 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: unknown ROM status value 05000001
Jul 27 23:15:16 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: extended rom status:  0x5000001 0x0 0x0 0x0 0x0 0x0 0x1811102 0x0
Jul 27 23:15:16 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ------------[ DSP dump end ]------------
Jul 27 23:15:16 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: load fw failed ret: -110
Jul 27 23:15:16 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: failed to start DSP
Jul 27 23:15:16 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: failed to boot DSP firmware -110
Jul 27 23:15:16 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: FW Poll Status: reg[0x4]=0x1d003c timedout
Jul 27 23:15:16 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: hda_dsp_core_reset_enter: timeout on HDA_DSP_REG_ADSPCS read
Jul 27 23:15:16 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: dsp core reset failed: core_mask 1
Jul 27 23:15:16 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: sof_probe_work failed err: -110

@marc-hb
Copy link
Collaborator

marc-hb commented Jul 28, 2021

As I wrote in the previous comment, with TPLG=/lib/firmware/intel/sof-tplg/sof-cml-rt1011-rt5682.tplg sof-test/test-case/check-suspend-resume-with-audio.sh -S 1 -w 3 -l 50 -m capture I can reproduce fairly quickly after about 30 iterations. I do not think this is a recent SOF regression because I reproduced with different, weeks-old commits like bdae0ee. I even checked out the SOF tag v1.8 and reproduced exactly the same!

So something else changed recently. I used the kernel that was already there, didn't touch it:
5.14.0-rc2-daily-default-20210726 #2113dc7f SMP Mon Jul 26 21:19:19 UTC 2021

Note I also tried to use a remote PDU to powercycle the device and not just reboot it (afraid of something like #4419) but I don't know effective that was. EDIT: the remote PDU cuts the power for about 1s. The battery has been disconnected an all devices. I know it's effective because the RTC clock loses track of time.

It's an across-the-board problem for all CML devices using I2S/HDAudio or SoundWire links. That's very very suspicious.

... while everything else is fine. So how could this be related to generic trace changes?

@mengdonglin mengdonglin added bug Something isn't working as expected CML Applies to Comet Lake platform I2S Applies to I2S bus for codec connection labels Jul 28, 2021
@fredoh9
Copy link
Contributor

fredoh9 commented Jul 28, 2021

@marc-hb Same problem with v1.8, that's hard to believe.
I tested with recently daily builds, I was able to reproduce the same result with daily build reported, good for good, bad for bad.

After reverting two commits above @plbossart mentioned, I'm not able to reproduce the problem.
this is my sof git tree after reverting

0faab8ec (HEAD) Revert "trace: move CONFIG_TRACEM implementation up a couple levels"
8d3b73a2 Revert "trace: add _log_nodict() and enable DMA trace for Zephyr"
9a7a5ce1 (origin/main, origin/HEAD) pipeline: trigger START from the task context
886b7ee4 pipeline: (cosmetic) use bool type, adjust line lengths
e879e954 pipeline: (cosmetic) simplify pipeline_task_init()

I used this command,
TPLG=/lib/firmware/intel/sof-tplg/sof-cml-rt1011-rt5682.tplg ~/sof-test/test-case/check-suspend-resume-with-audio.sh -l 25 -m capture

@marc-hb
Copy link
Collaborator

marc-hb commented Jul 28, 2021

Looking at daily test results 5539 I see the failures happened in two steps on some platforms and only the first step on others.

  • The first and common step is the single-line error message error: hda_dsp_stream_trigger: cmd X: timeout on STREAM_SD_OFFSET read. The test stops because of that error but it does not seem fatal.
  • The second and less common step starts with the same message, just with a different command number and then it looks as above.

@marc-hb Same problem with v1.8, that's hard to believe.

It was hard to believe for me too but I triple check that by watermarking the build with a .tarball-version and see the watermark in the kernel logs.

I used this command,

Use the same parameters as in the description. -l 25 is not enough. Maybe -S 1 -w 3 is required to reproduce on older versions? Could be timing-sensitive.

PS: CML also suffers from #4560 and #4561

@bardliao
Copy link
Collaborator

bardliao commented Jul 28, 2021

I can still reproduce the issue when 57ee04f and 0def905 are reverted on my CML SKU 098F device.
dmesg_revert_2_commits.txt

bf2ba830b (HEAD -> main) Revert "trace: move CONFIG_TRACEM implementation up a couple levels"
97ad0a634 Revert "trace: add _log_nodict() and enable DMA trace for Zephyr"
9a7a5ce17 (origin/main) pipeline: trigger START from the task context

And test with latest v1.8.x firmware from https://github.com/thesofproject/sof-bin/tree/main/v1.8.x is still on going, so far so good.

Edit: The test result with latest v1.8.x firmware is PASS on my side.

Edit: My kernel commit is thesofproject/linux@85bfa34

@lgirdwood
Copy link
Member

@marc-hb Same problem with v1.8, that's hard to believe.

It was hard to believe for me too but I triple check that by watermarking the build with a .tarball-version and see the watermark in the kernel logs.

So v1.8 would have been validated with a different kernel version. It's also possible the reproducability is dependent on the kernel version too. Has anyone tried the kernel from the v1.8 tag date with HEAD and v1.8 ?

@keyonjie
Copy link
Contributor

@marc-hb Same problem with v1.8, that's hard to believe.

It was hard to believe for me too but I triple check that by watermarking the build with a .tarball-version and see the watermark in the kernel logs.

So v1.8 would have been validated with a different kernel version. It's also possible the reproducability is dependent on the kernel version too. Has anyone tried the kernel from the v1.8 tag date with HEAD and v1.8 ?

the SOF v1.8 was tested with this kernel: https://github.com/thesofproject/linux/tree/release-sof-1.8-20210524

@lgirdwood
Copy link
Member

So we do the following for stream START|STOP

		ret = snd_sof_dsp_read_poll_timeout(sdev,
					HDA_DSP_HDA_BAR,
					sd_offset, run,
					((run &	dma_start) == dma_start),
					HDA_DSP_REG_POLL_INTERVAL_US,
					HDA_DSP_STREAM_RUN_TIMEOUT);

		if (ret < 0) {
			dev_err(sdev->dev,
				"error: %s: cmd %d: timeout on STREAM_SD_OFFSET read\n",
				__func__, cmd);
			return ret;
		}

Where the wait delay is 300uS

sound/soc/sof/intel/hda.h:179:#define HDA_DSP_STREAM_RUN_TIMEOUT		300

The DSP can take up to the next LL tick (up to 1ms) to do the stream DMA START | STOP. Can we try increasing the timeout to 1100uS (including some overhead).

This could be too long for kernel trigger atomic context so we could also remove this polling check here and place it in the PCM position callback e.g.

if (time_from_stream_trigger > 1100 && bit_no_set)
    printk(KERN_ERR "dma not started);

OR alternatively rely on the ALSA pointer timout of 100ms as indication of stream DMA status.

@plbossart
Copy link
Member

We've had issues with firmware boot for a while now, see the query https://github.com/thesofproject/linux/issues?q=is%3Aissue+is%3Aopen+label%3ACSE

What I see from the logs is that we seem to have different patterns of failure

a. the pattern in this bug

[ 5378.823476] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: hda_dsp_stream_trigger: cmd 5: timeout on STREAM_SD_OFFSET read
[ 5382.615214] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: cl_copy_fw: timeout HDA_DSP_SRAM_REG_ROM_STATUS read

b. a different pattern in kmod tests

Marc's post above and in thesofproject/linux#3056

Jul 27 23:15:16 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: cl_copy_fw: timeout HDA_DSP_SRAM_REG_ROM_STATUS read
Jul 27 23:15:16 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: FW Poll Status: reg[0x160]=0x140000 successful
Jul 27 23:15:16 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ------------[ DSP dump start ]------------
Jul 27 23:15:16 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: unknown ROM status value 05000001
Jul 27 23:15:16 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: extended rom status:  0x5000001 0x0 0x0 0x0 0x0 0x0 0x1811102 0x0
Jul 27 23:15:16 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ------------[ DSP dump end ]------------
Jul 27 23:15:16 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: load fw failed ret: -110
Jul 27 23:15:16 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: failed to start DSP
Jul 27 23:15:16 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: failed to boot DSP firmware -110
Jul 27 23:15:16 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: FW Poll Status: reg[0x4]=0x1d003c timedout
Jul 27 23:15:16 jf-cml-hel-rt5682-01 kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: hda_dsp_core_reset_enter: timeout on HDA_DSP_REG_ADSPCS read

c. and yet another pattern in thesofproject/linux#3045

[ 6005.333364] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: error: hda_dsp_core_reset_leave: timeout on HDA_DSP_REG_ADSPCS read
[ 6005.383872] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: error: hda_dsp_core_reset_leave: timeout on HDA_DSP_REG_ADSPCS read
[ 6005.434353] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: error: hda_dsp_core_reset_leave: timeout on HDA_DSP_REG_ADSPCS read
[ 6005.434356] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: error: dsp core 0/1 power up failed
[ 6005.434357] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ------------[ DSP dump start ]------------
[ 6005.434395] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: error: extended rom status:  0xffffffff 0xffffffff 0xffffffff 0xffffffff 0xffffffff 0xffffffff 0xffffffff 0xffffffff
[ 6005.434413] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: ------------[ DSP dump end ]------------
[ 6005.434414] kernel: sof-audio-pci-intel-tgl 0000:00:1f.3: error: dsp init failed after 3 attempts with err: -110

So I really wonder if we have multiple causes of failure. The changes with the DMA trace are fairly reproducible, it's rather unclear why.

I will re-test with v1.8, I am not completely sure it's the same problem - I was e.g. unable to reproduce the issue with more than 100 iteration with just the 2 commits reverted.

@plbossart
Copy link
Member

The DSP can take up to the next LL tick (up to 1ms) to do the stream DMA START | STOP. Can we try increasing the timeout to 1100uS (including some overhead).

I am not fully convinced @lgirdwood, the key point is that we only see this issue in suspend-resume tests. If we had a problem with DMAs we would have seen it in regular playback/capture tests. It's the suspend-resume part that's really important.

@lgirdwood
Copy link
Member

The DSP can take up to the next LL tick (up to 1ms) to do the stream DMA START | STOP. Can we try increasing the timeout to 1100uS (including some overhead).

I am not fully convinced @lgirdwood, the key point is that we only see this issue in suspend-resume tests. If we had a problem with DMAs we would have seen it in regular playback/capture tests. It's the suspend-resume part that's really important.

The trace uses the DMA too and is on the LL 1ms tick. Can you try and extend the delay ?

@lgirdwood
Copy link
Member

lgirdwood commented Jul 29, 2021

Yes, and it looks like gcc and xcc do have different results. I tried with internal 20210719 daily test binaries (commit f96e25f). The xcc one pass and the gcc one fail.

Thanks @bardliao , obvious differences between compilers are C calling ABI, code speed, layout of TEXT and DATA.

I can see 2 really interesting things from the data

  1. GCC trace always shows two trace IPCs after TRIGGER_START whilst XCC always shows one EDIT: this is before we enter D3. @bardliao are you able to provide the FW trace logs for each build to check for differences in the trace data ? The trace message should only be sent if the trace buffer > watermark OR every 500ms if there is new data.

  2. The stream IDs are different. GCC uses stream 1 and XCC stream 4 in the dmesg. I would assume these to be the same with the same topology ?

i.e. GCC

[  304.960002] sof-audio-pci-intel-cnl 0000:00:1f.3: pcm: trigger stream 1 dir 1 cmd 1
[  304.961806] sof-audio-pci-intel-cnl 0000:00:1f.3: FW Poll Status: reg[0xa0]=0x2024001e successful
[  304.961821] sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x60040000: GLB_STREAM_MSG: TRIG_START
[  304.962917] rfkill: input handler enabled
[  304.963079] sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x60040000: GLB_STREAM_MSG: TRIG_START
[  304.998420] thermal thermal_zone10: failed to read out thermal zone (-61)
[  304.998486] PM: suspend exit
[  305.089913] sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG
[  305.089945] sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG
[  305.109491] rfkill: input handler disabled
[  305.589981] sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG
[  305.590036] sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG
[  306.430759] ax88179_178a 1-8.3:1.0 enx000ec6a22def: ax88179 - Link status is: 1
[  307.961582] soundwire sdw-master-1: clock stop prep/de-prep done slave:15
[  307.961609] soundwire sdw-master-2: clock stop prep/de-prep done slave:15
[  307.961623] soundwire sdw-master-3: clock stop prep/de-prep done slave:15
[  308.331280] PM: suspend entry (s2idle)

and XCC

[ 3685.065788] sof-audio-pci-intel-cnl 0000:00:1f.3: pcm: trigger stream 4 dir 1 cmd 1
[ 3685.065998] sof-audio-pci-intel-cnl 0000:00:1f.3: FW Poll Status: reg[0xa0]=0x2024001e successful
[ 3685.066010] sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x60040000: GLB_STREAM_MSG: TRIG_START
[ 3685.066334] sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x60040000: GLB_STREAM_MSG: TRIG_START
[ 3685.789467] sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG
[ 3685.789523] sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG
[ 3686.310809] ax88179_178a 1-8.3:1.0 enx000ec6a22def: ax88179 - Link status is: 1
[ 3687.729751] soundwire sdw-master-0: clock stop prep/de-prep done slave:15
[ 3687.729773] soundwire sdw-master-1: clock stop prep/de-prep done slave:15
[ 3687.793590] soundwire sdw-master-2: clock stop prep/de-prep done slave:15
[ 3688.071421] PM: suspend entry (s2idle)

@lgirdwood
Copy link
Member

Some things to try

  1. Switch of trace Kconfig in FW and retry with GCC - will help rule out trace.

  2. Modify the script to stop the stream immediately before the suspend (as close as possible). Stress test 100 times, this will help establish whether the stream must be running (alongside the trace)

  3. Consume any remaining bytes in the HDA host DMA FIFO (i.e. update the DMA read pointer on the host side) before we trigger(SUSPEND). One effect of freezing userspace is that the host DMA read pointer will stop updating and block the FW pipeline and also block the DMA HW state machine (nothing todo as my buffer is full). Lets see if we can unblock any state machine.

@plbossart
Copy link
Member

@lgirdwood

  1. we can't remove the trace, currently all sof-test scripts require the DMA trace to be enabled.
    I tried just that in [TEST] disable DMA trace to see if underflows happen linux#3021 and it was a complete fail.
    I think it's a dependency we need to remove but it's not going to happen overnight.

  2. We can't also change the script to stop, because doing so would change what we test. The point of suspend-resume-with-audio is to check the transitions WITHOUT a stop.

@marc-hb
Copy link
Collaborator

marc-hb commented Jul 29, 2021

currently all sof-test scripts require the DMA trace to be enabled.

Note it's mostly optional at the developer level already. Otherwise yes, automation always uses it and depends on it right now. Discussed in thesofproject/sof-test#726

We can't also change the script to stop, because doing so would change what we test. The point of suspend-resume-with-audio is to check the transitions WITHOUT a stop.

Testing less always finds fewer bugs ;-)

@bardliao
Copy link
Collaborator

Yes, and it looks like gcc and xcc do have different results. I tried with internal 20210719 daily test binaries (commit f96e25f). The xcc one pass and the gcc one fail.

Thanks @bardliao , obvious differences between compilers are C calling ABI, code speed, layout of TEXT and DATA.

I can see 2 really interesting things from the data

  1. GCC trace always shows two trace IPCs after TRIGGER_START whilst XCC always shows one EDIT: this is before we enter D3. @bardliao are you able to provide the FW trace logs for each build to check for differences in the trace data ? The trace message should only be sent if the trace buffer > watermark OR every 500ms if there is new data.

sof-trace-gcc.txt
sof-trace-xcc.txt
I only tested with 3 runs. i.e. sof-test/test-case/check-suspend-resume-with-audio.sh -S 1 -w 3 -l 3 -m capture
Both test are pass.
And I can see slight difference between the two traces.
For example, I can see some message below in the XCC test, but not in the GCC one.
src/lib/agent.c:65 perf sys_load peak plat 25026 cpu 167257

  1. The stream IDs are different. GCC uses stream 1 and XCC stream 4 in the dmesg. I would assume these to be the same with the same topology ?

Yes, they are the same topology. I think it is because we have two capture devices and GCC failed at the first capture test and XCC finished two capture tests. And the dmesg log is truncated, that's why you see different stream ID.

arecord -l
**** List of CAPTURE Hardware Devices ****
card 0: sofsoundwire [sof-soundwire], device 1: Headset mic (*) []
  Subdevices: 1/1
  Subdevice #0: subdevice #0
card 0: sofsoundwire [sof-soundwire], device 4: Microphones (*) []
  Subdevices: 1/1
  Subdevice #0: subdevice #0

@lgirdwood
Copy link
Member

@XiaoyunWu6666 please close on Monday if all is good.

@marc-hb
Copy link
Collaborator

marc-hb commented Jul 31, 2021

Please don't close this even when the tests pass. Reverting important features without a clue why they failed is not a solution. Set it to the lowest priority if you want.

@XiaoyunWu6666
Copy link
Contributor Author

XiaoyunWu6666 commented Aug 2, 2021

in inner daily [id:5635] CML Helios is good ,let's lower the priority of this issue.
PR that reverted relevant features is here:#4578

@XiaoyunWu6666 XiaoyunWu6666 added the P2 Critical bugs or normal features label Aug 2, 2021
@XiaoyunWu6666
Copy link
Contributor Author

Close it now since relevant feature was reverted in #4578

Since this issue takes only several rounds to reproduce in testcase check-suspend-resume-with-capture-5 , we'll test how long it will take additionally to add this case in PR test.
If the extra time is acceptable ,we can track how it goes in PR test.

@bardliao
Copy link
Collaborator

bardliao commented Aug 3, 2021

BTW, I can still reproduce the issue with #4578 merged in GCC build. That could be a different issue.

@marc-hb marc-hb reopened this Aug 3, 2021
@marc-hb marc-hb added P3 Low-impact bugs or features and removed P2 Critical bugs or normal features labels Aug 3, 2021
@XiaoyunWu6666
Copy link
Contributor Author

XiaoyunWu6666 commented Sep 7, 2021

happen in inner daily 6362
on model CML_RVP_SDW
in testcase check-suspend-resume-with-capture-5

@lgirdwood
Copy link
Member

lgirdwood commented Sep 7, 2021

happen in inner daily 6362
on model CML_RVP_SDW
in testcase check-suspend-resume-with-capture-5

@XiaoyunWu6666 what's the reproducability now and is this fatal ?

@lgirdwood lgirdwood added this to the v1.9 milestone Sep 7, 2021
@marc-hb marc-hb self-assigned this Sep 8, 2021
@XiaoyunWu6666
Copy link
Contributor Author

XiaoyunWu6666 commented Sep 9, 2021

@lgirdwood
The reproduction rate is rather low , see issue summary below

First happened on 2021-07-26 (Report Id: 5515)
Last happened on 2021-09-06 (Report Id: 6362)
3 hits within 43 days totally

Concerning the number of testcases that this issue affects , it is not that fatal (about one each time). But its appearance causes the testcase to fail

@lgirdwood
Copy link
Member

@XiaoyunWu6666 thanks for the update.
@plbossart fwiw, there is a speculative HW channel for the DMA that switched ON/OFF to save power. @kv2019i and I were discussing this week and we could have a race between the DMA channel PM switching during the driver/FW DMA/stream shutdown logic. I wonder if there is a status bit for this that could be polled as part of the shutdown logic......

@marc-hb
Copy link
Collaborator

marc-hb commented Sep 10, 2021

The failure in 6362 looks different to me:

[ 5864.775947] kernel: rt700 sdw:1:025d:0700:00: in rt700_jack_detect_handler, jack_type=0x3
[ 5864.775953] kernel: rt700 sdw:1:025d:0700:00: in rt700_jack_detect_handler, btn_type=0x0
[ 5864.811316] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG
[ 5864.811368] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG
[ 5865.311346] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx: 0x90020000: GLB_TRACE_MSG
[ 5865.311401] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc rx done: 0x90020000: GLB_TRACE_MSG
[ 5866.538175] kernel: ax88179_178a 2-4:1.0 enx0050b6ba239a: ax88179 - Link status is: 1
[ 5867.992127] kernel: soundwire sdw-master-0: clock stop prep/de-prep done slave:15
[ 5867.992500] kernel: soundwire_intel soundwire_intel.link.1: intel_link_power_down: powering down all links
[ 5869.269665] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: pcm: trigger stream 11 dir 1 cmd 0
[ 5869.269674] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x60050000: GLB_STREAM_MSG: TRIG_STOP
[ 5869.269909] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x60050000: GLB_STREAM_MSG: TRIG_STOP
[ 5869.270667] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: FW Poll Status: reg[0xa0]=0x240002 timedout
[ 5869.270671] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: hda_dsp_stream_trigger: cmd 0 on dai_link "DMIC16kHz" (Capture, stream_tag: 2): timeout on STREAM_SD_OFFSET read
[ 5869.270677] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx: 0x60030000: GLB_STREAM_MSG: PCM_FREE
[ 5869.270909] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ipc tx succeeded: 0x60030000: GLB_STREAM_MSG: PCM_FREE
[ 5869.270937] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: pcm: free stream 11 dir 1
[ 5869.270976] kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: pcm: close stream 11 dir 1
[ 5869.275762] kernel: soundwire_intel soundwire_intel.link.1: intel_link_power_up: powering up all links
[ 5869.275765] kernel: soundwire_intel soundwire_intel.link.1: intel_link_power_up: first link up, programming SYNCPRD
[ 5869.276969] kernel: soundwire sdw-master-0: clock stop prep/de-prep done slave:15
[ 5869.278314] kernel: rt700 sdw:1:025d:0700:00: [rt700_sdw_write] 3121 <= 0000
[ 5869.278430] kernel: rt700 sdw:1:025d:0700:00: [rt700_sdw_write] 3122 <= 0000
[ 5869.278490] kernel: rt700 sdw:1:025d:0700:00: [rt700_sdw_write] 3123 <= 0000

This could mean this issue hasn't happened since revert #4578

@ranj063
Copy link
Collaborator

ranj063 commented Sep 18, 2021

@XiaoyunWu6666 @1994lwz could you please check if thesofproject/linux#3166 helps fix this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working as expected CML Applies to Comet Lake platform I2S Applies to I2S bus for codec connection P3 Low-impact bugs or features
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants