-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] DMA trigger stop failed when suspend-resume during capturing on CML-SKU0983-SDW #4779
Comments
this happened in 6200 6292 6655 |
Sorry @XiaoyunWu6666, not following you. Are you saying the "timeout on STREAM_SD_OFFSET" and "MCP_CONTROL_HW_RST is not cleared at iteration N" errors are correlated somehow? @bardliao this would be interesting indeed. |
@XiaoyunWu6666 @1994lwz can someone try reverting 9fadef7 and retest.
|
@lgirdwood @RanderWang @plbossart FYI , I tried to revert 9fadef7 and retest on CML_SKU0983_SDW (running suspend-resume-5-time during capture , for 100 rounds ) |
@1994lwz @XiaoyunWu6666 could you please check if thesofproject/linux#3166 helps fix the issue? |
I tried to revert 9fadef7 and retest on CML_SKU0955_HDA (running suspend-resume-5-time during capture , for 50 rounds ), haven't reproduce the issue. |
hi @ranj063 . I am afraid not , the same error appears again. Also I run a test in CI with 3166 (see inner result test 6693) , it got a FAIL too |
the same error appears again on CML_SKU0955_HDA. |
When the stream is cleared during the suspend trigger, the dma_data must be set to NULL and snd_hdac_ext_stream_release() must be called to release the link dev. Without this, some platforms run into issues with triggering the host DMA during system resume. Add the missing sequences to both hda_link_pcm_trigger() to handle all streams that get suspended and to hda_dsp_set_hw_params_upon_resume() to handle paused streams that are reset during system suspend. Also, because the dma_data is set to NULL during suspend, add the checks to ensure link_dev is not NULL during hw_params and hw_free to prevent NULL pointer dereferences. BugLink: thesofproject/sof#4779 Signed-off-by: Ranjani Sridharan <[email protected]>
@XiaoyunWu6666 @1994lwz I have updated my PR thesofproject/linux#3166 to fix the issues during suspend-resume on capture on the CML Helios and the CML-HDA. Could you please double confirm? @plbossart I need your help for the CML SDW case. |
When the stream is cleared during the suspend trigger, the dma_data must be set to NULL and snd_hdac_ext_stream_release() must be called to release the link dev. Without this, some platforms run into issues with triggering the host DMA during system resume. Add the missing sequences to both hda_link_pcm_trigger() to handle all streams that get suspended and to hda_dsp_set_hw_params_upon_resume() to handle paused streams that are reset during system suspend. Also, because the dma_data is set to NULL during suspend, add the checks to ensure link_dev is not NULL during hw_params and hw_free to prevent NULL pointer dereferences. BugLink: thesofproject/sof#4779 Signed-off-by: Ranjani Sridharan <[email protected]>
@ranj063 @1994lwz @XiaoyunWu6666 has anyone tried reverting 9fadef7 AND trying multiple capture streams for this test. i.e. try 2 then 3 capture streams in this test.
|
@lgirdwood I tried 2 capture streams during suspend with my kernel PR on the helios and it works fine. My PR fixes this issue on both the helios and the HDA laptop. |
This issue and related PRs are really hard to follow. I ran a quick test on CML_SKU09C6_SDW (same as CML-SKU0983-SDW) and I don't see any issue. "TPLG=/lib/firmware/intel/sof-tplg/sof-cml-rt711-rt1308-mono-rt715.tplg ~/sof-test/test-case/check-suspend-resume-with-audio.sh -l 5 -m capture" We should really deal with separate sightings and validations, it's not helpful if we conflate HDaudio, Chromebook and SoundWire platforms in the same bucket. They use different links and there's no real rationale for errors being common. kernel: 73e5ae02dd8c |
ok, so the command above is a bit misleading. I changed it to "TPLG=/lib/firmware/intel/sof-tplg/sof-cml-rt711-rt1308-mono-rt715.tplg ~/sof-test/test-case/check-suspend-resume-with-audio.sh -l 200 -m capture" and I see errors happening somewhat randomly, e.g. round 19, 2, 25. we probably need to change the '5' value since it leads to missed errors in CI. |
PR thesofproject/linux#3166 does not solve anything on CML_SKU09C6_SDW, now trying to with a revert of 9fadef7 |
This reverts commit 9fadef7. After multiple trials on a CometLake SoundWire device, this revert to bring the trace back to what it was seems to be the only solution, the suggested PR thesofproject/linux#3166 does not help on this SoundWire device. We had similar issues with SD offset timeouts and a similar revert with thesofproject#4578 at the end of July, there's something that we are missing on what the trace does and how it impacts the DMA handling. BugLink: thesofproject#4779 Signed-off-by: Pierre-Louis Bossart <[email protected]>
Found another recent CML suspend timeout but it looks a bit different: https://sof-ci.01.org/sofpr/PR4771/build10385/devicetest/?model=CML_HEL_RT5682&testcase=check-suspend-resume-with-capture
|
This reverts commit 9fadef7. After multiple trials on a CometLake SoundWire device, this revert to bring the trace back to what it was seems to be the only solution, the suggested PR thesofproject/linux#3166 does not help on this SoundWire device. We had similar issues with SD offset timeouts and a similar revert with #4578 at the end of July, there's something that we are missing on what the trace does and how it impacts the DMA handling. BugLink: #4779 Signed-off-by: Pierre-Louis Bossart <[email protected]>
Not quite following here, @plbossart do you mean changing some Kconfig items will impact the DMA of the normal pipelines? But anyway someone can still manually unselect the trace filtering and then the issue will be exposed again, no? |
#4785 has been merged ,and in today's inner daily 6739 , CML_SKU0983_SDW and CML_SKU0955_HDA are good |
When the system is brittle or unstable then all bets are off. |
@lgirdwood I think the right fix for this issue and #4558 is in Linux PR thesofproject/linux#3167 and SOF PR #4793 |
@lgirdwood @keyonjie My theory for this timeout issue is a race condition where the firmware removes access to the L2 SRAM when we stop the DMA (by clearing the DGCS_GEN bit), and as a result if there are pending DMA transactions clearing the RUN bit from the driver side times out. By changing the way the trace works, and possibly adding more trace transactions, we may see this race condition more often. The recommended sequence is to remove access to the L2 SRAM AFTER the DMA is stopped with a follow-up IPC. This is clearly indicated in the hardware programming sequence, it's implemented in the Windows driver but we didn't follow it in the Linux firmware/driver for some reason. |
Use the Linux PR thesofproject/linux#3167 and SOF PR #4793 to build the kernel and firmware, not reproduce on the cml-sku0983-sdw/ cml-sku0955-hda by test the suspend-resume during capturing 100 times |
@1994lwz can we close now ? |
This issue is fixed by thesofproject/linux#3167 and not reproduce on the recent daily, so close it. |
thesofproject/linux#3167 is not merged yet and neither is #4558 |
This issue is not reproduce anymore after merged the PR #4785, we will close it after the Linux PR thesofproject/linux#3167 and SOF PR #4793 be merged and verified. |
Since relevant PRs were already merged and havent happened in inner daily tests for nearly 50 days . So close it relevant PRs : thesofproject/linux#3167 #4820 and #4844 |
Observed this issue again when running suspend/resume stress testing on CML_SKU0983_SDW. Test ID:8054.
|
wrong bug @keqiaozhang, this had nothing to do with DMA. This should be added in thesofproject/linux#3012 |
Describe the bug
DMA trigger stop failed when suspend-resume during capturing on CML-SKU0983-SDW
To Reproduce
Run command: "TPLG=/lib/firmware/intel/sof-tplg/sof-cml-rt711-rt1308-mono-rt715.tplg ~/sof-test/test-case/check-suspend-resume-with-audio.sh -l 5 -m capture"
The reproduction rate is 100%
Environment
Kernel Branch: topic/sof-dev
Kernel Commit: a7bb845e
SOF Branch: main
SOF Commit: 9fadef7
Platform: CML-SKU0983-SDW
Screenshots or console output
[dmesg & slogger]
dmesg.txt
slogger.txt
[console]
[coredump log]
The text was updated successfully, but these errors were encountered: