Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ll_schedule: add a DSP load tracker to ll_tasks_execute #4943

Closed
wants to merge 1 commit into from

Conversation

kv2019i
Copy link
Collaborator

@kv2019i kv2019i commented Nov 1, 2021

Add a lightweight load calculatio to ll_tasks_execute() and prints out
the average every 2^10 runs (roughly once per second on most systems).

Example log output:
--cut--
[ 6950848.421715] ( 809129.187500) c0 ll-schedule ./schedule/ll_schedule.c:139 INFO ll avg 19720/38400
[ 8868838.449667] ( 679100.000000) c0 ll-schedule ./schedule/ll_schedule.c:139 INFO ll avg 19658/38400
[ 10802857.643650] ( 565147.187500) c0 ll-schedule ./schedule/ll_schedule.c:139 INFO ll avg 19822/38400
--cut--

UPDATE: example trace output in a case where DSP is under >95% load:
https://sof-ci.01.org/sofpr/PR4943/build10955/devicetest/?model=ADLP_RVP_NOCODEC&testcase=multiple-pipeline-all

Signed-off-by: Kai Vehmanen [email protected]

Add a lightweight load calculatio to ll_tasks_execute() and prints out
the average every 2^10 runs (roughly once per second on most systems).

Example log output:
--cut--
[     6950848.421715] (      809129.187500) c0 ll-schedule        ./schedule/ll_schedule.c:139  INFO ll avg 19720/38400
[     8868838.449667] (      679100.000000) c0 ll-schedule        ./schedule/ll_schedule.c:139  INFO ll avg 19658/38400
[    10802857.643650] (      565147.187500) c0 ll-schedule        ./schedule/ll_schedule.c:139  INFO ll avg 19822/38400
--cut--

Signed-off-by: Kai Vehmanen <[email protected]>
@kv2019i
Copy link
Collaborator Author

kv2019i commented Nov 1, 2021

We have similar trace output in SOF Zephyr builds and they've proven to be very useful for DSP algorithm debug and to track performance changes in CI and daily tests.

@marc-hb
Copy link
Collaborator

marc-hb commented Nov 1, 2021

We have similar trace output in SOF Zephyr builds and they've proven to be very useful for DSP algorithm debug and to track performance changes in CI and daily tests.

They've also proven very verbose and preventing the DSP from going to sleep. I comment them out in my workspace.

@kv2019i
Copy link
Collaborator Author

kv2019i commented Nov 1, 2021

@marc-hb wrote:

They've also proven very verbose and preventing the DSP from going to sleep. I comment them out in my workspace.

Why there is no bug about this? That sounds like an error if it prevents DSP from going to sleep.

It's a good question what the default should be for reporting. We can put this under Kconfig, but this should be enabled in CI for this to provide as a useful too to track DSP load.

@marc-hb
Copy link
Collaborator

marc-hb commented Nov 1, 2021

Related to older discussion in thesofproject/sof-test#726 [FEATURE] sof-test should run without DMA trace - only etrace

@keyonjie
Copy link
Contributor

keyonjie commented Nov 2, 2021

@kv2019i thank for thinking of this.

Basically the idea here is simple enough, though the usage of CONFIG_SYSTICK_PERIOD could be not accurate sometimes, as we are not always scheduled in 1 ms period.

OTOH, we actually already have similar mechanism (no percentage data though), please check the perf_cnt_xxx(), if we select CONFIG_PERFORMANCE_COUNTERS from kconfig, we will get logs like "perf ll_work peak plat %u cpu %u" with info level, where it only cares the peak but not the average value. To add average or percentage data support, better to do that with the existed mechanism.

If we want to get accurate percentage data, we might need to leverage the new_target_tick value, e.g.

percentage = 100 * (cycles1 - cycles0) / (new_target_tick - cycles0);

@kv2019i
Copy link
Collaborator Author

kv2019i commented Nov 2, 2021

@keyonjie wrote:

Basically the idea here is simple enough, though the usage of CONFIG_SYSTICK_PERIOD could be not accurate sometimes, as we are not always scheduled in 1 ms period.

Good point. I ported this from the new Zephyr-based scheduler that has a fixed tick. But indeed, especially if we want to flag errors from the data, the target tick should be taken into account. That doesn affect the other part:

OTOH, we actually already have similar mechanism (no percentage data though), please check the perf_cnt_xxx(), if we
select CONFIG_PERFORMANCE_COUNTERS from kconfig, we will get logs like "perf ll_work peak plat %u cpu %u" with
info level, where it only cares the peak but not the average value. To add average or percentage data support, better to

I looked again at this and the main problem this is designed to be a debugging feature.

There is definite value if the DSP load tracking can be made lightweight enough so we can leave it on in at least CI builds, and even in the product builds. This does limit what we can do. The existing mechanisms can print out traces at a much higher rate (when problems arise) and they do more calculations per trace.

The peak data is useful for debugging inprogress algorithms, but I'd argue a periodic DSP load measurement is more useful to debug issues with e.g. tplg changes and/or regressions in CI.

What this PR implements is not zero cost either, but due to fixed reporting interval, the cost is at least deterministic. But that does mean we don't really want to spend cycles to pretty-print the traces and e.g. the target-tick makes this a bit more challenging, but let's have a look. Maybe we can bake it in.

Of course, if the consensus is that even this is too much for non-R&D builds, then we might be better off to just add this functionality to existing PERFORMANCE_COUNTERS. Looking forward for input.


if (++load->checks == 1 << CHECKS_WINDOW_SIZE) {
load->cycles_sum >>= CHECKS_WINDOW_SIZE;
max = clock_us_to_ticks(PLATFORM_DEFAULT_CLOCK, CONFIG_SYSTICK_PERIOD);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be light-weight, but the above line performs run-time divisions and multiplications with constants, even if only every N runs. Since clock_us_to_ticks() is a real function, I don't think this will be pre-calculated and optimised away by the compiler. But since it's always the same calculation (as long as the clock frequency doesn't change), this should be calculated once and stored. To also take frequency calculation into account we can also store the frequency (index) that was used to calculate this, and if it changes - recalculate.
But yes, this would also somehow have to account for imprecise XTOS ticks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only part doing the div/mult is the clock_us_to_ticks() and this will only happen once per second (and you are right, we could do this at init since it's all const), all the rest is shifts which are optimal on xtensa

if (++load->checks == 1 << CHECKS_WINDOW_SIZE) {
load->cycles_sum >>= CHECKS_WINDOW_SIZE;
max = clock_us_to_ticks(PLATFORM_DEFAULT_CLOCK, CONFIG_SYSTICK_PERIOD);
tr_info(&ll_tr, "ll avg %u/%u", load->cycles_sum, max);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For automatic parsing, using blank as separator for numbers instead of "/" could be nicer?

Copy link
Member

@lgirdwood lgirdwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but we could do the clock us conversion at init since it always const.


if (++load->checks == 1 << CHECKS_WINDOW_SIZE) {
load->cycles_sum >>= CHECKS_WINDOW_SIZE;
max = clock_us_to_ticks(PLATFORM_DEFAULT_CLOCK, CONFIG_SYSTICK_PERIOD);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only part doing the div/mult is the clock_us_to_ticks() and this will only happen once per second (and you are right, we could do this at init since it's all const), all the rest is shifts which are optimal on xtensa

@kv2019i
Copy link
Collaborator Author

kv2019i commented Dec 8, 2021

No time to progress with this, so closing.

@kv2019i kv2019i closed this Dec 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants