Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRC: math: Replace exponential function for performance #8435

Merged

Conversation

ShriramShastry
Copy link
Contributor

@ShriramShastry ShriramShastry commented Nov 3, 2023

For DRC performance, replace exp_small_fixed() with sofm_exp_int32().
Included supporting change to include sofm_exp_int32() within exp_fixed()
and repositioned exp_small_fixed() for future use.

@ShriramShastry
Copy link
Contributor Author

For compilation, this PR requires further fixes.

@singalsu
Copy link
Collaborator

singalsu commented Nov 3, 2023

In my test this patch dropped drc.2.1 average MCPS from 140 to 96 in topology sof-hda-efx-generic-4ch.tplg. Excellent!

The used topology initializes DRC to passthrough, so I applied another with "sof-ctl -n 36 -s threshold_-35_knee_27_ratio_8.txt". The blob that I used is not in SOF git yet but I'll attach it here:

threshold_-35_knee_27_ratio_8.txt

src/audio/Kconfig Outdated Show resolved Hide resolved
src/math/decibels.c Outdated Show resolved Hide resolved
* The input is Q3.29
* The output is Q9.23
*/
int32_t exp_small_fixed(int32_t x)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need this function here since you are using sofm_exp_int32().

src/audio/Kconfig Outdated Show resolved Hide resolved
*/
y0 = Q_SHIFT_RND(exp_small_fixed(Q_SHIFT_LEFT(xs, 27, 29)), 23, 20);
y0 = Q_SHIFT_RND(sofm_exp_int32(Q_SHIFT_LEFT(xs, 27, 28)), 23, 20);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 78 - 81 still can be optimized further with instrinsic code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file didn't include xtensa header file, do you mean add hifi implementation of functions in this file?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, each for loop deserve an intrinsic implementation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check if it improves speed. In some cases the C multiply has been as fast.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll give { Q_CONVERT_FLOAT ~, { Q_SHIFT_RND ~, { Q_SHIFT_LEFT ~, and Q_MULTSR_32X32 } a shot as well. If I don't succeed, I'll try again with the next PR.

Copy link
Contributor Author

@ShriramShastry ShriramShastry Nov 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The macro cycle count performance across Generic C ( Q_ *) and HiFi intrinsic ( XT_ *) using the identical input data samples is shown in the table below.

Note: The math is being calculated accurately by the HiFi function for CONVERT_FLOAT.

image

Copy link
Contributor Author

@ShriramShastry ShriramShastry Nov 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have completed the implementation of Q_SHIFT_RND, Q_SHIFT_LEFT, and Q_MULTSR_32X32, and Q_CONVERT_FLOAT.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll correct the last few CI errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've finished addressing every CI error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ShriramShastry , still code style error.

@lgirdwood
Copy link
Member

lgirdwood commented Nov 10, 2023

@singalsu @ShriramShastry @andrula-song I think we need to have some options where we can decide at build time via Kconfig between speed and accuracy for certain maths ops. i.e. for 16bit output we may not need full 24/32 computations and so on. There may also be usages where speed is more important than bit accuracy, but thats up to you guys to identify and target.

@singalsu
Copy link
Collaborator

@singalsu @ShriramShastry @andrula-song I think we need to have some options where we can decide at build time via Kconfig between speed and accuracy for certain maths ops. i.e. for 16bit output we may not need full 24/32 computations and so on. There may also be usages where speed is more important than bit accuracy, but thats up to you guys to identify and target.

Yep, I wanted to avoid Sriram to change the old exponent function code since it is used by many other features. The new version for DRC is significantly faster but trades off slightly accuracy. We need to check case by case with exponent function to utilize.

@singalsu
Copy link
Collaborator

@ShriramShastry Any updates to this? The MCPS saving is large so we should get this cleaned up and merged.

@ShriramShastry ShriramShastry force-pushed the DRC_math_exp_optimization_dev branch 4 times, most recently from 78515d8 to 61ea083 Compare November 24, 2023 06:53
Copy link
Member

@lgirdwood lgirdwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @andrula-song pls review.

@lgirdwood lgirdwood added this to the v2.9 milestone Nov 24, 2023
@lgirdwood
Copy link
Member

@ShriramShastry some conflicts, pls rebase as CI wont complete.

@ShriramShastry ShriramShastry marked this pull request as ready for review November 27, 2023 05:44
@ShriramShastry ShriramShastry force-pushed the DRC_math_exp_optimization_dev branch 3 times, most recently from 3601def to 91f8cf5 Compare November 27, 2023 15:25
@ShriramShastry ShriramShastry force-pushed the DRC_math_exp_optimization_dev branch 2 times, most recently from a3519e1 to e1c8418 Compare December 7, 2023 11:44
@ShriramShastry ShriramShastry force-pushed the DRC_math_exp_optimization_dev branch 5 times, most recently from 9fa354a to 69ebd9a Compare December 8, 2023 05:37
*/
y0 = Q_SHIFT_RND(exp_small_fixed(Q_SHIFT_LEFT(xs, 27, 29)), 23, 20);
y0 = Q_SHIFT_RND(sofm_exp_int32(Q_SHIFT_LEFT(xs, 27, 28)), 23, 20);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ShriramShastry , still code style error.

@@ -105,10 +106,10 @@ int32_t exp_fixed(int32_t x)
n++;
}

/* exp_small_fixed() input is Q3.29, while x1 is Q5.27
* exp_small_fixed() output is Q9.23, while y0 is Q12.20
/* sofm_exp_int32() input is Q4.28, while x1 is Q5.27
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for comments, you mentioned input is Q4.28, however, what is x1? means sofm_exp_int32() output?
if means output, please change x1 to output.

* Output is Q12.20, 0.0 .. +2048.0
*/

int32_t exp_fixed(int32_t x)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

your patch sequence are quite strange, normally, we should have c version first, then hifi version, seems you already have exp_fixed hifi in previous patch, now this patch comes with c version, this is strange sequence.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The critical and hard requirement is that every git commit compiles and passes all the tests. CI does unfortunately not check this (for various, complicated reasons).

@btian1 if you think some git commit does not compile and pass the tests then please "Request Changes" and block this PR.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

your patch sequence are quite strange,

every git commit compiles and passes all the tests. CI does unfortunately not check this

The simplest way to solve all these problems and more is to not submit all your commits at the same time:

https://docs.zephyrproject.org/latest/contribute/contributor_expectations.html#defining-smaller-prs

@@ -302,10 +367,10 @@ int32_t exp_fixed(int32_t x)
int i;
int n = 0;

if (x < Q_CONVERT_FLOAT(-11.5, 27))
if (x < exp_hifi_q_convert_float(-11.5, 27))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not build with pre-compiler stage? could you show the asm code to compare?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I don't get the necessity in this situation; could you please explain?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean why not directly use:Q_CONVERT_FLOAT(-11.5, 27)?

return xt_o;
}

#define ONE_Q20 exp_hifi_q_convert_float(1.0, 20) /* Use Q12.20 */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are always repeating definition, please move to one common header.

y = ONE_Q20;
for (i = 0; i < (1 << n); i++)
y = (int32_t)Q_MULTSR_32X32((int64_t)y, y0, 20, 20, 20);
y = (int32_t)exp_hifi_q_multsr_32x32((int64_t)y, y0, 20, 20, 20);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this patch is not related with title(exp), you can't use one PR to cover all the changes, please move out this patch from this PR.

@@ -114,7 +115,7 @@ void win_povey_16b(int16_t win[], int length)
/* Calculate x^0.85 as exp(0.85 * log(x)) */
x2 = (int32_t)(ln_int32((uint32_t)x1) >> 1) - WIN_LOG_2POW31_Q26;
x3 = sat_int32(Q_MULTSR_32X32((int64_t)x2, WIN_085_Q31, 26, 31, 27)); /* Q5.27 */
x4 = exp_fixed(x3); /* Q5.27 -> Q12.20 */
x4 = sofm_exp_fixed(x3); /* Q5.27 -> Q12.20 */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

somehow, you can rename from the beginning or start from first patch, then this patch will be removed.

src/audio/module_adapter/module/generic.c Outdated Show resolved Hide resolved
src/math/decibels.c Outdated Show resolved Hide resolved
@singalsu
Copy link
Collaborator

@ShriramShastry I added your signed-off-by into #8605. Better to separate the fix from this large PR to get the common testbench build issue fixed.

src/math/exp_fcn_hifi.c Outdated Show resolved Hide resolved
src/math/exp_fcn_hifi.c Outdated Show resolved Hide resolved
src/math/exp_fcn_hifi.c Show resolved Hide resolved
src/math/exp_fcn_hifi.c Show resolved Hide resolved
@singalsu
Copy link
Collaborator

Seems it's not ensured by C standards that literals floating point macros are calculated in the C pre-processor. Maybe things have changed with xt-clang.

https://stackoverflow.com/questions/21241031/does-the-c-preprocessor-handle-floating-point-math-constants

It would be safest to change macros like #define SOME_COEF Q_CONVERT_FLOAT(-11.5, 27) into #define SOME_COEF -1543503872 /* -11.5 Q5.27 */.

@singalsu
Copy link
Collaborator

Seems it's not ensured by C standards that literals floating point macros are calculated in the C pre-processor. Maybe things have changed with xt-clang.

https://stackoverflow.com/questions/21241031/does-the-c-preprocessor-handle-floating-point-math-constants

It would be safest to change macros like #define SOME_COEF Q_CONVERT_FLOAT(-11.5, 27) into #define SOME_COEF -1543503872 /* -11.5 Q5.27 */.

@ShriramShastry Please check disassembly of your Q_CONVERT_FLOAT() usage. We looked into TGL and MTL builds with @btian1 and we could not find floating point code generation from current usages of the macro. No, float overhead found. So, let's not yet start to replace these before understanding better as I worried in my previous comment.

@@ -8,4 +8,5 @@ cmocka_test(window
${PROJECT_SOURCE_DIR}/src/math/base2log.c
${PROJECT_SOURCE_DIR}/src/math/decibels.c
${PROJECT_SOURCE_DIR}/src/math/exp_fcn.c
${PROJECT_SOURCE_DIR}/src/math/exp_fcn_hifi.c
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please squash this patch with previous one, as Marc said, you need get each patch pass all CI separately, here, there is obvious CI error.

ShriramShastry and others added 5 commits January 8, 2024 12:40
The macros are moved to header file. There are no functional changes.

Signed-off-by: shastry <[email protected]>
Unused variables from HiFi4/5 were reshuffled and placed in order
to use HiFi3 code. If the variable 'ret' is used uninitialized
whenever the 'if' condition is false, set it to false.

Signed-off-by: shastry <[email protected]>
This change allows the fast exponent library to replace
the decibels library for applications like DRC where exponent
function is used in hot code parts.

Signed-off-by: Seppo Ingalsuo <[email protected]>
Signed-off-by: shastry <[email protected]>
In Zephyr CMakeLists, add exponential source files to facilitate
the compilation of math C and HiFi code.

Signed-off-by: shastry <[email protected]>
The exp_fixed() function is replaced by fast sofm_exp_fixed()
and sofm_db2lin() functions. It saves 40 MCPS, from 123 to 83 MCPS
in a test run in TGL platform.

Signed-off-by: shastry <[email protected]>
@@ -26,6 +26,38 @@

#endif

int32_t sofm_exp_int32(int32_t x);
/* TODO: Is there a MCPS difference */
#define USING_QCONVERT 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already approved, but let's check in real devices if setting this to zero speeds up the code. In theory it shouldn't impact since Q_CONVERT_FLOAT macro should be evaluated in C pre-processor. Once confirmed we can remove the direct integers. Or if difference seen, remove the Q_CONVERT_FLOAT part.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, can we do this as next steps in a follow up PR and test drive this now.

@lgirdwood lgirdwood merged commit 3ed4ddd into thesofproject:main Jan 8, 2024
41 of 44 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants