mulmod_shoup and nmod_vec_scalar_mul_nmod #2055

vneiger · 2024-08-26T16:51:36Z

This PR touches things related to single word modular multiplication with precomputation (n_mulmod_shoup).

documentation enhancements, with minor cleaning and notation more consistent with other functions, and also more clear constraints on the input (notably a restriction on one of the input operands was indicated, but is not required at all)
detailed explanations of this method added at the bottom of ulong_extras/mulmod_precomp_shoup.c (could be put elsewhere, because this file might disappear, see below)
profiling files for related functions nmod_vec_scalar_mul/addmul and nmod_mat_nmod_vec_mul

Two conclusions from the latter performance tests, done on two machines (zen4 / icelake):

For small lengths (5, 10, 15...), it is quite beneficial to have the precomputation inlined, so I would suggest making n_mulmod_precomp_shoup inline: any opinion against that?
The current threshold for switching from mulmod2_preinv to mulmod_shoup, which is 10, seems too high. In fact, I get better results with mulmod_shoup as soon as the length exceeds... 1. For length 1, the results are quite close and sometimes in favor of mulmod_shoup (not really sure why). Some feedback could be useful about the accuracy of the new profiling files for small lengths, because I am not so familiar with benchmarking on so "small" computations. If this conclusion turns out correct, it seems that we could lower the threshold to 2, or even avoid the _generic code entirely (the _fullword one must be kept because this modular multiplication method does not accept 64-bit moduli).

…preinv; on zen4 the threshold is clearly too high for switching to the Shoup precomp multiplication

…ove constraint t < p (or b < n); unify naming (a*b mod n like mulmod2_preinv or mulmod_precomp, instead of w*t mod p) in both doc and code; start explanations of correctness and restrictions

vneiger · 2024-08-26T16:59:26Z

Excerpts from my measurements.

For nmod_vec/profile/p-scalar_mul.c (smaller is better; the three things are "branching interface | _mulmod_shoup | _generic"):

bit/len	1		2		3		4		5		6		7		1024		65536
63	11.5|8.8|9.7	7.3|4.8|6.9	6.0|3.9|5.8	5.7|4.0|5.7	5.6|3.7|5.5	5.9|3.7|5.7	5.5|3.4|5.6	2.1|2.1|4.5	2.2|2.2|4.5

And for mulmod routines (cycle counts should all be off by the same constant factor). The kind of useful feedback I was mentioning above is: are these "microbenchmarks" somehow accurate, can we for example deduce that on this machine, running mulmod_shoup for a single modular multiplication is just a bit slower than running mulmod2_preinv (7.993c versus 6.765c) ?

./p-mulmod_shoup
mulmod_shoup min time / max time is:
   - including precomputation: 7.993 cycles / 8.279 cycles
   - excluding precomputation: 2.056 cycles / 2.081 cycles
   - precomputation alone: 7.880 cycles / 8.821 cycles

./p-mulmod2_preinv
mulmod2_precomp min time is 6.765 cycles, max time is 6.910 cycles

./p-mulmod_precomp
mulmod_precomp min time is 6.908 cycles, max time is 7.894 cycles

albinahlback · 2024-08-26T19:17:48Z

Two conclusions from the latter performance tests, done on two machines (zen4 / icelake):

For small lengths (5, 10, 15...), it is quite beneficial to have the precomputation inlined, so I would suggest making n_mulmod_precomp_shoup inline: any opinion against that?

On x86, this makes a lot of sense since it basically just consists of a div instruction. Should make sense for other architectures as well.

The current threshold for switching from mulmod2_preinv to mulmod_shoup, which is 10, seems too high. In fact, I get better results with mulmod_shoup as soon as the length exceeds... 1. For length 1, the results are quite close and sometimes in favor of mulmod_shoup (not really sure why). Some feedback could be useful about the accuracy of the new profiling files for small lengths, because I am not so familiar with benchmarking on so "small" computations. If this conclusion turns out correct, it seems that we could lower the threshold to 2, or even avoid the _generic code entirely (the _fullword one must be kept because this modular multiplication method does not accept 64-bit moduli).

Yeah, we would really need a test suite as hardware changes over time. Moreover (and I believe I have mentioned this in some issue), we don't know why the threshold of 10 was set (perhaps a specific CPU, whatever). I could only guess that it was, in general, faster on x86 fifteen years ago or so, but that it today is faster the other way around.

src/ulong_extras/mulmod_precomp_shoup.c

src/ulong_extras/profile/p-mulmod_shoup.c

src/ulong_extras/test/t-mulmod_shoup.c

albinahlback · 2024-08-26T19:35:09Z

src/ulong_extras/test/t-mulmod_shoup.c

        umul_ppmm(p1, p2, a, b);
        p1 %= d;
        udiv_qrnnd(q, r2, p1, p2, d);

-        result = (r1 == r2);
+        int result = (r1 == r2);


We usually declare variables as the good old days, i.e. in the beginning of the function. I know this differs somewhat (I know the fft_small module declares them where needed), but I think we should be consistent. Perhaps @fredrik-johansson could give his input when he comes back from his vacation.

I'm indeed interested in viewpoints on this. As I wrote in another comment above, for these test files and some specific variables in them (loop counter, result, etc.) I totally understand that we want consistency and may want to declare them once for all at the beginning. But for core code (non-test, non-profile), I would not say the same, I find that in some places it makes the code harder to read and write because it makes less obvious --- sometimes far less when there many variables or long functions --- the scope of variables, whether they already have a value at some point, or even simply what their type is. Also in some pieces of code it even forbids us to do some things, like declaring some variables const. This being said, I guess (well, I hope) that the compiler is good enough for this to make no difference, so if there is a strong standpoint on the "old C style" declaration, I will follow it; in this case this should probably be written somewhere (I don't see it in the current code conventions in contributing.html).

In the aspect of the compiler, I don't think it makes a difference. For me, it is about being able to study how many registers will be used, sort of. However, this obviously goes against the thought of a higher-level language. Moreover, I suspect I am one of the few that still prefers this way of declaring variables. And yes, whatever way we choose to go, it would be good if it is documented.

…ar (closing FLINT_TEST_INIT) into FLINT_TEST_CLEAR

Co-authored-by: Albin Ahlbäck <[email protected]>

vneiger · 2024-09-03T14:38:12Z

There remains a question about where to put the explanation notes for mulmod_shoup: they are currently in ulong_extras/mulmod_shoup_precomp.c, but after changes this file now contains nothing else than these explanations.

The documentation would be a good place.

Done, in the last commits.

… variable names for nmod matrix*scalar; minor fixes

…e nmod_mat scalar mul into functions

vneiger · 2024-09-04T10:01:40Z

@albinahlback a quick check about the flint-mparam.h and configure.ac modifications would be welcome. My plan for this PR now is just to give a reasonable value for Skylake to the macro FLINT_MULMOD_SHOUP_THRESHOLD, and then I think it will be ready for merge. Any comment welcome!

fredrik-johansson · 2024-09-05T10:26:41Z

An extremely minor nitpick: I think I prefer threshold values to be defined so that one writes

    if (n < cutoff)            if (n >= cutoff)
        algorithm1                 algorithm2
    else                       else
        algorithm2                 algorithm1

rather than

    if (n <= cutoff)           if (n > cutoff)
        algorithm1                 algorithm2
    else                       else
        algorithm2                 algorithm1

Feel free to change this (or don't bother if you have better things to do).

Other than that, positive review from me.

vneiger · 2024-09-05T22:40:37Z

An extremely minor nitpick: I think I prefer threshold values to be defined so that one writes

I did this. It actually also simplifies away some ugly macro checks.

Other than that, positive review from me.

Ok! I've made minor changes and will merge if CI passes. I'll update broadwell's flint-mparam.h later if needed, and @albinahlback should feel free to complain at any point about changes in configure.ac and the various flint-mparam.h files.

vneiger added 8 commits August 23, 2024 18:40

nmod vec scalar nmod mul: profiling file more complete; trying qrnnd_…

2dd846b

…preinv; on zen4 the threshold is clearly too high for switching to the Shoup precomp multiplication

adding profile for mulmod_shoup

fd281f7

extend test file with arbitrary right operand (b can be > modulus)

b2cd7d6

mulmod shoup: update documentation and make restrictions clearer; rem…

1e49f35

…ove constraint t < p (or b < n); unify naming (a*b mod n like mulmod2_preinv or mulmod_precomp, instead of w*t mod p) in both doc and code; start explanations of correctness and restrictions

write explanations for the method

96cd8a7

measured against generic directly: no real difference

7e83f1c

nmod scalar addmul

208c448

minor changes

31ab34e

albinahlback reviewed Aug 26, 2024

View reviewed changes

vneiger and others added 19 commits August 27, 2024 01:04

make n_mulmod_precomp_shoup inline

0c416de

starting mulmod+precomp

37d461d

FLINT_TEST_CLEANUP -> FLINT_TEST_CLEAR

82acb85

FLINT_TEST_CLEANUP -> FLINT_TEST_CLEAR (missed two in examples)

95ba39d

add a couple of missing flint_rand_clear + change many flint_rand_cle…

4e3f105

…ar (closing FLINT_TEST_INIT) into FLINT_TEST_CLEAR

fix declaration in test file

d048276

Update src/test_helpers.h

d72c685

Co-authored-by: Albin Ahlbäck <[email protected]>

Update src/test_helpers.h

546564b

Co-authored-by: Albin Ahlbäck <[email protected]>

Update src/test_helpers.h

cdc62e1

Co-authored-by: Albin Ahlbäck <[email protected]>

Merge branch 'fix_flint_test_clear' into mulmod_shoup

bbcaba9

flint_rand_clear -> FLINT_TEST_CLEAR

de3c692

Forgot to list Harvey's paper in references

e9d2d4e

typo in p-mulmod2_preinv

391304f

full precomp in progress

564b246

fix documentation for _nmod_vec_scalar_mul_nmod_shoup

fb9d37c

fix doc nmod_poly_make_monic

844b1d2

prod with precomp

3351b12

prod with precomp

a40e2ec

test file

2018371

vneiger added 16 commits September 3, 2024 17:26

add macro for mulmod shoup threshold

856a073

insert macro for mulmod_shoup here and there; clean the doc and unify…

d9da4eb

… variable names for nmod matrix*scalar; minor fixes

insert macro for mulmod_shoup in nmod_poly evaluate nmod

e7da78b

add zen4 flint-mparam; add profile for nmod_mat scalar mul; reorganiz…

fd46986

…e nmod_mat scalar mul into functions

clean a bit

b72680d

clean a bit

88a5c75

profile for scalar addmul

90aedbe

add special cases to nmod_vec scalar

b83a9c7

forgot else

b750e0a

document nmod_mat_set_mod

d404497

use nmod_mul, faster

2d09be3

not making these subfunctions public

f2dc1d9

minor cleaning in flint-mparam

fe9da23

icelake tuning

37b7b7a

better if adding the corresponding file

61936d4

why two -march ?

e11f10d

shortcut some tests when mulmod_shoup_threshold is small

21c0ff4

vneiger mentioned this pull request Sep 4, 2024

Faster NMOD_RED using different precomputation? #2061

Open

typo in filename

d797554

vneiger added 5 commits September 5, 2024 23:21

fixes in profile for nmod_mat scalar

8851fef

comparison to threshold done as suggested

29d6a4c

link to broadwell flint-mparam

970e5d5

-march=kabylake does not exist

530deb1

broadwell mparam, to be confirmed because server was under heavy load

b690b4e

broadwell mparam, updated

8a2bfbb

vneiger merged commit 06c1720 into flintlib:main Sep 6, 2024
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mulmod_shoup and nmod_vec_scalar_mul_nmod #2055

mulmod_shoup and nmod_vec_scalar_mul_nmod #2055

vneiger commented Aug 26, 2024

vneiger commented Aug 26, 2024

albinahlback commented Aug 26, 2024 •

edited

Loading

albinahlback Aug 26, 2024

vneiger Aug 27, 2024

albinahlback Aug 27, 2024

vneiger commented Sep 3, 2024

vneiger commented Sep 4, 2024

fredrik-johansson commented Sep 5, 2024

vneiger commented Sep 5, 2024

mulmod_shoup and nmod_vec_scalar_mul_nmod #2055

mulmod_shoup and nmod_vec_scalar_mul_nmod #2055

Conversation

vneiger commented Aug 26, 2024

vneiger commented Aug 26, 2024

albinahlback commented Aug 26, 2024 • edited Loading

albinahlback Aug 26, 2024

Choose a reason for hiding this comment

vneiger Aug 27, 2024

Choose a reason for hiding this comment

albinahlback Aug 27, 2024

Choose a reason for hiding this comment

vneiger commented Sep 3, 2024

vneiger commented Sep 4, 2024

fredrik-johansson commented Sep 5, 2024

vneiger commented Sep 5, 2024

albinahlback commented Aug 26, 2024 •

edited

Loading