Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mulmod_shoup and nmod_vec_scalar_mul_nmod #2055

Merged
merged 67 commits into from
Sep 6, 2024
Merged

Conversation

vneiger
Copy link
Collaborator

@vneiger vneiger commented Aug 26, 2024

This PR touches things related to single word modular multiplication with precomputation (n_mulmod_shoup).

  • documentation enhancements, with minor cleaning and notation more consistent with other functions, and also more clear constraints on the input (notably a restriction on one of the input operands was indicated, but is not required at all)
  • detailed explanations of this method added at the bottom of ulong_extras/mulmod_precomp_shoup.c (could be put elsewhere, because this file might disappear, see below)
  • profiling files for related functions nmod_vec_scalar_mul/addmul and nmod_mat_nmod_vec_mul

Two conclusions from the latter performance tests, done on two machines (zen4 / icelake):

  • For small lengths (5, 10, 15...), it is quite beneficial to have the precomputation inlined, so I would suggest making n_mulmod_precomp_shoup inline: any opinion against that?
  • The current threshold for switching from mulmod2_preinv to mulmod_shoup, which is 10, seems too high. In fact, I get better results with mulmod_shoup as soon as the length exceeds... 1. For length 1, the results are quite close and sometimes in favor of mulmod_shoup (not really sure why). Some feedback could be useful about the accuracy of the new profiling files for small lengths, because I am not so familiar with benchmarking on so "small" computations. If this conclusion turns out correct, it seems that we could lower the threshold to 2, or even avoid the _generic code entirely (the _fullword one must be kept because this modular multiplication method does not accept 64-bit moduli).

…preinv; on zen4 the threshold is clearly too high for switching to the Shoup precomp multiplication
…ove constraint t < p (or b < n); unify naming (a*b mod n like mulmod2_preinv or mulmod_precomp, instead of w*t mod p) in both doc and code; start explanations of correctness and restrictions
@vneiger
Copy link
Collaborator Author

vneiger commented Aug 26, 2024

Excerpts from my measurements.

For nmod_vec/profile/p-scalar_mul.c (smaller is better; the three things are "branching interface | _mulmod_shoup | _generic"):

bit/len	1		2		3		4		5		6		7		1024		65536
63	11.5|8.8|9.7	7.3|4.8|6.9	6.0|3.9|5.8	5.7|4.0|5.7	5.6|3.7|5.5	5.9|3.7|5.7	5.5|3.4|5.6	2.1|2.1|4.5	2.2|2.2|4.5

And for mulmod routines (cycle counts should all be off by the same constant factor). The kind of useful feedback I was mentioning above is: are these "microbenchmarks" somehow accurate, can we for example deduce that on this machine, running mulmod_shoup for a single modular multiplication is just a bit slower than running mulmod2_preinv (7.993c versus 6.765c) ?

./p-mulmod_shoup
mulmod_shoup min time / max time is:
   - including precomputation: 7.993 cycles / 8.279 cycles
   - excluding precomputation: 2.056 cycles / 2.081 cycles
   - precomputation alone: 7.880 cycles / 8.821 cycles

./p-mulmod2_preinv
mulmod2_precomp min time is 6.765 cycles, max time is 6.910 cycles

./p-mulmod_precomp
mulmod_precomp min time is 6.908 cycles, max time is 7.894 cycles

@albinahlback
Copy link
Collaborator

albinahlback commented Aug 26, 2024

Two conclusions from the latter performance tests, done on two machines (zen4 / icelake):

  • For small lengths (5, 10, 15...), it is quite beneficial to have the precomputation inlined, so I would suggest making n_mulmod_precomp_shoup inline: any opinion against that?

On x86, this makes a lot of sense since it basically just consists of a div instruction. Should make sense for other architectures as well.

  • The current threshold for switching from mulmod2_preinv to mulmod_shoup, which is 10, seems too high. In fact, I get better results with mulmod_shoup as soon as the length exceeds... 1. For length 1, the results are quite close and sometimes in favor of mulmod_shoup (not really sure why). Some feedback could be useful about the accuracy of the new profiling files for small lengths, because I am not so familiar with benchmarking on so "small" computations. If this conclusion turns out correct, it seems that we could lower the threshold to 2, or even avoid the _generic code entirely (the _fullword one must be kept because this modular multiplication method does not accept 64-bit moduli).

Yeah, we would really need a test suite as hardware changes over time. Moreover (and I believe I have mentioned this in some issue), we don't know why the threshold of 10 was set (perhaps a specific CPU, whatever). I could only guess that it was, in general, faster on x86 fifteen years ago or so, but that it today is faster the other way around.

src/ulong_extras/mulmod_precomp_shoup.c Outdated Show resolved Hide resolved
src/ulong_extras/profile/p-mulmod_shoup.c Outdated Show resolved Hide resolved
src/ulong_extras/test/t-mulmod_shoup.c Outdated Show resolved Hide resolved
umul_ppmm(p1, p2, a, b);
p1 %= d;
udiv_qrnnd(q, r2, p1, p2, d);

result = (r1 == r2);
int result = (r1 == r2);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually declare variables as the good old days, i.e. in the beginning of the function. I know this differs somewhat (I know the fft_small module declares them where needed), but I think we should be consistent. Perhaps @fredrik-johansson could give his input when he comes back from his vacation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm indeed interested in viewpoints on this. As I wrote in another comment above, for these test files and some specific variables in them (loop counter, result, etc.) I totally understand that we want consistency and may want to declare them once for all at the beginning. But for core code (non-test, non-profile), I would not say the same, I find that in some places it makes the code harder to read and write because it makes less obvious --- sometimes far less when there many variables or long functions --- the scope of variables, whether they already have a value at some point, or even simply what their type is. Also in some pieces of code it even forbids us to do some things, like declaring some variables const. This being said, I guess (well, I hope) that the compiler is good enough for this to make no difference, so if there is a strong standpoint on the "old C style" declaration, I will follow it; in this case this should probably be written somewhere (I don't see it in the current code conventions in contributing.html).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the aspect of the compiler, I don't think it makes a difference. For me, it is about being able to study how many registers will be used, sort of. However, this obviously goes against the thought of a higher-level language. Moreover, I suspect I am one of the few that still prefers this way of declaring variables. And yes, whatever way we choose to go, it would be good if it is documented.

@vneiger
Copy link
Collaborator Author

vneiger commented Sep 3, 2024

There remains a question about where to put the explanation notes for mulmod_shoup: they are currently in ulong_extras/mulmod_shoup_precomp.c, but after changes this file now contains nothing else than these explanations.

The documentation would be a good place.

Done, in the last commits.

@vneiger
Copy link
Collaborator Author

vneiger commented Sep 4, 2024

@albinahlback a quick check about the flint-mparam.h and configure.ac modifications would be welcome. My plan for this PR now is just to give a reasonable value for Skylake to the macro FLINT_MULMOD_SHOUP_THRESHOLD, and then I think it will be ready for merge. Any comment welcome!

@fredrik-johansson
Copy link
Collaborator

An extremely minor nitpick: I think I prefer threshold values to be defined so that one writes

    if (n < cutoff)            if (n >= cutoff)
        algorithm1                 algorithm2
    else                       else
        algorithm2                 algorithm1

rather than

    if (n <= cutoff)           if (n > cutoff)
        algorithm1                 algorithm2
    else                       else
        algorithm2                 algorithm1

Feel free to change this (or don't bother if you have better things to do).

Other than that, positive review from me.

@vneiger
Copy link
Collaborator Author

vneiger commented Sep 5, 2024

An extremely minor nitpick: I think I prefer threshold values to be defined so that one writes

I did this. It actually also simplifies away some ugly macro checks.

Other than that, positive review from me.

Ok! I've made minor changes and will merge if CI passes. I'll update broadwell's flint-mparam.h later if needed, and @albinahlback should feel free to complain at any point about changes in configure.ac and the various flint-mparam.h files.

@vneiger vneiger merged commit 06c1720 into flintlib:main Sep 6, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants