Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trivial improvements for xbin_min and xbin_max may lead to speedups in sample_get_x #969

Open
valassi opened this issue Aug 15, 2024 · 3 comments · May be fixed by #970 or #946
Open

Trivial improvements for xbin_min and xbin_max may lead to speedups in sample_get_x #969

valassi opened this issue Aug 15, 2024 · 3 comments · May be fixed by #970 or #946
Assignees

Comments

@valassi
Copy link
Member

valassi commented Aug 15, 2024

I am doing a few tests with sample_get_x towards vectorising it, see #963

Apart from the issue reported in #968, I think I identified another two trivial but useful improvements in sample_get_x

This is WIP to be confirmed.

@valassi valassi self-assigned this Aug 15, 2024
@valassi valassi linked a pull request Aug 15, 2024 that will close this issue
@valassi valassi linked a pull request Aug 15, 2024 that will close this issue
@valassi
Copy link
Member Author

valassi commented Aug 15, 2024

Two, I checked that in a case like CMS DY+3j, the function is most often called with xmin=0 or xmax=1, and it is possible to cache these values

This is 291bcf5

@valassi
Copy link
Member Author

valassi commented Aug 15, 2024

One, some minor streamlining of xbin_min and xbin_max calculations seems to be useful

This might be this, but is seems too silly to have an effect, maybe it was elsewhere 23a1358

valassi added a commit to valassi/madgraph4gpu that referenced this issue Aug 15, 2024
…ode for xbin_min and xbin_max (part1 of madgraph5#969)

There is indeed a small but clear improvement

CUDACPP_RUNTIME_DISABLEFPE=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_dy3j_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL                         :    4.5494s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1688s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0669s
 [COUNTERS] Fortran Random2Momenta         (  3 ) :    3.2830s for  1170103 events => throughput is 2.81E-06 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.1061s for    49152 events => throughput is 2.16E-06 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1361s for    16384 events => throughput is 8.31E-06 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0519s for    16384 events => throughput is 3.17E-06 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0649s for    16384 events => throughput is 3.96E-06 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1366s for  1170103 events => throughput is 1.17E-07 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4745s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0257s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0349s for    16384 events => throughput is 2.13E-06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 21 ) :    4.5145s
 [COUNTERS] OVERALL MEs                    ( 22 ) :    0.0349s for    16384 events => throughput is 2.13E-06 events/s
valassi added a commit to valassi/madgraph4gpu that referenced this issue Aug 15, 2024
… for xmin=0 and xbin_max for xmax=1 (part2 of madgraph5#969)

There is indeed another clear and not too small improvement

CUDACPP_RUNTIME_DISABLEFPE=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_dy3j_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL                         :    4.2184s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1695s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0672s
 [COUNTERS] Fortran Random2Momenta         (  3 ) :    2.9293s for  1170103 events => throughput is 2.50E-06 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.1094s for    49152 events => throughput is 2.23E-06 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1379s for    16384 events => throughput is 8.42E-06 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0560s for    16384 events => throughput is 3.42E-06 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0707s for    16384 events => throughput is 4.31E-06 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1447s for  1170103 events => throughput is 1.24E-07 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4719s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0267s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0350s for    16384 events => throughput is 2.13E-06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 21 ) :    4.1834s
 [COUNTERS] OVERALL MEs                    ( 22 ) :    0.0350s for    16384 events => throughput is 2.13E-06 events/s
@valassi
Copy link
Member Author

valassi commented Aug 15, 2024

See the difference between the default 079207d

CUDACPP_RUNTIME_DISABLEFPE=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_dy3j_x1_cudacpp
 Found          997  events.
 Wrote           59  events.
 Actual xsec    5.9274488566377981
 [COUNTERS] PROGRAM TOTAL                         :    4.6537s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1603s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0673s
 [COUNTERS] Fortran Random2Momenta         (  3 ) :    3.4183s for  1170103 events => throughput is 2.92E-06 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.1002s for    49152 events => throughput is 2.04E-06 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1307s for    16384 events => throughput is 7.98E-06 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0505s for    16384 events => throughput is 3.08E-06 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0657s for    16384 events => throughput is 4.01E-06 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1321s for  1170103 events => throughput is 1.13E-07 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4682s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0257s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0346s for    16384 events => throughput is 2.11E-06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 21 ) :    4.6191s
 [COUNTERS] OVERALL MEs                    ( 22 ) :    0.0346s for    16384 events => throughput is 2.11E-06 events/s

And then the change 1, removing a few xbin calls
b69c61c

CUDACPP_RUNTIME_DISABLEFPE=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_dy3j_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL                         :    4.5494s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1688s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0669s
 [COUNTERS] Fortran Random2Momenta         (  3 ) :    3.2830s for  1170103 events => throughput is 2.81E-06 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.1061s for    49152 events => throughput is 2.16E-06 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1361s for    16384 events => throughput is 8.31E-06 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0519s for    16384 events => throughput is 3.17E-06 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0649s for    16384 events => throughput is 3.96E-06 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1366s for  1170103 events => throughput is 1.17E-07 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4745s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0257s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0349s for    16384 events => throughput is 2.13E-06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 21 ) :    4.5145s
 [COUNTERS] OVERALL MEs                    ( 22 ) :    0.0349s for    16384 events => throughput is 2.13E-06 events/s

And then caching the xbin values
a6d57a8


CUDACPP_RUNTIME_DISABLEFPE=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_dy3j_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL                         :    4.2184s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1695s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0672s
 [COUNTERS] Fortran Random2Momenta         (  3 ) :    2.9293s for  1170103 events => throughput is 2.50E-06 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.1094s for    49152 events => throughput is 2.23E-06 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1379s for    16384 events => throughput is 8.42E-06 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0560s for    16384 events => throughput is 3.42E-06 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0707s for    16384 events => throughput is 4.31E-06 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1447s for  1170103 events => throughput is 1.24E-07 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4719s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0267s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0350s for    16384 events => throughput is 2.13E-06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 21 ) :    4.1834s
 [COUNTERS] OVERALL MEs                    ( 22 ) :    0.0350s for    16384 events => throughput is 2.13E-06 events/s

I think this could become a small standalone PR. To discuss with @oliviermattelaer

valassi added a commit to valassi/madgraph4gpu that referenced this issue Aug 19, 2024
… gg_tt.mad), simplify the code for xbin_min and xbin_max (part1 of madgraph5#969)

There is indeed a small but clear improvement

CUDACPP_RUNTIME_DISABLEFPE=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_dy3j_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL                         :    4.5494s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1688s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0669s
 [COUNTERS] Fortran Random2Momenta         (  3 ) :    3.2830s for  1170103 events => throughput is 2.81E-06 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.1061s for    49152 events => throughput is 2.16E-06 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1361s for    16384 events => throughput is 8.31E-06 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0519s for    16384 events => throughput is 3.17E-06 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0649s for    16384 events => throughput is 3.96E-06 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1366s for  1170103 events => throughput is 1.17E-07 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4745s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0257s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0349s for    16384 events => throughput is 2.13E-06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 21 ) :    4.5145s
 [COUNTERS] OVERALL MEs                    ( 22 ) :    0.0349s for    16384 events => throughput is 2.13E-06 events/s
valassi added a commit to valassi/madgraph4gpu that referenced this issue Aug 19, 2024
… gg_tt.mad), cache xbin_min for xmin=0 and xbin_max for xmax=1 (part2 of madgraph5#969)

There is indeed another clear and not too small improvement

CUDACPP_RUNTIME_DISABLEFPE=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_dy3j_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL                         :    4.2184s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1695s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0672s
 [COUNTERS] Fortran Random2Momenta         (  3 ) :    2.9293s for  1170103 events => throughput is 2.50E-06 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.1094s for    49152 events => throughput is 2.23E-06 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1379s for    16384 events => throughput is 8.42E-06 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0560s for    16384 events => throughput is 3.42E-06 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0707s for    16384 events => throughput is 4.31E-06 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1447s for  1170103 events => throughput is 1.24E-07 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4719s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0267s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0350s for    16384 events => throughput is 2.13E-06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 21 ) :    4.1834s
 [COUNTERS] OVERALL MEs                    ( 22 ) :    0.0350s for    16384 events => throughput is 2.13E-06 events/s
valassi added a commit to valassi/madgraph4gpu that referenced this issue Aug 19, 2024
… gg_tt.mad), comment out dead if/then branches (for warnings that are commented out)

This is another minor component of madgraph5#969. It gives almost insignificant performance improvements, but it simplifies the code.

CUDACPP_RUNTIME_DISABLEFPE=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_dy3j_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL                         :    4.1574s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1706s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0670s
 [COUNTERS] Fortran Random2Momenta         (  3 ) :    2.8950s for  1170103 events => throughput is 2.47E-06 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.1021s for    49152 events => throughput is 2.08E-06 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1360s for    16384 events => throughput is 8.30E-06 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0518s for    16384 events => throughput is 3.16E-06 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0679s for    16384 events => throughput is 4.15E-06 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1401s for  1170103 events => throughput is 1.20E-07 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4658s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0263s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0347s for    16384 events => throughput is 2.12E-06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 21 ) :    4.1227s
 [COUNTERS] OVERALL MEs                    ( 22 ) :    0.0347s for    16384 events => throughput is 2.12E-06 events/s
valassi added a commit to valassi/madgraph4gpu that referenced this issue Aug 19, 2024
… gg_tt.mad), skip xbin checks if CUDACPP_RUNTIME_SKIPXBINCHECKS is set (part3 of madgraph5#969)

This is a very large improvement, but it may be more controversial, hence it is disabled by default...

CUDACPP_RUNTIME_DISABLEFPE=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_dy3j_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL                         :    4.1142s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1610s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0670s
 [COUNTERS] Fortran Random2Momenta         (  3 ) :    2.8821s for  1170103 events => throughput is 2.46E-06 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.0962s for    49152 events => throughput is 1.96E-06 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1278s for    16384 events => throughput is 7.80E-06 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0485s for    16384 events => throughput is 2.96E-06 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0670s for    16384 events => throughput is 4.09E-06 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1355s for  1170103 events => throughput is 1.16E-07 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4683s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0262s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0348s for    16384 events => throughput is 2.13E-06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 21 ) :    4.0794s
 [COUNTERS] OVERALL MEs                    ( 22 ) :    0.0348s for    16384 events => throughput is 2.13E-06 events/s

CUDACPP_RUNTIME_SKIPXBINCHECKS=1 CUDACPP_RUNTIME_DISABLEFPE=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_dy3j_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL                         :    3.2969s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1726s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0674s
 [COUNTERS] Fortran Random2Momenta         (  3 ) :    2.0464s for  1170103 events => throughput is 1.75E-06 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.0958s for    49152 events => throughput is 1.95E-06 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1298s for    16384 events => throughput is 7.92E-06 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0482s for    16384 events => throughput is 2.94E-06 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0656s for    16384 events => throughput is 4.00E-06 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1412s for  1170103 events => throughput is 1.21E-07 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4685s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0266s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0349s for    16384 events => throughput is 2.13E-06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 21 ) :    3.2620s
 [COUNTERS] OVERALL MEs                    ( 22 ) :    0.0349s for    16384 events => throughput is 2.13E-06 events/s
valassi added a commit to valassi/madgraph4gpu that referenced this issue Aug 19, 2024
…5#969 performance improvements in sample_get_x in dsample.f

This includes
- simplify the code for xbin_min and xbin_max (remove dead code)
- cache xbin_min for xmin=0 and xbin_max for xmax=1
- comment out dead if/then branches (for warnings that were already commented out)
- optionally skip xbin checks if CUDACPP_RUNTIME_SKIPXBINCHECKS is set

The only files that still need to be patched are
- 4 in patch.common: Source/makefile, Source/genps.inc, Source/dsample.f, SubProcesses/makefile
- 4 in patch.P1: auto_dsig1.f, auto_dsig.f, driver.f, matrix1.f

./CODEGEN/generateAndCompare.sh gg_tt --mad --nopatch
git diff --no-ext-diff -R gg_tt.mad/Source/makefile gg_tt.mad/Source/genps.inc gg_tt.mad/SubProcesses/makefile gg_tt.mad/Source/dsample.f > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.common
git diff --no-ext-diff -R gg_tt.mad/SubProcesses/P1_gg_ttx/auto_dsig1.f gg_tt.mad/SubProcesses/P1_gg_ttx/auto_dsig.f gg_tt.mad/SubProcesses/P1_gg_ttx/driver.f gg_tt.mad/SubProcesses/P1_gg_ttx/matrix1.f > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.P1
git checkout gg_tt.mad

(Later checked that regenerating gg_tt.mad is ok)
valassi added a commit to valassi/madgraph4gpu that referenced this issue Aug 19, 2024
…graph5#969 improvements in dsample.f) on itscrd90

Code generation completed in 245 seconds
Code generation and additional checks completed in 372 seconds
@valassi valassi linked a pull request Aug 19, 2024 that will close this issue
@valassi valassi linked a pull request Aug 19, 2024 that will close this issue
valassi added a commit to valassi/madgraph4gpu that referenced this issue Aug 22, 2024
… copy this to gg_tt.mad!], skip xbin checks if CUDACPP_RUNTIME_SKIPXBINCHECKS is set (part3 of madgraph5#969)

This is a very large improvement, but it may be more controversial, hence it is disabled by default...

CUDACPP_RUNTIME_DISABLEFPE=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_dy3j_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL                         :    4.1142s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1610s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0670s
 [COUNTERS] Fortran Random2Momenta         (  3 ) :    2.8821s for  1170103 events => throughput is 2.46E-06 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.0962s for    49152 events => throughput is 1.96E-06 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1278s for    16384 events => throughput is 7.80E-06 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0485s for    16384 events => throughput is 2.96E-06 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0670s for    16384 events => throughput is 4.09E-06 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1355s for  1170103 events => throughput is 1.16E-07 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4683s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0262s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0348s for    16384 events => throughput is 2.13E-06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 21 ) :    4.0794s
 [COUNTERS] OVERALL MEs                    ( 22 ) :    0.0348s for    16384 events => throughput is 2.13E-06 events/s

CUDACPP_RUNTIME_SKIPXBINCHECKS=1 CUDACPP_RUNTIME_DISABLEFPE=1 ./build.cuda_d_inl0_hrd0/madevent_cuda < /tmp/avalassi/input_dy3j_x1_cudacpp
 [COUNTERS] PROGRAM TOTAL                         :    3.2969s
 [COUNTERS] Fortran Other                  (  0 ) :    0.1726s
 [COUNTERS] Fortran Initialise(I/O)        (  1 ) :    0.0674s
 [COUNTERS] Fortran Random2Momenta         (  3 ) :    2.0464s for  1170103 events => throughput is 1.75E-06 events/s
 [COUNTERS] Fortran PDFs                   (  4 ) :    0.0958s for    49152 events => throughput is 1.95E-06 events/s
 [COUNTERS] Fortran UpdateScaleCouplings   (  5 ) :    0.1298s for    16384 events => throughput is 7.92E-06 events/s
 [COUNTERS] Fortran Reweight               (  6 ) :    0.0482s for    16384 events => throughput is 2.94E-06 events/s
 [COUNTERS] Fortran Unweight(LHE-I/O)      (  7 ) :    0.0656s for    16384 events => throughput is 4.00E-06 events/s
 [COUNTERS] Fortran SamplePutPoint         (  8 ) :    0.1412s for  1170103 events => throughput is 1.21E-07 events/s
 [COUNTERS] CudaCpp Initialise             ( 11 ) :    0.4685s
 [COUNTERS] CudaCpp Finalise               ( 12 ) :    0.0266s
 [COUNTERS] CudaCpp MEs                    ( 19 ) :    0.0349s for    16384 events => throughput is 2.13E-06 events/s
 [COUNTERS] OVERALL NON-MEs                ( 21 ) :    3.2620s
 [COUNTERS] OVERALL MEs                    ( 22 ) :    0.0349s for    16384 events => throughput is 2.13E-06 events/s
valassi added a commit to valassi/madgraph4gpu that referenced this issue Aug 22, 2024
…5#969 performance improvements in sample_get_x in dsample.f

This includes
- simplify the code for xbin_min and xbin_max (remove dead code)
- cache xbin_min for xmin=0 and xbin_max for xmax=1
- comment out dead if/then branches (for warnings that were already commented out)
- [NOT YET INCLUDED! I forgot this...] optionally skip xbin checks if CUDACPP_RUNTIME_SKIPXBINCHECKS is set

The only files that still need to be patched are
- 4 in patch.common: Source/makefile, Source/genps.inc, Source/dsample.f, SubProcesses/makefile
- 4 in patch.P1: auto_dsig1.f, auto_dsig.f, driver.f, matrix1.f

./CODEGEN/generateAndCompare.sh gg_tt --mad --nopatch
git diff --no-ext-diff -R gg_tt.mad/Source/makefile gg_tt.mad/Source/genps.inc gg_tt.mad/SubProcesses/makefile gg_tt.mad/Source/dsample.f > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.common
git diff --no-ext-diff -R gg_tt.mad/SubProcesses/P1_gg_ttx/auto_dsig1.f gg_tt.mad/SubProcesses/P1_gg_ttx/auto_dsig.f gg_tt.mad/SubProcesses/P1_gg_ttx/driver.f gg_tt.mad/SubProcesses/P1_gg_ttx/matrix1.f > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.P1
git checkout gg_tt.mad

(Later checked that regenerating gg_tt.mad is ok)
valassi added a commit to valassi/madgraph4gpu that referenced this issue Aug 22, 2024
…graph5#969 improvements in dsample.f) on itscrd90 [NB: CUDACPP_RUNTIME_SKIPXBINCHECKS is still missing here!]

Code generation completed in 245 seconds
Code generation and additional checks completed in 372 seconds
valassi added a commit to valassi/madgraph4gpu that referenced this issue Aug 22, 2024
…cluding the latest timers/counters and madgraph5#969 sample_get_x speedups [NB: CUDACPP_RUNTIME_SKIPXBINCHECKS still missing!]

CUDACPP_RUNTIME_DISABLEFPE=1 ./tlau/lauX.sh -fortran pp_dy3j.mad -togridpack
valassi added a commit to valassi/madgraph4gpu that referenced this issue Aug 22, 2024
… CUDACPP_RUNTIME_SKIPXBINCHECKS patch madgraph5#968 (on top of madgraph5#969)

This includes
- optionally skip xbin checks if CUDACPP_RUNTIME_SKIPXBINCHECKS is set

The only files that still need to be patched are
- 4 in patch.common: Source/makefile, Source/genps.inc, Source/dsample.f, SubProcesses/makefile
- 4 in patch.P1: auto_dsig1.f, auto_dsig.f, driver.f, matrix1.f

./CODEGEN/generateAndCompare.sh gg_tt --mad --nopatch
git diff --no-ext-diff -R gg_tt.mad/Source/makefile gg_tt.mad/Source/genps.inc gg_tt.mad/SubProcesses/makefile gg_tt.mad/Source/dsample.f > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.common
git diff --no-ext-diff -R gg_tt.mad/SubProcesses/P1_gg_ttx/auto_dsig1.f gg_tt.mad/SubProcesses/P1_gg_ttx/auto_dsig.f gg_tt.mad/SubProcesses/P1_gg_ttx/driver.f gg_tt.mad/SubProcesses/P1_gg_ttx/matrix1.f > CODEGEN/PLUGIN/CUDACPP_SA_OUTPUT/MG5aMC_patches/PROD/patch.P1
git checkout gg_tt.mad

(Later checked that regenerating gg_tt.mad is ok)
valassi added a commit to valassi/madgraph4gpu that referenced this issue Aug 22, 2024
…UDACPP_RUNTIME_SKIPXBINCHECKS set madgraph5#968 : big improvement!)

For the cuda backend is now, skipping xbin checks madgraph5#968
Phase space sampling in dy+3j has decreased from 78s to 53s (down by 30%) thanks to removal of xbin checks
> [GridPackCmd.launch] GRIDPCK TOTAL                    135.1144
> [madevent COUNTERS]  PROGRAM TOTAL                    130.8140s
> [madevent COUNTERS]  Fortran PhaseSpaceSampling        53.0338s for   44652395 events
> ...
> [madevent COUNTERS]  CudaCpp MEs                       35.4908s for    1769472 events
> [madevent COUNTERS]  OVERALL NON-MEs                   95.3232s
> [madevent COUNTERS]  OVERALL MEs                       35.4908s for    1769472 events

For the cuda backend was, including xbin checks but including trivial improvements madgraph5#969
Phase space sampling in dy+3j has decreased from 93s to 78s (down by 15%) thanks to removal of xbin checks
< [GridPackCmd.launch] GRIDPCK TOTAL                    160.1718
< [madevent COUNTERS]  PROGRAM TOTAL                    155.8605s
< [madevent COUNTERS]  Fortran PhaseSpaceSampling        78.1023s for   44652395 events
< ...
< [madevent COUNTERS]  CudaCpp MEs                       35.4320s for    1769472 events
< [madevent COUNTERS]  OVERALL NON-MEs                  120.4290s
< [madevent COUNTERS]  OVERALL MEs                       35.4320s for    1769472 events

For the cuda backend was in 2e59eca, without trivial improvements
< [GridPackCmd.launch] GRIDPCK TOTAL                    176.8891
< [madevent COUNTERS]  PROGRAM TOTAL                    172.6370s
< [madevent COUNTERS]  Fortran Random2Momenta            93.2907s for   44651014 events
< ...
< [madevent COUNTERS]  CudaCpp MEs                       35.4557s for    1769472 events
< [madevent COUNTERS]  OVERALL NON-MEs                  137.1806s
< [madevent COUNTERS]  OVERALL MEs                       35.4557s for    1769472 events
valassi added a commit to valassi/madgraph4gpu that referenced this issue Aug 22, 2024
…ts - but not yet the latest upstream/master) into cmsdyps

Fix conflicts in patch.common (NB: the 968/969 improvements are now in the OLD sample_get_x)
valassi added a commit to valassi/madgraph4gpu that referenced this issue Aug 22, 2024
…ts - but not yet the latest upstream/master) into cmsdyps

Fix conflicts in patch.P1 and patch.common (NB: the 968/969 improvements are now in the OLD sample_get_x)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant