Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(WIP) HELINL=L (L for linker) helas mode: pre-compile templates into separate .o object files (using RDC for CUDA; still missing HIP) #978

Draft
wants to merge 66 commits into
base: master
Choose a base branch
from

Conversation

valassi
Copy link
Member

@valassi valassi commented Aug 27, 2024

WIP on removing template/inline from helas (related to splitting kernels)

…FVs and for compiling them as separate object files (related to splitting kernels)
@valassi valassi self-assigned this Aug 27, 2024
@valassi valassi marked this pull request as draft August 27, 2024 15:37
…the P subdirectory (depends on npar) - build succeeds for cpp, link fails for cuda

ccache /usr/local/cuda-12.0/bin/nvcc  -I. -I../../src  -Xcompiler -O3 -gencode arch=compute_70,code=compute_70 -gencode arch=compute_70,code=sm_70 -lineinfo -use_fast_math -I/usr/local/cuda-12.0/include/ -DUSE_NVTX  -std=c++17  -ccbin /usr/lib64/ccache/g++ -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_DOUBLE -Xcompiler -fPIC -c -x cu CPPProcess.cc -o CPPProcess_cuda.o
ptxas fatal   : Unresolved extern function '_ZN9mg5amcGpu14helas_VVV1P0_1EPKdS1_S1_dddPd'
…cuda tests succeed

The build issues some warnings however
nvlink warning : SM Arch ('sm_52') not found in './CPPProcess_cuda.o'
nvlink warning : SM Arch ('sm_52') not found in './HelAmps_cuda.o'
nvlink warning : SM Arch ('sm_52') not found in './CPPProcess_cuda.o'
nvlink warning : SM Arch ('sm_52') not found in './HelAmps_cuda.o'
…ption HELINL=L and '#ifdef MGONGPU_LINKER_HELAMPS'
…c++, a factor 3 slower for cuda...

./tput/teeThroughputX.sh -ggtt -makej -makeclean -inlLonly

diff -u --color tput/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt tput/logs_ggtt_mad/log_ggtt_mad_d_inlL_hrd0.txt

-Process                     = SIGMA_SM_GG_TTX_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=0] [hardcodePARAM=0]
+Process                     = SIGMA_SM_GG_TTX_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=L] [hardcodePARAM=0]
 Workflow summary            = CUD:DBL+THX:CURDEV+RMBDEV+MESDEV/none+NAVBRK
 FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
-EvtsPerSec[Rmb+ME]     (23) = ( 4.589473e+07                 )  sec^-1
-EvtsPerSec[MatrixElems] (3) = ( 1.164485e+08                 )  sec^-1
-EvtsPerSec[MECalcOnly] (3a) = ( 1.280951e+08                 )  sec^-1
-MeanMatrixElemValue         = ( 2.086689e+00 +- 3.413217e-03 )  GeV^0
-TOTAL       :     0.528239 sec
-INFO: No Floating Point Exceptions have been reported
-     2,222,057,027      cycles                           #    2.887 GHz
-     3,171,868,018      instructions                     #    1.43  insn per cycle
-       0.826440817 seconds time elapsed
-runNcu /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/build.cuda_d_inl0_hrd0/check_cuda.exe -p 2048 256 1
-==PROF== Profiling "sigmaKin": launch__registers_per_thread 214
+EvtsPerSec[Rmb+ME]     (23) = ( 2.667135e+07                 )  sec^-1
+EvtsPerSec[MatrixElems] (3) = ( 4.116115e+07                 )  sec^-1
+EvtsPerSec[MECalcOnly] (3a) = ( 4.251573e+07                 )  sec^-1
+MeanMatrixElemValue         = ( 2.086689e+00 +- 3.413217e-03 )  GeV^0
+TOTAL       :     0.550450 sec
+INFO: No Floating Point Exceptions have been reported
+     2,272,219,097      cycles                           #    2.889 GHz
+     3,361,475,195      instructions                     #    1.48  insn per cycle
+       0.842685843 seconds time elapsed
+runNcu /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/build.cuda_d_inlL_hrd0/check_cuda.exe -p 2048 256 1
+==PROF== Profiling "sigmaKin": launch__registers_per_thread 190
 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
…P* (the source is the same but it must be compiled in each P* separately)
@valassi
Copy link
Member Author

valassi commented Aug 28, 2024

The functionality is in principle completed including the backport to CODEGEN. I will run some functionality and performance tests.

git add *.mad/*/HelAmps.cc *.mad/*/*/HelAmps.cc *.sa/*/HelAmps.cc *.sa/*/*/HelAmps.cc
…ild failed?

./tput/teeThroughputX.sh -ggttggg -makej -makeclean -inlL

ccache /usr/local/cuda-12.0/bin/nvcc  -I. -I../../src  -Xcompiler -O3 -gencode arch=compute_70,code=compute_70 -gencode arch=compute_70,code=sm_70 -lineinfo -use_fast_math -I/usr/local/cuda-12.0/include/ -DUSE_NVTX  -std=c++17  -ccbin /usr/lib64/ccache/g++ -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_DOUBLE -DMGONGPU_INLINE_HELAMPS -Xcompiler -fPIC -c -x cu CPPProcess.cc -o build.cuda_d_inl1_hrd0/CPPProcess_cuda.o
nvcc error   : 'ptxas' died due to signal 9 (Kill signal)
make[2]: *** [cudacpp.mk:754: build.cuda_d_inl1_hrd0/CPPProcess_cuda.o] Error 9
make[2]: Leaving directory '/data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg'
make[1]: *** [makefile:142: build.cuda_d_inl1_hrd0/.cudacpplibs] Error 2
make[1]: Leaving directory '/data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg'
make: *** [makefile:282: bldcuda] Error 2
make: *** Waiting for unfinished jobs....
… build time is from cache

./tput/teeThroughputX.sh -ggttggg -makej -makeclean
…mode (use that from the previous run, not from cache)

./tput/teeThroughputX.sh -ggttggg -makej -makeclean
…factor x2 faster (c++? cuda?), runtime is 5-10% slower in C++, but 5-10% faster in cuda!?

./tput/teeThroughputX.sh -ggttggg -makej -makeclean -inlLonly

diff -u --color tput/logs_ggttggg_mad/log_ggttggg_mad_d_inlL_hrd0.txt  tput/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt
...
 On itscrd90.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]:
 =========================================================================
-runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inlL_hrd0/check_cuda.exe -p 1 256 2 OMP=
+runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inl0_hrd0/check_cuda.exe -p 1 256 2 OMP=
 INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW
-Process                     = SIGMA_SM_GG_TTXGGG_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=L] [hardcodePARAM=0]
+Process                     = SIGMA_SM_GG_TTXGGG_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=0] [hardcodePARAM=0]
 Workflow summary            = CUD:DBL+THX:CURDEV+RMBDEV+MESDEV/none+NAVBRK
 FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
-EvtsPerSec[Rmb+ME]     (23) = ( 4.338149e+02                 )  sec^-1
-EvtsPerSec[MatrixElems] (3) = ( 4.338604e+02                 )  sec^-1
-EvtsPerSec[MECalcOnly] (3a) = ( 4.338867e+02                 )  sec^-1
-MeanMatrixElemValue         = ( 1.187066e-05 +- 9.825549e-06 )  GeV^-6
-TOTAL       :     2.242693 sec
-INFO: No Floating Point Exceptions have been reported
-     7,348,976,543      cycles                           #    2.902 GHz
-    16,466,315,526      instructions                     #    2.24  insn per cycle
-       2.591057214 seconds time elapsed
-runNcu /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inlL_hrd0/check_cuda.exe -p 1 256 1
+EvtsPerSec[Rmb+ME]     (23) = ( 4.063038e+02                 )  sec^-1
+EvtsPerSec[MatrixElems] (3) = ( 4.063437e+02                 )  sec^-1
+EvtsPerSec[MECalcOnly] (3a) = ( 4.063626e+02                 )  sec^-1
+MeanMatrixElemValue         = ( 1.187066e-05 +- 9.825549e-06 )  GeV^-6
+TOTAL       :     2.552546 sec
+INFO: No Floating Point Exceptions have been reported
+     7,969,059,552      cycles                           #    2.893 GHz
+    17,401,037,642      instructions                     #    2.18  insn per cycle
+       2.954791685 seconds time elapsed
+runNcu /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inl0_hrd0/check_cuda.exe -p 1 256 1
 ==PROF== Profiling "sigmaKin": launch__registers_per_thread 255
 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
...
 =========================================================================
-runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.512y_d_inlL_hrd0/check_cpp.exe -p 1 256 2 OMP=
+runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.512y_d_inl0_hrd0/check_cpp.exe -p 1 256 2 OMP=
 INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW
-Process                     = SIGMA_SM_GG_TTXGGG_CPP [gcc 11.3.1] [inlineHel=L] [hardcodePARAM=0]
+Process                     = SIGMA_SM_GG_TTXGGG_CPP [gcc 11.3.1] [inlineHel=0] [hardcodePARAM=0]
 Workflow summary            = CPP:DBL+CXS:CURHST+RMBHST+MESHST/512y+CXVBRK
 FP precision                = DOUBLE (NaN/abnormal=0, zero=0)
 Internal loops fptype_sv    = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES]
-EvtsPerSec[Rmb+ME]     (23) = ( 3.459662e+02                 )  sec^-1
-EvtsPerSec[MatrixElems] (3) = ( 3.460086e+02                 )  sec^-1
-EvtsPerSec[MECalcOnly] (3a) = ( 3.460086e+02                 )  sec^-1
+EvtsPerSec[Rmb+ME]     (23) = ( 3.835352e+02                 )  sec^-1
+EvtsPerSec[MatrixElems] (3) = ( 3.836003e+02                 )  sec^-1
+EvtsPerSec[MECalcOnly] (3a) = ( 3.836003e+02                 )  sec^-1
 MeanMatrixElemValue         = ( 1.187066e-05 +- 9.825549e-06 )  GeV^-6
-TOTAL       :     1.528240 sec
+TOTAL       :     1.378567 sec
 INFO: No Floating Point Exceptions have been reported
-     4,140,408,789      cycles                           #    2.703 GHz
-     9,072,597,595      instructions                     #    2.19  insn per cycle
-       1.532357792 seconds time elapsed
-=Symbols in CPPProcess_cpp.o= (~sse4:    0) (avx2:94048) (512y:   91) (512z:    0)
+     3,738,350,469      cycles                           #    2.705 GHz
+     8,514,195,736      instructions                     #    2.28  insn per cycle
+       1.382567882 seconds time elapsed
+=Symbols in CPPProcess_cpp.o= (~sse4:    0) (avx2:80619) (512y:   89) (512z:    0)
 -------------------------------------------------------------------------
… (commented out) for the memory corruption madgraph5#806

This shows an uninitialised value deep inside hiprand

[valassia@nid005067 bash] ~/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux > valgrind ./check_hip.exe -p 1 8 1
==105499== Memcheck, a memory error detector
==105499== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==105499== Using Valgrind-3.20.0 and LibVEX; rerun with -h for copyright info
==105499== Command: ./check_hip.exe -p 1 8 1
==105499==
==105499== Warning: set address range perms: large range [0x59c90000, 0x159e91000) (noaccess)
INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW
Get random numbers from Hiprand
==105499== Conditional jump or move depends on uninitialised value(s)
==105499==    at 0x1253777C: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003)
==105499==    by 0x12537F40: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003)
==105499==    by 0x12540782: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003)
==105499==    by 0x125629DD: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003)
==105499==    by 0x4B825EB: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003)
==105499==    by 0x4B88342: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003)
==105499==    by 0x4B822FF: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003)
==105499==    by 0x4B55120: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003)
==105499==    by 0x4B2B590: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003)
==105499==    by 0x49D84AF: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003)
==105499==    by 0x49D87C4: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003)
==105499==    by 0x4A00FA2: hipMemcpy (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003)
==105499==
==105499== Conditional jump or move depends on uninitialised value(s)
==105499==    at 0x12537B82: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003)
==105499==    by 0x12537F40: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003)
==105499==    by 0x12540782: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003)
==105499==    by 0x125629DD: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003)
==105499==    by 0x4B825EB: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003)
==105499==    by 0x4B88342: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003)
==105499==    by 0x4B822FF: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003)
==105499==    by 0x4B55120: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003)
==105499==    by 0x4B2B590: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003)
==105499==    by 0x49D84AF: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003)
==105499==    by 0x49D87C4: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003)
==105499==    by 0x4A00FA2: hipMemcpy (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003)
==105499==
Got random numbers from Hiprand
==105499== Invalid read of size 8
==105499==    at 0x21F741: std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, float, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, float> > >::operator[](std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe)
==105499==    by 0x21D0D1: mgOnGpu::TimerMap::start(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe)
==105499==    by 0x215CBB: main (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe)
==105499==  Address 0x1c00000043 is not stack'd, malloc'd or (recently) free'd
==105499==
==105499==
==105499== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==105499==  Access not within mapped region at address 0x1C00000043
==105499==    at 0x21F741: std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, float, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, float> > >::operator[](std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe)
==105499==    by 0x21D0D1: mgOnGpu::TimerMap::start(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe)
==105499==    by 0x215CBB: main (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe)
==105499==  If you believe this happened as a result of a stack
==105499==  overflow in your program's main thread (unlikely but
==105499==  possible), you can try to increase the size of the
==105499==  main thread stack using the --main-stacksize= flag.
==105499==  The main thread stack size used in this run was 16777216.

Unfortunately however also --common crashes (and gives the same uninitialised problem, whether related or not)
…ad of HIP pinned host malloc to debug madgraph5#806 - still crashes, will revert

This makes the valgrind 'conditional jump on uninitialised variable' disappear, but the crash from invalid memory reads still remains

[valassia@nid005067 bash] ~/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux > valgrind --track-origins=yes ./check_hip.exe --common -p 1 8 1
==10800== Memcheck, a memory error detector
==10800== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==10800== Using Valgrind-3.20.0 and LibVEX; rerun with -h for copyright info
==10800== Command: ./check_hip.exe --common -p 1 8 1
==10800==
==10800== Warning: set address range perms: large range [0x59c90000, 0x159e91000) (noaccess)
INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW
==10800== Invalid read of size 8
==10800==    at 0x21EF01: std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, float, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, float> > >::operator[](std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe)
==10800==    by 0x21CA21: mgOnGpu::TimerMap::start(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe)
==10800==    by 0x2158A5: main (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe)
==10800==  Address 0x140000003b is not stack'd, malloc'd or (recently) free'd
==10800==
==10800==
==10800== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==10800==  Access not within mapped region at address 0x140000003B
==10800==    at 0x21EF01: std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, float, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, float> > >::operator[](std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe)
==10800==    by 0x21CA21: mgOnGpu::TimerMap::start(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe)
==10800==    by 0x2158A5: main (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe)
==10800==  If you believe this happened as a result of a stack
==10800==  overflow in your program's main thread (unlikely but
==10800==  possible), you can try to increase the size of the
==10800==  main thread stack using the --main-stacksize= flag.
==10800==  The main thread stack size used in this run was 16777216.
==10800==
==10800== HEAP SUMMARY:
==10800==     in use at exit: 4,784,824 bytes in 17,735 blocks
==10800==   total heap usage: 306,364 allocs, 288,629 frees, 180,986,538 bytes allocated
==10800==
==10800== LEAK SUMMARY:
==10800==    definitely lost: 256 bytes in 5 blocks
==10800==    indirectly lost: 3,522 bytes in 64 blocks
==10800==      possibly lost: 9,544 bytes in 80 blocks
==10800==    still reachable: 4,771,502 bytes in 17,586 blocks
==10800==                       of which reachable via heuristic:
==10800==                         multipleinheritance: 384 bytes in 4 blocks
==10800==         suppressed: 0 bytes in 0 blocks
==10800== Rerun with --leak-check=full to see details of leaked memory
==10800==
==10800== For lists of detected and suppressed errors, rerun with: -s
==10800== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
Segmentation fault
…madgraph5#806 - now valgrind gives no invalid read, but there is a 'Memory access fault'

Using valgrind
[valassia@nid005067 bash] ~/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux > valgrind --track-origins=yes ./check_hip.exe --common -p 1 8 1
==80385== Memcheck, a memory error detector
==80385== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==80385== Using Valgrind-3.20.0 and LibVEX; rerun with -h for copyright info
==80385== Command: ./check_hip.exe --common -p 1 8 1
==80385==
DEBUG: TimerMap::stop() enter
DEBUG: TimerMap::stop() exit
==80385== Warning: set address range perms: large range [0x59c90000, 0x159e91000) (noaccess)
DEBUG: TimerMap::stop() enter
DEBUG: TimerMap::stop() retrieve '00 GpuInit'
DEBUG: TimerMap::stop() exit
INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW
...
DEBUG: TimerMap::stop() enter
DEBUG: TimerMap::stop() retrieve '0e SGoodHel'
DEBUG: TimerMap::stop() exit
Memory access fault by GPU node-4 (Agent handle: 0x1417d4a0) on address 0xfffd862e5000. Reason: Unknown.
==80385==
==80385== Process terminating with default action of signal 6 (SIGABRT): dumping core
==80385==    at 0x63D3D2B: raise (in /lib64/libc-2.31.so)
==80385==    by 0x63D53E4: abort (in /lib64/libc-2.31.so)
==80385==    by 0x12580D1B: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003)
==80385==    by 0x1257ABC8: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003)
==80385==    by 0x1252C9E6: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003)
==80385==    by 0x127C66E9: start_thread (in /lib64/libpthread-2.31.so)
==80385==    by 0x64A150E: clone (in /lib64/libc-2.31.so)
==80385==
==80385== HEAP SUMMARY:
==80385==     in use at exit: 4,790,652 bytes in 17,774 blocks
==80385==   total heap usage: 306,424 allocs, 288,650 frees, 180,987,695 bytes allocated
==80385==
==80385== LEAK SUMMARY:
==80385==    definitely lost: 184 bytes in 4 blocks
==80385==    indirectly lost: 2,658 bytes in 52 blocks
==80385==      possibly lost: 10,768 bytes in 86 blocks
==80385==    still reachable: 4,777,042 bytes in 17,632 blocks
==80385==                       of which reachable via heuristic:
==80385==                         multipleinheritance: 496 bytes in 5 blocks
==80385==         suppressed: 0 bytes in 0 blocks
==80385== Rerun with --leak-check=full to see details of leaked memory
==80385==
==80385== For lists of detected and suppressed errors, rerun with: -s
==80385== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Aborted

Using rocgdb
[valassia@nid005067 bash] ~/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux > rocgdb --args ./check_hip.exe  -p 1 8 1
GNU gdb (rocm-rel-6.0-131) 13.2
...
(gdb) run
Starting program: /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe -p 1 8 1
...
DEBUG: TimerMap::stop() enter
DEBUG: TimerMap::stop() retrieve '0e SGoodHel'
DEBUG: TimerMap::stop() exit
New Thread 0x1554445ff700 (LWP 94651)
New Thread 0x1555470b7700 (LWP 94652)
Thread 0x1554445ff700 (LWP 94651) exited
Warning: precise memory violation signal reporting is not enabled, reported
location may not be accurate.  See "show amdgpu precise-memory".

Thread 6 "check_hip.exe" received signal SIGSEGV, Segmentation fault.
[Switching to thread 6, lane 0 (AMDGPU Lane 1:2:1:1/0 (0,0,0)[0,0,0])]
0x0000155547130598 in mg5amcGpu::sigmaKin(double const*, double const*, double const*, double const*, double*, unsigned int const*, double*, double*, int*, int*) ()
   from file:///pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/../../lib/libmg5amc_gux_ttxux_hip.so#offset=57344&size=114640
(gdb) where
0  0x0000155547130598 in mg5amcGpu::sigmaKin(double const*, double const*, double const*, double const*, double*, unsigned int const*, double*, double*, int*, int*) ()
   from file:///pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/../../lib/libmg5amc_gux_ttxux_hip.so#offset=57344&size=114640
(gdb) l
1       ../sysdeps/x86_64/crtn.S: No such file or directory.
...
(gdb) set amdgpu precise-memory
(gdb) run
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe -p 1 8 1
...
DEBUG: TimerMap::stop() enter
DEBUG: TimerMap::stop() retrieve '0e SGoodHel'
DEBUG: TimerMap::stop() exit
New Thread 0x1554445ff700 (LWP 99032)
New Thread 0x1555470b7700 (LWP 99033)
Thread 0x1554445ff700 (LWP 99032) exited
Thread 6 "check_hip.exe" received signal SIGSEGV, Segmentation fault.
[Switching to thread 6, lane 0 (AMDGPU Lane 1:2:1:1/0 (0,0,0)[0,0,0])]
0x000015554713050c in mg5amcGpu::sigmaKin(double const*, double const*, double const*, double const*, double*, unsigned int const*, double*, double*, int*, int*) ()
   from file:///pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/../../lib/libmg5amc_gux_ttxux_hip.so#offset=57344&size=114640
...
(gdb) info threads
  Id   Target Id                                         Frame
  1    Thread 0x1555471dda80 (LWP 98983) "check_hip.exe" 0x0000155547603d57 in ?? ()
   from /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1
  2    Thread 0x1555469ff700 (LWP 99017) "check_hip.exe" 0x00001555538f64a7 in ioctl () from /lib64/libc.so.6
  5    Thread 0x1555470b7700 (LWP 99033) "check_hip.exe" 0x000015554759fd04 in sem_post@@GLIBC_2.2.5 ()
   from /lib64/libpthread.so.0
* 6    AMDGPU Wave 1:2:1:1 (0,0,0)/0 "check_hip.exe"     0x000015554713050c in mg5amcGpu::sigmaKin(double const*, double const*, double const*, double const*, double*, unsigned int const*, double*, double*, int*, int*) ()
   from file:///pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/../../lib/libmg5amc_gux_ttxux_hip.so#offset=57344&size=114640
… in vxxxxx (which may explain why this only appears in gqttq?)

[valassia@nid005067 bash] ~/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux > rocgdb --args ./check_hip.exe  -p 1 8 1
GNU gdb (rocm-rel-6.0-131) 13.2
...
(gdb) set amdgpu precise-memory
(gdb) run
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe -p 1 8 1
...
DEBUG: TimerMap::stop() enter
DEBUG: TimerMap::stop() retrieve '0e SGoodHel'
DEBUG: TimerMap::stop() exit
New Thread 0x1554445ff700 (LWP 1669)
New Thread 0x155547087700 (LWP 1670)
Thread 0x1554445ff700 (LWP 1669) exited
Thread 6 "check_hip.exe" received signal SIGSEGV, Segmentation fault.
[Switching to thread 6, lane 0 (AMDGPU Lane 1:2:1:1/0 (0,0,0)[0,0,0])]
mg5amcGpu::calculate_wavefunctions (ihel=<optimized out>, allmomenta=<optimized out>, allcouplings=<optimized out>,
    allMEs=<optimized out>, channelId=<optimized out>, allNumerators=<optimized out>, allDenominators=<optimized out>,
    jamp2_sv=<optimized out>) at CPPProcess.cc:328
328           vxxxxx<M_ACCESS, W_ACCESS>( momenta, 0., cHel[ihel][0], -1, w_fp[0], 0 );
(gdb) where
 0  mg5amcGpu::calculate_wavefunctions (ihel=<optimized out>, allmomenta=<optimized out>, allcouplings=<optimized out>,
    allMEs=<optimized out>, channelId=<optimized out>, allNumerators=<optimized out>, allDenominators=<optimized out>,
    jamp2_sv=<optimized out>) at CPPProcess.cc:328
 1  mg5amcGpu::sigmaKin (allmomenta=<optimized out>, allcouplings=<optimized out>, allrndhel=<optimized out>,
    allrndcol=<optimized out>, allMEs=<optimized out>, allChannelIds=<optimized out>, allNumerators=<optimized out>,
    allDenominators=<optimized out>, allselhel=<optimized out>, allselcol=<optimized out>) at CPPProcess.cc:1043
(gdb) info threads
  Id   Target Id                                        Frame
  1    Thread 0x1555471aea80 (LWP 1645) "check_hip.exe" 0x00001555475d5d57 in ?? ()
   from /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1
  2    Thread 0x1555469ff700 (LWP 1655) "check_hip.exe" 0x00001555538c84a7 in ioctl () from /lib64/libc.so.6
  5    Thread 0x155547087700 (LWP 1670) "check_hip.exe" 0x00001555538c84a7 in ioctl () from /lib64/libc.so.6
* 6    AMDGPU Wave 1:2:1:1 (0,0,0)/0 "check_hip.exe"    mg5amcGpu::calculate_wavefunctions (ihel=<optimized out>,
    allmomenta=<optimized out>, allcouplings=<optimized out>, allMEs=<optimized out>, channelId=<optimized out>,
    allNumerators=<optimized out>, allDenominators=<optimized out>, jamp2_sv=<optimized out>) at CPPProcess.cc:328
…d for debugging the crash madgraph5#806 in hipcc

Revert "[amd] in gq_ttq.mad cudacpp.mk, enable -ggdb... the issue seems to be in vxxxxx (which may explain why this only appears in gqttq?)"
This reverts commit 5cc62a6.

Revert "[amd] in gq_ttq.mad timermap.h, add some debug printouts for the crash madgraph5#806 - now valgrind gives no invalid read, but there is a 'Memory access fault'"
This reverts commit 5b8d92f.

Revert "[amd] in gq_ttq.mad MemoryBuffers.h, temporarely use c++ malloc instead of HIP pinned host malloc to debug madgraph5#806 - still crashes, will revert"
This reverts commit 007173a.

Revert "[amd] in gq_ttq.mad HiprandRandomNumberKernel.cc, add debug printouts (commented out) for the memory corruption madgraph5#806"
This reverts commit c7b3dc0.
…adgraph5#806 for HIPCC by disabling hipcc optimizations (use -O0 instead of -O3)

The test now succeeds!
./check_hip.exe  -p 1 8 1
…adgraph5#806 for HIPCC by disabling hipcc -O3, but keep -O2 (better than -O0)

The test now still succeeds!
./check_hip.exe  -p 1 8 1
…) - now they all succeed! gqttq crash madgraph5#806 has disappeared

(Note: performance on HIP do not seem to be significantly degraded with -O2 with respect to -O3, eg on ggttgg)

STARTED  AT Thu 19 Sep 2024 06:24:53 PM EEST
./tput/teeThroughputX.sh -mix -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean  -nocuda
ENDED(1) AT Thu 19 Sep 2024 07:15:36 PM EEST [Status=0]
./tput/teeThroughputX.sh -flt -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean  -nocuda
ENDED(2) AT Thu 19 Sep 2024 07:32:30 PM EEST [Status=0]
./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -flt -bridge -makeclean  -nocuda
ENDED(3) AT Thu 19 Sep 2024 07:41:44 PM EEST [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -rmbhst  -nocuda
ENDED(4) AT Thu 19 Sep 2024 07:43:46 PM EEST [Status=0]
SKIP './tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -common  -nocuda'
ENDED(5) AT Thu 19 Sep 2024 07:43:46 PM EEST [Status=0]
./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -common  -nocuda
ENDED(6) AT Thu 19 Sep 2024 07:45:46 PM EEST [Status=0]
./tput/teeThroughputX.sh -mix -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean  -nocuda
ENDED(7) AT Thu 19 Sep 2024 08:17:24 PM EEST [Status=0]

No errors found in logs
…ds (madgraph5#806 fixed), all as expected (heft fail madgraph5#833, skip ggttggg madgraph5#933)

(Note: performance on HIP do not seem to be significantly degraded with -O2 with respect to -O3, eg on ggttgg)

STARTED  AT Thu 19 Sep 2024 11:37:44 PM EEST
(SM tests)
ENDED(1) AT Fri 20 Sep 2024 02:00:00 AM EEST [Status=0]
(BSM tests)
ENDED(1) AT Fri 20 Sep 2024 02:08:55 AM EEST [Status=0]

16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt
12 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt
12 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt
12 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt
1 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_m_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_m_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt
16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_m_inl0_hrd0.txt
Revert "[amd] rerun 30 tmad tests on LUMI against AMD GPUs - now gqttq succeeds (madgraph5#806 fixed), all as expected (heft fail madgraph5#833, skip ggttggg madgraph5#933)"
This reverts commit 0d7d4cd.

Revert "[amd] rerun 96 tput builds and tests on LUMI worker node (small-g 72h) - now they all succeed! gqttq crash madgraph5#806 has disappeared"
This reverts commit e41c7ff.
…he getCompiler() function

This gives for instance:
[valassia@nid005067 bash] ~/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux > ./check_hip.exe  -p 1 8 1
Process = SIGMA_SM_GUX_TTXUX_HIP [hipcc 6.0.32831 (clang 17.0.0)] [inlineHel=0] [hardcodePARAM=0]

(Checked that all is ok when regenerating gq_ttq.mad/SubProcesses/P1_gux_ttxux)
git checkout upstream/master tput/logs_* tmad/logs_*
Fix conflicts (essentially, add -inlL and -inlLonly options to upstream/master scripts):
- epochX/cudacpp/tmad/madX.sh
- epochX/cudacpp/tmad/teeMadX.sh
- epochX/cudacpp/tput/allTees.sh
- epochX/cudacpp/tput/teeThroughputX.sh
- epochX/cudacpp/tput/throughputX.sh
@valassi
Copy link
Member Author

valassi commented Sep 20, 2024

I updated this with the latest master as I am doing on all PRs

  • test this mode on HIP (what is the rdc equivalent?

I had some LUMI shell running and I tried this (after also merging in #1007 with various AMD things)

There is a -fgpu-rdc which succeeds compilation but the issues come at link time.

Note that #802 is actually a 'shared object initialization failed' error

So the status is

  • HELINL=L works ok for C++ and (with rdc) for CUDA
  • HELINL=L does not work for HIP yet

…=L) to cuda only as it does not apply to hip

The hip compilation of CPPProcess.cc now fails as
ccache /opt/rocm-6.0.3/bin/hipcc  -I. -I../../src   -O2 --offload-arch=gfx90a -target x86_64-linux-gnu -DHIP_PLATFORM=amd -DHIP_FAST_MATH -I/opt/rocm-6.0.3/include/ -std=c++17 -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_DOUBLE -DMGONGPU_LINKER_HELAMPS  -fPIC -c -x hip CPPProcess.cc -o CPPProcess_hip.o
lld: error: undefined hidden symbol: mg5amcGpu::linker_CD_FFV1_0(double const*, double const*, double const*, double const*, double, double*)
…ompilation on hip for HELINL=L

The hip link of check_hip.exe now fails with
ccache /opt/rocm-6.0.3/bin/hipcc -o check_hip.exe ./check_sa_hip.o -L../../lib -lmg5amc_common_hip -Xlinker -rpath='$ORIGIN/../../lib'  -L../../lib -lmg5amc_gg_ttx_hip ./CommonRandomNumberKernel_hip.o ./RamboSamplingKernels_hip.o ./CurandRandomNumberKernel_hip.o ./HiprandRandomNumberKernel_hip.o  -L/opt/rocm-6.0.3/lib/ -lhiprand
ld.lld: error: undefined reference due to --no-allow-shlib-undefined: __hip_fatbin
…k_hip.exe link on hip for HELINL=L, the build succeeds but at runtime it fails

The execution fails with
./check_hip.exe -p 1 8 1
ERROR! assertGpu: 'shared object initialization failed' (303) in CPPProcess.cc:558

In addition, the hip link of fcheck_hip.exe fails with
ftn --cray-bypass-pkgconfig -craype-verbose -ffixed-line-length-132 -o fcheck_hip.exe ./fcheck_sa_fortran.o ./fsampler_hip.o -L../../lib -lmg5amc_common_hip -Xlinker -rpath='$ORIGIN/../../lib'  -lgfortran -L../../lib -lmg5amc_gg_ttx_hip ./CommonRandomNumberKernel_hip.o ./RamboSamplingKernels_hip.o -lstdc++ -L/opt/rocm-6.0.3/lib -lamdhip64
gfortran-13 -march=znver3 -D__CRAY_X86_TRENTO -D__CRAY_AMD_GFX90A -D__CRAYXT_COMPUTE_LINUX_TARGET -D__TARGET_LINUX__ -ffixed-line-length-132 -o fcheck_hip.exe ./fcheck_sa_fortran.o ./fsampler_hip.o -L../../lib -lmg5amc_common_hip -Xlinker -rpath=$ORIGIN/../../lib -lgfortran -L../../lib -lmg5amc_gg_ttx_hip ./CommonRandomNumberKernel_hip.o ./RamboSamplingKernels_hip.o -lstdc++ -L/opt/rocm-6.0.3/lib -lamdhip64 -Wl,-Bdynamic -Wl,--as-needed,-lgfortran,-lquadmath,--no-as-needed -Wl,--as-needed,-lpthread,--no-as-needed -Wl,--disable-new-dtags
/usr/lib64/gcc/x86_64-suse-linux/13/../../../../x86_64-suse-linux/bin/ld: ../../lib/libmg5amc_gg_ttx_hip.so: undefined reference to `__hip_fatbin'
…ipcc instead of gfortran to link fcheck_hip.exe: this links but it fails at runtime, will revert

Also add -gggdb for debugging. At runtime this fails with the usual madgraph5#802.
It is now clear that this is in gpuMemcpyToSymbol (line 558)
And the error is precisely 'shared object initialization failed'

./fcheck_hip.exe 1 32 1
...
WARNING! Instantiate device Bridge (nevt=32, gpublocks=1, gputhreads=32, gpublocks*gputhreads=32)
ERROR! assertGpu: 'shared object initialization failed' (303) in CPPProcess.cc:558
fcheck_hip.exe: ./GpuRuntime.h:26: void assertGpu(hipError_t, const char *, int, bool): Assertion `code == gpuSuccess' failed.

Program received signal SIGABRT: Process abort signal.
Backtrace for this error:
0  0x14f947bff2e2 in ???
1  0x14f947bfe475 in ???
2  0x14f945f33dbf in ???
3  0x14f945f33d2b in ???
4  0x14f945f353e4 in ???
5  0x14f945f2bc69 in ???
6  0x14f945f2bcf1 in ???
7  0x14f947bcef96 in _Z9assertGpu10hipError_tPKcib
        at ./GpuRuntime.h:26
8  0x14f947bcef96 in _ZN9mg5amcGpu10CPPProcessC2Ebb
        at /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:558
9  0x14f947bd2cf3 in _ZN9mg5amcGpu6BridgeIdEC2Ejjj
        at ./Bridge.h:268
10  0x14f947bd678e in fbridgecreate_
        at /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/fbridge.cc:54
11  0x2168fd in ???
12  0x216bfe in ???
13  0x14f945f1e24c in ???
14  0x216249 in _start
        at ../sysdeps/x86_64/start.S:120
15  0xffffffffffffffff in ???
Aborted
… hipcc to link fcheck_hip.exe

Revert "[helas] in gg_tt.mad cudacpp.mk, temporarely go back and try to use hipcc instead of gfortran to link fcheck_hip.exe: this links but it fails at runtime, will revert"
This reverts commit 988419b.

NOTE: I tried to use FC=hipcc and this also compiles the fortran ok!
Probably it internally uses flang from llvm madgraph5#804

The problem however is that there is no lowercase 'main' in fcheck_sa_fortran.o, only an uppercase 'MAIN_'.

Summary of the status" HELINL=L "rdc" is not supported on our AMD GPUs for now.
…y and support HELINL=L on AMD GPUs via HIP (still incomplete)
@valassi valassi changed the title (WIP) HELINL=L (L for linker) helas mode: pre-compile templates into separate .o object files (using RDC for CUDA) (WIP) HELINL=L (L for linker) helas mode: pre-compile templates into separate .o object files (using RDC for CUDA; still missing HIP) Sep 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant