(WIP) HELINL=L (L for linker) helas mode: pre-compile templates into separate .o object files (using RDC for CUDA; still missing HIP) #978

valassi · 2024-08-27T15:37:06Z

WIP on removing template/inline from helas (related to splitting kernels)

…FVs and for compiling them as separate object files (related to splitting kernels)

…d MemoryAccessMomenta.h

…the P subdirectory (depends on npar) - build succeeds for cpp, link fails for cuda ccache /usr/local/cuda-12.0/bin/nvcc -I. -I../../src -Xcompiler -O3 -gencode arch=compute_70,code=compute_70 -gencode arch=compute_70,code=sm_70 -lineinfo -use_fast_math -I/usr/local/cuda-12.0/include/ -DUSE_NVTX -std=c++17 -ccbin /usr/lib64/ccache/g++ -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_DOUBLE -Xcompiler -fPIC -c -x cu CPPProcess.cc -o CPPProcess_cuda.o ptxas fatal : Unresolved extern function '_ZN9mg5amcGpu14helas_VVV1P0_1EPKdS1_S1_dddPd'

…cuda tests succeed The build issues some warnings however nvlink warning : SM Arch ('sm_52') not found in './CPPProcess_cuda.o' nvlink warning : SM Arch ('sm_52') not found in './HelAmps_cuda.o' nvlink warning : SM Arch ('sm_52') not found in './CPPProcess_cuda.o' nvlink warning : SM Arch ('sm_52') not found in './HelAmps_cuda.o'

…ption HELINL=L and '#ifdef MGONGPU_LINKER_HELAMPS'

…me on each log

…nd -inlLonly options

… to ease code generation

…y in the HELINL=L mode

…c++, a factor 3 slower for cuda... ./tput/teeThroughputX.sh -ggtt -makej -makeclean -inlLonly diff -u --color tput/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt tput/logs_ggtt_mad/log_ggtt_mad_d_inlL_hrd0.txt -Process = SIGMA_SM_GG_TTX_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=0] [hardcodePARAM=0] +Process = SIGMA_SM_GG_TTX_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=L] [hardcodePARAM=0] Workflow summary = CUD:DBL+THX:CURDEV+RMBDEV+MESDEV/none+NAVBRK FP precision = DOUBLE (NaN/abnormal=0, zero=0) -EvtsPerSec[Rmb+ME] (23) = ( 4.589473e+07 ) sec^-1 -EvtsPerSec[MatrixElems] (3) = ( 1.164485e+08 ) sec^-1 -EvtsPerSec[MECalcOnly] (3a) = ( 1.280951e+08 ) sec^-1 -MeanMatrixElemValue = ( 2.086689e+00 +- 3.413217e-03 ) GeV^0 -TOTAL : 0.528239 sec -INFO: No Floating Point Exceptions have been reported - 2,222,057,027 cycles # 2.887 GHz - 3,171,868,018 instructions # 1.43 insn per cycle - 0.826440817 seconds time elapsed -runNcu /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/build.cuda_d_inl0_hrd0/check_cuda.exe -p 2048 256 1 -==PROF== Profiling "sigmaKin": launch__registers_per_thread 214 +EvtsPerSec[Rmb+ME] (23) = ( 2.667135e+07 ) sec^-1 +EvtsPerSec[MatrixElems] (3) = ( 4.116115e+07 ) sec^-1 +EvtsPerSec[MECalcOnly] (3a) = ( 4.251573e+07 ) sec^-1 +MeanMatrixElemValue = ( 2.086689e+00 +- 3.413217e-03 ) GeV^0 +TOTAL : 0.550450 sec +INFO: No Floating Point Exceptions have been reported + 2,272,219,097 cycles # 2.889 GHz + 3,361,475,195 instructions # 1.48 insn per cycle + 0.842685843 seconds time elapsed +runNcu /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/build.cuda_d_inlL_hrd0/check_cuda.exe -p 2048 256 1 +==PROF== Profiling "sigmaKin": launch__registers_per_thread 190 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%

…lates in HELINL=L mode

…t.mad of HelAmps.h in HELINL=L mode

…t.mad of CPPProcess.cc in HELINL=L mode

…P* (the source is the same but it must be compiled in each P* separately)

… complete its backport

…L=L is complete)

valassi · 2024-08-28T17:57:38Z

The functionality is in principle completed including the backport to CODEGEN. I will run some functionality and performance tests.

…tions

git add *.mad/*/HelAmps.cc *.mad/*/*/HelAmps.cc *.sa/*/HelAmps.cc *.sa/*/*/HelAmps.cc

…ild failed? ./tput/teeThroughputX.sh -ggttggg -makej -makeclean -inlL ccache /usr/local/cuda-12.0/bin/nvcc -I. -I../../src -Xcompiler -O3 -gencode arch=compute_70,code=compute_70 -gencode arch=compute_70,code=sm_70 -lineinfo -use_fast_math -I/usr/local/cuda-12.0/include/ -DUSE_NVTX -std=c++17 -ccbin /usr/lib64/ccache/g++ -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_DOUBLE -DMGONGPU_INLINE_HELAMPS -Xcompiler -fPIC -c -x cu CPPProcess.cc -o build.cuda_d_inl1_hrd0/CPPProcess_cuda.o nvcc error : 'ptxas' died due to signal 9 (Kill signal) make[2]: *** [cudacpp.mk:754: build.cuda_d_inl1_hrd0/CPPProcess_cuda.o] Error 9 make[2]: Leaving directory '/data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg' make[1]: *** [makefile:142: build.cuda_d_inl1_hrd0/.cudacpplibs] Error 2 make[1]: Leaving directory '/data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg' make: *** [makefile:282: bldcuda] Error 2 make: *** Waiting for unfinished jobs....

… build time is from cache ./tput/teeThroughputX.sh -ggttggg -makej -makeclean

…mode (use that from the previous run, not from cache) ./tput/teeThroughputX.sh -ggttggg -makej -makeclean

…factor x2 faster (c++? cuda?), runtime is 5-10% slower in C++, but 5-10% faster in cuda!? ./tput/teeThroughputX.sh -ggttggg -makej -makeclean -inlLonly diff -u --color tput/logs_ggttggg_mad/log_ggttggg_mad_d_inlL_hrd0.txt tput/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt ... On itscrd90.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]: ========================================================================= -runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inlL_hrd0/check_cuda.exe -p 1 256 2 OMP= +runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inl0_hrd0/check_cuda.exe -p 1 256 2 OMP= INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW -Process = SIGMA_SM_GG_TTXGGG_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=L] [hardcodePARAM=0] +Process = SIGMA_SM_GG_TTXGGG_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=0] [hardcodePARAM=0] Workflow summary = CUD:DBL+THX:CURDEV+RMBDEV+MESDEV/none+NAVBRK FP precision = DOUBLE (NaN/abnormal=0, zero=0) -EvtsPerSec[Rmb+ME] (23) = ( 4.338149e+02 ) sec^-1 -EvtsPerSec[MatrixElems] (3) = ( 4.338604e+02 ) sec^-1 -EvtsPerSec[MECalcOnly] (3a) = ( 4.338867e+02 ) sec^-1 -MeanMatrixElemValue = ( 1.187066e-05 +- 9.825549e-06 ) GeV^-6 -TOTAL : 2.242693 sec -INFO: No Floating Point Exceptions have been reported - 7,348,976,543 cycles # 2.902 GHz - 16,466,315,526 instructions # 2.24 insn per cycle - 2.591057214 seconds time elapsed -runNcu /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inlL_hrd0/check_cuda.exe -p 1 256 1 +EvtsPerSec[Rmb+ME] (23) = ( 4.063038e+02 ) sec^-1 +EvtsPerSec[MatrixElems] (3) = ( 4.063437e+02 ) sec^-1 +EvtsPerSec[MECalcOnly] (3a) = ( 4.063626e+02 ) sec^-1 +MeanMatrixElemValue = ( 1.187066e-05 +- 9.825549e-06 ) GeV^-6 +TOTAL : 2.552546 sec +INFO: No Floating Point Exceptions have been reported + 7,969,059,552 cycles # 2.893 GHz + 17,401,037,642 instructions # 2.18 insn per cycle + 2.954791685 seconds time elapsed +runNcu /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inl0_hrd0/check_cuda.exe -p 1 256 1 ==PROF== Profiling "sigmaKin": launch__registers_per_thread 255 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ... ========================================================================= -runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.512y_d_inlL_hrd0/check_cpp.exe -p 1 256 2 OMP= +runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.512y_d_inl0_hrd0/check_cpp.exe -p 1 256 2 OMP= INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW -Process = SIGMA_SM_GG_TTXGGG_CPP [gcc 11.3.1] [inlineHel=L] [hardcodePARAM=0] +Process = SIGMA_SM_GG_TTXGGG_CPP [gcc 11.3.1] [inlineHel=0] [hardcodePARAM=0] Workflow summary = CPP:DBL+CXS:CURHST+RMBHST+MESHST/512y+CXVBRK FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES] -EvtsPerSec[Rmb+ME] (23) = ( 3.459662e+02 ) sec^-1 -EvtsPerSec[MatrixElems] (3) = ( 3.460086e+02 ) sec^-1 -EvtsPerSec[MECalcOnly] (3a) = ( 3.460086e+02 ) sec^-1 +EvtsPerSec[Rmb+ME] (23) = ( 3.835352e+02 ) sec^-1 +EvtsPerSec[MatrixElems] (3) = ( 3.836003e+02 ) sec^-1 +EvtsPerSec[MECalcOnly] (3a) = ( 3.836003e+02 ) sec^-1 MeanMatrixElemValue = ( 1.187066e-05 +- 9.825549e-06 ) GeV^-6 -TOTAL : 1.528240 sec +TOTAL : 1.378567 sec INFO: No Floating Point Exceptions have been reported - 4,140,408,789 cycles # 2.703 GHz - 9,072,597,595 instructions # 2.19 insn per cycle - 1.532357792 seconds time elapsed -=Symbols in CPPProcess_cpp.o= (~sse4: 0) (avx2:94048) (512y: 91) (512z: 0) + 3,738,350,469 cycles # 2.705 GHz + 8,514,195,736 instructions # 2.28 insn per cycle + 1.382567882 seconds time elapsed +=Symbols in CPPProcess_cpp.o= (~sse4: 0) (avx2:80619) (512y: 89) (512z: 0) -------------------------------------------------------------------------

… (commented out) for the memory corruption madgraph5#806 This shows an uninitialised value deep inside hiprand [valassia@nid005067 bash] ~/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux > valgrind ./check_hip.exe -p 1 8 1 ==105499== Memcheck, a memory error detector ==105499== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al. ==105499== Using Valgrind-3.20.0 and LibVEX; rerun with -h for copyright info ==105499== Command: ./check_hip.exe -p 1 8 1 ==105499== ==105499== Warning: set address range perms: large range [0x59c90000, 0x159e91000) (noaccess) INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW Get random numbers from Hiprand ==105499== Conditional jump or move depends on uninitialised value(s) ==105499== at 0x1253777C: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003) ==105499== by 0x12537F40: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003) ==105499== by 0x12540782: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003) ==105499== by 0x125629DD: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003) ==105499== by 0x4B825EB: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x4B88342: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x4B822FF: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x4B55120: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x4B2B590: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x49D84AF: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x49D87C4: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x4A00FA2: hipMemcpy (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== ==105499== Conditional jump or move depends on uninitialised value(s) ==105499== at 0x12537B82: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003) ==105499== by 0x12537F40: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003) ==105499== by 0x12540782: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003) ==105499== by 0x125629DD: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003) ==105499== by 0x4B825EB: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x4B88342: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x4B822FF: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x4B55120: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x4B2B590: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x49D84AF: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x49D87C4: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x4A00FA2: hipMemcpy (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== Got random numbers from Hiprand ==105499== Invalid read of size 8 ==105499== at 0x21F741: std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, float, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, float> > >::operator[](std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe) ==105499== by 0x21D0D1: mgOnGpu::TimerMap::start(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe) ==105499== by 0x215CBB: main (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe) ==105499== Address 0x1c00000043 is not stack'd, malloc'd or (recently) free'd ==105499== ==105499== ==105499== Process terminating with default action of signal 11 (SIGSEGV): dumping core ==105499== Access not within mapped region at address 0x1C00000043 ==105499== at 0x21F741: std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, float, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, float> > >::operator[](std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe) ==105499== by 0x21D0D1: mgOnGpu::TimerMap::start(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe) ==105499== by 0x215CBB: main (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe) ==105499== If you believe this happened as a result of a stack ==105499== overflow in your program's main thread (unlikely but ==105499== possible), you can try to increase the size of the ==105499== main thread stack using the --main-stacksize= flag. ==105499== The main thread stack size used in this run was 16777216. Unfortunately however also --common crashes (and gives the same uninitialised problem, whether related or not)

…ad of HIP pinned host malloc to debug madgraph5#806 - still crashes, will revert This makes the valgrind 'conditional jump on uninitialised variable' disappear, but the crash from invalid memory reads still remains [valassia@nid005067 bash] ~/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux > valgrind --track-origins=yes ./check_hip.exe --common -p 1 8 1 ==10800== Memcheck, a memory error detector ==10800== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al. ==10800== Using Valgrind-3.20.0 and LibVEX; rerun with -h for copyright info ==10800== Command: ./check_hip.exe --common -p 1 8 1 ==10800== ==10800== Warning: set address range perms: large range [0x59c90000, 0x159e91000) (noaccess) INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW ==10800== Invalid read of size 8 ==10800== at 0x21EF01: std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, float, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, float> > >::operator[](std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe) ==10800== by 0x21CA21: mgOnGpu::TimerMap::start(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe) ==10800== by 0x2158A5: main (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe) ==10800== Address 0x140000003b is not stack'd, malloc'd or (recently) free'd ==10800== ==10800== ==10800== Process terminating with default action of signal 11 (SIGSEGV): dumping core ==10800== Access not within mapped region at address 0x140000003B ==10800== at 0x21EF01: std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, float, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, float> > >::operator[](std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe) ==10800== by 0x21CA21: mgOnGpu::TimerMap::start(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe) ==10800== by 0x2158A5: main (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe) ==10800== If you believe this happened as a result of a stack ==10800== overflow in your program's main thread (unlikely but ==10800== possible), you can try to increase the size of the ==10800== main thread stack using the --main-stacksize= flag. ==10800== The main thread stack size used in this run was 16777216. ==10800== ==10800== HEAP SUMMARY: ==10800== in use at exit: 4,784,824 bytes in 17,735 blocks ==10800== total heap usage: 306,364 allocs, 288,629 frees, 180,986,538 bytes allocated ==10800== ==10800== LEAK SUMMARY: ==10800== definitely lost: 256 bytes in 5 blocks ==10800== indirectly lost: 3,522 bytes in 64 blocks ==10800== possibly lost: 9,544 bytes in 80 blocks ==10800== still reachable: 4,771,502 bytes in 17,586 blocks ==10800== of which reachable via heuristic: ==10800== multipleinheritance: 384 bytes in 4 blocks ==10800== suppressed: 0 bytes in 0 blocks ==10800== Rerun with --leak-check=full to see details of leaked memory ==10800== ==10800== For lists of detected and suppressed errors, rerun with: -s ==10800== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0) Segmentation fault

…madgraph5#806 - now valgrind gives no invalid read, but there is a 'Memory access fault' Using valgrind [valassia@nid005067 bash] ~/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux > valgrind --track-origins=yes ./check_hip.exe --common -p 1 8 1 ==80385== Memcheck, a memory error detector ==80385== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al. ==80385== Using Valgrind-3.20.0 and LibVEX; rerun with -h for copyright info ==80385== Command: ./check_hip.exe --common -p 1 8 1 ==80385== DEBUG: TimerMap::stop() enter DEBUG: TimerMap::stop() exit ==80385== Warning: set address range perms: large range [0x59c90000, 0x159e91000) (noaccess) DEBUG: TimerMap::stop() enter DEBUG: TimerMap::stop() retrieve '00 GpuInit' DEBUG: TimerMap::stop() exit INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW ... DEBUG: TimerMap::stop() enter DEBUG: TimerMap::stop() retrieve '0e SGoodHel' DEBUG: TimerMap::stop() exit Memory access fault by GPU node-4 (Agent handle: 0x1417d4a0) on address 0xfffd862e5000. Reason: Unknown. ==80385== ==80385== Process terminating with default action of signal 6 (SIGABRT): dumping core ==80385== at 0x63D3D2B: raise (in /lib64/libc-2.31.so) ==80385== by 0x63D53E4: abort (in /lib64/libc-2.31.so) ==80385== by 0x12580D1B: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003) ==80385== by 0x1257ABC8: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003) ==80385== by 0x1252C9E6: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003) ==80385== by 0x127C66E9: start_thread (in /lib64/libpthread-2.31.so) ==80385== by 0x64A150E: clone (in /lib64/libc-2.31.so) ==80385== ==80385== HEAP SUMMARY: ==80385== in use at exit: 4,790,652 bytes in 17,774 blocks ==80385== total heap usage: 306,424 allocs, 288,650 frees, 180,987,695 bytes allocated ==80385== ==80385== LEAK SUMMARY: ==80385== definitely lost: 184 bytes in 4 blocks ==80385== indirectly lost: 2,658 bytes in 52 blocks ==80385== possibly lost: 10,768 bytes in 86 blocks ==80385== still reachable: 4,777,042 bytes in 17,632 blocks ==80385== of which reachable via heuristic: ==80385== multipleinheritance: 496 bytes in 5 blocks ==80385== suppressed: 0 bytes in 0 blocks ==80385== Rerun with --leak-check=full to see details of leaked memory ==80385== ==80385== For lists of detected and suppressed errors, rerun with: -s ==80385== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0) Aborted Using rocgdb [valassia@nid005067 bash] ~/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux > rocgdb --args ./check_hip.exe -p 1 8 1 GNU gdb (rocm-rel-6.0-131) 13.2 ... (gdb) run Starting program: /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe -p 1 8 1 ... DEBUG: TimerMap::stop() enter DEBUG: TimerMap::stop() retrieve '0e SGoodHel' DEBUG: TimerMap::stop() exit New Thread 0x1554445ff700 (LWP 94651) New Thread 0x1555470b7700 (LWP 94652) Thread 0x1554445ff700 (LWP 94651) exited Warning: precise memory violation signal reporting is not enabled, reported location may not be accurate. See "show amdgpu precise-memory". Thread 6 "check_hip.exe" received signal SIGSEGV, Segmentation fault. [Switching to thread 6, lane 0 (AMDGPU Lane 1:2:1:1/0 (0,0,0)[0,0,0])] 0x0000155547130598 in mg5amcGpu::sigmaKin(double const*, double const*, double const*, double const*, double*, unsigned int const*, double*, double*, int*, int*) () from file:///pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/../../lib/libmg5amc_gux_ttxux_hip.so#offset=57344&size=114640 (gdb) where 0 0x0000155547130598 in mg5amcGpu::sigmaKin(double const*, double const*, double const*, double const*, double*, unsigned int const*, double*, double*, int*, int*) () from file:///pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/../../lib/libmg5amc_gux_ttxux_hip.so#offset=57344&size=114640 (gdb) l 1 ../sysdeps/x86_64/crtn.S: No such file or directory. ... (gdb) set amdgpu precise-memory (gdb) run The program being debugged has been started already. Start it from the beginning? (y or n) y Starting program: /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe -p 1 8 1 ... DEBUG: TimerMap::stop() enter DEBUG: TimerMap::stop() retrieve '0e SGoodHel' DEBUG: TimerMap::stop() exit New Thread 0x1554445ff700 (LWP 99032) New Thread 0x1555470b7700 (LWP 99033) Thread 0x1554445ff700 (LWP 99032) exited Thread 6 "check_hip.exe" received signal SIGSEGV, Segmentation fault. [Switching to thread 6, lane 0 (AMDGPU Lane 1:2:1:1/0 (0,0,0)[0,0,0])] 0x000015554713050c in mg5amcGpu::sigmaKin(double const*, double const*, double const*, double const*, double*, unsigned int const*, double*, double*, int*, int*) () from file:///pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/../../lib/libmg5amc_gux_ttxux_hip.so#offset=57344&size=114640 ... (gdb) info threads Id Target Id Frame 1 Thread 0x1555471dda80 (LWP 98983) "check_hip.exe" 0x0000155547603d57 in ?? () from /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1 2 Thread 0x1555469ff700 (LWP 99017) "check_hip.exe" 0x00001555538f64a7 in ioctl () from /lib64/libc.so.6 5 Thread 0x1555470b7700 (LWP 99033) "check_hip.exe" 0x000015554759fd04 in sem_post@@GLIBC_2.2.5 () from /lib64/libpthread.so.0 * 6 AMDGPU Wave 1:2:1:1 (0,0,0)/0 "check_hip.exe" 0x000015554713050c in mg5amcGpu::sigmaKin(double const*, double const*, double const*, double const*, double*, unsigned int const*, double*, double*, int*, int*) () from file:///pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/../../lib/libmg5amc_gux_ttxux_hip.so#offset=57344&size=114640

… in vxxxxx (which may explain why this only appears in gqttq?) [valassia@nid005067 bash] ~/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux > rocgdb --args ./check_hip.exe -p 1 8 1 GNU gdb (rocm-rel-6.0-131) 13.2 ... (gdb) set amdgpu precise-memory (gdb) run The program being debugged has been started already. Start it from the beginning? (y or n) y Starting program: /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe -p 1 8 1 ... DEBUG: TimerMap::stop() enter DEBUG: TimerMap::stop() retrieve '0e SGoodHel' DEBUG: TimerMap::stop() exit New Thread 0x1554445ff700 (LWP 1669) New Thread 0x155547087700 (LWP 1670) Thread 0x1554445ff700 (LWP 1669) exited Thread 6 "check_hip.exe" received signal SIGSEGV, Segmentation fault. [Switching to thread 6, lane 0 (AMDGPU Lane 1:2:1:1/0 (0,0,0)[0,0,0])] mg5amcGpu::calculate_wavefunctions (ihel=<optimized out>, allmomenta=<optimized out>, allcouplings=<optimized out>, allMEs=<optimized out>, channelId=<optimized out>, allNumerators=<optimized out>, allDenominators=<optimized out>, jamp2_sv=<optimized out>) at CPPProcess.cc:328 328 vxxxxx<M_ACCESS, W_ACCESS>( momenta, 0., cHel[ihel][0], -1, w_fp[0], 0 ); (gdb) where 0 mg5amcGpu::calculate_wavefunctions (ihel=<optimized out>, allmomenta=<optimized out>, allcouplings=<optimized out>, allMEs=<optimized out>, channelId=<optimized out>, allNumerators=<optimized out>, allDenominators=<optimized out>, jamp2_sv=<optimized out>) at CPPProcess.cc:328 1 mg5amcGpu::sigmaKin (allmomenta=<optimized out>, allcouplings=<optimized out>, allrndhel=<optimized out>, allrndcol=<optimized out>, allMEs=<optimized out>, allChannelIds=<optimized out>, allNumerators=<optimized out>, allDenominators=<optimized out>, allselhel=<optimized out>, allselcol=<optimized out>) at CPPProcess.cc:1043 (gdb) info threads Id Target Id Frame 1 Thread 0x1555471aea80 (LWP 1645) "check_hip.exe" 0x00001555475d5d57 in ?? () from /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1 2 Thread 0x1555469ff700 (LWP 1655) "check_hip.exe" 0x00001555538c84a7 in ioctl () from /lib64/libc.so.6 5 Thread 0x155547087700 (LWP 1670) "check_hip.exe" 0x00001555538c84a7 in ioctl () from /lib64/libc.so.6 * 6 AMDGPU Wave 1:2:1:1 (0,0,0)/0 "check_hip.exe" mg5amcGpu::calculate_wavefunctions (ihel=<optimized out>, allmomenta=<optimized out>, allcouplings=<optimized out>, allMEs=<optimized out>, channelId=<optimized out>, allNumerators=<optimized out>, allDenominators=<optimized out>, jamp2_sv=<optimized out>) at CPPProcess.cc:328

…d for debugging the crash madgraph5#806 in hipcc Revert "[amd] in gq_ttq.mad cudacpp.mk, enable -ggdb... the issue seems to be in vxxxxx (which may explain why this only appears in gqttq?)" This reverts commit 5cc62a6. Revert "[amd] in gq_ttq.mad timermap.h, add some debug printouts for the crash madgraph5#806 - now valgrind gives no invalid read, but there is a 'Memory access fault'" This reverts commit 5b8d92f. Revert "[amd] in gq_ttq.mad MemoryBuffers.h, temporarely use c++ malloc instead of HIP pinned host malloc to debug madgraph5#806 - still crashes, will revert" This reverts commit 007173a. Revert "[amd] in gq_ttq.mad HiprandRandomNumberKernel.cc, add debug printouts (commented out) for the memory corruption madgraph5#806" This reverts commit c7b3dc0.

…adgraph5#806 for HIPCC by disabling hipcc optimizations (use -O0 instead of -O3) The test now succeeds! ./check_hip.exe -p 1 8 1

…adgraph5#806 for HIPCC by disabling hipcc -O3, but keep -O2 (better than -O0) The test now still succeeds! ./check_hip.exe -p 1 8 1

…ead of -O3 (workaround for gq_ttq crash madgraph5#806)

…) - now they all succeed! gqttq crash madgraph5#806 has disappeared (Note: performance on HIP do not seem to be significantly degraded with -O2 with respect to -O3, eg on ggttgg) STARTED AT Thu 19 Sep 2024 06:24:53 PM EEST ./tput/teeThroughputX.sh -mix -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean -nocuda ENDED(1) AT Thu 19 Sep 2024 07:15:36 PM EEST [Status=0] ./tput/teeThroughputX.sh -flt -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean -nocuda ENDED(2) AT Thu 19 Sep 2024 07:32:30 PM EEST [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -flt -bridge -makeclean -nocuda ENDED(3) AT Thu 19 Sep 2024 07:41:44 PM EEST [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -rmbhst -nocuda ENDED(4) AT Thu 19 Sep 2024 07:43:46 PM EEST [Status=0] SKIP './tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -common -nocuda' ENDED(5) AT Thu 19 Sep 2024 07:43:46 PM EEST [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -common -nocuda ENDED(6) AT Thu 19 Sep 2024 07:45:46 PM EEST [Status=0] ./tput/teeThroughputX.sh -mix -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean -nocuda ENDED(7) AT Thu 19 Sep 2024 08:17:24 PM EEST [Status=0] No errors found in logs

…ds (madgraph5#806 fixed), all as expected (heft fail madgraph5#833, skip ggttggg madgraph5#933) (Note: performance on HIP do not seem to be significantly degraded with -O2 with respect to -O3, eg on ggttgg) STARTED AT Thu 19 Sep 2024 11:37:44 PM EEST (SM tests) ENDED(1) AT Fri 20 Sep 2024 02:00:00 AM EEST [Status=0] (BSM tests) ENDED(1) AT Fri 20 Sep 2024 02:08:55 AM EEST [Status=0] 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt 12 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt 12 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt 12 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt 1 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_m_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_m_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_m_inl0_hrd0.txt

Revert "[amd] rerun 30 tmad tests on LUMI against AMD GPUs - now gqttq succeeds (madgraph5#806 fixed), all as expected (heft fail madgraph5#833, skip ggttggg madgraph5#933)" This reverts commit 0d7d4cd. Revert "[amd] rerun 96 tput builds and tests on LUMI worker node (small-g 72h) - now they all succeed! gqttq crash madgraph5#806 has disappeared" This reverts commit e41c7ff.

…he getCompiler() function This gives for instance: [valassia@nid005067 bash] ~/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux > ./check_hip.exe -p 1 8 1 Process = SIGMA_SM_GUX_TTXUX_HIP [hipcc 6.0.32831 (clang 17.0.0)] [inlineHel=0] [hardcodePARAM=0] (Checked that all is ok when regenerating gq_ttq.mad/SubProcesses/P1_gux_ttxux)

…ler for HIP)

git checkout upstream/master tput/logs_* tmad/logs_*

Fix conflicts (essentially, add -inlL and -inlLonly options to upstream/master scripts): - epochX/cudacpp/tmad/madX.sh - epochX/cudacpp/tmad/teeMadX.sh - epochX/cudacpp/tput/allTees.sh - epochX/cudacpp/tput/teeThroughputX.sh - epochX/cudacpp/tput/throughputX.sh

valassi · 2024-09-20T15:26:13Z

I updated this with the latest master as I am doing on all PRs

test this mode on HIP (what is the rdc equivalent?

I had some LUMI shell running and I tried this (after also merging in #1007 with various AMD things)

There is a -fgpu-rdc which succeeds compilation but the issues come at link time.

Using gfortran to link (as I do now due to Segfault in fgcheck.exe on LUMI (should we link hip, c++, fortran using hipcc or the fortran compiler?) #802) I am unable to link '__hip_fatbin'
If I go back to hipcc for linking and add -fgpu-rdc --hip-link then it links, but it fails at runtime with Segfault in fgcheck.exe on LUMI (should we link hip, c++, fortran using hipcc or the fortran compiler?) #802
I also tried to compile .f files with hipcc (I guess flang? Port to flang and F90 compliance of madgraph fortran codebase #804) and this succeeds, but then the 'main' cannot be found at link time, only the 'MAIN_'

Note that #802 is actually a 'shared object initialization failed' error

So the status is

HELINL=L works ok for C++ and (with rdc) for CUDA
HELINL=L does not work for HIP yet

…re merges

… into helas

…=L) to cuda only as it does not apply to hip The hip compilation of CPPProcess.cc now fails as ccache /opt/rocm-6.0.3/bin/hipcc -I. -I../../src -O2 --offload-arch=gfx90a -target x86_64-linux-gnu -DHIP_PLATFORM=amd -DHIP_FAST_MATH -I/opt/rocm-6.0.3/include/ -std=c++17 -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_DOUBLE -DMGONGPU_LINKER_HELAMPS -fPIC -c -x hip CPPProcess.cc -o CPPProcess_hip.o lld: error: undefined hidden symbol: mg5amcGpu::linker_CD_FFV1_0(double const*, double const*, double const*, double const*, double, double*)

…ompilation on hip for HELINL=L The hip link of check_hip.exe now fails with ccache /opt/rocm-6.0.3/bin/hipcc -o check_hip.exe ./check_sa_hip.o -L../../lib -lmg5amc_common_hip -Xlinker -rpath='$ORIGIN/../../lib' -L../../lib -lmg5amc_gg_ttx_hip ./CommonRandomNumberKernel_hip.o ./RamboSamplingKernels_hip.o ./CurandRandomNumberKernel_hip.o ./HiprandRandomNumberKernel_hip.o -L/opt/rocm-6.0.3/lib/ -lhiprand ld.lld: error: undefined reference due to --no-allow-shlib-undefined: __hip_fatbin

…k_hip.exe link on hip for HELINL=L, the build succeeds but at runtime it fails The execution fails with ./check_hip.exe -p 1 8 1 ERROR! assertGpu: 'shared object initialization failed' (303) in CPPProcess.cc:558 In addition, the hip link of fcheck_hip.exe fails with ftn --cray-bypass-pkgconfig -craype-verbose -ffixed-line-length-132 -o fcheck_hip.exe ./fcheck_sa_fortran.o ./fsampler_hip.o -L../../lib -lmg5amc_common_hip -Xlinker -rpath='$ORIGIN/../../lib' -lgfortran -L../../lib -lmg5amc_gg_ttx_hip ./CommonRandomNumberKernel_hip.o ./RamboSamplingKernels_hip.o -lstdc++ -L/opt/rocm-6.0.3/lib -lamdhip64 gfortran-13 -march=znver3 -D__CRAY_X86_TRENTO -D__CRAY_AMD_GFX90A -D__CRAYXT_COMPUTE_LINUX_TARGET -D__TARGET_LINUX__ -ffixed-line-length-132 -o fcheck_hip.exe ./fcheck_sa_fortran.o ./fsampler_hip.o -L../../lib -lmg5amc_common_hip -Xlinker -rpath=$ORIGIN/../../lib -lgfortran -L../../lib -lmg5amc_gg_ttx_hip ./CommonRandomNumberKernel_hip.o ./RamboSamplingKernels_hip.o -lstdc++ -L/opt/rocm-6.0.3/lib -lamdhip64 -Wl,-Bdynamic -Wl,--as-needed,-lgfortran,-lquadmath,--no-as-needed -Wl,--as-needed,-lpthread,--no-as-needed -Wl,--disable-new-dtags /usr/lib64/gcc/x86_64-suse-linux/13/../../../../x86_64-suse-linux/bin/ld: ../../lib/libmg5amc_gg_ttx_hip.so: undefined reference to `__hip_fatbin'

…ipcc instead of gfortran to link fcheck_hip.exe: this links but it fails at runtime, will revert Also add -gggdb for debugging. At runtime this fails with the usual madgraph5#802. It is now clear that this is in gpuMemcpyToSymbol (line 558) And the error is precisely 'shared object initialization failed' ./fcheck_hip.exe 1 32 1 ... WARNING! Instantiate device Bridge (nevt=32, gpublocks=1, gputhreads=32, gpublocks*gputhreads=32) ERROR! assertGpu: 'shared object initialization failed' (303) in CPPProcess.cc:558 fcheck_hip.exe: ./GpuRuntime.h:26: void assertGpu(hipError_t, const char *, int, bool): Assertion `code == gpuSuccess' failed. Program received signal SIGABRT: Process abort signal. Backtrace for this error: 0 0x14f947bff2e2 in ??? 1 0x14f947bfe475 in ??? 2 0x14f945f33dbf in ??? 3 0x14f945f33d2b in ??? 4 0x14f945f353e4 in ??? 5 0x14f945f2bc69 in ??? 6 0x14f945f2bcf1 in ??? 7 0x14f947bcef96 in _Z9assertGpu10hipError_tPKcib at ./GpuRuntime.h:26 8 0x14f947bcef96 in _ZN9mg5amcGpu10CPPProcessC2Ebb at /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:558 9 0x14f947bd2cf3 in _ZN9mg5amcGpu6BridgeIdEC2Ejjj at ./Bridge.h:268 10 0x14f947bd678e in fbridgecreate_ at /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/fbridge.cc:54 11 0x2168fd in ??? 12 0x216bfe in ??? 13 0x14f945f1e24c in ??? 14 0x216249 in _start at ../sysdeps/x86_64/start.S:120 15 0xffffffffffffffff in ??? Aborted

… hipcc to link fcheck_hip.exe Revert "[helas] in gg_tt.mad cudacpp.mk, temporarely go back and try to use hipcc instead of gfortran to link fcheck_hip.exe: this links but it fails at runtime, will revert" This reverts commit 988419b. NOTE: I tried to use FC=hipcc and this also compiles the fortran ok! Probably it internally uses flang from llvm madgraph5#804 The problem however is that there is no lowercase 'main' in fcheck_sa_fortran.o, only an uppercase 'MAIN_'. Summary of the status" HELINL=L "rdc" is not supported on our AMD GPUs for now.

…y and support HELINL=L on AMD GPUs via HIP (still incomplete)

[helas] in gg_tt.mad, proof of concept for removing template/inline F…

475463b

…FVs and for compiling them as separate object files (related to splitting kernels)

valassi self-assigned this Aug 27, 2024

valassi marked this pull request as draft August 27, 2024 15:37

valassi added 20 commits August 28, 2024 10:37

[helas] in gg_tt.mad and CODEGEN, add comments in MemoryAccessGs.h an…

6b0ba37

…d MemoryAccessMomenta.h

[helas] in gg_tt.mad, avoid link warnings when using RDC

7aef7e2

[helas] in gg_tt.mad, clean up 'linked HelAmps' implementation: add o…

77d157c

…ption HELINL=L and '#ifdef MGONGPU_LINKER_HELAMPS'

[helas] in tput/teeThroughputX.sh, print out the preliminary build ti…

f105b9c

…me on each log

[helas] in tput throughputX.sh and teeThroughputX.sh, add the -inlL a…

5f73fbb

…nd -inlLonly options

[helas] in tput/allTees.sh, add 18 inlL tests

8fe9ba4

[helas] in gg_tt.mad, fix clang formatting

4ee2863

[helas] in gg_tt.mad, fix inlineHel=L printout in check_sa.cc

0b259a8

[helas] in gg_tt.mad CPPProcess.cc and HelAmps_sm.h, move code around…

7fb5a25

… to ease code generation

[helas] in gg_tt.mad cudacpp.mk, build HelAmps.o and use rdc=true onl…

716326c

…y in the HELINL=L mode

[helas] in CODEGEN, complete the backport from gg_tt.mad of file temp…

ee84d7d

…lates in HELINL=L mode

[helas] in CODEGEN model_handling.py, complete the backport from gg_t…

4c4198f

…t.mad of HelAmps.h in HELINL=L mode

[helas] in CODEGEN model_handling.py, complete the backport from gg_t…

ae7d18b

…t.mad of CPPProcess.cc in HELINL=L mode

[helas] in gg_tt.mad, move HelAmps.cc to SubProcesses and link it in …

a58cc9c

…P* (the source is the same but it must be compiled in each P* separately)

[helas] in CODEGEN and gg_tt.mad, fix HelAmps.cc in HELINL=L mode and…

64875e7

… complete its backport

[helas] regenerate gg_tt.mad, check that all is ok (codegen for HELIN…

9f1cfd2

…L=L is complete)

[helas] regenerate all processes with support for HELINL=L

5ca9d2d

valassi added 6 commits August 28, 2024 20:55

[helas] in tmad madX.sh and teeMadX.sh, add -inlonly and -inlLonly op…

f0a5105

…tions

[helas] add HelAmps.cc to all regenerated processes

348ebfd

git add *.mad/*/HelAmps.cc *.mad/*/*/HelAmps.cc *.sa/*/HelAmps.cc *.sa/*/*/HelAmps.cc

[helas] rerun the ggttggg tput test only in inl0 mode - note that the…

de8d452

… build time is from cache ./tput/teeThroughputX.sh -ggttggg -makej -makeclean

[helas] manually fix the build time in the ggttggg tput test in inl0 …

93f351b

…mode (use that from the previous run, not from cache) ./tput/teeThroughputX.sh -ggttggg -makej -makeclean

valassi added 17 commits September 19, 2024 16:15

[amd] in gq_ttq.mad and CODEGEN, work around the memory access crash m…

44a6b1d

…adgraph5#806 for HIPCC by disabling hipcc optimizations (use -O0 instead of -O3) The test now succeeds! ./check_hip.exe -p 1 8 1

[amd] in gq_ttq.mad and CODEGEN, work around the memory access crash m…

3c2792a

…adgraph5#806 for HIPCC by disabling hipcc -O3, but keep -O2 (better than -O0) The test now still succeeds! ./check_hip.exe -p 1 8 1

[amd] regenerate all processes, including OPTFLAGS=-O2 for hipcc inst…

f91c156

…ead of -O3 (workaround for gq_ttq crash madgraph5#806)

[amd] ** COMPLETE AMD ** regenerate all processes (including getCompi…

11b5aa0

…ler for HIP)

[helas] move to upstream/master tput/tmad logs for easier merging

a9a93bb

git checkout upstream/master tput/logs_* tmad/logs_*

[helas] move to upstream/master gg_tt.mad codegen log for easier merging

716ebaf

Merge branch 'amd' (with OPTFLAGS=-O2 to fix madgraph5#806) into helas

fbf892e

valassi mentioned this pull request Sep 20, 2024

Segfault in fgcheck.exe on LUMI (should we link hip, c++, fortran using hipcc or the fortran compiler?) #802

Closed

valassi added 9 commits September 21, 2024 10:30

[amd] move back to previous upstream/master codegen logs to ease futu…

0351d1c

…re merges

Merge branch 'amd' (go back to previous upstream/master codegen logs)…

8fce3b1

… into helas

[helas] regenerate all processes after merging master and amd

e6561e9

[helas] backport to CODEGEN the gg_tt.mad changes in cudacpp.mk to tr…

f021a89

…y and support HELINL=L on AMD GPUs via HIP (still incomplete)

valassi force-pushed the helas branch from 6567245 to f021a89 Compare September 21, 2024 08:36

valassi changed the title ~~(WIP) HELINL=L (L for linker) helas mode: pre-compile templates into separate .o object files (using RDC for CUDA)~~ (WIP) HELINL=L (L for linker) helas mode: pre-compile templates into separate .o object files (using RDC for CUDA; still missing HIP) Sep 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(WIP) HELINL=L (L for linker) helas mode: pre-compile templates into separate .o object files (using RDC for CUDA; still missing HIP) #978

(WIP) HELINL=L (L for linker) helas mode: pre-compile templates into separate .o object files (using RDC for CUDA; still missing HIP) #978

valassi commented Aug 27, 2024

valassi commented Aug 28, 2024

valassi commented Sep 20, 2024

(WIP) HELINL=L (L for linker) helas mode: pre-compile templates into separate .o object files (using RDC for CUDA; still missing HIP) #978

Are you sure you want to change the base?

(WIP) HELINL=L (L for linker) helas mode: pre-compile templates into separate .o object files (using RDC for CUDA; still missing HIP) #978

Conversation

valassi commented Aug 27, 2024

valassi commented Aug 28, 2024

valassi commented Sep 20, 2024