-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(WIP) HELINL=L (L for linker) helas mode: pre-compile templates into separate .o object files (using RDC for CUDA; still missing HIP) #978
base: master
Are you sure you want to change the base?
Conversation
…FVs and for compiling them as separate object files (related to splitting kernels)
…d MemoryAccessMomenta.h
…the P subdirectory (depends on npar) - build succeeds for cpp, link fails for cuda ccache /usr/local/cuda-12.0/bin/nvcc -I. -I../../src -Xcompiler -O3 -gencode arch=compute_70,code=compute_70 -gencode arch=compute_70,code=sm_70 -lineinfo -use_fast_math -I/usr/local/cuda-12.0/include/ -DUSE_NVTX -std=c++17 -ccbin /usr/lib64/ccache/g++ -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_DOUBLE -Xcompiler -fPIC -c -x cu CPPProcess.cc -o CPPProcess_cuda.o ptxas fatal : Unresolved extern function '_ZN9mg5amcGpu14helas_VVV1P0_1EPKdS1_S1_dddPd'
…cuda tests succeed The build issues some warnings however nvlink warning : SM Arch ('sm_52') not found in './CPPProcess_cuda.o' nvlink warning : SM Arch ('sm_52') not found in './HelAmps_cuda.o' nvlink warning : SM Arch ('sm_52') not found in './CPPProcess_cuda.o' nvlink warning : SM Arch ('sm_52') not found in './HelAmps_cuda.o'
…ption HELINL=L and '#ifdef MGONGPU_LINKER_HELAMPS'
…nd -inlLonly options
… to ease code generation
…y in the HELINL=L mode
…c++, a factor 3 slower for cuda... ./tput/teeThroughputX.sh -ggtt -makej -makeclean -inlLonly diff -u --color tput/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt tput/logs_ggtt_mad/log_ggtt_mad_d_inlL_hrd0.txt -Process = SIGMA_SM_GG_TTX_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=0] [hardcodePARAM=0] +Process = SIGMA_SM_GG_TTX_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=L] [hardcodePARAM=0] Workflow summary = CUD:DBL+THX:CURDEV+RMBDEV+MESDEV/none+NAVBRK FP precision = DOUBLE (NaN/abnormal=0, zero=0) -EvtsPerSec[Rmb+ME] (23) = ( 4.589473e+07 ) sec^-1 -EvtsPerSec[MatrixElems] (3) = ( 1.164485e+08 ) sec^-1 -EvtsPerSec[MECalcOnly] (3a) = ( 1.280951e+08 ) sec^-1 -MeanMatrixElemValue = ( 2.086689e+00 +- 3.413217e-03 ) GeV^0 -TOTAL : 0.528239 sec -INFO: No Floating Point Exceptions have been reported - 2,222,057,027 cycles # 2.887 GHz - 3,171,868,018 instructions # 1.43 insn per cycle - 0.826440817 seconds time elapsed -runNcu /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/build.cuda_d_inl0_hrd0/check_cuda.exe -p 2048 256 1 -==PROF== Profiling "sigmaKin": launch__registers_per_thread 214 +EvtsPerSec[Rmb+ME] (23) = ( 2.667135e+07 ) sec^-1 +EvtsPerSec[MatrixElems] (3) = ( 4.116115e+07 ) sec^-1 +EvtsPerSec[MECalcOnly] (3a) = ( 4.251573e+07 ) sec^-1 +MeanMatrixElemValue = ( 2.086689e+00 +- 3.413217e-03 ) GeV^0 +TOTAL : 0.550450 sec +INFO: No Floating Point Exceptions have been reported + 2,272,219,097 cycles # 2.889 GHz + 3,361,475,195 instructions # 1.48 insn per cycle + 0.842685843 seconds time elapsed +runNcu /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/build.cuda_d_inlL_hrd0/check_cuda.exe -p 2048 256 1 +==PROF== Profiling "sigmaKin": launch__registers_per_thread 190 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100%
…lates in HELINL=L mode
…t.mad of HelAmps.h in HELINL=L mode
…t.mad of CPPProcess.cc in HELINL=L mode
…P* (the source is the same but it must be compiled in each P* separately)
… complete its backport
The functionality is in principle completed including the backport to CODEGEN. I will run some functionality and performance tests. |
git add *.mad/*/HelAmps.cc *.mad/*/*/HelAmps.cc *.sa/*/HelAmps.cc *.sa/*/*/HelAmps.cc
…ild failed? ./tput/teeThroughputX.sh -ggttggg -makej -makeclean -inlL ccache /usr/local/cuda-12.0/bin/nvcc -I. -I../../src -Xcompiler -O3 -gencode arch=compute_70,code=compute_70 -gencode arch=compute_70,code=sm_70 -lineinfo -use_fast_math -I/usr/local/cuda-12.0/include/ -DUSE_NVTX -std=c++17 -ccbin /usr/lib64/ccache/g++ -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_DOUBLE -DMGONGPU_INLINE_HELAMPS -Xcompiler -fPIC -c -x cu CPPProcess.cc -o build.cuda_d_inl1_hrd0/CPPProcess_cuda.o nvcc error : 'ptxas' died due to signal 9 (Kill signal) make[2]: *** [cudacpp.mk:754: build.cuda_d_inl1_hrd0/CPPProcess_cuda.o] Error 9 make[2]: Leaving directory '/data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg' make[1]: *** [makefile:142: build.cuda_d_inl1_hrd0/.cudacpplibs] Error 2 make[1]: Leaving directory '/data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg' make: *** [makefile:282: bldcuda] Error 2 make: *** Waiting for unfinished jobs....
… build time is from cache ./tput/teeThroughputX.sh -ggttggg -makej -makeclean
…mode (use that from the previous run, not from cache) ./tput/teeThroughputX.sh -ggttggg -makej -makeclean
…factor x2 faster (c++? cuda?), runtime is 5-10% slower in C++, but 5-10% faster in cuda!? ./tput/teeThroughputX.sh -ggttggg -makej -makeclean -inlLonly diff -u --color tput/logs_ggttggg_mad/log_ggttggg_mad_d_inlL_hrd0.txt tput/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt ... On itscrd90.cern.ch [CPU: Intel(R) Xeon(R) Silver 4216 CPU] [GPU: 1x Tesla V100S-PCIE-32GB]: ========================================================================= -runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inlL_hrd0/check_cuda.exe -p 1 256 2 OMP= +runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inl0_hrd0/check_cuda.exe -p 1 256 2 OMP= INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW -Process = SIGMA_SM_GG_TTXGGG_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=L] [hardcodePARAM=0] +Process = SIGMA_SM_GG_TTXGGG_CUDA [nvcc 12.0.140 (gcc 11.3.1)] [inlineHel=0] [hardcodePARAM=0] Workflow summary = CUD:DBL+THX:CURDEV+RMBDEV+MESDEV/none+NAVBRK FP precision = DOUBLE (NaN/abnormal=0, zero=0) -EvtsPerSec[Rmb+ME] (23) = ( 4.338149e+02 ) sec^-1 -EvtsPerSec[MatrixElems] (3) = ( 4.338604e+02 ) sec^-1 -EvtsPerSec[MECalcOnly] (3a) = ( 4.338867e+02 ) sec^-1 -MeanMatrixElemValue = ( 1.187066e-05 +- 9.825549e-06 ) GeV^-6 -TOTAL : 2.242693 sec -INFO: No Floating Point Exceptions have been reported - 7,348,976,543 cycles # 2.902 GHz - 16,466,315,526 instructions # 2.24 insn per cycle - 2.591057214 seconds time elapsed -runNcu /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inlL_hrd0/check_cuda.exe -p 1 256 1 +EvtsPerSec[Rmb+ME] (23) = ( 4.063038e+02 ) sec^-1 +EvtsPerSec[MatrixElems] (3) = ( 4.063437e+02 ) sec^-1 +EvtsPerSec[MECalcOnly] (3a) = ( 4.063626e+02 ) sec^-1 +MeanMatrixElemValue = ( 1.187066e-05 +- 9.825549e-06 ) GeV^-6 +TOTAL : 2.552546 sec +INFO: No Floating Point Exceptions have been reported + 7,969,059,552 cycles # 2.893 GHz + 17,401,037,642 instructions # 2.18 insn per cycle + 2.954791685 seconds time elapsed +runNcu /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.cuda_d_inl0_hrd0/check_cuda.exe -p 1 256 1 ==PROF== Profiling "sigmaKin": launch__registers_per_thread 255 ==PROF== Profiling "sigmaKin": sm__sass_average_branch_targets_threads_uniform.pct 100% ... ========================================================================= -runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.512y_d_inlL_hrd0/check_cpp.exe -p 1 256 2 OMP= +runExe /data/avalassi/GPU2023/madgraph4gpuX/epochX/cudacpp/gg_ttggg.mad/SubProcesses/P1_gg_ttxggg/build.512y_d_inl0_hrd0/check_cpp.exe -p 1 256 2 OMP= INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW -Process = SIGMA_SM_GG_TTXGGG_CPP [gcc 11.3.1] [inlineHel=L] [hardcodePARAM=0] +Process = SIGMA_SM_GG_TTXGGG_CPP [gcc 11.3.1] [inlineHel=0] [hardcodePARAM=0] Workflow summary = CPP:DBL+CXS:CURHST+RMBHST+MESHST/512y+CXVBRK FP precision = DOUBLE (NaN/abnormal=0, zero=0) Internal loops fptype_sv = VECTOR[4] ('512y': AVX512, 256bit) [cxtype_ref=YES] -EvtsPerSec[Rmb+ME] (23) = ( 3.459662e+02 ) sec^-1 -EvtsPerSec[MatrixElems] (3) = ( 3.460086e+02 ) sec^-1 -EvtsPerSec[MECalcOnly] (3a) = ( 3.460086e+02 ) sec^-1 +EvtsPerSec[Rmb+ME] (23) = ( 3.835352e+02 ) sec^-1 +EvtsPerSec[MatrixElems] (3) = ( 3.836003e+02 ) sec^-1 +EvtsPerSec[MECalcOnly] (3a) = ( 3.836003e+02 ) sec^-1 MeanMatrixElemValue = ( 1.187066e-05 +- 9.825549e-06 ) GeV^-6 -TOTAL : 1.528240 sec +TOTAL : 1.378567 sec INFO: No Floating Point Exceptions have been reported - 4,140,408,789 cycles # 2.703 GHz - 9,072,597,595 instructions # 2.19 insn per cycle - 1.532357792 seconds time elapsed -=Symbols in CPPProcess_cpp.o= (~sse4: 0) (avx2:94048) (512y: 91) (512z: 0) + 3,738,350,469 cycles # 2.705 GHz + 8,514,195,736 instructions # 2.28 insn per cycle + 1.382567882 seconds time elapsed +=Symbols in CPPProcess_cpp.o= (~sse4: 0) (avx2:80619) (512y: 89) (512z: 0) -------------------------------------------------------------------------
… (commented out) for the memory corruption madgraph5#806 This shows an uninitialised value deep inside hiprand [valassia@nid005067 bash] ~/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux > valgrind ./check_hip.exe -p 1 8 1 ==105499== Memcheck, a memory error detector ==105499== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al. ==105499== Using Valgrind-3.20.0 and LibVEX; rerun with -h for copyright info ==105499== Command: ./check_hip.exe -p 1 8 1 ==105499== ==105499== Warning: set address range perms: large range [0x59c90000, 0x159e91000) (noaccess) INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW Get random numbers from Hiprand ==105499== Conditional jump or move depends on uninitialised value(s) ==105499== at 0x1253777C: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003) ==105499== by 0x12537F40: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003) ==105499== by 0x12540782: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003) ==105499== by 0x125629DD: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003) ==105499== by 0x4B825EB: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x4B88342: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x4B822FF: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x4B55120: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x4B2B590: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x49D84AF: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x49D87C4: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x4A00FA2: hipMemcpy (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== ==105499== Conditional jump or move depends on uninitialised value(s) ==105499== at 0x12537B82: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003) ==105499== by 0x12537F40: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003) ==105499== by 0x12540782: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003) ==105499== by 0x125629DD: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003) ==105499== by 0x4B825EB: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x4B88342: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x4B822FF: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x4B55120: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x4B2B590: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x49D84AF: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x49D87C4: ??? (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== by 0x4A00FA2: hipMemcpy (in /opt/rocm-6.0.3/lib/libamdhip64.so.6.0.60003) ==105499== Got random numbers from Hiprand ==105499== Invalid read of size 8 ==105499== at 0x21F741: std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, float, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, float> > >::operator[](std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe) ==105499== by 0x21D0D1: mgOnGpu::TimerMap::start(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe) ==105499== by 0x215CBB: main (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe) ==105499== Address 0x1c00000043 is not stack'd, malloc'd or (recently) free'd ==105499== ==105499== ==105499== Process terminating with default action of signal 11 (SIGSEGV): dumping core ==105499== Access not within mapped region at address 0x1C00000043 ==105499== at 0x21F741: std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, float, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, float> > >::operator[](std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe) ==105499== by 0x21D0D1: mgOnGpu::TimerMap::start(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe) ==105499== by 0x215CBB: main (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe) ==105499== If you believe this happened as a result of a stack ==105499== overflow in your program's main thread (unlikely but ==105499== possible), you can try to increase the size of the ==105499== main thread stack using the --main-stacksize= flag. ==105499== The main thread stack size used in this run was 16777216. Unfortunately however also --common crashes (and gives the same uninitialised problem, whether related or not)
…ad of HIP pinned host malloc to debug madgraph5#806 - still crashes, will revert This makes the valgrind 'conditional jump on uninitialised variable' disappear, but the crash from invalid memory reads still remains [valassia@nid005067 bash] ~/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux > valgrind --track-origins=yes ./check_hip.exe --common -p 1 8 1 ==10800== Memcheck, a memory error detector ==10800== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al. ==10800== Using Valgrind-3.20.0 and LibVEX; rerun with -h for copyright info ==10800== Command: ./check_hip.exe --common -p 1 8 1 ==10800== ==10800== Warning: set address range perms: large range [0x59c90000, 0x159e91000) (noaccess) INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW ==10800== Invalid read of size 8 ==10800== at 0x21EF01: std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, float, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, float> > >::operator[](std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe) ==10800== by 0x21CA21: mgOnGpu::TimerMap::start(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe) ==10800== by 0x2158A5: main (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe) ==10800== Address 0x140000003b is not stack'd, malloc'd or (recently) free'd ==10800== ==10800== ==10800== Process terminating with default action of signal 11 (SIGSEGV): dumping core ==10800== Access not within mapped region at address 0x140000003B ==10800== at 0x21EF01: std::map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, float, std::less<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, float> > >::operator[](std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe) ==10800== by 0x21CA21: mgOnGpu::TimerMap::start(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe) ==10800== by 0x2158A5: main (in /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe) ==10800== If you believe this happened as a result of a stack ==10800== overflow in your program's main thread (unlikely but ==10800== possible), you can try to increase the size of the ==10800== main thread stack using the --main-stacksize= flag. ==10800== The main thread stack size used in this run was 16777216. ==10800== ==10800== HEAP SUMMARY: ==10800== in use at exit: 4,784,824 bytes in 17,735 blocks ==10800== total heap usage: 306,364 allocs, 288,629 frees, 180,986,538 bytes allocated ==10800== ==10800== LEAK SUMMARY: ==10800== definitely lost: 256 bytes in 5 blocks ==10800== indirectly lost: 3,522 bytes in 64 blocks ==10800== possibly lost: 9,544 bytes in 80 blocks ==10800== still reachable: 4,771,502 bytes in 17,586 blocks ==10800== of which reachable via heuristic: ==10800== multipleinheritance: 384 bytes in 4 blocks ==10800== suppressed: 0 bytes in 0 blocks ==10800== Rerun with --leak-check=full to see details of leaked memory ==10800== ==10800== For lists of detected and suppressed errors, rerun with: -s ==10800== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0) Segmentation fault
…madgraph5#806 - now valgrind gives no invalid read, but there is a 'Memory access fault' Using valgrind [valassia@nid005067 bash] ~/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux > valgrind --track-origins=yes ./check_hip.exe --common -p 1 8 1 ==80385== Memcheck, a memory error detector ==80385== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al. ==80385== Using Valgrind-3.20.0 and LibVEX; rerun with -h for copyright info ==80385== Command: ./check_hip.exe --common -p 1 8 1 ==80385== DEBUG: TimerMap::stop() enter DEBUG: TimerMap::stop() exit ==80385== Warning: set address range perms: large range [0x59c90000, 0x159e91000) (noaccess) DEBUG: TimerMap::stop() enter DEBUG: TimerMap::stop() retrieve '00 GpuInit' DEBUG: TimerMap::stop() exit INFO: The following Floating Point Exceptions will cause SIGFPE program aborts: FE_DIVBYZERO, FE_INVALID, FE_OVERFLOW ... DEBUG: TimerMap::stop() enter DEBUG: TimerMap::stop() retrieve '0e SGoodHel' DEBUG: TimerMap::stop() exit Memory access fault by GPU node-4 (Agent handle: 0x1417d4a0) on address 0xfffd862e5000. Reason: Unknown. ==80385== ==80385== Process terminating with default action of signal 6 (SIGABRT): dumping core ==80385== at 0x63D3D2B: raise (in /lib64/libc-2.31.so) ==80385== by 0x63D53E4: abort (in /lib64/libc-2.31.so) ==80385== by 0x12580D1B: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003) ==80385== by 0x1257ABC8: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003) ==80385== by 0x1252C9E6: ??? (in /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1.12.60003) ==80385== by 0x127C66E9: start_thread (in /lib64/libpthread-2.31.so) ==80385== by 0x64A150E: clone (in /lib64/libc-2.31.so) ==80385== ==80385== HEAP SUMMARY: ==80385== in use at exit: 4,790,652 bytes in 17,774 blocks ==80385== total heap usage: 306,424 allocs, 288,650 frees, 180,987,695 bytes allocated ==80385== ==80385== LEAK SUMMARY: ==80385== definitely lost: 184 bytes in 4 blocks ==80385== indirectly lost: 2,658 bytes in 52 blocks ==80385== possibly lost: 10,768 bytes in 86 blocks ==80385== still reachable: 4,777,042 bytes in 17,632 blocks ==80385== of which reachable via heuristic: ==80385== multipleinheritance: 496 bytes in 5 blocks ==80385== suppressed: 0 bytes in 0 blocks ==80385== Rerun with --leak-check=full to see details of leaked memory ==80385== ==80385== For lists of detected and suppressed errors, rerun with: -s ==80385== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0) Aborted Using rocgdb [valassia@nid005067 bash] ~/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux > rocgdb --args ./check_hip.exe -p 1 8 1 GNU gdb (rocm-rel-6.0-131) 13.2 ... (gdb) run Starting program: /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe -p 1 8 1 ... DEBUG: TimerMap::stop() enter DEBUG: TimerMap::stop() retrieve '0e SGoodHel' DEBUG: TimerMap::stop() exit New Thread 0x1554445ff700 (LWP 94651) New Thread 0x1555470b7700 (LWP 94652) Thread 0x1554445ff700 (LWP 94651) exited Warning: precise memory violation signal reporting is not enabled, reported location may not be accurate. See "show amdgpu precise-memory". Thread 6 "check_hip.exe" received signal SIGSEGV, Segmentation fault. [Switching to thread 6, lane 0 (AMDGPU Lane 1:2:1:1/0 (0,0,0)[0,0,0])] 0x0000155547130598 in mg5amcGpu::sigmaKin(double const*, double const*, double const*, double const*, double*, unsigned int const*, double*, double*, int*, int*) () from file:///pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/../../lib/libmg5amc_gux_ttxux_hip.so#offset=57344&size=114640 (gdb) where 0 0x0000155547130598 in mg5amcGpu::sigmaKin(double const*, double const*, double const*, double const*, double*, unsigned int const*, double*, double*, int*, int*) () from file:///pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/../../lib/libmg5amc_gux_ttxux_hip.so#offset=57344&size=114640 (gdb) l 1 ../sysdeps/x86_64/crtn.S: No such file or directory. ... (gdb) set amdgpu precise-memory (gdb) run The program being debugged has been started already. Start it from the beginning? (y or n) y Starting program: /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe -p 1 8 1 ... DEBUG: TimerMap::stop() enter DEBUG: TimerMap::stop() retrieve '0e SGoodHel' DEBUG: TimerMap::stop() exit New Thread 0x1554445ff700 (LWP 99032) New Thread 0x1555470b7700 (LWP 99033) Thread 0x1554445ff700 (LWP 99032) exited Thread 6 "check_hip.exe" received signal SIGSEGV, Segmentation fault. [Switching to thread 6, lane 0 (AMDGPU Lane 1:2:1:1/0 (0,0,0)[0,0,0])] 0x000015554713050c in mg5amcGpu::sigmaKin(double const*, double const*, double const*, double const*, double*, unsigned int const*, double*, double*, int*, int*) () from file:///pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/../../lib/libmg5amc_gux_ttxux_hip.so#offset=57344&size=114640 ... (gdb) info threads Id Target Id Frame 1 Thread 0x1555471dda80 (LWP 98983) "check_hip.exe" 0x0000155547603d57 in ?? () from /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1 2 Thread 0x1555469ff700 (LWP 99017) "check_hip.exe" 0x00001555538f64a7 in ioctl () from /lib64/libc.so.6 5 Thread 0x1555470b7700 (LWP 99033) "check_hip.exe" 0x000015554759fd04 in sem_post@@GLIBC_2.2.5 () from /lib64/libpthread.so.0 * 6 AMDGPU Wave 1:2:1:1 (0,0,0)/0 "check_hip.exe" 0x000015554713050c in mg5amcGpu::sigmaKin(double const*, double const*, double const*, double const*, double*, unsigned int const*, double*, double*, int*, int*) () from file:///pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/../../lib/libmg5amc_gux_ttxux_hip.so#offset=57344&size=114640
… in vxxxxx (which may explain why this only appears in gqttq?) [valassia@nid005067 bash] ~/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux > rocgdb --args ./check_hip.exe -p 1 8 1 GNU gdb (rocm-rel-6.0-131) 13.2 ... (gdb) set amdgpu precise-memory (gdb) run The program being debugged has been started already. Start it from the beginning? (y or n) y Starting program: /pfs/lustrep3/scratch/project_465001114/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux/check_hip.exe -p 1 8 1 ... DEBUG: TimerMap::stop() enter DEBUG: TimerMap::stop() retrieve '0e SGoodHel' DEBUG: TimerMap::stop() exit New Thread 0x1554445ff700 (LWP 1669) New Thread 0x155547087700 (LWP 1670) Thread 0x1554445ff700 (LWP 1669) exited Thread 6 "check_hip.exe" received signal SIGSEGV, Segmentation fault. [Switching to thread 6, lane 0 (AMDGPU Lane 1:2:1:1/0 (0,0,0)[0,0,0])] mg5amcGpu::calculate_wavefunctions (ihel=<optimized out>, allmomenta=<optimized out>, allcouplings=<optimized out>, allMEs=<optimized out>, channelId=<optimized out>, allNumerators=<optimized out>, allDenominators=<optimized out>, jamp2_sv=<optimized out>) at CPPProcess.cc:328 328 vxxxxx<M_ACCESS, W_ACCESS>( momenta, 0., cHel[ihel][0], -1, w_fp[0], 0 ); (gdb) where 0 mg5amcGpu::calculate_wavefunctions (ihel=<optimized out>, allmomenta=<optimized out>, allcouplings=<optimized out>, allMEs=<optimized out>, channelId=<optimized out>, allNumerators=<optimized out>, allDenominators=<optimized out>, jamp2_sv=<optimized out>) at CPPProcess.cc:328 1 mg5amcGpu::sigmaKin (allmomenta=<optimized out>, allcouplings=<optimized out>, allrndhel=<optimized out>, allrndcol=<optimized out>, allMEs=<optimized out>, allChannelIds=<optimized out>, allNumerators=<optimized out>, allDenominators=<optimized out>, allselhel=<optimized out>, allselcol=<optimized out>) at CPPProcess.cc:1043 (gdb) info threads Id Target Id Frame 1 Thread 0x1555471aea80 (LWP 1645) "check_hip.exe" 0x00001555475d5d57 in ?? () from /opt/rocm-6.0.3/lib/libhsa-runtime64.so.1 2 Thread 0x1555469ff700 (LWP 1655) "check_hip.exe" 0x00001555538c84a7 in ioctl () from /lib64/libc.so.6 5 Thread 0x155547087700 (LWP 1670) "check_hip.exe" 0x00001555538c84a7 in ioctl () from /lib64/libc.so.6 * 6 AMDGPU Wave 1:2:1:1 (0,0,0)/0 "check_hip.exe" mg5amcGpu::calculate_wavefunctions (ihel=<optimized out>, allmomenta=<optimized out>, allcouplings=<optimized out>, allMEs=<optimized out>, channelId=<optimized out>, allNumerators=<optimized out>, allDenominators=<optimized out>, jamp2_sv=<optimized out>) at CPPProcess.cc:328
…d for debugging the crash madgraph5#806 in hipcc Revert "[amd] in gq_ttq.mad cudacpp.mk, enable -ggdb... the issue seems to be in vxxxxx (which may explain why this only appears in gqttq?)" This reverts commit 5cc62a6. Revert "[amd] in gq_ttq.mad timermap.h, add some debug printouts for the crash madgraph5#806 - now valgrind gives no invalid read, but there is a 'Memory access fault'" This reverts commit 5b8d92f. Revert "[amd] in gq_ttq.mad MemoryBuffers.h, temporarely use c++ malloc instead of HIP pinned host malloc to debug madgraph5#806 - still crashes, will revert" This reverts commit 007173a. Revert "[amd] in gq_ttq.mad HiprandRandomNumberKernel.cc, add debug printouts (commented out) for the memory corruption madgraph5#806" This reverts commit c7b3dc0.
…adgraph5#806 for HIPCC by disabling hipcc optimizations (use -O0 instead of -O3) The test now succeeds! ./check_hip.exe -p 1 8 1
…adgraph5#806 for HIPCC by disabling hipcc -O3, but keep -O2 (better than -O0) The test now still succeeds! ./check_hip.exe -p 1 8 1
…ead of -O3 (workaround for gq_ttq crash madgraph5#806)
…) - now they all succeed! gqttq crash madgraph5#806 has disappeared (Note: performance on HIP do not seem to be significantly degraded with -O2 with respect to -O3, eg on ggttgg) STARTED AT Thu 19 Sep 2024 06:24:53 PM EEST ./tput/teeThroughputX.sh -mix -hrd -makej -eemumu -ggtt -ggttg -ggttgg -gqttq -ggttggg -makeclean -nocuda ENDED(1) AT Thu 19 Sep 2024 07:15:36 PM EEST [Status=0] ./tput/teeThroughputX.sh -flt -hrd -makej -eemumu -ggtt -ggttgg -inlonly -makeclean -nocuda ENDED(2) AT Thu 19 Sep 2024 07:32:30 PM EEST [Status=0] ./tput/teeThroughputX.sh -makej -eemumu -ggtt -ggttg -gqttq -ggttgg -ggttggg -flt -bridge -makeclean -nocuda ENDED(3) AT Thu 19 Sep 2024 07:41:44 PM EEST [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -rmbhst -nocuda ENDED(4) AT Thu 19 Sep 2024 07:43:46 PM EEST [Status=0] SKIP './tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -common -nocuda' ENDED(5) AT Thu 19 Sep 2024 07:43:46 PM EEST [Status=0] ./tput/teeThroughputX.sh -eemumu -ggtt -ggttgg -flt -common -nocuda ENDED(6) AT Thu 19 Sep 2024 07:45:46 PM EEST [Status=0] ./tput/teeThroughputX.sh -mix -hrd -makej -susyggtt -susyggt1t1 -smeftggtttt -heftggbb -makeclean -nocuda ENDED(7) AT Thu 19 Sep 2024 08:17:24 PM EEST [Status=0] No errors found in logs
…ds (madgraph5#806 fixed), all as expected (heft fail madgraph5#833, skip ggttggg madgraph5#933) (Note: performance on HIP do not seem to be significantly degraded with -O2 with respect to -O3, eg on ggttgg) STARTED AT Thu 19 Sep 2024 11:37:44 PM EEST (SM tests) ENDED(1) AT Fri 20 Sep 2024 02:00:00 AM EEST [Status=0] (BSM tests) ENDED(1) AT Fri 20 Sep 2024 02:08:55 AM EEST [Status=0] 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_d_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_eemumu_mad/log_eemumu_mad_m_inl0_hrd0.txt 12 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_d_inl0_hrd0.txt 12 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_f_inl0_hrd0.txt 12 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttggg_mad/log_ggttggg_mad_m_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_d_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttgg_mad/log_ggttgg_mad_m_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_d_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggttg_mad/log_ggttg_mad_m_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_d_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_ggtt_mad/log_ggtt_mad_m_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_d_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_gqttq_mad/log_gqttq_mad_m_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_d_inl0_hrd0.txt 1 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_heftggbb_mad/log_heftggbb_mad_m_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_d_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_smeftggtttt_mad/log_smeftggtttt_mad_m_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_d_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggt1t1_mad/log_susyggt1t1_mad_m_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_d_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_f_inl0_hrd0.txt 16 /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/tmad/logs_susyggtt_mad/log_susyggtt_mad_m_inl0_hrd0.txt
Revert "[amd] rerun 30 tmad tests on LUMI against AMD GPUs - now gqttq succeeds (madgraph5#806 fixed), all as expected (heft fail madgraph5#833, skip ggttggg madgraph5#933)" This reverts commit 0d7d4cd. Revert "[amd] rerun 96 tput builds and tests on LUMI worker node (small-g 72h) - now they all succeed! gqttq crash madgraph5#806 has disappeared" This reverts commit e41c7ff.
…he getCompiler() function This gives for instance: [valassia@nid005067 bash] ~/GPU2024/madgraph4gpu/epochX/cudacpp/gq_ttq.mad/SubProcesses/P1_gux_ttxux > ./check_hip.exe -p 1 8 1 Process = SIGMA_SM_GUX_TTXUX_HIP [hipcc 6.0.32831 (clang 17.0.0)] [inlineHel=0] [hardcodePARAM=0] (Checked that all is ok when regenerating gq_ttq.mad/SubProcesses/P1_gux_ttxux)
git checkout upstream/master tput/logs_* tmad/logs_*
Fix conflicts (essentially, add -inlL and -inlLonly options to upstream/master scripts): - epochX/cudacpp/tmad/madX.sh - epochX/cudacpp/tmad/teeMadX.sh - epochX/cudacpp/tput/allTees.sh - epochX/cudacpp/tput/teeThroughputX.sh - epochX/cudacpp/tput/throughputX.sh
I updated this with the latest master as I am doing on all PRs
I had some LUMI shell running and I tried this (after also merging in #1007 with various AMD things) There is a
Note that #802 is actually a 'shared object initialization failed' error So the status is
|
…=L) to cuda only as it does not apply to hip The hip compilation of CPPProcess.cc now fails as ccache /opt/rocm-6.0.3/bin/hipcc -I. -I../../src -O2 --offload-arch=gfx90a -target x86_64-linux-gnu -DHIP_PLATFORM=amd -DHIP_FAST_MATH -I/opt/rocm-6.0.3/include/ -std=c++17 -DMGONGPU_FPTYPE_DOUBLE -DMGONGPU_FPTYPE2_DOUBLE -DMGONGPU_LINKER_HELAMPS -fPIC -c -x hip CPPProcess.cc -o CPPProcess_hip.o lld: error: undefined hidden symbol: mg5amcGpu::linker_CD_FFV1_0(double const*, double const*, double const*, double const*, double, double*)
…ompilation on hip for HELINL=L The hip link of check_hip.exe now fails with ccache /opt/rocm-6.0.3/bin/hipcc -o check_hip.exe ./check_sa_hip.o -L../../lib -lmg5amc_common_hip -Xlinker -rpath='$ORIGIN/../../lib' -L../../lib -lmg5amc_gg_ttx_hip ./CommonRandomNumberKernel_hip.o ./RamboSamplingKernels_hip.o ./CurandRandomNumberKernel_hip.o ./HiprandRandomNumberKernel_hip.o -L/opt/rocm-6.0.3/lib/ -lhiprand ld.lld: error: undefined reference due to --no-allow-shlib-undefined: __hip_fatbin
…k_hip.exe link on hip for HELINL=L, the build succeeds but at runtime it fails The execution fails with ./check_hip.exe -p 1 8 1 ERROR! assertGpu: 'shared object initialization failed' (303) in CPPProcess.cc:558 In addition, the hip link of fcheck_hip.exe fails with ftn --cray-bypass-pkgconfig -craype-verbose -ffixed-line-length-132 -o fcheck_hip.exe ./fcheck_sa_fortran.o ./fsampler_hip.o -L../../lib -lmg5amc_common_hip -Xlinker -rpath='$ORIGIN/../../lib' -lgfortran -L../../lib -lmg5amc_gg_ttx_hip ./CommonRandomNumberKernel_hip.o ./RamboSamplingKernels_hip.o -lstdc++ -L/opt/rocm-6.0.3/lib -lamdhip64 gfortran-13 -march=znver3 -D__CRAY_X86_TRENTO -D__CRAY_AMD_GFX90A -D__CRAYXT_COMPUTE_LINUX_TARGET -D__TARGET_LINUX__ -ffixed-line-length-132 -o fcheck_hip.exe ./fcheck_sa_fortran.o ./fsampler_hip.o -L../../lib -lmg5amc_common_hip -Xlinker -rpath=$ORIGIN/../../lib -lgfortran -L../../lib -lmg5amc_gg_ttx_hip ./CommonRandomNumberKernel_hip.o ./RamboSamplingKernels_hip.o -lstdc++ -L/opt/rocm-6.0.3/lib -lamdhip64 -Wl,-Bdynamic -Wl,--as-needed,-lgfortran,-lquadmath,--no-as-needed -Wl,--as-needed,-lpthread,--no-as-needed -Wl,--disable-new-dtags /usr/lib64/gcc/x86_64-suse-linux/13/../../../../x86_64-suse-linux/bin/ld: ../../lib/libmg5amc_gg_ttx_hip.so: undefined reference to `__hip_fatbin'
…ipcc instead of gfortran to link fcheck_hip.exe: this links but it fails at runtime, will revert Also add -gggdb for debugging. At runtime this fails with the usual madgraph5#802. It is now clear that this is in gpuMemcpyToSymbol (line 558) And the error is precisely 'shared object initialization failed' ./fcheck_hip.exe 1 32 1 ... WARNING! Instantiate device Bridge (nevt=32, gpublocks=1, gputhreads=32, gpublocks*gputhreads=32) ERROR! assertGpu: 'shared object initialization failed' (303) in CPPProcess.cc:558 fcheck_hip.exe: ./GpuRuntime.h:26: void assertGpu(hipError_t, const char *, int, bool): Assertion `code == gpuSuccess' failed. Program received signal SIGABRT: Process abort signal. Backtrace for this error: 0 0x14f947bff2e2 in ??? 1 0x14f947bfe475 in ??? 2 0x14f945f33dbf in ??? 3 0x14f945f33d2b in ??? 4 0x14f945f353e4 in ??? 5 0x14f945f2bc69 in ??? 6 0x14f945f2bcf1 in ??? 7 0x14f947bcef96 in _Z9assertGpu10hipError_tPKcib at ./GpuRuntime.h:26 8 0x14f947bcef96 in _ZN9mg5amcGpu10CPPProcessC2Ebb at /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/CPPProcess.cc:558 9 0x14f947bd2cf3 in _ZN9mg5amcGpu6BridgeIdEC2Ejjj at ./Bridge.h:268 10 0x14f947bd678e in fbridgecreate_ at /users/valassia/GPU2024/madgraph4gpu/epochX/cudacpp/gg_tt.mad/SubProcesses/P1_gg_ttx/fbridge.cc:54 11 0x2168fd in ??? 12 0x216bfe in ??? 13 0x14f945f1e24c in ??? 14 0x216249 in _start at ../sysdeps/x86_64/start.S:120 15 0xffffffffffffffff in ??? Aborted
… hipcc to link fcheck_hip.exe Revert "[helas] in gg_tt.mad cudacpp.mk, temporarely go back and try to use hipcc instead of gfortran to link fcheck_hip.exe: this links but it fails at runtime, will revert" This reverts commit 988419b. NOTE: I tried to use FC=hipcc and this also compiles the fortran ok! Probably it internally uses flang from llvm madgraph5#804 The problem however is that there is no lowercase 'main' in fcheck_sa_fortran.o, only an uppercase 'MAIN_'. Summary of the status" HELINL=L "rdc" is not supported on our AMD GPUs for now.
…y and support HELINL=L on AMD GPUs via HIP (still incomplete)
WIP on removing template/inline from helas (related to splitting kernels)