Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] libmscclpp_nccl fails linking using ROCm 6.0 #349

Open
corey-derochie-amd opened this issue Sep 10, 2024 · 1 comment
Open

[Bug] libmscclpp_nccl fails linking using ROCm 6.0 #349

corey-derochie-amd opened this issue Sep 10, 2024 · 1 comment

Comments

@corey-derochie-amd
Copy link

While commit 72b99a4 allows libmscclpp to compile using ROCm 6.0, there are still linker errors in libmscclpp_nccl:

ld.lld: error: duplicate symbol: __float2bfloat16(float)
>>> defined at executor.cc
>>>            ../../CMakeFiles/mscclpp_obj.dir/src/executor/executor.cc.o:(__float2bfloat16(float))
>>> defined at allreduce.hpp
>>>            CMakeFiles/mscclpp_nccl_obj.dir/src/allreduce.hpp.o:(.text+0x0)

ld.lld: error: duplicate symbol: __bfloat1622float2(__hip_bfloat162)
>>> defined at executor.cc
>>>            ../../CMakeFiles/mscclpp_obj.dir/src/executor/executor.cc.o:(__bfloat1622float2(__hip_bfloat162))
>>> defined at allreduce.hpp
>>>            CMakeFiles/mscclpp_nccl_obj.dir/src/allreduce.hpp.o:(.text+0x40)

ld.lld: error: duplicate symbol: __double2bfloat16(double)
>>> defined at executor.cc
>>>            ../../CMakeFiles/mscclpp_obj.dir/src/executor/executor.cc.o:(__double2bfloat16(double))
>>> defined at allreduce.hpp
>>>            CMakeFiles/mscclpp_nccl_obj.dir/src/allreduce.hpp.o:(.text+0x60)

ld.lld: error: duplicate symbol: __float22bfloat162_rn(HIP_vector_type<float, 2u>)
>>> defined at executor.cc
>>>            ../../CMakeFiles/mscclpp_obj.dir/src/executor/executor.cc.o:(__float22bfloat162_rn(HIP_vector_type<float, 2u>))
>>> defined at allreduce.hpp
>>>            CMakeFiles/mscclpp_nccl_obj.dir/src/allreduce.hpp.o:(.text+0xA0)

ld.lld: error: duplicate symbol: __high2float(__hip_bfloat162)
>>> defined at executor.cc
>>>            ../../CMakeFiles/mscclpp_obj.dir/src/executor/executor.cc.o:(__high2float(__hip_bfloat162))
>>> defined at allreduce.hpp
>>>            CMakeFiles/mscclpp_nccl_obj.dir/src/allreduce.hpp.o:(.text+0x120)

ld.lld: error: duplicate symbol: __low2float(__hip_bfloat162)
>>> defined at executor.cc
>>>            ../../CMakeFiles/mscclpp_obj.dir/src/executor/executor.cc.o:(__low2float(__hip_bfloat162))
>>> defined at allreduce.hpp
>>>            CMakeFiles/mscclpp_nccl_obj.dir/src/allreduce.hpp.o:(.text+0x130)

ld.lld: error: duplicate symbol: __float2bfloat16(float)
>>> defined at executor.cc
>>>            ../../CMakeFiles/mscclpp_obj.dir/src/executor/executor.cc.o:(__float2bfloat16(float))
>>> defined at nccl.cu
>>>            CMakeFiles/mscclpp_nccl_obj.dir/src/nccl.cu.o:(.text+0x0)

ld.lld: error: duplicate symbol: __bfloat1622float2(__hip_bfloat162)
>>> defined at executor.cc
>>>            ../../CMakeFiles/mscclpp_obj.dir/src/executor/executor.cc.o:(__bfloat1622float2(__hip_bfloat162))
>>> defined at nccl.cu
>>>            CMakeFiles/mscclpp_nccl_obj.dir/src/nccl.cu.o:(.text+0x40)

ld.lld: error: duplicate symbol: __double2bfloat16(double)
>>> defined at executor.cc
>>>            ../../CMakeFiles/mscclpp_obj.dir/src/executor/executor.cc.o:(__double2bfloat16(double))
>>> defined at nccl.cu
>>>            CMakeFiles/mscclpp_nccl_obj.dir/src/nccl.cu.o:(.text+0x60)

ld.lld: error: duplicate symbol: __float22bfloat162_rn(HIP_vector_type<float, 2u>)
>>> defined at executor.cc
>>>            ../../CMakeFiles/mscclpp_obj.dir/src/executor/executor.cc.o:(__float22bfloat162_rn(HIP_vector_type<float, 2u>))
>>> defined at nccl.cu
>>>            CMakeFiles/mscclpp_nccl_obj.dir/src/nccl.cu.o:(.text+0xA0)

ld.lld: error: duplicate symbol: __high2float(__hip_bfloat162)
>>> defined at executor.cc
>>>            ../../CMakeFiles/mscclpp_obj.dir/src/executor/executor.cc.o:(__high2float(__hip_bfloat162))
>>> defined at nccl.cu
>>>            CMakeFiles/mscclpp_nccl_obj.dir/src/nccl.cu.o:(.text+0x120)

ld.lld: error: duplicate symbol: __low2float(__hip_bfloat162)
>>> defined at executor.cc
>>>            ../../CMakeFiles/mscclpp_obj.dir/src/executor/executor.cc.o:(__low2float(__hip_bfloat162))
>>> defined at nccl.cu
>>>            CMakeFiles/mscclpp_nccl_obj.dir/src/nccl.cu.o:(.text+0x130)
clang++: error: linker command failed with exit code 1 (use -v to see invocation)
gmake[5]: *** [apps/nccl/CMakeFiles/mscclpp_nccl.dir/build.make:145: apps/nccl/libmscclpp_nccl.so.0.5.2] Error 1
gmake[4]: *** [CMakeFiles/Makefile2:379: apps/nccl/CMakeFiles/mscclpp_nccl.dir/all] Error 2
gmake[4]: *** Waiting for unfinished jobs....
[100%] Built target check-format-cpp
gmake[3]: *** [Makefile:139: all] Error 2
gmake[2]: *** [CMakeFiles/mscclpp_nccl-download.dir/build.make:86: mscclpp_nccl-download-prefix/src/mscclpp_nccl-download-stamp/mscclpp_nccl-download-build] Error 2
gmake[1]: *** [CMakeFiles/Makefile2:83: CMakeFiles/mscclpp_nccl-download.dir/all] Error 2
gmake: *** [Makefile:91: all] Error 2

This does not appear to be an issue with later versions of ROCm.

@chhwang
Copy link
Contributor

chhwang commented Sep 17, 2024

Hi @corey-derochie-amd, the team has investigated this from before, and it is very tricky to tackle from the mscclpp's side. We rather use this ROCm patch for include/hip/amd_detail/amd_hip_bf16.h to avoid this issue on ROCm 6.0.

97c97
< #define __HOST_DEVICE__ __device__
---
> #define __HOST_DEVICE__ __device__ static
100c100
< #define __HOST_DEVICE__ __host__ __device__
---
> #define __HOST_DEVICE__ __host__ __device__ static inline

This is already adopted in ROCm 6.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants