Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Is there a known bug with Driver Version: 535.129.03 which cases MscclppAllReduce3 to hang? #260

Open
saeedmaleki opened this issue Feb 6, 2024 · 5 comments

Comments

@saeedmaleki
Copy link
Contributor

Hi MSCCL++ team,

Do you know if Driver Version: 535.129.03 has a bug that makes AllReduce3 to timeout?

Thanks,
--Saeed

@Binyang2014
Copy link
Contributor

Hmm... not tested based on this version. Azure hpc image using driver 535.86.10 and doesn't have this issue.
https://github.com/Azure/azhpc-images/blob/63e5eaa23de69ccc1c6e6a52dff29037c88e96d4/ubuntu/common/install_nvidiagpudriver.sh#L16-L19

@saeedmaleki
Copy link
Contributor Author

thanks @Binyang2014! Debugging this issue with nvidia.

@chhwang
Copy link
Contributor

chhwang commented Mar 26, 2024

Hi @saeedmaleki, is this issue resolved on your end? 535.154.05 is working good on my env.

@saeedmaleki
Copy link
Contributor Author

it definitely still happens, i think this is a non-deterministic bug. NVIDIA couldn't reproduce it either. so maybe we could ignore it for now.

@chhwang
Copy link
Contributor

chhwang commented Apr 6, 2024

Actually, I can occasionally reproduce this bug. @Binyang2014 @aashaka please be aware.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants