Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-node training #305

Open
LeoXinhaoLee opened this issue Aug 31, 2024 · 1 comment
Open

Multi-node training #305

LeoXinhaoLee opened this issue Aug 31, 2024 · 1 comment

Comments

@LeoXinhaoLee
Copy link

Hi, thank you so much for releasing this great code base!

I noticed that your Laion blog says that the pre-training of OpenLM 1B/7B took place on 128 or 256 A100s. Therefore, I'm wondering if the current code supports multi-node training? The current training command seems to only use 4 gpus on 1 node.

Thank you very much!

@sedrick-keh-tri
Copy link
Collaborator

Yes, OpenLM supports multi-node training. The standard torchrun multi-node setup should work fine. If you are using something like AWS sagemaker, we also have sample codes here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants