Multi-node training #305

LeoXinhaoLee · 2024-08-31T23:28:33Z

Hi, thank you so much for releasing this great code base!

I noticed that your Laion blog says that the pre-training of OpenLM 1B/7B took place on 128 or 256 A100s. Therefore, I'm wondering if the current code supports multi-node training? The current training command seems to only use 4 gpus on 1 node.

Thank you very much!

sedrick-keh-tri · 2024-09-06T01:09:10Z

Yes, OpenLM supports multi-node training. The standard torchrun multi-node setup should work fine. If you are using something like AWS sagemaker, we also have sample codes here

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-node training #305

Multi-node training #305

LeoXinhaoLee commented Aug 31, 2024

sedrick-keh-tri commented Sep 6, 2024

Multi-node training #305

Multi-node training #305

Comments

LeoXinhaoLee commented Aug 31, 2024

sedrick-keh-tri commented Sep 6, 2024