Update multi-node.qmd #1688

shahdivax · 2024-06-07T06:57:51Z

Title: Distributed Finetuning For Multi-Node with Axolotl and Deepspeed

Description:
This PR introduces a comprehensive guide for setting up a distributed finetuning environment using Axolotl and Accelerate. The guide covers the following steps:

Configuring SSH for passwordless access across multiple nodes
Generating and exchanging public keys for secure communication
Configuring Axolotl with shared settings and host files
Configuring Accelerate for multi-node training with Deepspeed
Running distributed finetuning using Accelerate

Added Distributed Finetuning For Multi-Node with Axolotl and Deepspeed

winglian · 2024-06-07T22:09:48Z

@muellerzr seem right?

casper-hansen · 2024-06-11T11:30:38Z

This seems to assume that you have access to each node before your training starts. However, a lot of cloud systems like AzureML, SLURM, SageMaker does not let you follow guides like this because the assumptions of the guide is that you can modify these variables.

@shahdivax @winglian I would suggest a bit more of an automatic setup if you want this to work well for users.

shahdivax · 2024-06-11T11:35:57Z

This seems to assume that you have access to each node before your training starts. However, a lot of cloud systems like AzureML, SLURM, SageMaker does not let you follow guides like this because the assumptions of the guide is that you can modify these variables.

@shahdivax @winglian I would suggest a bit more of an automatic setup if you want this to work well for users.

This assumes that user are using EC2 instances from AWS.

( I forgot to add that 😓)

Edit: Added in the heading

docs/multi-node.qmd

muellerzr · 2024-06-11T17:51:31Z

docs/multi-node.qmd

+On Node 1 (server), run the finetuning process using Accelerate:
+
+```bash
+accelerate launch -m axolotl.cli.train examples/llama-2/qlora.yml
+```
+
+This will start the finetuning process across all nodes. You can check the different IP addresses before each step to verify that the training is running on every node.


To my knowledge this is not the case. You need to do accelerate launch -m on every server else it will sit there and never actually start

When we tasted, we were only starting on single node (server) and it was able to use the resources from other nodes,
As a proof, we were able to see the ip of both the machines on the left, and in the total GPU it were showing all the GPU form all the nodes.

@muellerzr My guess is this is probably specific to deepspeed since the IP addresses are set in a hostfile. We should probably disambiguate this that it only needs to be run on the first node when this is the case. Most other cases like FSDP or plain multinode DDP will likely still need accelerate launch to be run on each node.

@winglian, @muellerzr That might be the case , because for us deepspeed was a good options where we were using multi node for finetuning via EC2, as it provides the public ip , and we used hostfile, it was really easy to connect both machines and run the finetuning on root only, this indeed connected all the other instances. (using all the resources from all the nodes via single node)

docs/multi-node.qmd

Co-authored-by: Wing Lian <[email protected]>

shahdivax

All the changes required are Done , And I think this doc is now good to go.

docs/multi-node.qmd

Update multi-node.qmd

e9ab5d8

Added Distributed Finetuning For Multi-Node with Axolotl and Deepspeed

shahdivax mentioned this pull request Jun 7, 2024

Guide For Multi-Node Distributed Finetuning #1477

Closed

Update multi-node.qmd

81c91ab

muellerzr reviewed Jun 11, 2024

View reviewed changes

Update multi-node.qmd

93e71f0

winglian reviewed Jun 12, 2024

View reviewed changes

docs/multi-node.qmd Outdated Show resolved Hide resolved

Update docs/multi-node.qmd

640bb6c

Co-authored-by: Wing Lian <[email protected]>

shahdivax commented Jul 5, 2024

View reviewed changes

docs/multi-node.qmd Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update multi-node.qmd #1688

Update multi-node.qmd #1688

shahdivax commented Jun 7, 2024

winglian commented Jun 7, 2024

casper-hansen commented Jun 11, 2024

shahdivax commented Jun 11, 2024 •

edited

Loading

muellerzr Jun 11, 2024

shahdivax Jun 11, 2024 •

edited

Loading

winglian Jun 12, 2024

shahdivax Jul 5, 2024 •

edited

Loading

shahdivax left a comment

Update multi-node.qmd #1688

Are you sure you want to change the base?

Update multi-node.qmd #1688

Conversation

shahdivax commented Jun 7, 2024

winglian commented Jun 7, 2024

casper-hansen commented Jun 11, 2024

shahdivax commented Jun 11, 2024 • edited Loading

muellerzr Jun 11, 2024

Choose a reason for hiding this comment

shahdivax Jun 11, 2024 • edited Loading

Choose a reason for hiding this comment

winglian Jun 12, 2024

Choose a reason for hiding this comment

shahdivax Jul 5, 2024 • edited Loading

Choose a reason for hiding this comment

shahdivax left a comment

Choose a reason for hiding this comment

shahdivax commented Jun 11, 2024 •

edited

Loading

shahdivax Jun 11, 2024 •

edited

Loading

shahdivax Jul 5, 2024 •

edited

Loading