Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update multi-node.qmd #1688

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Conversation

shahdivax
Copy link

Title: Distributed Finetuning For Multi-Node with Axolotl and Deepspeed

Description:
This PR introduces a comprehensive guide for setting up a distributed finetuning environment using Axolotl and Accelerate. The guide covers the following steps:

  1. Configuring SSH for passwordless access across multiple nodes
  2. Generating and exchanging public keys for secure communication
  3. Configuring Axolotl with shared settings and host files
  4. Configuring Accelerate for multi-node training with Deepspeed
  5. Running distributed finetuning using Accelerate

Added Distributed Finetuning For Multi-Node with Axolotl and Deepspeed
@winglian
Copy link
Collaborator

winglian commented Jun 7, 2024

@muellerzr seem right?

@casper-hansen
Copy link
Collaborator

This seems to assume that you have access to each node before your training starts. However, a lot of cloud systems like AzureML, SLURM, SageMaker does not let you follow guides like this because the assumptions of the guide is that you can modify these variables.

@shahdivax @winglian I would suggest a bit more of an automatic setup if you want this to work well for users.

@shahdivax
Copy link
Author

shahdivax commented Jun 11, 2024

This seems to assume that you have access to each node before your training starts. However, a lot of cloud systems like AzureML, SLURM, SageMaker does not let you follow guides like this because the assumptions of the guide is that you can modify these variables.

@shahdivax @winglian I would suggest a bit more of an automatic setup if you want this to work well for users.

This assumes that user are using EC2 instances from AWS.

( I forgot to add that 😓)

Edit: Added in the heading

docs/multi-node.qmd Show resolved Hide resolved
docs/multi-node.qmd Outdated Show resolved Hide resolved
Comment on lines +147 to +153
On Node 1 (server), run the finetuning process using Accelerate:

```bash
accelerate launch -m axolotl.cli.train examples/llama-2/qlora.yml
```

This will start the finetuning process across all nodes. You can check the different IP addresses before each step to verify that the training is running on every node.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my knowledge this is not the case. You need to do accelerate launch -m on every server else it will sit there and never actually start

Copy link
Author

@shahdivax shahdivax Jun 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we tasted, we were only starting on single node (server) and it was able to use the resources from other nodes,
As a proof, we were able to see the ip of both the machines on the left, and in the total GPU it were showing all the GPU form all the nodes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@muellerzr My guess is this is probably specific to deepspeed since the IP addresses are set in a hostfile. We should probably disambiguate this that it only needs to be run on the first node when this is the case. Most other cases like FSDP or plain multinode DDP will likely still need accelerate launch to be run on each node.

Copy link
Author

@shahdivax shahdivax Jul 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@winglian, @muellerzr That might be the case , because for us deepspeed was a good options where we were using multi node for finetuning via EC2, as it provides the public ip , and we used hostfile, it was really easy to connect both machines and run the finetuning on root only, this indeed connected all the other instances. (using all the resources from all the nodes via single node)

docs/multi-node.qmd Outdated Show resolved Hide resolved
Copy link
Author

@shahdivax shahdivax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the changes required are Done , And I think this doc is now good to go.

docs/multi-node.qmd Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants