Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow serving llama models with tensor parallel #592

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Jackmin801
Copy link

@Jackmin801 Jackmin801 commented Jul 20, 2024

It would be great to be able to serve llama models faster my enabling tensor parallel.

This is currently a hacky implementation as I havent quite figured out how to get num_key_value_heads to be correct in the cache conversion function. However, it has been hardcoded to work with "cuda:0 cuda:1"

@Jackmin801
Copy link
Author

Jackmin801 commented Jul 20, 2024

@justheuristic @borzunov
Does this implementation look roughly correct to you?

It doesnt seem to be working and hangs trying to process outputs in the def process_output(output, output_actions: Dict[Arg, Callable[[torch.Tensor, int], torch.Tensor]], *, rank: int, world_size: int) of tensor_parallel library. I am launching with this command on a machine with 2 x 3090:

python -m petals.cli.run_server --port 31337 PrimeIntellect/Meta-Llama-3-70B-Instruct --initial_peers $INITIAL_PEERS --block_indices 0:1 --tensor_parallel_devices cuda:0 cuda:1

These are the inputs to process_output that were captured by prints

output_actions = {0: <tensor_parallel.communications.NCCLAllReduce object at 0x75e304f099d0>, 2: <tensor_parallel.communications.CollectiveOperation object at 0x75e304f8be50>}
output = {0: <class 'torch.Tensor'>, 1: <class 'NoneType'>, 2: <class 'tuple'>}
rank = 0
world_size = 2

My guess is its because only one of the ranks reaches this line and so it waits indefinitely in the all reduce? Am I launching th e server incorrectly?

@justheuristic
Copy link
Collaborator

Thank you for the pull request!
It does look roughly correct, but there may be caveats. For this to be merged, it would be great to have at least one test configuration that 1) uses at least one TP server and 2) includes running test_full_model.py for a llama-based model (our tests use makeeye/tinyllama-v0) . You can view CI tests below and edit CI configurations in .github/workflows/run-tests.py.

If you change CI configuration and push changes to this PR , the new tests should run automatically.

p.s. if you're not familiar with how our CI is configured, we are using github actions (see tutorial here)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants