Allow serving llama models with tensor parallel #592

Jackmin801 · 2024-07-20T10:36:09Z

It would be great to be able to serve llama models faster my enabling tensor parallel.

This is currently a hacky implementation as I havent quite figured out how to get num_key_value_heads to be correct in the cache conversion function. However, it has been hardcoded to work with "cuda:0 cuda:1"

…ng nccl ops

Jackmin801 · 2024-07-20T10:54:06Z

@justheuristic @borzunov
Does this implementation look roughly correct to you?

It doesnt seem to be working and hangs trying to process outputs in the def process_output(output, output_actions: Dict[Arg, Callable[[torch.Tensor, int], torch.Tensor]], *, rank: int, world_size: int) of tensor_parallel library. I am launching with this command on a machine with 2 x 3090:

python -m petals.cli.run_server --port 31337 PrimeIntellect/Meta-Llama-3-70B-Instruct --initial_peers $INITIAL_PEERS --block_indices 0:1 --tensor_parallel_devices cuda:0 cuda:1

These are the inputs to process_output that were captured by prints

output_actions = {0: <tensor_parallel.communications.NCCLAllReduce object at 0x75e304f099d0>, 2: <tensor_parallel.communications.CollectiveOperation object at 0x75e304f8be50>}
output = {0: <class 'torch.Tensor'>, 1: <class 'NoneType'>, 2: <class 'tuple'>}
rank = 0
world_size = 2

My guess is its because only one of the ranks reaches this line and so it waits indefinitely in the all reduce? Am I launching th e server incorrectly?

justheuristic · 2024-07-21T21:46:32Z

Thank you for the pull request!
It does look roughly correct, but there may be caveats. For this to be merged, it would be great to have at least one test configuration that 1) uses at least one TP server and 2) includes running test_full_model.py for a llama-based model (our tests use makeeye/tinyllama-v0) . You can view CI tests below and edit CI configurations in .github/workflows/run-tests.py.

If you change CI configuration and push changes to this PR , the new tests should run automatically.

p.s. if you're not familiar with how our CI is configured, we are using github actions (see tutorial here)

hangs in line 320 of slicer_wrapper.py of tensor_parallel package doi…

cad787f

…ng nccl ops

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow serving llama models with tensor parallel #592

Allow serving llama models with tensor parallel #592

Jackmin801 commented Jul 20, 2024 •

edited

Loading

Jackmin801 commented Jul 20, 2024 •

edited

Loading

justheuristic commented Jul 21, 2024

Allow serving llama models with tensor parallel #592

Are you sure you want to change the base?

Allow serving llama models with tensor parallel #592

Conversation

Jackmin801 commented Jul 20, 2024 • edited Loading

Jackmin801 commented Jul 20, 2024 • edited Loading

justheuristic commented Jul 21, 2024

Jackmin801 commented Jul 20, 2024 •

edited

Loading

Jackmin801 commented Jul 20, 2024 •

edited

Loading