Add new Relax function to the batched model for evaluating query tokens over multiple time steps in parallel #156

masahi · 2024-01-10T00:32:27Z

In speculative decoding and restoring KV cache entries for evicted parallel-sampling requests, we need to be able to compute logits over multiple tokens (time steps) while utilizing the KV cache for the past tensors. This is a hybrid of prefill and decode functions, in that

prefill can compute logits over multiple tokens but doesn't read from KV cache
decode works on one token at a time.

I'm introducing a new function, tentatively called evaluate_multi_query, for this purpose. multi_query_decode is also a good name.

The changes in run_llama_batched_vllm.py shows a new request type and how the new function is meant to be used. There is no change under serve yet since it is purely a model change. After we agree on the approach, I'll integrate this new function into the engine to complete my parallel-sampling work. @yelite needs this for speculative decoding.

There is no attention kernel that reads from KV cache and operates on multiple queries, except FlashInfer which has BatchedPrefillWithKVCache. But we can emulate the behavior of such kernel by materializing past KV tensors from the cache, concat them with the present tensors, and running the standard prefill attention. This is not efficient but its correctness is much easier to verify. Until we integrate FlashInfer or Flash attention adds paged KV cache support, we can use this emulation.

@sunggg @yelite @elvin-n

examples/python/run_llama_batched_vllm.py

sunggg · 2024-01-11T01:47:29Z

Thank you for the PR, @masahi! Which tvm should I use to run this?
Also, would you move examples/python/run_llama_batched_vllm.py under the serve/ so that that can be a single folder for mlc-serve?

masahi · 2024-01-11T02:16:58Z

For now we need TVM from https://github.com/masahi/tvm/tree/vllm-cache-reconstruct. After apache/tvm#16376 is merged, I'll do a rebase.

would you move examples/python/run_llama_batched_vllm.py under the serve/ so that that can be a single folder for mlc-serve?

examples/python/run_llama_batched_vllm.py is not associated with mlc-serve. mlc-ai/main also has it. I added it to demonstrate how to use the batched llama model.

masahi · 2024-01-11T10:13:08Z

Opened #157 which uses the new Relax function from this PR to enable parallel-sampling eviction.

yelite

This looks great and should be sufficient for the speculative decoding with draft model.

By the way, is it still necessary to keep decode after we have a good kernel for evaluate_multi_query? Will there be performance loss if we run evaluate_multi_query with one token from each sequence? If not, maybe we can just name this decode. Maybe we can even retire prefill if the kernel can specialize without degrading performance in the case where it doesn't need to read past KV from cache.

masahi · 2024-01-12T06:17:34Z

By the way, is it still necessary to keep decode after we have a good kernel for evaluate_multi_query? Will there be performance loss if we run evaluate_multi_query with one token from each sequence? If not, maybe we can just name this decode. Maybe we can even retire prefill if the kernel can specialize without degrading performance in the case where it doesn't need to read past KV from cache.

This is an interesting idea. I'd like to think that specialization allows perf advantages (decode kernel shouldn't parallelize over the query tokens, since that dimension is small). FlashInfer implements a dedicated kernel for batched decode while it also has BatchedPrefillWithKVCache. We have to measure and see.

The comparison is a bit subtle since moving from single query to multiple ones involves switching entirely different kernel implementations (vllm to flash attention / flash infer). So perf can be affected by any number of reasons besides the increase in the number of query tokens.

This PR reorganizes the artifact structure. We now have two separate types of directories to store the libs/weights/..., with one "prebuilt" directory which holds all the prebuilt libs and weights downloaded from internet, and other model directories that are generated by local builds. CLI and test scripts are updated accordingly for this change.

yelite reviewed Jan 11, 2024

View reviewed changes

examples/python/run_llama_batched_vllm.py Outdated Show resolved Hide resolved

masahi added 6 commits January 11, 2024 07:00

add new model for evaluating logits over multiple queries using KV cache

e0ef4c6

add test

4ccbb27

clean

f1314a5

Only the number of past tokens is needed

2bee022

fix build

756b09f

fix

09ef5b3

masahi force-pushed the model-multi-query-logits branch from ccbfb6e to 09ef5b3 Compare January 11, 2024 07:00

correctly handle num_past_tokens > sliding_window case

7b67ba4

masahi force-pushed the model-multi-query-logits branch from 4001d61 to 7b67ba4 Compare January 11, 2024 08:51

masahi mentioned this pull request Jan 11, 2024

Parallel sampling eviction #157

Merged

yelite approved these changes Jan 12, 2024

View reviewed changes

masahi merged commit 66a2e53 into octoml:batch-serving Jan 13, 2024
1 check passed

masahi mentioned this pull request Jan 17, 2024

[Bug] Recovering logic of a long evicted request is broken #163

Open

masahi mentioned this pull request Jan 30, 2024

Update model definition to support Flash-Decoding #177

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new Relax function to the batched model for evaluating query tokens over multiple time steps in parallel #156

Add new Relax function to the batched model for evaluating query tokens over multiple time steps in parallel #156

masahi commented Jan 10, 2024 •

edited

Loading

sunggg commented Jan 11, 2024

masahi commented Jan 11, 2024

masahi commented Jan 11, 2024 •

edited

Loading

yelite left a comment

masahi commented Jan 12, 2024

Add new Relax function to the batched model for evaluating query tokens over multiple time steps in parallel #156

Add new Relax function to the batched model for evaluating query tokens over multiple time steps in parallel #156

Conversation

masahi commented Jan 10, 2024 • edited Loading

sunggg commented Jan 11, 2024

masahi commented Jan 11, 2024

masahi commented Jan 11, 2024 • edited Loading

yelite left a comment

Choose a reason for hiding this comment

masahi commented Jan 12, 2024

masahi commented Jan 10, 2024 •

edited

Loading

masahi commented Jan 11, 2024 •

edited

Loading