Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate Flash-Decoding into engine #181

Merged
merged 30 commits into from
Feb 12, 2024

Conversation

masahi
Copy link
Member

@masahi masahi commented Jan 31, 2024

A follow-up to #177

As I commented in #177 (comment), this PR introduces a breaking change to the build flow (--use-vllm-attention is removed). So I recommend merging this PR after other high-priority PRs like #82 are merged. Marked as draft to avoid an early merge.

After this PR, replace --use-vllm-attention in your build command with --paged-kv-cache-type vllm or --paged-kv-cache-type flash-decoding. You also need the latest for-mlc-serve-jan12.

Preliminary benchmark results

benchmark_throughput.py

Using --max-num-batched-tokens 4096 --greedy-sampling-ratio 1

llama 7B fp16
FD (block size 256): Engine Throughput: 43.52 requests/s, 15714.20 tokens/s (437 blocks)
FD (block size 128): Engine Throughput: 44.29 requests/s, 15991.46 tokens/s (874 blocks)
vLLM: Engine Throughput: 42.22 requests/s, 15245.43 tokens/s

Mistral 7B fp16
FD (block size 256): Engine Throughput: 46.68 requests/s, 17859.27 tokens/s (1766 blocks)
FD (block size 128): Engine Throughput: 48.80 requests/s, 18673.48 tokens/s (3533 blocks)
vLLM: Engine Throughput: 52.95 requests/s, 20259.87 tokens/s

llama 13b fp16
FD (block size 256): Engine Throughput: 24.02 requests/s, 8674.84 tokens/s (210 blocks)
FD (block size 128): Engine Throughput: 23.73 requests/s, 8569.77 tokens/s (421 blocks)
vLLM: Engine Throughput: 22.73 requests/s, 8206.14 tokens/s

llama 70b fp16, 2gpu
FD (block size 256): Engine Throughput: 5.09 requests/s, 1839.43 tokens/s (59 blocks)
FD (block size 128): Engine Throughput: 5.70 requests/s, 2057.70 tokens/s (113 blocks)
vLLM: Engine Throughput: 6.01 requests/s, 2168.58 tokens/s (909 blocks)

Mixtral fp16, 2gpu
FD (block size 256): Engine Throughput: 26.84 requests/s, 10270.41 tokens/s (1637 blocks)
FD (block size 128): Engine Throughput: 25.16 requests/s, 9625.92 tokens/s (3274 blocks)
vLLM: Engine Throughput: 26.27 requests/s, 10052.30 tokens/s

llmperf
Using llama 13b fp16
MLC_API_BASE="http://localhost:8000/v1" MLC_API_KEY="xxxxx" python llmperf.py -r 300 -c 30 --max-tokens 150 -f mlc -m dist/models/llama-2-13b-chat-hf

FD

OK          280
Mismatch     20
Name: count, dtype: int64
Clean DF is: 300
Mean End-to-end: 3191 ms
Mean TTFT: 495 ms (mean tokens in: 504, out: 135)
Max TTFT: 939 ms
TTFT > 3 s: 0.00%
ITL (out): 23.77 ms/token, mean tokens/s output (out): 42.18 token/s

vLLM

OK          278
Mismatch     22
Name: count, dtype: int64
Clean DF is: 300
Mean End-to-end: 3468 ms
Mean TTFT: 503 ms (mean tokens in: 503, out: 134)
Max TTFT: 890 ms
TTFT > 3 s: 0.00%
ITL (out): 26.01 ms/token, mean tokens/s output (out): 38.53 token/s

mlc_llm/core.py Outdated Show resolved Hide resolved
@sunggg
Copy link
Member

sunggg commented Jan 31, 2024

Thank you for the great improvement, @masahi! Let me follow-up later this week.

So I recommend merging this PR after other high-priority PRs like #82 are merged.
To avoid the accidental mistake, can we mark this PR as the draft for now?

@masahi masahi marked this pull request as draft January 31, 2024 18:06
@masahi masahi marked this pull request as ready for review February 8, 2024 19:43
@masahi
Copy link
Member Author

masahi commented Feb 8, 2024

This is ready for review. More benchmarks will be done after it is merged. You should update for-mlc-serve-jan12, and --use-vllm-attention in the build command needs to be replaced with --page-kv-cache-type vllm. FD is not used unless you specify --paged-kv-cache-type flash-decoding.

@sunggg @elvin-n @yelite @vinx13

Copy link
Member

@sunggg sunggg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @masahi!

@sunggg sunggg merged commit edf8d27 into octoml:batch-serving Feb 12, 2024
1 check passed
Lunderberg pushed a commit to Lunderberg/mlc-llm that referenced this pull request Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants