Integrate Flash-Decoding into engine #181

masahi · 2024-01-31T04:38:44Z

A follow-up to #177

As I commented in #177 (comment), this PR introduces a breaking change to the build flow (--use-vllm-attention is removed). So I recommend merging this PR after other high-priority PRs like #82 are merged. Marked as draft to avoid an early merge.

After this PR, replace --use-vllm-attention in your build command with --paged-kv-cache-type vllm or --paged-kv-cache-type flash-decoding. You also need the latest for-mlc-serve-jan12.

Preliminary benchmark results

benchmark_throughput.py

Using --max-num-batched-tokens 4096 --greedy-sampling-ratio 1

llama 7B fp16
FD (block size 256): Engine Throughput: 43.52 requests/s, 15714.20 tokens/s (437 blocks)
FD (block size 128): Engine Throughput: 44.29 requests/s, 15991.46 tokens/s (874 blocks)
vLLM: Engine Throughput: 42.22 requests/s, 15245.43 tokens/s

Mistral 7B fp16
FD (block size 256): Engine Throughput: 46.68 requests/s, 17859.27 tokens/s (1766 blocks)
FD (block size 128): Engine Throughput: 48.80 requests/s, 18673.48 tokens/s (3533 blocks)
vLLM: Engine Throughput: 52.95 requests/s, 20259.87 tokens/s

llama 13b fp16
FD (block size 256): Engine Throughput: 24.02 requests/s, 8674.84 tokens/s (210 blocks)
FD (block size 128): Engine Throughput: 23.73 requests/s, 8569.77 tokens/s (421 blocks)
vLLM: Engine Throughput: 22.73 requests/s, 8206.14 tokens/s

llama 70b fp16, 2gpu
FD (block size 256): Engine Throughput: 5.09 requests/s, 1839.43 tokens/s (59 blocks)
FD (block size 128): Engine Throughput: 5.70 requests/s, 2057.70 tokens/s (113 blocks)
vLLM: Engine Throughput: 6.01 requests/s, 2168.58 tokens/s (909 blocks)

Mixtral fp16, 2gpu
FD (block size 256): Engine Throughput: 26.84 requests/s, 10270.41 tokens/s (1637 blocks)
FD (block size 128): Engine Throughput: 25.16 requests/s, 9625.92 tokens/s (3274 blocks)
vLLM: Engine Throughput: 26.27 requests/s, 10052.30 tokens/s

llmperf
Using llama 13b fp16
MLC_API_BASE="http://localhost:8000/v1" MLC_API_KEY="xxxxx" python llmperf.py -r 300 -c 30 --max-tokens 150 -f mlc -m dist/models/llama-2-13b-chat-hf

FD

OK          280
Mismatch     20
Name: count, dtype: int64
Clean DF is: 300
Mean End-to-end: 3191 ms
Mean TTFT: 495 ms (mean tokens in: 504, out: 135)
Max TTFT: 939 ms
TTFT > 3 s: 0.00%
ITL (out): 23.77 ms/token, mean tokens/s output (out): 42.18 token/s

vLLM

OK          278
Mismatch     22
Name: count, dtype: int64
Clean DF is: 300
Mean End-to-end: 3468 ms
Mean TTFT: 503 ms (mean tokens in: 503, out: 134)
Max TTFT: 890 ms
TTFT > 3 s: 0.00%
ITL (out): 26.01 ms/token, mean tokens/s output (out): 38.53 token/s

serve/mlc_serve/model/tvm_model.py

mlc_llm/core.py

sunggg · 2024-01-31T16:13:28Z

Thank you for the great improvement, @masahi! Let me follow-up later this week.

So I recommend merging this PR after other high-priority PRs like #82 are merged.
To avoid the accidental mistake, can we mark this PR as the draft for now?

masahi · 2024-02-08T19:49:21Z

This is ready for review. More benchmarks will be done after it is merged. You should update for-mlc-serve-jan12, and --use-vllm-attention in the build command needs to be replaced with --page-kv-cache-type vllm. FD is not used unless you specify --paged-kv-cache-type flash-decoding.

@sunggg @elvin-n @yelite @vinx13

sunggg

LGTM, thanks @masahi!

masahi added 22 commits January 29, 2024 09:26

test stub

30e57a0

wip

6f3429a

wip

97a4366

wip

7279cb6

compiled

7348f0e

wip

b692376

fix

1df6cac

fix

8c8872c

wip, decode with flash decoding works

6a8272f

all work

487129c

add paged_kv_cache_type option

8114197

read kv_type from artifact

2d6c81b

black

67353b2

refactor attention backend

b9e41e1

minor clean up

910e31b

Integrate flash-decoding into mlc-serve

ab910f2

remove --use-vllm-attention

4c8a75b

wip decode_multi_query integration

00e1d09

temp handling for multi-query logits

5fbf671

remove tmp support for multi-query decode

2eff7b0

Merge branch 'batch-serving' into flash-decoding-engine

c51c2a4

Merge branch 'batch-serving' into flash-decoding-engine

d7704e2

masahi commented Jan 31, 2024

View reviewed changes

serve/mlc_serve/model/tvm_model.py Show resolved Hide resolved

masahi commented Jan 31, 2024

View reviewed changes

serve/mlc_serve/model/tvm_model.py Show resolved Hide resolved

elvin-n reviewed Jan 31, 2024

View reviewed changes

mlc_llm/core.py Outdated Show resolved Hide resolved

masahi marked this pull request as draft January 31, 2024 18:06

masahi added 3 commits January 31, 2024 18:07

typo

404b305

Merge branch 'batch-serving' into flash-decoding-engine

d87506c

Merge branch 'batch-serving' into flash-decoding-engine

99af3fb

masahi added 3 commits February 1, 2024 21:52

use block size 128 or 64 when possible

a003965

Merge branch 'batch-serving' into flash-decoding-engine

780e244

remove unused var

56d7a23

masahi mentioned this pull request Feb 8, 2024

[JSON Mode] Constrained Sampling #175

Merged

masahi added 2 commits February 8, 2024 19:36

Merge branch 'batch-serving' into flash-decoding-engine

1b976dc

merge fix

a028c7d

masahi marked this pull request as ready for review February 8, 2024 19:43

vinx13 approved these changes Feb 8, 2024

View reviewed changes

sunggg approved these changes Feb 12, 2024

View reviewed changes

sunggg merged commit edf8d27 into octoml:batch-serving Feb 12, 2024
1 check passed

Lunderberg pushed a commit to Lunderberg/mlc-llm that referenced this pull request Feb 27, 2024

Free memory on reload using memory allocator clear func (octoml#181)

058cbbf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate Flash-Decoding into engine #181

Integrate Flash-Decoding into engine #181

masahi commented Jan 31, 2024 •

edited

Loading

sunggg commented Jan 31, 2024

masahi commented Feb 8, 2024 •

edited

Loading

sunggg left a comment

Integrate Flash-Decoding into engine #181

Integrate Flash-Decoding into engine #181

Conversation

masahi commented Jan 31, 2024 • edited Loading

sunggg commented Jan 31, 2024

masahi commented Feb 8, 2024 • edited Loading

sunggg left a comment

Choose a reason for hiding this comment

masahi commented Jan 31, 2024 •

edited

Loading

masahi commented Feb 8, 2024 •

edited

Loading