Parallel sampling eviction #157

masahi · 2024-01-11T10:11:51Z

The most interesting change is in engine_common.py where I use one PrefillRequest and EvalMultiQueryRequest for each sequence to restore cache entries for parallel-sampling requests.

I couldn't find a good way to test this pragmatically. The way I tested it was to change the condition https://github.com/octoml/mlc-llm/blob/batch-serving/serve/mlc_serve/engine/engine_common.py#L327 depending on model / input to manually cause an eviction.

Ready for review @sunggg @elvin-n

* add discord link * empty line * fix

…tched_tokens

masahi · 2024-02-01T10:07:03Z

serve/mlc_serve/engine/engine_common.py

+                        f"since it has generated more than {self.max_num_batched_tokens} tokens in total"
+                         "and currently we do not support preempting such request.",
+                    )
+                    continue


@sunggg @elvin-n Please be aware of this limitation. Due to this, there is still a case when a parallel-sampling request is cancelled rather than preempted.

In general, we don't have a good solution for preempting a request which has generated more than max_num_batched_tokens tokens. See also #163. The easiest solution would be to stop generation at max_num_batched_tokens, but then we cannot support "unlimited" generation.

This reverts commit ed0e52f.

Revert "Parallel sampling eviction (#157)" This reverts commit ed0e52f.

masahi added 29 commits January 11, 2024 07:00

add new model for evaluating logits over multiple queries using KV cache

e0ef4c6

add test

4ccbb27

clean

f1314a5

Only the number of past tokens is needed

2bee022

fix build

756b09f

fix

09ef5b3

correctly handle num_past_tokens > sliding_window case

7b67ba4

wip

e0517fd

blac

cf89a5b

wip

9ca4806

wip

4541b4d

remove cancel call back in eviction

5d376d2

Create MultiQueryDecodeRequest

59c36cc

only the number of past tokens is needed

f58acf7

wip

d9dd2ca

wip

cb11761

wip

24f7bfa

fix

34da221

wip

d94e9d8

wip

4a3bb77

wip

0c6875e

wip

a46abe1

working?

c80bea2

remove dbg print

18239a4

multi gpu works

fd2b2bd

fixed sliding window logic

6ac292b

remove dbug print

2f9d1f7

clean and fix

3a9f6d6

mypy

9fb9261

masahi mentioned this pull request Jan 11, 2024

Add new Relax function to the batched model for evaluating query tokens over multiple time steps in parallel #156

Merged

masahi force-pushed the parallel-sampling-eviction branch from bc3dc83 to 9fb9261 Compare January 13, 2024 07:44

masahi added 5 commits January 13, 2024 07:44

generate signature update

906b23b

Merge branch 'batch-serving' into parallel-sampling-eviction

2c1aa04

more

b197e71

fix mypy

2dfa28d

fix

e287c5f

masahi marked this pull request as ready for review January 13, 2024 08:10

masahi marked this pull request as draft January 18, 2024 04:49

Lunderberg pushed a commit to Lunderberg/mlc-llm that referenced this pull request Jan 30, 2024

[Community] Add link to Discord server (octoml#157)

26d341a

* add discord link * empty line * fix

masahi added 8 commits January 31, 2024 21:06

Merge branch 'batch-serving' into parallel-sampling-eviction

417750c

fix

c925c52

mypy fix

a4d6e01

Merge branch 'batch-serving' into parallel-sampling-eviction

7360392

refactor

5dbf73e

fix

78a6f77

rename

9189697

Disallow preempting when a request has generated more than max_num_ba…

d4fe2d7

…tched_tokens

masahi marked this pull request as ready for review February 1, 2024 10:00

masahi commented Feb 1, 2024

View reviewed changes

masahi mentioned this pull request Feb 1, 2024

[Bug] Recovering logic of a long evicted request is broken #163

Open

sunggg merged commit ed0e52f into octoml:batch-serving Feb 2, 2024
1 check passed

sunggg added a commit that referenced this pull request Feb 2, 2024

Revert "Parallel sampling eviction (#157)"

d324181

This reverts commit ed0e52f.

sunggg mentioned this pull request Feb 2, 2024

Revert "Parallel sampling eviction" #189

Merged

sunggg added a commit that referenced this pull request Feb 2, 2024

Revert "Parallel sampling eviction" (#189)

8a119d1

Revert "Parallel sampling eviction (#157)" This reverts commit ed0e52f.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel sampling eviction #157

Parallel sampling eviction #157

masahi commented Jan 11, 2024 •

edited

Loading

masahi Feb 1, 2024 •

edited

Loading

Parallel sampling eviction #157

Parallel sampling eviction #157

Conversation

masahi commented Jan 11, 2024 • edited Loading

masahi Feb 1, 2024 • edited Loading

Choose a reason for hiding this comment

masahi commented Jan 11, 2024 •

edited

Loading

masahi Feb 1, 2024 •

edited

Loading