KV cache refactor to decouple cache blocks and metadata about them #168

masahi · 2024-01-19T23:10:49Z

Right now we are attaching both cache blocks and metadata about them such as block_table into a single class which is passed back and forth between the engine and the model. While working on multi-gpu support for PT models, I learned that I need to use some RPC framework to manage multiple processes.

The inputs from the engine need to cross the RPC boundary on each inference, and it is a bit awkward if the actual cache blocks also need to be passed from the engine always for no reason. For disco, only a handle to the cache blocks (DRef) is communicated between the engine and the model, but even this is unnecessary.

With this PR, the cache blocks are completely owned by the model and only block_tables etc need to be passed from the engine. I renamed the class to KVCacheInfo to reflect its new role. The interface in model_module.py doesn't need to change since this change is impl details of the paged cache model.

@yelite

masahi · 2024-01-19T23:15:20Z

serve/mlc_serve/utils.py

@@ -24,6 +24,7 @@ def get_default_mlc_serve_argparser(description="", allow_override=False):
    parser.add_argument("--artifact-path", type=str, default="dist")
    parser.add_argument("--use-sync-engine", action="store_true")
    parser.add_argument("--max-num-batched-tokens", type=int, default=4096)
+    parser.add_argument("--num-sequences-to-sample", type=int, default=1)


This was accidentaly removed in #162 apparently.

yelite · 2024-01-20T14:28:00Z

I am wondering if it's still necessary to pass the KVCacheInfo to TextGenerator.generate. Previously I created the interface with an opaque kv cache object to make test easier and reduce implicit relationship between model and cache. But now I feel the model and cache needs to coupled together anyway, especially after this PR (the actual cache data is owned by the model, so the engine don't have any other choices when passing the KVCacheInfo to TextGenerator.generate.

masahi · 2024-01-20T22:58:26Z

I see what you mean. We could do another refactoring, where

self.text_generator.generate(requests,  self.cache_manager.get_cache())

is changed to

self.text_generator.generate(requests)

and all manipulations of self.cache_managers by the engine indirectly update the cache manager associated with the model.

However, going back to our initial design goals, we want to keep the model as pure as possible and put all states such as cache management into the engine side. The current interface is still good in that sense.

Also I now realized that this PR goes against the model purity goal a bit, but I think it is reasonable since the cache blocks are updated in-place by the model anyway.

KV cache refactor to decouple cache blocks and metadata about them

08a63ca

masahi commented Jan 19, 2024

View reviewed changes

yelite approved these changes Jan 20, 2024

View reviewed changes

yelite merged commit 7b22824 into octoml:batch-serving Jan 20, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KV cache refactor to decouple cache blocks and metadata about them #168

KV cache refactor to decouple cache blocks and metadata about them #168

masahi commented Jan 19, 2024 •

edited

Loading

masahi Jan 19, 2024 •

edited

Loading

yelite commented Jan 20, 2024

masahi commented Jan 20, 2024

KV cache refactor to decouple cache blocks and metadata about them #168

KV cache refactor to decouple cache blocks and metadata about them #168

Conversation

masahi commented Jan 19, 2024 • edited Loading

masahi Jan 19, 2024 • edited Loading

Choose a reason for hiding this comment

yelite commented Jan 20, 2024

masahi commented Jan 20, 2024

masahi commented Jan 19, 2024 •

edited

Loading

masahi Jan 19, 2024 •

edited

Loading