Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KV cache refactor to decouple cache blocks and metadata about them #168

Merged
merged 1 commit into from
Jan 20, 2024

Conversation

masahi
Copy link
Member

@masahi masahi commented Jan 19, 2024

Right now we are attaching both cache blocks and metadata about them such as block_table into a single class which is passed back and forth between the engine and the model. While working on multi-gpu support for PT models, I learned that I need to use some RPC framework to manage multiple processes.

The inputs from the engine need to cross the RPC boundary on each inference, and it is a bit awkward if the actual cache blocks also need to be passed from the engine always for no reason. For disco, only a handle to the cache blocks (DRef) is communicated between the engine and the model, but even this is unnecessary.

With this PR, the cache blocks are completely owned by the model and only block_tables etc need to be passed from the engine. I renamed the class to KVCacheInfo to reflect its new role. The interface in model_module.py doesn't need to change since this change is impl details of the paged cache model.

@yelite

@@ -24,6 +24,7 @@ def get_default_mlc_serve_argparser(description="", allow_override=False):
parser.add_argument("--artifact-path", type=str, default="dist")
parser.add_argument("--use-sync-engine", action="store_true")
parser.add_argument("--max-num-batched-tokens", type=int, default=4096)
parser.add_argument("--num-sequences-to-sample", type=int, default=1)
Copy link
Member Author

@masahi masahi Jan 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was accidentaly removed in #162 apparently.

@yelite yelite merged commit 7b22824 into octoml:batch-serving Jan 20, 2024
1 check passed
@yelite
Copy link

yelite commented Jan 20, 2024

I am wondering if it's still necessary to pass the KVCacheInfo to TextGenerator.generate. Previously I created the interface with an opaque kv cache object to make test easier and reduce implicit relationship between model and cache. But now I feel the model and cache needs to coupled together anyway, especially after this PR (the actual cache data is owned by the model, so the engine don't have any other choices when passing the KVCacheInfo to TextGenerator.generate.

@masahi
Copy link
Member Author

masahi commented Jan 20, 2024

I see what you mean. We could do another refactoring, where

self.text_generator.generate(requests,  self.cache_manager.get_cache())

is changed to

self.text_generator.generate(requests)

and all manipulations of self.cache_managers by the engine indirectly update the cache manager associated with the model.

However, going back to our initial design goals, we want to keep the model as pure as possible and put all states such as cache management into the engine side. The current interface is still good in that sense.

Also I now realized that this PR goes against the model purity goal a bit, but I think it is reasonable since the cache blocks are updated in-place by the model anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants