Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] when --tp 2 #2423

Open
maxin9966 opened this issue Sep 4, 2024 · 6 comments
Open

[Feature] when --tp 2 #2423

maxin9966 opened this issue Sep 4, 2024 · 6 comments
Assignees

Comments

@maxin9966
Copy link

Motivation

CUDA_VISIBLE_DEVICES=3,4 lmdeploy serve api_server /home/ma/work/modelscope/glm-4-9b-chat-GPTQ-Int4 --backend turbomind --model-format gptq --server-port 11231 --tp 2 --session-len 16500 --cache-max-entry-count 0.1 --model-name gpt --max-batch-size 64

Regarding the issue of memory usage when --tp 2 is enabled, why does the memory usage double when tp equals 2? Each GPU is loading a full model individually. Shouldn't the model be split and distributed across different GPU instances instead?

Related resources

No response

Additional context

No response

@lvhan028
Copy link
Collaborator

lvhan028 commented Sep 4, 2024

Please checkout the NOTE part in https://lmdeploy.readthedocs.io/en/latest/get_started/get_started.html
The KV cache is allocated according to the ratio of the FREE GPU mem after the model is loaded.

@maxin9966
Copy link
Author

@lvhan028 --cache-max-entry-count 0.1

I set it to 0.1, and the two graphics cards each take up over 7G. When I set tp=1, the graphics cards take up over 7G.

@lvhan028
Copy link
Collaborator

lvhan028 commented Sep 5, 2024

Assume ONE GPU total memory is T, the model memory footprint is S, the hyper-parameter --cache-max-entry-count is lambda and the GPU number is P in tensor parallelism.

According to LMDeploy memory management policy, lambda * (T-S/P) will be allocated for KV cache on each GPU, no matter whether the model is quantized or not.

@maxin9966
Copy link
Author

@lvhan028 I know the formula, but the actual measurement does not match the formula. For the same command, only changing tp. tp=1, the single card uses more than 7G of VRAM, tp=2 with dual cards, each card uses more than 7G.

Am I missing some startup parameters?

@lvhan028
Copy link
Collaborator

lvhan028 commented Sep 6, 2024

The token_embedding and lm_head weights are not splitted and distributed across GPUs.
Each GPU owns a copy.
PR #2252 resolves it and will be released in next week.

@lvhan028
Copy link
Collaborator

May try the v0.6.0

@lvhan028 lvhan028 self-assigned this Sep 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants