-
Notifications
You must be signed in to change notification settings - Fork 380
-
Notifications
You must be signed in to change notification settings - Fork 380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] when --tp 2 #2423
Comments
Please checkout the NOTE part in https://lmdeploy.readthedocs.io/en/latest/get_started/get_started.html |
@lvhan028 --cache-max-entry-count 0.1 I set it to 0.1, and the two graphics cards each take up over 7G. When I set tp=1, the graphics cards take up over 7G. |
Assume ONE GPU total memory is According to LMDeploy memory management policy, |
@lvhan028 I know the formula, but the actual measurement does not match the formula. For the same command, only changing tp. tp=1, the single card uses more than 7G of VRAM, tp=2 with dual cards, each card uses more than 7G. Am I missing some startup parameters? |
The |
May try the v0.6.0 |
Motivation
CUDA_VISIBLE_DEVICES=3,4 lmdeploy serve api_server /home/ma/work/modelscope/glm-4-9b-chat-GPTQ-Int4 --backend turbomind --model-format gptq --server-port 11231 --tp 2 --session-len 16500 --cache-max-entry-count 0.1 --model-name gpt --max-batch-size 64
Regarding the issue of memory usage when --tp 2 is enabled, why does the memory usage double when tp equals 2? Each GPU is loading a full model individually. Shouldn't the model be split and distributed across different GPU instances instead?
Related resources
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: