[Feature] when --tp 2 #2423

maxin9966 · 2024-09-04T15:21:26Z

Motivation

CUDA_VISIBLE_DEVICES=3,4 lmdeploy serve api_server /home/ma/work/modelscope/glm-4-9b-chat-GPTQ-Int4 --backend turbomind --model-format gptq --server-port 11231 --tp 2 --session-len 16500 --cache-max-entry-count 0.1 --model-name gpt --max-batch-size 64

Regarding the issue of memory usage when --tp 2 is enabled, why does the memory usage double when tp equals 2? Each GPU is loading a full model individually. Shouldn't the model be split and distributed across different GPU instances instead?

Related resources

No response

Additional context

No response

lvhan028 · 2024-09-04T15:42:09Z

Please checkout the NOTE part in https://lmdeploy.readthedocs.io/en/latest/get_started/get_started.html
The KV cache is allocated according to the ratio of the FREE GPU mem after the model is loaded.

maxin9966 · 2024-09-04T19:03:28Z

@lvhan028 --cache-max-entry-count 0.1

I set it to 0.1, and the two graphics cards each take up over 7G. When I set tp=1, the graphics cards take up over 7G.

lvhan028 · 2024-09-05T03:00:42Z

Assume ONE GPU total memory is T, the model memory footprint is S, the hyper-parameter --cache-max-entry-count is lambda and the GPU number is P in tensor parallelism.

According to LMDeploy memory management policy, lambda * (T-S/P) will be allocated for KV cache on each GPU, no matter whether the model is quantized or not.

maxin9966 · 2024-09-05T08:40:22Z

@lvhan028 I know the formula, but the actual measurement does not match the formula. For the same command, only changing tp. tp=1, the single card uses more than 7G of VRAM, tp=2 with dual cards, each card uses more than 7G.

Am I missing some startup parameters?

lvhan028 · 2024-09-06T03:44:59Z

The token_embedding and lm_head weights are not splitted and distributed across GPUs.
Each GPU owns a copy.
PR #2252 resolves it and will be released in next week.

lvhan028 · 2024-09-16T11:00:12Z

May try the v0.6.0

lvhan028 added the awaiting response label Sep 16, 2024

lvhan028 self-assigned this Sep 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] when --tp 2 #2423

[Feature] when --tp 2 #2423

maxin9966 commented Sep 4, 2024

lvhan028 commented Sep 4, 2024

maxin9966 commented Sep 4, 2024

lvhan028 commented Sep 5, 2024

maxin9966 commented Sep 5, 2024

lvhan028 commented Sep 6, 2024

lvhan028 commented Sep 16, 2024

[Feature] when --tp 2 #2423

[Feature] when --tp 2 #2423

Comments

maxin9966 commented Sep 4, 2024

Motivation

Related resources

Additional context

lvhan028 commented Sep 4, 2024

maxin9966 commented Sep 4, 2024

lvhan028 commented Sep 5, 2024

maxin9966 commented Sep 5, 2024

lvhan028 commented Sep 6, 2024

lvhan028 commented Sep 16, 2024