-
Notifications
You must be signed in to change notification settings - Fork 380
-
Notifications
You must be signed in to change notification settings - Fork 380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] 2x4090 with Llama2 70B silently crashes (i.e. without any error message in DEBUG mode) as of v0.6.0a0 and v0.6.0 (but works fine in previous versions) #2468
Comments
Have you set |
Hi @irexyc, thanks for your speedy response. I just tried it and "unfortunately" it fixes the issue 😅 I.e. inference works fine on v0.6.0 with 2x4090 if Is there anything else you'd like me to try? |
出现一样问题,同样必须执行TM_DEBUG_LEVEL=DEBUG,模型才能运行。不然多卡出现一张卡占用100%锁死问题。 |
Can you reproduce it with other models? I can't reproduce it with Qwen2-7B-AWQ or Llama3-70B-AWQ with v0.6.0 on 2 RTX 4090 GPUs. |
same problem has been shown on 2*A100 GPUS with Qwen2-72B-Instruct-GPTQ-Int4 and InternVL2-40B-AWQ models |
@lzhangzz Note that this bug report is about Llama 2 70B. Can you try with Llama 2 70B AWQ instead of Llama 3? Here's my command again from the original post for convenience:
|
@josephrocca |
Sorry for the confusion. Internet access is quite limited on our 4090 environment so I started with what I already have on the machine. |
@lvhan028 I have tested multiple Llama 2 70B AWQ models (not just (I did try testing LLama 3 70B on 2x4090 just now, but for some reason hit a separate problem with an explicit OOM error - likely an unrelated issue that I just need to spend time to debug. I will look into that issue tomorrow and post a separate issue if needed, but it's likely something that is wrong on my end.) |
@josephrocca In my test with Llama3 70B AWQ on 2x4090, |
Checklist
Describe the bug
Llama2 70B works fine on a dual RTX 4090 machine in v0.5.3, but fails in v0.6.0a0 and v0.6.0. There is no error message given, even with
--log-level DEBUG
.Reproduction
I'm testing on Runpod, using the official Docker images from here: https://hub.docker.com/r/openmmlab/lmdeploy/tags
lmdeploy serve api_server lmdeploy/llama2-chat-70b-4bit --model-name "lmdeploy/llama2-chat-70b-4bit" --server-port 3000 --tp 2 --session-len 8192 --model-format awq --enable-prefix-caching --quant-policy 4 --log-level DEBUG
cu11
andcu12
tags, and there was no difference in behavior.--enable-prefix-caching
and--quant-policy 4
, but this did not fix it.Environment
Error traceback
The text was updated successfully, but these errors were encountered: