-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fails hard on CUDA error #523
Comments
Stacktrace:
|
cc @tgaddair |
I did an experiment where I make inference requests sequentially every time using a different adapter it eventually fails every time on this line lorax/server/lorax_server/adapters/lora.py Line 169 in ecbe9ea
Restarting the server and trying the same failing adapter works. Which means the issue is not with the adapter. There is some issue with how lorax manages adapters in memory maybe?
Additionally the Webserver crashes as well |
Resolved after updating docker image to latest |
No, actually issue is not resolved. The same test running long enough will eventually crash. Now in a different code path
|
@magdyksaleh can you have a look. I was able to catch it on predibase cloud as well. |
Hey @yunmanger1, I'll try and repro this today. In the meantime, if there's any additional info you can provide to help with the repro, please let me know. For example:
|
System Info
We are using streaming v1 chat completions API. After some amount of requests or a request with large enough context lorax server fails to respond. And all consequent requests also fail.
we are running it in docker with 1 GPU on A100 PCIe runpod.io:
full request log:
Information
Tasks
Reproduction
Expected behavior
if one request fails consequent request should not be failing.
The text was updated successfully, but these errors were encountered: