Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to allocate memory #84

Open
aidinrs opened this issue Jun 17, 2024 · 3 comments
Open

Unable to allocate memory #84

aidinrs opened this issue Jun 17, 2024 · 3 comments

Comments

@aidinrs
Copy link

aidinrs commented Jun 17, 2024

It seems there is a problem with memory allocation when processing longer prompts, I have used a prompt with around 3500 tokens in LLMEval and when processing the prompt, the process hugs up to 12.5 GB of the memory initially, around 5GB of these are for the model weights which is fine, but the extra 7GB doesn't seem normal. The prompt is around 3500 tokens which means 2MB for each token! The memory usage gets lower (to 6GB) when the prompt processing phase is done. The issue gets worse when the full context is used (it hugs up to ~25GB).

I don't have this issue with llama.cpp since it just allocates the memory required for the weights with a little extra memory for the calculations.

Configuring the memory and cache limits also doesn't help, the process throws before processing.

This issue hinders running and developing applications for devices with lower than 32GB of RAM.

@davidkoski
Copy link
Collaborator

Check out the details here: #17

You might want to use this:

and set the cache limit to a few megabytes and see how that behaves.

@aidinrs
Copy link
Author

aidinrs commented Jun 17, 2024

@davidkoski I tried that already and it doesn't help. The initial jump in memory appears only when processing the prompt. When tokens are being generated one-by-one the memory usage is back to normal.

@awni
Copy link
Member

awni commented Aug 19, 2024

The memory needed for long prompts scales with the square of the prompt length. So in your case: 3500 * 3500 * num_heads * 2 would be the memory used in bytes for the attention scores with a prompt length of 3500.

What were you running when it jumped to 12GB?

Also #93 should bring LLMEval up to parity with our Python counter part which can handle much longer prompts with lower memory use.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants