Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Steady RAM Usage Increase During Video Inference using video.py #39

Open
chain-of-immortals opened this issue Aug 20, 2024 · 7 comments

Comments

@chain-of-immortals
Copy link

chain-of-immortals commented Aug 20, 2024

Hello,

I’ve been running some tests using the nano_llm.vision.video module with live camera streaming on AGX Orin 64gb model.

with the following parameters,
--model Efficient-Large-Model/VILA1.5-13b
--max-images 5
--max-new-tokens 3
--prompt 'do you see a moniter in the frame? reply in binary 0 is no and 1 is yes'

I noticed a steady increase in RAM usage during these tests and wanted to get some clarification on what might be causing this.

Here are the details:

Setup:
First, I used a USB camera streaming at 640x480 resolution.
Then, I tested with another camera streaming at 4K resolution.I have attached the graph of the ram usage in both the cases.
output

Observation: In both cases, I observed a continuous climb in RAM usage over time, which persisted throughout the streaming session. Much quicker ramp up in the case of 4k images.
I’m wondering if this behavior could be attributed to how frames are handled or any other aspects of the video processing pipeline in the script. Is there any known issue or specific configuration I should be aware of that might help address this?

Also How should i think about the optimal size of the video frames i should be feeding this OpenVila1.5 13b model?

Any insights or suggestions would be greatly appreciated.

Thank you!

@ms1design
Copy link
Contributor

ms1design commented Aug 24, 2024

@dusty-nv bump.

Basically this is what I mentioned few times during our conversations. In my use case where I inference MLCModel (no vision) in loop - after around 100 samples I get OOM and the process get's killed.

Tried to do gc after each inference iteration, even chat_history.reset() won’t help:

Meta-Llama-3 1-8B-Instruct-prev_memory_usage

I did a memory profiling and looks like it's chat_history.embed_chat(), where embeddings are joined together using np.concatenate:

Filename: /opt/NanoLLM/nano_llm/chat/history.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
   344   1949.5 MiB   1949.5 MiB           1       @profile
   345                                             def embed_chat(self, use_cache=True, max_tokens=None, wrap_tokens=None, **kwargs):
   346                                                 """
   347                                                 Assemble the embedding of either the latest or entire chat.
   348                                                 
   349                                                 If ``use_cache=True`` (the default), and only the new embeddings will be returned.
   350                                                 If ``use_cache=False``, then the entire chat history will be returned.
   351                                                 
   352                                                 This function returns an ``(embedding, position)`` tuple, where the embedding array
   353                                                 contains the new embeddings (or tokens) from the chat, and position is the current
   354                                                 overall position in the history (up to the model's context window length)
   355                                                 
   356                                                 If the number of tokens in the chat history exceeds the length given in ``max_tokens`` argument
   357                                                 (which is typically the model's context window, minus the max generation length),
   358                                                 then the chat history will drop all but the latest ``wrap_tokens``, starting with a user prompt.
   359                                                 If `max_tokens` is provided but `wrap_tokens` is not, then the overflow tokens will be truncated.
   360                                                 """
   361   1949.5 MiB      0.0 MiB           1           embeddings = []
   362   1949.5 MiB      0.0 MiB           1           position = 0
   363                                               
   364   1976.4 MiB      0.0 MiB           5           for n, msg in enumerate(self.messages):
   365   1976.4 MiB      0.0 MiB           4               if use_cache:
   366                                                         if msg.cached:
   367                                                             position += msg.num_tokens
   368                                                         else:
   369                                                             embeddings.append(msg.embed())
   370                                                             use_cache = False  # all entries after this need to be included
   371                                                     else:
   372   1976.4 MiB     26.9 MiB           4                   embeddings.append(msg.embed())
   373                                                       
   374   1976.4 MiB      0.0 MiB           4               if not use_cache and logging.getLogger().isEnabledFor(logging.DEBUG) and (len(self.messages) - n < 5):
   375                                                         logging.debug(f"chat msg {n}  role={msg.role}  type={msg.type}  tokens={msg.num_tokens}  `{msg.template if msg.template else msg.content if isinstance(msg.content, str) else ''}`".replace('\n', '\\n'))
   376                                         
   377   1976.4 MiB      0.0 MiB           1           entries = len(embeddings)
   378   2000.2 MiB     23.8 MiB           1           embeddings = np.concatenate(embeddings, axis=1) #, position
   379                                         
   380   2000.2 MiB      0.0 MiB           1           '''
   381                                                 if max_tokens and position + embeddings.shape[1] > max_tokens:
   382                                                     if wrap_tokens:
   383                                                         self.reset(wrap_tokens=wrap_tokens)
   384                                                         embeddings, position = self.embed_chat(use_cache=False, max_tokens=max_tokens, wrap_tokens=wrap_tokens, **kwargs)
   385                                                         logging.warning(f"Chat overflow, max history lenth {max_tokens} tokens exceeded (keeping the most recent {embeddings.shape[1]} tokens)")
   386                                                     else:
   387                                                         logging.warning(f"Truncating chat history overflow to {max_tokens} tokens")
   388                                                         return embeddings[:,:max_tokens,:], position
   389                                                 '''
   390                                                     
   391   2000.2 MiB      0.0 MiB           1           logging.debug(f"chat embed  entries={entries}  shape={embeddings.shape}  position={position}")
   392   2000.2 MiB      0.0 MiB           1           return embeddings, position 

@dusty-nv
Copy link
Owner

Hi guys, thanks for reporting this and providing the charts - will look into this. @ms1design are you using streaming mode for generation? Is your generation script essentially like nano_llm/chat/example.py ?

Can you try setting this line to self.state = self.model._create_kv_cache(use_cache=False) to see if it is related to kv_cache caching? You can edit the NanoLLM sources by cloning/mounting it in 'dev mode' like this: https://www.jetson-ai-lab.com/agent_studio.html#dev-mode

Also I take it you are running the normal latest NanoLLM container on JetPack 6? Thanks for the debugging info you have gathered!

@chain-of-immortals
Copy link
Author

on my end...yes im running the latest NanoLLM container on jetpack 6. Thx for looking in this issue.

@dusty-nv
Copy link
Owner

dusty-nv commented Aug 24, 2024 via email

@chain-of-immortals
Copy link
Author

chain-of-immortals commented Aug 26, 2024

i was able to notice is a much more pronounced ram increase when i was streaming 4k images versus lower resolution. it is not as noticable when streaming lower resolution streams. Looking forward to testing the new release, when its available. thx.

@ms1design
Copy link
Contributor

ms1design commented Aug 26, 2024

@ms1design are you using streaming mode for generation? Is your generation script essentially like nano_llm/chat/example.py ?

It's streaming and yes it's still not yet using Plugins.

Can you try setting this line to self.state = self.model._create_kv_cache(use_cache=False) to see if it is related to kv_cache caching?

@dusty-nv yes, I did that also, but unfortunatelly with the same results:

memory_usage

@dusty-nv would you be so kind to share your benchmark logic? Let me explain how mine works, maybe the issue is in my loop:

  1. Load model (streaming=True, use_cache=True)
  2. Inference with empty chat history
  3. reset chat history
  4. repeat point 2 until end of samples

Also I take it you are running the normal latest NanoLLM container on JetPack 6?

I'm running dustynv/nano_llm:humble-r36.3.0 image (sha256:6944c57c8b1381fc430dc3ebd0ad5ceec1a63a21853dd1c2c544f7959939506f) on:

root@ubuntu:/data/nano_llm_ha# cat /etc/nv_tegra_release
# R36 (release), REVISION: 2.0, GCID: 34956989, BOARD: generic, EABI: aarch64, DATE: Thu Nov 30 19:03:58 UTC 2023
# KERNEL_VARIANT: oot
TARGET_USERSPACE_LIB_DIR=nvidia
TARGET_USERSPACE_LIB_DIR_PATH=usr/lib/aarch64-linux-gnu/nvidia

@dusty-nv any hints on this?

@ms1design
Copy link
Contributor

i was able to notice is a much more pronounced ram increase when i was streaming 4k images versus lower resolution.

@chain-of-immortals Similar as me - when I reduce the length of system prompt I can go beyond 250 samples before OOM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants