Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Memory Commit Savings. Possible total memory savings. Allow fully optimized model to be serialized to disk and used as-is without large heap allocs #21448

Open
ivberg opened this issue Jul 22, 2024 · 1 comment
Labels
feature request request for unsupported feature or enhancement model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc. platform:windows issues related to the Windows platform

Comments

@ivberg
Copy link
Contributor

ivberg commented Jul 22, 2024

Describe the feature request

Overview

Users want as low as possible memory utilization and great system performance when running AI models.

The benefits are large multi GB Process Memory Commit savings & possible total overall memory savings in some models.

The feature ask is to allow fully optimized model to be serialized to disk and used as-is without large heap allocs.

In the following, Windows examples are used, but this would likely apply to other OSes and ORT as well; since memory mapping APIs are available on other OSes and the ORT code is cross platform in this respect.

About

  1. If all of a model weights are not always used (sparse tensors?) then the weights actually used could only be read in from disk, and only occupy memory when accessed. In this case, total memory usage for running an AI model is less than the on disk size of the model.

  2. For most Large Language Models (LLMs) all the weights during the attention mechanism are usually needed & accessed during inference. However, the method that they are accessed and read in from disk has performance and memory implications.

Reducing Process Memory Commit Usage

Using the example for (2) LLMs, there are techniques to reduce the memory commit usage of a process (using OS memory mapping APIs), and sometimes obtain higher performance including inference perf especially under low memory conditions. Some AI models are large and much more likely to push a system to its memory limits. If you use these memory mappings APIs, and don't need to allocate heap memory, then say weights / initializer data from an ONNX model can just be loaded when accessed and would not occupy process commit.

This is very beneficial because System Total Commit is a precious resource. Commit Limit is physical memory + pagefile size. E.g. 16GB RAM + 16GB pagefile = Max 32GB of memory that can be allocated. Once this limit is reached no more memory across the system can be allocated at all. See more
Commit_charge
Pushing the Limits of Windows: Virtual Memory
Virtual Address Space and Physical Storage

For all further examples, we are going to use an example SLM (Small Language Model) Phi Silica example of around 1.85GB on disk size (3.2B) params.

Part 1 - Use ONNX External Data file with proper alignment + disable ORT Arena allocator

For our first experiment, we used ONNX External Data files with proper alignment fixes to generate a file which could be successfully mapped on Windows for all large initializers.
See External Data Conversion is not saving most data with alignment support. Therefore, mmap support disabled for these initializers
We also disabled the Arena memory allocator, as on CPU, the process consumes a lot more memory greedily and clouds the memory picture
m_session_options.DisableCpuMemArena();

With this in place ONNXRuntime was able to save a few hundred MBs (233MB) of process commit. This is just from the change of having an aligned external data file and thus letting ORT use map file support. However, this is not saving that much commit memory compared to the entire size of the model.

Part 2 - Disable pre-packing

For our second experiment, in addition to the technique & settings above, we disabled ORT pre-packing; which we determined from tracing was allocating the largest memory still - SessionState::PrepackConstantInitializedTensors
// Disable pre-packing - saves commit but REALLY REALLY bad for inf perf and overall runtime m_session_options.AddConfigEntry(kOrtSessionOptionsConfigDisablePrepacking, "1");
With this the commit memory savings were large, and in line with most of the size of the model 77%, in this case around 1436MB of commit. The issue here though is that disabling prepacking had severe runtime inference performance (200x worse), making the model unusable performance wise, but great memory wise.

General framework for implementing the feature

What follows is technical information on the general approach that ORT might use to both prepack a model, and then serialize that to disk, such that memory mapping could work AND large memory allocations were not needed by ORT. This would have the benefit of the best of both worlds, great runtime performance while getting the best utilization of system memory.

Changes in how the model weights are accessed

What would happen is simply the OS would page in the initializers and weights as needed on demand during inference. Weights that were routinely accessed would be kept in physical memory, not much differently than how heap memory for active working sets is kept in physical memory when needed. The difference from before that would take place is

  1. System commit would not increase by much when loading or executing a model. The memory usage of the process running ORT would not show as very large.
  2. Using tools like SysInternals RamMap and VMMap you would see the process using map files.
  3. Under low memory conditions, if memory pages with weights needed to be made available, the system memory manager would not have to page that to the disk or pagefile. Instead, the page is backed by the file on disk and can simply be discarded from memory, thus saving disk IO writing to the pagefile, MM CPU time compressing the page, and other accounting.
     

Feature request suggestions

So how to go about implementing this feature request?

ONNX Runtime already has the notion of graph optimizations that can be serialized/written to disk, for example in offline mode tied to a specific class of hardware - graph-optimizations.

However, even when using this offline optimized model, large memory allocations will still take occur in ORT due to something called PrePacking. Prepacking has large positive runtime inference performance benefits. However, our view is that these pre-packing optimizations should be done once, and be able to be serialized to disk so the data structure on disk matches the most optimized in memory layout that ORT will use.

Once prepacking is serialized on disk and used during session load, then the data structures needed for inference are already mapped into the process address space when using memory mapping and MapViewOfFile . With no other major allocations needed, when ORT attempts to access a data structure or weights, then simply the OS would page in from disk to memory those weights.

FYI @pranavsharma and @yuslepukhin whom we have already been working with on ORT on this

Describe scenario use case

This will be useful to optimize memory usage for on-device client scenarios with limited physical RAM with large CPU models.

Larger models on disk (1GB+) for example with Billions of parameters would utilize memory better with fully working memory map support.

One such example is Large Language Models (LLMs) or Small Language Models such as Phi Silica 3

@ivberg ivberg added the feature request request for unsupported feature or enhancement label Jul 22, 2024
@github-actions github-actions bot added model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc. platform:windows issues related to the Windows platform labels Jul 22, 2024
@ivberg
Copy link
Contributor Author

ivberg commented Jul 22, 2024

Related:
[ExternalData - On Windows document proper offset for memory mapping support] (onnx/onnx#6247)
[Python external data conversion - Add support for aligning data for mmap] (onnx/onnx#6248)
#21195

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request request for unsupported feature or enhancement model:transformer issues related to a transformer model: BERT, GPT2, Hugging Face, Longformer, T5, etc. platform:windows issues related to the Windows platform
Projects
None yet
Development

No branches or pull requests

1 participant