[BYO-FT] Improve check for available memory #210

Lunderberg · 2024-02-14T18:55:23Z

Previously, any parameters whose allocations was performed by the relax VM would be double-counted.

masahi · 2024-02-14T19:17:05Z

serve/mlc_serve/model/tvm_model.py

@@ -232,7 +238,9 @@ def profile_memory_usage(self, seq_lens):

        self.mod["evaluate"](input_ids, positions, seq_lens, self.params)

-        return self.get_used_memory()
+        vm_alloc_after = self.get_used_memory()


sorry there is an incredibly embarrassing hack I did to TVM's PooledAllocator based on an assumption that get_used_memory() is only called once:
https://github.com/octoml/tvm/blob/for-mlc-serve-jan12/src/runtime/memory/pooled_allocator.h#L114-L116

This was to change the behavior of PooledAllocator before / after memory profiling, to workaround a limitation of the allocator. I have a branch that integrates a third-party CUDA allocator https://github.com/rapidsai/rmm to replace PooledAllocator, so that we don't have to worry about this issue.

That said, we can make get_used_memory() pure by introducing a global function to toggle the allocation mode. I should have done that but I did the shortcut of abusing get_used_memory() to signal the end of memory profiling.

Such global function is also useful even after we land the allocator based on rmm, since tracking statistics for memory allocation is costly and should only be done during memory profiling.

Ah, got it. So, by calling it twice, I'm invalidating the safety net anyways.

I think in that case, I'd lean toward disabling this check altogether, and replacing it with a compile-time tracking of memory usage. That way, instead of profiling, we could know the maximum size of the live set.

I'm curious if such compile-time estimate of memory usage agrees with the "ground truth" that we collect at runtime, given that the latter is affected by the behavior of the underlying allocator which the memory planner is not aware of. And my impression of Relax's memory planning has been that it allocates much more than necessary, compared to the allocator in torch.

Yeah. I think that increased memory would show up in the compile-time estimation, because a large part of the footprint is due to operations that aren't done in place. @slyubomirsky has been doing a lot of good work for in-place operations, which should give us a path forward on footprint.

masahi · 2024-02-14T19:26:40Z

any parameters whose allocations was performed by the relax VM

When do we encounter such params? I thought all params are allocated by NDArray::Empty().

Lunderberg · 2024-02-14T20:55:01Z

When do we encounter such params? I thought all params are allocated by NDArray::Empty().

In the current flow with pre-processed weights, that is correct. All parameters are allocated by NDArray::Empty(), as part of tvm.nd.array.

For the pre-process at initialization flow, which will be used to support bring-your-own-fine-tune, the transform_params is called at runtime, and outputs the parameters to be used. If there is any processing required, transform_params will allocate space for the pre-processed parameters. That allocation is performed by the relax VM, but is later used as a model parameter.

sunggg · 2024-02-16T17:34:49Z

Hi, @Lunderberg! Thanks for the contribution!
One quick question:

If there is any processing required, transform_params will allocate space for the pre-processed parameters.

Does it mean there will be a dynamic allocation? To guarantee that we wouldn't get OOM during the runtime, we've been very careful about the memory allocation assuming the worst worst scenario. I think we are reaching the point where we would need to revisit this assumption, so your input will be very appreciated :)

Lunderberg · 2024-02-16T18:18:12Z

Currently, all the allocations are static, but that's entirely due to the weights having static shape. While we could in theory write a transform_params function that is dynamic over the model parameters, but I don't see a need for that at the moment. What would be useful would be to have the transform_params function be dynamic over world_size, as then we could have a single .so file that can be used with different multi-GPU strategies.

I think it would be good to have a PrimFunc that takes any dynamic parameters, and returns the memory footprint for that set of dynamic parameters.

[BYO-FT] Improve check for available memory

55e1516

Previously, any parameters whose allocations was performed by the relax VM would be double-counted.

masahi reviewed Feb 14, 2024

View reviewed changes

masahi mentioned this pull request Feb 15, 2024

Toggle memory profiling start/stop explicitly #213

Merged

masahi merged commit 5e36d1b into batch-serving Feb 15, 2024
1 check passed

Lunderberg deleted the lunderberg/remove_double_counting_of_memory branch February 16, 2024 16:24

Lunderberg pushed a commit to Lunderberg/mlc-llm that referenced this pull request Feb 27, 2024

[Hotfix] C++ compilation issue (octoml#210)

1bd6786

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BYO-FT] Improve check for available memory #210

[BYO-FT] Improve check for available memory #210

Lunderberg commented Feb 14, 2024

masahi Feb 14, 2024 •

edited

Loading

masahi Feb 14, 2024

Lunderberg Feb 14, 2024

masahi Feb 14, 2024 •

edited

Loading

Lunderberg Feb 14, 2024

masahi commented Feb 14, 2024 •

edited

Loading

Lunderberg commented Feb 14, 2024

sunggg commented Feb 16, 2024

Lunderberg commented Feb 16, 2024

[BYO-FT] Improve check for available memory #210

[BYO-FT] Improve check for available memory #210

Conversation

Lunderberg commented Feb 14, 2024

masahi Feb 14, 2024 • edited Loading

Choose a reason for hiding this comment

masahi Feb 14, 2024

Choose a reason for hiding this comment

Lunderberg Feb 14, 2024

Choose a reason for hiding this comment

masahi Feb 14, 2024 • edited Loading

Choose a reason for hiding this comment

Lunderberg Feb 14, 2024

Choose a reason for hiding this comment

masahi commented Feb 14, 2024 • edited Loading

Lunderberg commented Feb 14, 2024

sunggg commented Feb 16, 2024

Lunderberg commented Feb 16, 2024

masahi Feb 14, 2024 •

edited

Loading

masahi Feb 14, 2024 •

edited

Loading

masahi commented Feb 14, 2024 •

edited

Loading