Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BYO-FT] Improve check for available memory #210

Merged
merged 1 commit into from
Feb 15, 2024

Conversation

Lunderberg
Copy link
Member

Previously, any parameters whose allocations was performed by the relax VM would be double-counted.

Previously, any parameters whose allocations was performed by the
relax VM would be double-counted.
@@ -232,7 +238,9 @@ def profile_memory_usage(self, seq_lens):

self.mod["evaluate"](input_ids, positions, seq_lens, self.params)

return self.get_used_memory()
vm_alloc_after = self.get_used_memory()
Copy link
Member

@masahi masahi Feb 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry there is an incredibly embarrassing hack I did to TVM's PooledAllocator based on an assumption that get_used_memory() is only called once:
https://github.com/octoml/tvm/blob/for-mlc-serve-jan12/src/runtime/memory/pooled_allocator.h#L114-L116

This was to change the behavior of PooledAllocator before / after memory profiling, to workaround a limitation of the allocator. I have a branch that integrates a third-party CUDA allocator https://github.com/rapidsai/rmm to replace PooledAllocator, so that we don't have to worry about this issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, we can make get_used_memory() pure by introducing a global function to toggle the allocation mode. I should have done that but I did the shortcut of abusing get_used_memory() to signal the end of memory profiling.

Such global function is also useful even after we land the allocator based on rmm, since tracking statistics for memory allocation is costly and should only be done during memory profiling.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, got it. So, by calling it twice, I'm invalidating the safety net anyways.

I think in that case, I'd lean toward disabling this check altogether, and replacing it with a compile-time tracking of memory usage. That way, instead of profiling, we could know the maximum size of the live set.

Copy link
Member

@masahi masahi Feb 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious if such compile-time estimate of memory usage agrees with the "ground truth" that we collect at runtime, given that the latter is affected by the behavior of the underlying allocator which the memory planner is not aware of. And my impression of Relax's memory planning has been that it allocates much more than necessary, compared to the allocator in torch.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I think that increased memory would show up in the compile-time estimation, because a large part of the footprint is due to operations that aren't done in place. @slyubomirsky has been doing a lot of good work for in-place operations, which should give us a path forward on footprint.

@masahi
Copy link
Member

masahi commented Feb 14, 2024

any parameters whose allocations was performed by the relax VM

When do we encounter such params? I thought all params are allocated by NDArray::Empty().

@Lunderberg
Copy link
Member Author

When do we encounter such params? I thought all params are allocated by NDArray::Empty().

In the current flow with pre-processed weights, that is correct. All parameters are allocated by NDArray::Empty(), as part of tvm.nd.array.

For the pre-process at initialization flow, which will be used to support bring-your-own-fine-tune, the transform_params is called at runtime, and outputs the parameters to be used. If there is any processing required, transform_params will allocate space for the pre-processed parameters. That allocation is performed by the relax VM, but is later used as a model parameter.

@masahi masahi merged commit 5e36d1b into batch-serving Feb 15, 2024
1 check passed
@Lunderberg Lunderberg deleted the lunderberg/remove_double_counting_of_memory branch February 16, 2024 16:24
@sunggg
Copy link
Member

sunggg commented Feb 16, 2024

Hi, @Lunderberg! Thanks for the contribution!
One quick question:

If there is any processing required, transform_params will allocate space for the pre-processed parameters.

Does it mean there will be a dynamic allocation? To guarantee that we wouldn't get OOM during the runtime, we've been very careful about the memory allocation assuming the worst worst scenario. I think we are reaching the point where we would need to revisit this assumption, so your input will be very appreciated :)

@Lunderberg
Copy link
Member Author

Currently, all the allocations are static, but that's entirely due to the weights having static shape. While we could in theory write a transform_params function that is dynamic over the model parameters, but I don't see a need for that at the moment. What would be useful would be to have the transform_params function be dynamic over world_size, as then we could have a single .so file that can be used with different multi-GPU strategies.

I think it would be good to have a PrimFunc that takes any dynamic parameters, and returns the memory footprint for that set of dynamic parameters.

Lunderberg pushed a commit to Lunderberg/mlc-llm that referenced this pull request Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants