Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hard to benchmark the operation in the repo #39

Open
mynotwo opened this issue Aug 29, 2024 · 1 comment
Open

Hard to benchmark the operation in the repo #39

mynotwo opened this issue Aug 29, 2024 · 1 comment

Comments

@mynotwo
Copy link

mynotwo commented Aug 29, 2024

Hi, thanks for your work! I recently wanna benchmark each step's latency of this repo, and I found if I use torch.cuda.synchonize() and time.time(), I cannot get the actual data copy time.

For example, I believe the data copy time is those two lines.

    device_expert_buffer.storage.copy_(self.offloaded_storages[info_to_load.index], non_blocking=True)
    offloaded_storage_buffer.copy_(self.main_modules[info_to_evict.index].storage, non_blocking=True)

And time.time gives me 1e-5s, which I believe is far faster than real data transfer latency. I think the reason might be there exist multiple process/threads and would lead to wrong latency. Could you help me solve this problem?

Many thanks!

@dvmazur
Copy link
Owner

dvmazur commented Aug 29, 2024

Hi! In this case the .copy_ operation is non-blocking. Meaning it doesn't wait for the underlying copy to finish, but lets the python thread proceed as soon as the operation is submitted. You might want to look into torch's profiler. I recommend you export your traces into json and view them using perfetto or chrome://tracing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants