Hard to benchmark the operation in the repo #39

mynotwo · 2024-08-29T02:24:00Z

Hi, thanks for your work! I recently wanna benchmark each step's latency of this repo, and I found if I use torch.cuda.synchonize() and time.time(), I cannot get the actual data copy time.

For example, I believe the data copy time is those two lines.

    device_expert_buffer.storage.copy_(self.offloaded_storages[info_to_load.index], non_blocking=True)
    offloaded_storage_buffer.copy_(self.main_modules[info_to_evict.index].storage, non_blocking=True)

And time.time gives me 1e-5s, which I believe is far faster than real data transfer latency. I think the reason might be there exist multiple process/threads and would lead to wrong latency. Could you help me solve this problem?

Many thanks!

The text was updated successfully, but these errors were encountered:

dvmazur · 2024-08-29T08:59:29Z

Hi! In this case the .copy_ operation is non-blocking. Meaning it doesn't wait for the underlying copy to finish, but lets the python thread proceed as soon as the operation is submitted. You might want to look into torch's profiler. I recommend you export your traces into json and view them using perfetto or chrome://tracing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hard to benchmark the operation in the repo #39

Hard to benchmark the operation in the repo #39

mynotwo commented Aug 29, 2024

dvmazur commented Aug 29, 2024 •

edited

Loading

Hard to benchmark the operation in the repo #39

Hard to benchmark the operation in the repo #39

Comments

mynotwo commented Aug 29, 2024

dvmazur commented Aug 29, 2024 • edited Loading

dvmazur commented Aug 29, 2024 •

edited

Loading