Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wandb-osh cannot handle many runs at once #83

Open
RitwikGupta opened this issue Jan 29, 2024 · 6 comments · May be fixed by #85 or #101
Open

Wandb-osh cannot handle many runs at once #83

RitwikGupta opened this issue Jan 29, 2024 · 6 comments · May be fixed by #85 or #101
Labels
enhancement New feature or request

Comments

@RitwikGupta
Copy link

Hello,

We are a wandb-osh power user. First of all, thank you for making this excellent utility.
We frequently run 20-30 runs, all logging to wandb, simultaneously. What ends up happening is that the runs which log more frequently crowd out the runs that log less frequently. Therefore, slow runs rarely update to wandb!

If wandb-osh could handle command files in a first-in-first-out fashion rather than "last written", this would fix the issue.

Thank you again for making this. I will attempt to make a PR when I get some time, unless you get to it first.

@klieret
Copy link
Owner

klieret commented Feb 2, 2024

Thanks a lot for the report and the PR ❤️ . I'm currently reading through your PR.

Let's keep this issue open until we merge it!

@klieret klieret reopened this Feb 2, 2024
@klieret klieret added the enhancement New feature or request label Feb 2, 2024
@klieret klieret linked a pull request Feb 2, 2024 that will close this issue
@klieret
Copy link
Owner

klieret commented Mar 22, 2024

#101 could be a simpler fix to this issue and I expect this to be merged very soon

@RitwikGupta
Copy link
Author

RitwikGupta commented Mar 22, 2024

@klieret Sorry I have not had a chance yet to go back to this and implement your suggestions.
I don't think #101 would address this issue. We are logging to WandB at every step in our epoch, with 20-30 runs all doing that at the same time. wandb-osh cannot keep up with this since it's going through commands one at a time.

Multiprocessing lets it attend to multiple jobs at once.

@klieret
Copy link
Owner

klieret commented Mar 22, 2024

Hi @RitwikGupta, thanks for your reply. It would be a tradeoff between update frequencies and number of runs. A single wandb sync call probably takes 2s to complete, so you can sync 30 runs per minute, if you are fine with having runs update once per minute (or less).

The current setup is that it tries to trigger a sync for every epoch, and that might have some runs outcrowding others. So #101 allows to bring that rate down to something where the wandb sync calls can keep up with it.

But I agree that it's still nice to have the multiple-job setup. Unfortunately I also currently don't have much time to put into this. But I'm happy to review again, if you address the comments in #85 :)

@RitwikGupta
Copy link
Author

@klieret I think the difference between our wandb logging setups is that we are logging every iteration, of which we have 10-20k per epoch. Therefore, our sync calls take about 5-10s to complete, reducing us to 6-12 runs that can sync per minute. We are often running 15+ such runs at a time.

I am also currently short on time, but I will find the one hour it will take to implement your feedback soon. It's on my to do list!

@klieret
Copy link
Owner

klieret commented Apr 1, 2024

Thanks for your reply.

Just for clarification: Even if you log every iteration, you do not need to trigger the sync every iteration. The frequency with which you trigger the sync will be adjustable with #101, so no matter how long the sync takes or how many iterations you log, you should still be able to synchronize all projects (just with less frequent updates).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
2 participants