-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support abandoning jobs that take too long to cancel #78
Comments
An earlier refactor caused these graceful shutdown routines to stop respecting the provided context. No matter what happened with the provided context, they would continue blocking, though they would ultimately return the context's error upon return. With this change, both stop methods will return early if the provided context is cancelled or timed out _prior_ to the clean shutdown completing. This makes it straightforward to implement a shutdown flow which first calls `Stop` with a context that expires after 5 seconds, and after that call `StopAndCancel`. If the shutdown _still_ isn't complete, the user can choose to exit anyway. Partially fixes #78.
Thanks for reporting this @vito! I've fixed your issue (1) above as part of #79. This is something we used to have but broke during a refactor in the push toward release. The second part is a bit trickier, although I think it's conceptually possible. Internally the Client knows which jobs haven't yet completed, so it's conceivable that we could grab that list and make a final SQL update prior to exiting. It would be challenging to make sure that this is all done safely and that there are no adverse effects or panics if a job happens to finish while we've already given up on it and are marking it as I do think that given we can't actually halt a running goroutine in Go, this is probably the right design for us in the end, if we can find a reasonable way to make it work cleanly. Meanwhile, your best bet is to rely on a combination of |
An earlier refactor caused these graceful shutdown routines to stop respecting the provided context. No matter what happened with the provided context, they would continue blocking, though they would ultimately return the context's error upon return. With this change, both stop methods will return early if the provided context is cancelled or timed out _prior_ to the clean shutdown completing. This makes it straightforward to implement a shutdown flow which first calls `Stop` with a context that expires after 5 seconds, and after that call `StopAndCancel`. If the shutdown _still_ isn't complete, the user can choose to exit anyway. Partially fixes #78.
Nice. #79 + I am a little afraid that uncancellable jobs (and by extension jobs that don't die well after a client stops and are rescued an hour later) end up being a major Achilles Heel for people just given how easy it is to write an uncancellable job in Go. Having the client do one final pass where it basically runs the rescuer on all jobs that were cancelled but didn't stop to get their state back in good order for the next time a client starts up might not be the worst idea. |
An earlier refactor caused these graceful shutdown routines to stop respecting the provided context. No matter what happened with the provided context, they would continue blocking, though they would ultimately return the context's error upon return. With this change, both stop methods will return early if the provided context is cancelled or timed out _prior_ to the clean shutdown completing. This makes it straightforward to implement a shutdown flow which first calls `Stop` with a context that expires after 5 seconds, and after that call `StopAndCancel`. If the shutdown _still_ isn't complete, the user can choose to exit anyway. Partially fixes #78.
An earlier refactor caused these graceful shutdown routines to stop respecting the provided context. No matter what happened with the provided context, they would continue blocking, though they would ultimately return the context's error upon return. With this change, both stop methods will return early if the provided context is cancelled or timed out _prior_ to the clean shutdown completing. This makes it straightforward to implement a shutdown flow which first calls `Stop` with a context that expires after 5 seconds, and after that call `StopAndCancel`. If the shutdown _still_ isn't complete, the user can choose to exit anyway. Partially fixes #78.
The first part of this is resolved as per #79 and will be included in the upcoming release. It sounds like @brandur is eager to also see if we can find a way to cleanly handle stuck jobs on shutdown, so I will leave this issue open to track that additional work. Thanks again for the detailed issue @vito 🙏 |
Yep. On my feature list! |
@vito We just cut a new release for stopping respecting context cancellation. 0.0.11. |
@brandur Awesome, thanks for following up! 🙏 |
Context
Hiya, thanks for making River! Really digging it so far.
I have an app that queues jobs that I don't really need to wait to complete before exiting, even after cancelling. I just want to guarantee fast deploys and let jobs just get tossed around like a hot potato during a rolling deploy.
Before I continue: this isn't blocking me, and it's a nuanced topic on an edge case within an edge case, so no rush.
StopAndCancel
is almost what I want. Cancelled jobs end up in aretryable
state and they either get picked back up elsewhere or recovered when the instance comes back. The one case it doesn't handle is when jobs take too long to stop (typically by not respectingctx.Done()
, likely a bug).I tried to pass a timeout
ctx
toStopAndCancel
, but it doesn't currently respect it:With a job that does
time.Sleep(time.Minute)
theStopAndCancel
will just wait the full 1 minute instead of being interrupted after 10 seconds. This seems like a simple bug, happy to submit a PR.Even if it did respect the timeout, there's a secondary problem: if we bail out at that point, the worker's jobs will be left in a
running
state. At this point the job is stuck, and if you have uniqueness rules like I do that'll prevent future attempts too. As far as I can tell the only recourse at this point is to delete the job from the database or set its state toretryable
.Proposal
This might be problematic for other use cases, but at least for mine, this would be a big help:
StopAndCancel
respects ctx cancellationStopAndCancel
times out it marks allrunning
jobs on the local worker asretryable
Alternatively, having some way to do (2) independently would be fine. For example I could do a
db.workerClient.Abandon(stopCtx)
in theif stopCtx.Err() != nil {
branch above.This would allow graceful shutdown in most cases, prevent hanging in the "truly stuck" case, and allow jobs to be retried in all cases except
kill -9
.Alternative
The root of the problem here is really the stuck
running
jobs. I would also be OK with leaving everything as-is and instead just enforcing a stop timeout at the infra layer, letting the app getkill -9
ed, and having River automatically detect when arunning
job is abandoned so things aren't stuck forever.The text was updated successfully, but these errors were encountered: