Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow with pending pod should be retriable when failed #13579

Open
4 tasks done
shuangkun opened this issue Sep 9, 2024 · 3 comments
Open
4 tasks done

Workflow with pending pod should be retriable when failed #13579

shuangkun opened this issue Sep 9, 2024 · 3 comments
Assignees
Labels
area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries P3 Low priority problem/more information needed Not enough information has been provide to diagnose this issue. type/bug

Comments

@shuangkun
Copy link
Member

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

My workflow failed unexpectedly, but I can't retry it
image

The actual situation is that the corresponding pod has been completed. Because the workflow failed, the corresponding information was not updated. I think we should support retry in this case.

tianshuangkun@U-4YKHFNR6-2229 argo-workflows % kubectl get pod large-workflow-t696s-sleep-2611185570
NAME                                    READY   STATUS      RESTARTS   AGE
large-workflow-t696s-sleep-2611185570   0/2     Completed   0          46m

Version(s)

v3.4.12

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

a large workflow

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
@shuangkun shuangkun self-assigned this Sep 9, 2024
@agilgur5 agilgur5 added area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries P3 Low priority labels Sep 9, 2024
@agilgur5 agilgur5 changed the title Workflow has pending pod should can be retry when failed Workflow with pending pod should be retriable when failed Sep 9, 2024
@agilgur5
Copy link
Member

The actual situation is that the corresponding pod has been completed. Because the workflow failed, the corresponding information was not updated. I think we should support retry in this case.

I don't think this is correct, that sounds like a Controller bug with the Workflow not being correctly tracked as completed. You are on an older version too.

The Workflow should be stopped before being retried, otherwise that can cause very very unpredictable race conditions. Not to mention that the retry process can delete Pods (you do still have #12734 which still needs further iteration and review), which would be even stranger.

All in all, this suggestion sounds like it would dramatically increase unpredictability, which is not good.

If anything, the root cause of the Controller not tracking correctly should be resolved. You're also missing a Workflow that reproduces in this issue too, please make sure to have a reproduction. It's not debuggable otherwise for a root cause analysis

@agilgur5 agilgur5 added the problem/more information needed Not enough information has been provide to diagnose this issue. label Sep 10, 2024
@shuangkun
Copy link
Member Author

This workflow is already in the Failed state. When it fails, the pod status changes will not be tracked, so the nodeStatus is still in the Pending state at the moment of Failed, but in fact the pod has since been Completed. Can we manually retry in this scenario? Otherwise, I have no way of knowing how to make my workflow successful. It has 150,000 pods and nearly 90% of the steps have been completed.

@agilgur5
Copy link
Member

When it fails, the pod status changes will not be tracked, so the nodeStatus is still in the Pending state at the moment of Failed, but in fact the pod has since been Completed.

Ah I see, thanks for clarifying.

So in this case you can't "stop" the Workflow either since it is already considered "stopped"?

I'm still thinking this is a Controller bug. The Workflow should still be Running if it has Pending Pods and it should correctly track all Pods it ran. It would also need to signal a termination to that Pod once it's up per fastFail or similar logic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries P3 Low priority problem/more information needed Not enough information has been provide to diagnose this issue. type/bug
Projects
None yet
Development

No branches or pull requests

2 participants