Workflow with pending pod should be retriable when failed #13579

shuangkun · 2024-09-09T14:12:28Z

Pre-requisites

I have double-checked my configuration
I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
I have searched existing issues and could not find a match for this bug
I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

My workflow failed unexpectedly, but I can't retry it

The actual situation is that the corresponding pod has been completed. Because the workflow failed, the corresponding information was not updated. I think we should support retry in this case.

tianshuangkun@U-4YKHFNR6-2229 argo-workflows % kubectl get pod large-workflow-t696s-sleep-2611185570
NAME                                    READY   STATUS      RESTARTS   AGE
large-workflow-t696s-sleep-2611185570   0/2     Completed   0          46m

Version(s)

v3.4.12

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

a large workflow

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded

The text was updated successfully, but these errors were encountered:

agilgur5 · 2024-09-10T00:00:40Z

The actual situation is that the corresponding pod has been completed. Because the workflow failed, the corresponding information was not updated. I think we should support retry in this case.

I don't think this is correct, that sounds like a Controller bug with the Workflow not being correctly tracked as completed. You are on an older version too.

The Workflow should be stopped before being retried, otherwise that can cause very very unpredictable race conditions. Not to mention that the retry process can delete Pods (you do still have #12734 which still needs further iteration and review), which would be even stranger.

All in all, this suggestion sounds like it would dramatically increase unpredictability, which is not good.

If anything, the root cause of the Controller not tracking correctly should be resolved. You're also missing a Workflow that reproduces in this issue too, please make sure to have a reproduction. It's not debuggable otherwise for a root cause analysis

shuangkun · 2024-09-10T01:46:38Z

This workflow is already in the Failed state. When it fails, the pod status changes will not be tracked, so the nodeStatus is still in the Pending state at the moment of Failed, but in fact the pod has since been Completed. Can we manually retry in this scenario? Otherwise, I have no way of knowing how to make my workflow successful. It has 150,000 pods and nearly 90% of the steps have been completed.

agilgur5 · 2024-09-14T15:30:49Z

When it fails, the pod status changes will not be tracked, so the nodeStatus is still in the Pending state at the moment of Failed, but in fact the pod has since been Completed.

Ah I see, thanks for clarifying.

So in this case you can't "stop" the Workflow either since it is already considered "stopped"?

I'm still thinking this is a Controller bug. The Workflow should still be Running if it has Pending Pods and it should correctly track all Pods it ran. It would also need to signal a termination to that Pod once it's up per fastFail or similar logic

shuangkun added the type/bug label Sep 9, 2024

shuangkun self-assigned this Sep 9, 2024

agilgur5 added area/retry-manual Manual workflow "Retry" Action (API/CLI/UI). See retryStrategy for template-level retries P3 Low priority labels Sep 9, 2024

agilgur5 changed the title ~~Workflow has pending pod should can be retry when failed~~ Workflow with pending pod should be retriable when failed Sep 9, 2024

agilgur5 added the problem/more information needed Not enough information has been provide to diagnose this issue. label Sep 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workflow with pending pod should be retriable when failed #13579

Workflow with pending pod should be retriable when failed #13579

shuangkun commented Sep 9, 2024

agilgur5 commented Sep 10, 2024

shuangkun commented Sep 10, 2024

agilgur5 commented Sep 14, 2024

Workflow with pending pod should be retriable when failed #13579

Workflow with pending pod should be retriable when failed #13579

Comments

shuangkun commented Sep 9, 2024

Pre-requisites

What happened? What did you expect to happen?

Version(s)

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Logs from the workflow controller

Logs from in your workflow's wait container

agilgur5 commented Sep 10, 2024

shuangkun commented Sep 10, 2024

agilgur5 commented Sep 14, 2024