-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-41811: (agent-based installer) let the bootstrap wait for workers before the reboot #910
base: master
Are you sure you want to change the base?
Conversation
In some cases the bootstrap node may reboot before the workers started the joining process, thus removing the assisted-service that it's still required by the workers
@andfasano: This pull request references Jira Issue OCPBUGS-41811, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: andfasano The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/jira refresh |
@andfasano: This pull request references Jira Issue OCPBUGS-41811, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
/test ? |
@andfasano: The following commands are available to trigger required jobs:
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/test e2e-agent-ha-dualstack |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #910 +/- ##
==========================================
- Coverage 55.70% 55.61% -0.09%
==========================================
Files 15 15
Lines 3208 3231 +23
==========================================
+ Hits 1787 1797 +10
- Misses 1249 1257 +8
- Partials 172 177 +5
Flags with carried forward coverage won't be shown. Click here to find out more.
|
/test e2e-agent-ha-dualstack |
1 similar comment
/test e2e-agent-ha-dualstack |
@andfasano: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
/test e2e-agent-ha-dualstack |
return true | ||
} | ||
|
||
i.log.Infof("Found %d workers, verifying their current installation stage", len(workers)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From just reading the log output, it is unclear what the wait is for. Perhaps change this log message to indicate that we are waiting for the worker to reach reboot.
i.log.Infof("Found %d workers, verifying their current installation stage", len(workers)) | |
i.log.Infof("Found %d workers, waiting for workers to reach reboot stage", len(workers)) |
As described by the analysis in https://issues.redhat.com/browse/OCPBUGS-41811, in some cases the bootstrap node may reboot before the workers started the joining process, thus removing the assisted-service that it's still required by the workers. This prevents the worker to successfully join the cluster, causing the failure of the cluster deployment.
This patch introduces an explicit synchronization between the bootstrap node and the workers (only in case the installation was performed via the agent-based installer), delaying the bootstrap reboot until all the workers passed the
waiting for control plane
stage.