Fix bad interactions between timeouts and build retires #10480
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Merge Checklist
All boxes should be checked before merging the PR (just tick any boxes which don't apply to this PR)
*-static
subpackages, etc.) have had theirRelease
tag incremented../cgmanifest.json
,./toolkit/scripts/toolchain/cgmanifest.json
,.github/workflows/cgmanifest.json
)./LICENSES-AND-NOTICES/SPECS/data/licenses.json
,./LICENSES-AND-NOTICES/SPECS/LICENSES-MAP.md
,./LICENSES-AND-NOTICES/SPECS/LICENSE-EXCEPTIONS.PHOTON
)*.signatures.json
filessudo make go-tidy-all
andsudo make go-test-coverage
passSummary
When we queue a package to build (or test), we set a timeout (by default 8h). If the build has not finished by then we forcibly stop the build and mark it as failed.
We also support
PACKAGE_BUILD_RETRIES
andCHECK_BUILD_RETRIES
, which will cause failed builds to re-run.However, each time the retry was triggered the timeout would reset. For example in the buddy builds this means that a stuck package test could take 4x8=32h to build, which would exceed pipeline time limits. We want to exit gracefully with an error state so that we can generate and publish logs correctly. If the pipeline forces the timeout, it can be difficult to debug.
Instead of resetting the timeout with each retry, have all attempts share a single timeout. If the timeout is exceeded stop retrying (use
RunWithLinearBackoff()
which will take actx
configured with a timeout, so we can break out early).As part of this fix, I also noticed that the timeout handling was not cleaning up the build chroot correctly. We should not be using anything related to
panic()
for error handling, instead uselogger.Log.Fatal*()
which gives the logging library a chance to run its registered cleanup functions (ie final chroot cleanup) before exiting "gracefully".Change Log
BuildAgent.BuildPacakge()
now takes atime.Duration
instead of using the value fromBuildAgentConfig
.StopAllChildProcesses()
which is likePermanentlyStopAllChildProcesses()
but does not set the disable flag (so we can run the gpg-agent cleanup still on exit).Does this affect the toolchain?
NO
Associated issues
Test Methodology
(Added custom %check to words with
sleep 9h
)