Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bad interactions between timeouts and build retires #10480

Open
wants to merge 4 commits into
base: 3.0-dev
Choose a base branch
from

Conversation

dmcilvaney
Copy link
Contributor

@dmcilvaney dmcilvaney commented Sep 18, 2024

Merge Checklist

All boxes should be checked before merging the PR (just tick any boxes which don't apply to this PR)

  • The toolchain has been rebuilt successfully (or no changes were made to it)
  • The toolchain/worker package manifests are up-to-date
  • Any updated packages successfully build (or no packages were changed)
  • Packages depending on static components modified in this PR (Golang, *-static subpackages, etc.) have had their Release tag incremented.
  • Package tests (%check section) have been verified with RUN_CHECK=y for existing SPEC files, or added to new SPEC files
  • All package sources are available
  • cgmanifest files are up-to-date and sorted (./cgmanifest.json, ./toolkit/scripts/toolchain/cgmanifest.json, .github/workflows/cgmanifest.json)
  • LICENSE-MAP files are up-to-date (./LICENSES-AND-NOTICES/SPECS/data/licenses.json, ./LICENSES-AND-NOTICES/SPECS/LICENSES-MAP.md, ./LICENSES-AND-NOTICES/SPECS/LICENSE-EXCEPTIONS.PHOTON)
  • All source files have up-to-date hashes in the *.signatures.json files
  • sudo make go-tidy-all and sudo make go-test-coverage pass
  • Documentation has been updated to match any changes to the build system
  • Ready to merge

Summary

When we queue a package to build (or test), we set a timeout (by default 8h). If the build has not finished by then we forcibly stop the build and mark it as failed.

We also support PACKAGE_BUILD_RETRIES and CHECK_BUILD_RETRIES, which will cause failed builds to re-run.

However, each time the retry was triggered the timeout would reset. For example in the buddy builds this means that a stuck package test could take 4x8=32h to build, which would exceed pipeline time limits. We want to exit gracefully with an error state so that we can generate and publish logs correctly. If the pipeline forces the timeout, it can be difficult to debug.

Instead of resetting the timeout with each retry, have all attempts share a single timeout. If the timeout is exceeded stop retrying (use RunWithLinearBackoff() which will take a ctx configured with a timeout, so we can break out early).

As part of this fix, I also noticed that the timeout handling was not cleaning up the build chroot correctly. We should not be using anything related to panic() for error handling, instead use logger.Log.Fatal*() which gives the logging library a chance to run its registered cleanup functions (ie final chroot cleanup) before exiting "gracefully".

Change Log
  • Package build timeout shared by all retry attempts, each invocation of BuildAgent.BuildPacakge() now takes a time.Duration instead of using the value from BuildAgentConfig.
  • Properly clean up build chroot on timeout
    • Handle timeout logic inside the chroot.Run so we correctly exit the chroot before leaving the function, otherwise the chroot cleanup code will run from within the chroot itself and the paths will be wrong.
    • Add a new StopAllChildProcesses() which is like PermanentlyStopAllChildProcesses() but does not set the disable flag (so we can run the gpg-agent cleanup still on exit).
Does this affect the toolchain?

NO

Associated issues
  • #xxxx
Test Methodology

(Added custom %check to words with sleep 9h)

@dmcilvaney dmcilvaney requested a review from a team as a code owner September 18, 2024 02:46
@dmcilvaney dmcilvaney added bug Something isn't working go Pull requests that update Go code 3.0-dev PRs Destined for AzureLinux 3.0 labels Sep 18, 2024
toolkit/tools/pkgworker/pkgworker.go Outdated Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.0-dev PRs Destined for AzureLinux 3.0 bug Something isn't working go Pull requests that update Go code Tools
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants