diff --git a/keps/sig-node/4603-tune-crashloopbackoff/README.md b/keps/sig-node/4603-tune-crashloopbackoff/README.md new file mode 100644 index 00000000000..9e9fbc1e4b0 --- /dev/null +++ b/keps/sig-node/4603-tune-crashloopbackoff/README.md @@ -0,0 +1,1776 @@ + +# KEP-4604: Tune CrashLoopBackoff + + + + + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [Existing backoff curve change: front loaded decay, lower maximum backoff](#existing-backoff-curve-change-front-loaded-decay-lower-maximum-backoff) + - [Node specific kubelet config for maximum backoff down to 1 second](#node-specific-kubelet-config-for-maximum-backoff-down-to-1-second) + - [Refactor and flat rate to 10 minutes for the backoff counter reset threshold](#refactor-and-flat-rate-to-10-minutes-for-the-backoff-counter-reset-threshold) + - [User Stories](#user-stories) + - [Task isolation](#task-isolation) + - [Fast restart on failure](#fast-restart-on-failure) + - [Sidecar containers fast restart](#sidecar-containers-fast-restart) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Front loaded decay curve methodology](#front-loaded-decay-curve-methodology) + - [Per node config](#per-node-config) + - [Refactor of recovery threshold](#refactor-of-recovery-threshold) + - [Conflict resolution](#conflict-resolution) + - [Kubelet overhead analysis](#kubelet-overhead-analysis) + - [Benchmarking](#benchmarking) + - [Relationship with Job API podFailurePolicy and backoffLimit](#relationship-with-job-api-podfailurepolicy-and-backofflimit) + - [Relationship with ImagePullBackOff](#relationship-with-imagepullbackoff) + - [Relationship with k/k#123602](#relationship-with-kk123602) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Alpha](#alpha) + - [Beta](#beta) + - [GA](#ga) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Conflict resolution](#conflict-resolution-1) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) + - [Global override](#global-override) + - [Per exit code configuration](#per-exit-code-configuration) + - [RestartPolicy: Rapid](#restartpolicy-rapid) + - [Flat-rate restarts for Succeeded Pods](#flat-rate-restarts-for-succeeded-pods) + - [On Success and the 10 minute recovery threshold](#on-success-and-the-10-minute-recovery-threshold) + - [Related: API opt-in for flat rate/quick restarts when transitioning from Succeeded phase](#related-api-opt-in-for-flat-ratequick-restarts-when-transitioning-from-succeeded-phase) + - [Related: Succeeded vs Rapidly failing: who's getting the better deal?](#related-succeeded-vs-rapidly-failing-whos-getting-the-better-deal) + - [Front loaded decay with interval](#front-loaded-decay-with-interval) + - [Late recovery](#late-recovery) + - [More complex heuristics](#more-complex-heuristics) +- [Appendix A](#appendix-a) + - [Kubelet SyncPod](#kubelet-syncpod) + - [Runtime SyncPod](#runtime-syncpod) + - [Kubelet SyncTerminatingPod + runtime killPod](#kubelet-syncterminatingpod--runtime-killpod) + - [Kubelet SyncTerminatedPod](#kubelet-syncterminatedpod) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + + +## Release Signoff Checklist + + + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [x] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + + + +CrashLoopBackoff is designed to slow the speed at which failing containers +restart, preventing the starvation of kubelet by a misbehaving container. +Currently it is a subjectively conservative, fixed behavior regardless of +container failure type: when a Pod has a restart policy besides `Never`, after +containers in a Pod exit, the kubelet restarts them with an exponential back-off +delay (10s, 20s, 40s, …), that is capped at five minutes. The delay for restarts +will stay at 5 minutes until a container has executed for 2x the maximum backoff +-- that is, 10 minutes -- without any problems, in which case the kubelet resets +the backoff count for that container and further crash loops start again at the +beginning of the delay curve +([ref](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/)). + +Both the decay to 5 minute back-off delay, and the 10 minute recovery threshold, +are considered too conservative, especially in cases where the exit code was 0 +(Success) and the pod is transitioned into a "Completed" state or the expected +length of the pod run is less than 10 minutes. + +This KEP proposes the following changes: +* Provide 2 alpha-gated changes to get feedback and periodic scalability tests + * Changes to the global initial backoff to 1s and maximum backoff to 1 minute + * A knob to cluster operators to configure maximum backoff down, to minimum 1s, at the node level +* Formally split image pull backoff and container restart backoff behavior +* Formally split backoff counter reset threshold for container restart backoff + behavior and maintain the current 10 minute recovery threshold + +![A graph showing the change in elapsed time for observed restarts with the +CrashLoopBackOffBehavior of today, with the proposed new default, and with the +proposed minimum per node configuration](./restarts-vs-elapsed-all.png "KEP-4603 +CrashLoopBackoff proposal comparison") + +## Motivation + + + +[Kubernetes#57291](https://github.com/kubernetes/kubernetes/issues/57291), with +over 250 positive reactions and in the top five upvoted issues in k/k, covers a +range of suggestions to change the rate of decay for the backoff delay or the +criteria to restart the backoff counter, in some cases requesting to make this +behavior tunable per node, per container, and/or per exit code. Anecdotally, +there are use cases representative of some of Kubernetes' most rapidly growing +workload types like +[gaming](https://github.com/googleforgames/agones/issues/2781) and +[AI/ML](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3329-retriable-and-non-retriable-failures) +that would benefit from this behavior being different for varying types of +containers. Application-based workarounds using init containers or startup +wrapper scripts, or custom operators like +[kube_remediator](https://github.com/ankilosaurus/kube_remediator) and +[descheduler](https://github.com/kubernetes-sigs/descheduler) are used by the +community to anticipate crashloopbackoff behavior, prune pods with nonzero +backoff counters, or otherwise "pretend" the pod did not exit recently to force +a faster restart from kubelet. Discussions with early Kubernetes contributors +indicate that the current behavior was not designed beyond the motivation to +throttle misbehaving containers, and is open to reintepretation in light of user +experiences, empirical evidence, and emerging workloads. + +By definition this KEP will cause pods to restart faster and more often than the +current status quo; let it be known that such a change is desired. It is also +the intention of the author that, to some degree, this change happens without +the need to reconfigure workloads or expose extensive API surfaces, as +experience shows that makes changes difficult to adopt, increases the risk for +misconfiguration, and can make the system overly complex to reason about. That +being said, this KEP recognizes the need of backoff times to protect node +stability, avoiding too big of sweeping changes to backoff behavior, or setting +unreasonable or unpredictable defaults. Therefore the approach manages the risk +of different decay curves to node stability by providing an API surface to +opt-in to the riskiest option, and ultimately also provides the observability +instrumentation needed by cluster operators to model the impact of this change +on their existing deployments. + +A large number of alternatives have been discussed over the 5+ years the +canonical tracking issue has been open, some of which imply high levels of +sophistication for kubelet to make different decisions based on system state, +workload specific factors, or by detecting anomalous workload behavior. While +this KEP does not rule out those directions in the future, the proposal herein +focuses on the simpler, easily modeled changes that are designed to address the +most common issues observed today. It is the desire of the author that the +observability and benchmarking improvements instrumented as part of this +proposal can serve as a basis for pursuing those types of improvements in the +future, including as signals for end users to represent such solutions as +necessary. Some analysis of these alternatives are included in the Alternatives +Considered section below for future reference. + +### Goals + + + +* Improve pod restart backoff logic to better match the actual load it creates + and meet emerging use cases +* Quantify and expose risks to node stability resulting from decreased and/or + hetereogeneous backoff configurations +* Empirically derive defaults and allowable thresholds that approximate current + node stability +* Provide a simple UX that does not require changes for the majority of + workloads +* Must work for Jobs and sidecar containers + +### Non-Goals + + + +* This effort is NOT intending to support fully user-specified configuration, to + cap risk to node stability +* This effort is purposefully NOT implementing more complex heuristics by + kubelet (e.g. system state, workload specific factors, or by detecting +anomalous workload behavior) to focus on better benchmarking and observability + and address common use cases with easily modelled changes first +* This effort is NOT changing the independent backoff curve for image pulls + +## Proposal + + + +This design seeks to incorporate a two-pronged approach: + +1. Change the existing initial value for the backoff curve to 1s to stack more + retries earlier, and reduce the maximum backoff for a given restart from 5 + minutes to 1 minute, for all restarts (`restartPolicy: OnFailure` and + `restartPolicy: Always`) +2. Provide an option to cluster operators to configure an even lower maximum +backoff for all containers on a specific Node, down to 1s + +To derive these values, manual stress testing observing the behavior of kubelet, + API server, and overall cluster operations and behavior were performed. In +addition, part of the alpha period will be dedicated entirely to systematically +stress testing kubelet and API Server with different distributions of workloads +utilizing the new backoff curves. During the benchmarking period in alpha, +kubelet memory and CPU, API server latency, and pod restart latency will be +observed and analyzed to further refine the maximum allowable restart rate for +fully saturated nodes. + +Longer term, these +metrics will also supply cluster operators the data necessary to better analyze +and anticipate the change in load and node stability as a result of upgrading to +these changes. + +While they are more deeply rationalized in the Alternatives Considered section, +because they are commonly inquired about at this point, note that this proposal +will NOT change: +* backoff behavior for Pods transitioning from the "Success" state differently + from those transitioning from a "Failed" state -- see [here in Alternatives + Considered](#on-success-and-the-10-minute-recovery-threshold) +* the ImagePullBackoff -- out of scope, see [Design + Details](#relationship-with-imagepullbackoff) +* changes that address 'late recovery', or modifications to backoff behavior + once the maximum backoff has been reached -- see + [Alternatives](#more-complex-heuristics) + + + +### Existing backoff curve change: front loaded decay, lower maximum backoff + +This KEP proposes changing the existing backoff curve to load more restarts +earlier by changing the initial value of the exponential backoff, and to retry +periodically more often by reducing the maximum for the backoff. A number of +alternate initial values and maximum backoff values are modelled below. This +proposal suggests we start with a new initial value of 1s (changed from 10s +today) and a maximum backoff of 1 minute (changed from 5 minutes today) based on +prior research, and further analyze its impact on infrastructure during alpha. + +![A graph showing the change in elapsed time for observed restarts with the +CrashLoopBackOffBehavior of today vs the proposed new +default](./restarts-vs-elapsed-new-default.png "Default backoff curve change") + + +### Node specific kubelet config for maximum backoff down to 1 second + +This KEP also proposes providing a more flexible mechanism for modifying the +maximum backoff as an opt-in. The configuration for this will be per node, by an +integer representing seconds (notably NOT resulting in subsecond granularity). +The minimum allowable value will be 1 second. The maximum allowable value will +be 300 seconds. This per node configuration will be configurable only by a user +with awareness of the node holistically. + +![A graph showing the change in elapsed time for observed restarts with the +CrashLoopBackOffBehavior of today vs the proposed minimum for per node +configuration](./restarts-vs-elapsed-minimum-per-node.png "Per node minimum backoff +curve allowed") + + +### Refactor and flat rate to 10 minutes for the backoff counter reset threshold + +Finally, this KEP proposes factoring out the backoff counter reset threshold for +CrashLoopBackoff behavior induced restarts into a new constant. Instead of being +proportional to the configured maximum backoff, it will instead be a flat rate +equivalent to the current implementation, 10 minutes. This will only apply to +backoff counter resets from CrashLoopBackoff behavior. + + +### User Stories + + + +#### Task isolation +By design the container wants to exit so it will be recreated and restarted to +start a new “session” or “task”. It may be a new gaming session or a new +isolated task processing an item from the queue. + +This is not possible to do by creating the wrapper of a container that will +restart the process in a loop because those tasks or gaming sessions desire to +start from scratch. In some cases - they may even want to download a new +container image. + +For these cases, it is important that the startup latency of the full +rescheduling of a new pod is avoided. + +#### Fast restart on failure + +There are AI/ML scenarios where an entire workload is delayed if one of the Pods +is not running. Some Pods in this scenario may fail with the recoverable error - +for example some dependency failure or be killed by infrastructure (e.g. exec +probe failures on containerd slow restart). In such cases, it is desirable to +restart the container as fast as possible and not wait for 5 minutes of the max +crashloop backoff timeout. With the existing max crashloop backoff timeout, a +failure of a single pod in a highly-coupled workload can cause a cascade in the +workload leading to an overall delay of (much) greater than the 5 minute +backoff. + +The typical pattern here today is to be quick to restart container, but with the +intermittently failed dependency (e.g. network is down for some time, some +intermittent issue with a GPU), this causes the container to fail repeatedly and +the backoff timeout to eventually be reached. However when the dependency goes +back to green, the container is not restarted immediately and has already +reached the maximum crash loop backoff duration of 5 minutes. + +#### Sidecar containers fast restart + +There are cases when the Pod consists of a user container implementing business +logic and a sidecar providing networking access (Istio), logging (opentelemetry +collector), or orchestration features. In some cases sidecar containers are +critical for the user container functioning as they provide a basic +infrastructure for the user container to run in. In such cases it is considered +beneficial to not apply the exponential backoff to the sidecar container +restarts and keep it at a low constant value. + +This is especially true for cases when the sidecar is killed by infrastructure +(e.g. OOMKill) as it may have happened for reasons independent from the sidecar +functionality. + + +### Notes/Constraints/Caveats (Optional) + + + +### Risks and Mitigations + + + +The biggest risk of this proposal is reducing the decay of the _default_ +CrashLoopBackoff: to do so too severely compromises node stability, risking the +kubelet component to become too slow to respond and the pod lifecycle to +increase in latency, or worse, causing entire nodes to crash if kubelet takes up +too much CPU or memory. Since each phase transition for a Pod also has an +accompanying API request, if the requests become rapid enough due to fast enough +churn of Pods through CrashLoopBackoff phases, the central API server could +become unresponsive, effectively taking down an entire cluster. + +The same risk exists for the per Node feature, which, while not default, is by +design allowing a more severe reduction in the decay behavior. In the worst +case, it could cause nodes to fully saturate with near-instantly restarting pods +that will never recover, risking similar issues as above: taking down nodes or +at least nontrivially slowing kubelet, or increasing the API requests to store +backoff state so significantly that the central API server is unresponsive and +the cluster fails. + +Some observations and analysis were made to quantify these risks going into +alpha. In the [Kubelet Overhead Analysis](#kubelet-overhead-analysis), the code +paths all restarting pods go through result in 5 obvious `/pods` API server +status updates; when observed empirically in an active cluster, while we do see +an initial surges in `/pods` API traffic to ~5 QPS when deploying 110 mass +crashing pods for our tests, even with instantly crashing pods and +intantaneously restarting CrashLoopBackOff behavior, `/pods` API requests +quickly normalized to ~2 QPS due to rate limiting. In the same tests, runtime +CPU usage increased by x10 and the API server CPU usage increased by 2x. + +For both of these changes, by passing these changes through the existing +SIG-scalability tests, while pursuing manual and more detailed periodic +benchmarking during the alpha period, we can increase the confidence in the +changes and explore the possibility of reducing the values further in the +future. + +In the meantime, during alpha, naturally the first line of defense is that the +enhancements, even the reduced "default" baseline curve for CrashLoopBackoff, +are not usable by default and must be opted into. In this specific case they are +opted into separately with different alpha feature gates, so clusters will only +be affected by each risk if the cluster operator enables the new features during +the alpha period. + +Beyond this, there are two main mitigations during alpha: conservativism in +changes to the default behavior based on prior stress testing, and limiting any +further overrides to be opt-in per Node, and only by users with the permissions +to modify the kubelet configuration -- in other words, a cluster operator +persona. + +The alpha changes to the _default_ backoff curve were chosen because they meet +emerging use cases and user sentiment from the canonical feature request issue +and research by the author, AND because they are indicated to be safe changes +based on initial benchmarking and analysis. The reduction of the initial value +from 10s to 1s and the reduction of the backoff maximum from 5 minutes to 1 +minute introduces + +* 5 excess restarts per pod in the first five minutes + * the first 3 excess in the first 30 seconds + * and the last 2 excess between 150 and 300 seconds + * an excess 5 restarts every five minutes compared to today given the change + in the backoff maximum + +For a hypothetical node with the default maximum 110 pods all stuck in a +simultaneous 10s CrashLoopBackoff, at its most efficient this would result in a +new restart for each pod every second, and API requests to the API server to +change the state transition would increase at its fastest period from ~550 +requests/10s to ~5500 requests/10s, or 10x. In conjunction, the lowered backoff +would continuously add 5 excess restarts every five minutes compared to today's +behavior, so each crashing pod would be adding an excess of 25 pod state +transition AIP requests, for a total of 2750 excess API requests for a node +fully saturated with crashing pods. + +For the per node case, because the change could be more significant, with nearly +trivial (when compared to pod startup metrics) max allowable backoffs of 1s, +there is more risk to node stability expected. This change is of particular +interest to be tested in the alpha period by end users, and is why it is still +included, but only by opt-in. In this case, for a hypothetical node with the +default maximum 110 pods each with one crashing container all stuck in a +simultaneous 1s maximum CrashLoopBackoff, as above, at its most efficient this +would result in a new restart for each pod every second, and therefore the API +requests to change the state transition would be expected to increase from ~550 +requests/10s to 5500 requests/10s, or 10x. In addition, since the maximum +backoff would be lowered, an ideal pod would continue to restart more often than +today's behavior, adding 305 excess restarts within the first 5 minutes and 310 +excess restarts every 5 minutes after that; each crashing pod would be +contributing an excess of ~1550 pod state transition API requests, and fully +saturated node with a full 110 crashing pods would be adding 170,500 new pod +transition API requests every five minutes, which is an an excess of ~568 +requests/10s. <<[!UNRESOLVED kubernetes default for the kubelet client rate +limit and how this changes by machine size]>> <<[UNRESOLVED]>> + + +## Design Details + + + +### Front loaded decay curve methodology +As mentioned above, today the standard backoff curve is a 2x exponential decay +starting at 10s and capping at 5 minutes, resulting in a composite of the +standard hockey-stick exponential decay graph followed by a linear rise until +the heat death of the universe as depicted below: + +![A graph showing the backoff decay for a Kubernetes pod in +CrashLoopBackoff](./crashloopbackoff-succeedingcontainer.png "CrashLoopBackoff +decay") + +Remember that the backoff counter is reset if containers run longer than 10 +minutes, so in the worst case where a container always exits after 9:59:59, this +means in the first 30 minute period, the container will restart twice. In a more +easily digestible example used in models below, for a fast exiting container +crashing every 10 seconds, in the first 30 minutes the container will restart +about 10 times, with the first four restarts in the first 5 minutes. + +Why change the initial value of the backoff curve instead of its rate, or why +not change the decay function entirely to other well known equations (like +functions resulting in curves that are lienar, parabolic, sinusoidal, etc)? + +Exponential decay, particularly at a rate of 2x, is commonly used for software +retry backoff as it has the nice properties of starting restarts at a low value, +but penalizing repeated crashes harshly, to protect primarily against +unrecoverable failures. In contrast, we can interpret linear curves as +penalizing every failure the same, or parabolic and sinusoidal curves as giving +our software a "second chance" and forgiving later failures more. For a default +restart decay curve, where the cause of the restart cannot be known, 2x +exponential decay still models the desired properties more, as the biggest risk +is unrecoverable failures causing "runaway" containers to overload kubelet. + +To determine the effect in abstract of changing the initial value on current +behavior, we modeled the change in the starting value of the decay from 10s to +1s, 250ms, or even 25ms. + +!["A graph showing the decay curves for different initial values"](differentinitialvalues.png +"Alternate CrashLoopBackoff initial values") + +For today's decay rate, the first restart is within the first 10s, the second +within the first 30s, the third within the first 70s. Using those same time +windows to compare alternate initial values, for example changing the initial +rate to 1s, we would instead have 3 restarts in the first time window, 1 restart +within the second time window, and two more restarts within the third time +window. As seen below, this type of change gives us more restarts earlier, but +even at 250ms or 25ms initial values, each approach a similar rate of restarts +after the third time window. + +![A graph showing different exponential backoff decays for initial values of +10s, 1s, 250ms and 25ms](initialvaluesandnumberofrestarts.png "Changes to decay +with different initial values") + +Among these modeled initial values, we would get between 3-7 excess restarts per +backoff lifetime, mostly within the first three time windows matching today's +restart behavior. + +The above modeled values converge like this because of the harshly increasing +rate caused by the exponential growth, but also because they grow to the same +maximum value -- 5 minutes. By layering on alternate models for maximum backoff, +we observe more restarts for longer: + +![A graph showing different exponential backoff decays for 2 initial values and +2 different maximum backoff +thresholds](initialvaluesandmaxonnumberofrestarts.png "Changes to decay with +different initial and maximum backoff values") + +While this graph includes several variations -- initial values of 10s in red and +pink, initial values of 1s in purple and lilac, and both of these initial values +modeled either 60s or 30s maximum backoff -- hopefully it can impress upon the +reader how including a reduction to maximum backoff increases the number of +excess restarts later in the cycle. In the most extreme case modeled above, with +initial value of 1s and a maximum backoff of 30s, there are 2 excess restarts in +the first time window, 1 excess in the second time window, 3 excess in the +fourth time window and 4 excess in the fifth time window for a total of 10 +excess restarts compared to today's behavior. + +As a first estimate for alpha in line with maximums discussed on +[Kubernetes#57291](https://github.com/kubernetes/kubernetes/issues/57291), the +initial curve is selected at initial=1s / max=1 minute. During the alpha period +this will be further investigated against kubelet capacity, potentially +targeting something closer to an initial value near 0s, and a cap of 10-30s. To +further restrict the blast radius of this change before full and complete +benchmarking is worked up, this is gated by its own feature gate, +`ReduceDefaultCrashLoopBackoffDecay`. + +### Per node config + +For some users in +[Kubernetes#57291](https://github.com/kubernetes/kubernetes/issues/57291), any +delay over 1 minute at any point is just too slow, even if it is legitimately +crashing. A common refrain is that for independently recoverable errors, +especially system infrastructure events or recovered external dependencies, or +for absolutely nonnegotiably critical sidecar pods, users would rather poll more +often or more intelligently to reduce the amount of time a workload has to wait +to try again after a failure. In the extreme cases, users want to be able to +configure (by container, node, or exit code) the backoff to close to 0 seconds. +This KEP considers it out of scope to implement fully user-customizable +behavior, and too risky without full and complete benchmarking to node stability +to allow legitimately crashing workloads to have a backoff of 0, but it is in +scope for the first alpha to provide users a way to opt in to a even +faster restart behavior. + +So why opt in by node? In fact, the initial proposal of this KEP for 1.31 was to +opt in by Pod, to minimize the blast radius of a given Pod's worst case restart +behavior. In 1.31 this was proposed using a new `restartPolicy` value in the Pod +API, described in [Alternatives Considered here](#restartpolicy-rapid). Concerns +with this approach fell into two buckets: 1. technical issues with the API +(which could have been resolved by a different API approach), and 2. design +issues with exposing this kind of configuration to users without holistic +insight into cluster operations, for example, to users who might have pod +manifest permissions in their namespace but not for other namespaces in the same +cluster and which might be dependent on the same kubelet. For 1.32, we were +looking to address the second issue by moving the configuration somewhere we +could better guarantee a cluster operator type persona would have exclusive +access to. In addition, initial manual stress testing and benchmarking indicated +that even in the unlikely case of mass pathologically crashing and instantly +restarting pods across an entire node, cluster operations proceeded with +acceptable latency, disk, cpu and memory. Worker polling loops, context +timeouts, the interaction between various other backoffs, as well as API server +rate limiting made up the gap to the stability of the system.Therefore, to +simplify both the implementation and the API surface, this 1.32 proposal puts +forth that the opt-in will be configured per node via kubelet configuration. + +Kubelet configuration is governed by two main input points, 1) command-line flag +based config and 2) configuration following the API specification of the +`kubelet.config.k8s.io/v1beta1 KubeletConfiguration` Kind, which is passed to +kubelet as a config file or, beta as of Kubernetes 1.30, a config directory +([ref](https://kubernetes.io/docs/tasks/administer-cluster/kubelet-config-file/)). +Since this is a per-node configuration that likely will be set on a subset of +nodes, or potentially even differently per node, it's important that it can be +manipulated per node. By default `KubeletConfiguration` is intended to be shared +between nodes, but the beta feature for drop-in configuration files in a +colocated config directory cirumvent this. However, it is still the opinion of +the author it is better manipulated with a command-line flag, so lifecycle +tooling that configures nodes can expose it more transparently. + +Exposed command-line configuration for Kubelet are defined by [`struct +KubeletFlags`](https://github.com/kubernetes/kubernetes/blob/release-1.31/cmd/kubelet/app/options/options.go#L54) +and merged with configuration files in [`kubelet/server.go +NewKubeletCommand`](https://github.com/kubernetes/kubernetes/blob/release-1.31/cmd/kubelet/app/server.go#L141). +To expose the configuration of this feature, a new field will be added to the +`KubeletFlags` struct called `NodeMaxCrashLoopBackOff` of `int32` type that will +be validated in kubelet at runtime, as a value over `1` and under 300s. + +### Refactor of recovery threshold + +A simple change +to maximum backoff would naturally come with a modification of the backoff +counter reset threshold -- as it is currently calculated based on 2x the maximum +backoff. Without any other modification, as a result of this KEP, default +containers would be "rewarded" by having their backoff counter set back to 0 for +running successfully for 2\*1 minute=2 minutes (instead of for 2\*5minutes=10 +minutes like it is today); containers on nodes with an override could be +rewarded for running successfully for as low as 2 seconds if they are configured +with the minimum allowable backoff of 1s. + +From a technical perspective, granularity of the associated worker polling loops +governing restart behavior is between 1 and 10 seconds, so a reset value under +10 seconds is effectively meaningless (until and unless those loops increase in +speed or we move to evented PLEG). From a user perspective, it does not seem +that there is any particular end user value in artificially preserving the +current 10 minute recovery threshold as part of this implementation, since it +was an arbitrary value in the first place. However, since late recovery as a +category of problem space is expressly a non-goal of this KEP, and in the +interest of reducing the number of changed variables during the alpha period to +better observe the ones previously enumerated, this proposal intends to maintain +that 10 minute recovery threshold anyways. + +Forecasting that the recovery threshold for CrashLoopBackOff may be better +served by being configurable in the future, or at least separated in the code +from all other uses of `client_go.Backoff` for whatever future enhancements +address the late recovery problem space, the mechanism proposed here is to +redefine `client_go.Backoff` to accept alternate functions for +[`client_go.Backoff.hasExpired`](https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/util/flowcontrol/backoff.go#L178), +and configure the `client_go.Backoff` object created for use by the kube runtime +manager for container restart bacoff with a function that compares to a flat +rate of 300 seconds. + +### Conflict resolution + +Depending on the enablement state of the alpha gates and the configuration of an +individual node, some conflicts may need to be resolved to determine the initial +value or maximum backoff for a given node. See the [Upgrade/Downgrade Strategy +section](#upgrade--downgrade-strategy) for more details. + +### Kubelet overhead analysis + +As it's intended that, after this KEP, pods will restart more often that in +current Kubernetes, it's important to understand what the kubelet does during +pod restarts. The most informative code path for that is through (all links to +1.31) [`kubelet/kubelet.go +SyncPod`](https://github.com/kubernetes/kubernetes/blob/release-1.31/pkg/kubelet/kubelet.go), +[`kubelet/kuberuntime/kuberuntime_manager.go +SyncPod`](https://github.com/kubernetes/kubernetes/blob/release-1.31/pkg/kubelet/kuberuntime/kuberuntime_manager.go#L1052), +[`kubelet/kuberuntime/kuberuntime_container.go +startContainer`](https://github.com/kubernetes/kubernetes/blob/release-1.31/pkg/kubelet/kuberuntime/kuberuntime_container.go#L195) +and [`kubelet/kubelet.go SyncTerminatingPod`](), +[`kubelet/kuberuntime/kuberuntime_container.go +killContainer`](https://github.com/kubernetes/kubernetes/blob/release-1.31/pkg/kubelet/kuberuntime/kuberuntime_container.go#L761), +[`kubelet/kuberuntime/kuberuntime_manager.go +killPodWithSyncResult`](https://github.com/kubernetes/kubernetes/blob/release-1.31/pkg/kubelet/kuberuntime/kuberuntime_manager.go#L1466), + [`kubelet/kubelet.go +SyncTerminatingPod`](https://github.com/kubernetes/kubernetes/blob/release-1.31/pkg/kubelet/kubelet.go#L1996), and [`kubelet/kubelet.go SyncTerminatedPod`](https://github.com/kubernetes/kubernetes/blob/release-1.31/pkg/kubelet/kubelet.go#L2143). + +!["A sequence diagram showing Kubelet and Container Runtime code paths +responsible for behaviors common to all pod +restarts"](kubeletvsruntime-restartresponsibility.png "High level Kubelet and +Container Runtime restart code paths") + + +As you might imagine this is a very critical code path with hooks to many +in-flight features; [Appendix A](#appendix-a) includes a more complete list (yet +still not as exhaustive as the source code), but the following are selected as +the most important behaviors of kubelet during a restart to know about, that are +not currently behind a feature gate. + +After a Pod is in Terminating phase, kubelet: + +* Clears up old containers using container runtime +* Stops probes and pod sandbox, and unmounts volumes and unregisters + secrets/configmaps (since Pod was in a terminal phase) +* While still in the backoff window, wait for network / attach volumes / + register pod to secret and configmap managers / re-download image secrets if + necessary + +Once the current backoff window has passed, kubelet: + +* Potentially re-downloads the image (utilizing network + IO and blocks other + image downloads) if image pull policy specifies it + ([ref](https://github.com/kubernetes/kubernetes/blob/release-1.31/pkg/kubelet/images/image_manager.go#L135)). +* Recreates the pod sandbox and probe workers +* Recreates the containers using container runtime +* Runs user configured prestart and poststart hooks for each container +* Runs startup probes until containers have started (startup probes may be more + expensive than the readiness probes as they often configured to run more + frequently) +* Redownloads all secrets and configmaps, as the pod has been unregistered and + reregistered to the managers, while computing container environment/EnvVarFrom +* Application runs through its own initialization logic (typically utilizing + more IO) +* Logs information about all container operations (utilizing disk IO and + “spamming” logs) + +The following diagram showcases these same highlights more visually and in +context of the responsible API surface (Kubelet or Runtime aka CRI). + +!["A diagram showing Kubelet and Container Runtime code paths responsible for +behaviors common to all pod restarts"](code-diagram-for-restarts.png "Kubelet +and Container Runtime restart code paths") + +``` + <<[UNRESOLVED answer these question from original PR]>> + >Does this [old container cleanup using containerd] include cleaning up the image filesystem? There might be room for some optimization here, if we can reuse the RO layers. + to answer question: looks like it is per runtime. need to check about leasees. also part of the value of this is to restart the sandbox. +``` + +### Benchmarking + +Again, let it be known that by definition this KEP will cause pods to restart +faster and more often than the current status quo and such a change is desired. +However, to do so as safely as possible, it is required that during the alpha +period, we reevaluate the SLIs and SLOs and benchmarks related to this change and +expose clearly the methodology needed for cluster operators to be able to quantify the +risk posed to their specific clusters on upgrade. + +To best reason about the changes in this KEP, we require the ability to +determine, for a given percentage of heterogenity between "Succeeded" +terminating pods, and "Failed" (aka crashing) terminating pods whose +`restartPolicy: Always`: + * what is the load and rate of Pod restart related API requests to the API + server? + * what are the performance (memory, CPU, and pod start latency) effects on the + kubelet component? What about when considering the effects of different + plugins (e.g. CSI, CNI) + +Today there are alpha SLIs in Kubernetes that can observe that impact in +aggregate: +* Kubelet component CPU and memory +* `kubelet_http_inflight_requests` +* `kubelet_http_requests_duration_seconds` +* `kubelet_http_requests_total` +* `kubelet_pod_worker_duration_seconds` +* `kubelet_runtime_operations_duration_seconds` +* `kubelet_pod_start_duration_seconds` +* `kubelet_pod_start_sli_duration_seconds` + +In addition, estimates given the currently suggested changes in API requests are +included in [Risks and Mitigations](#risks-and-mitigations) and were deciding +factors in specific changes to the backoff curves. Since the changes in this +proposal are deterministic, this is pre-calculatable for a given heterogenity of +quantity and rate of restarting pods. + +In addition, the `kube-state-metrics` project already implements +restart-specific metadata for metrics that can be used to observe pod restart +latency in more detail, including: +* `kube_pod_container_status_restarts_total` +* `kube_pod_restart_policy` +* `kube_pod_start_time` +* `kube_pod_created` + +During the alpha period, these metrics, the SIG-Scalability benchmarking tests, +added kubelet performance tests, and manual benchmarking by the author against +`kube-state-metrics` will be used to answer the above questions, tying together +the container restart policy (inherited or declared), the terminal state of a +container before restarting, and the number of container restarts, to articulate +the observed rate and load of restart related API requests and the performance +effects on kubelet. + +### Relationship with Job API podFailurePolicy and backoffLimit + +Job API provides its own API surface for describing alterntive restart +behaviors, from [KEP-3329: Retriable and non-retriable Pod failures for +Jobs](https://github.com/kubernetes/enhancements/tree/master/keps/sig-apps/3329-retriable-and-non-retriable-failures), +in beta as of Kubernetes 1.30. The following example from that KEP shows the new +configuration options: `backoffLimit`, which controls for number of retries on +failure, and `podFailurePolicy`, which controls for types of workload exit codes +or kube system events to ignore against that `backoffLimit`. + +```yaml +apiVersion: v1 +kind: Job +spec: + [ . . . ] + backoffLimit: 3 + podFailurePolicy: + rules: + - action: FailJob + onExitCodes: + containerName: main-job-container + operator: In + values: [1,2,3] + - action: Ignore + onPodConditions: + - type: DisruptionTarget +``` + +The implementation of KEP-3329 is entirely in the Job controller, and the +restarts are not handled by kubelet at all; in fact, use of this API is only +available if the `restartPolicy` is set to `Never` (though +[kubernetes#125677](https://github.com/kubernetes/kubernetes/issues/125677) +wants to relax this validation to allow it to be used with other `restartPolicy` +values). As a result, to expose the new backoff curve to Jobs using this +feature, the updated backoff curve must also be implemented in the Job +controller. This is currently considered out of scope of the first alpha +implementation of this design. + +### Relationship with ImagePullBackOff + +ImagePullBackoff is used, as the name suggests, only when a container needs to +pull a new image. If the image pull fails, a backoff decay is used to make later +retries on the image download wait longer and longer. This is configured +internally independently +([here](https://github.com/kubernetes/kubernetes/blob/release-1.30/pkg/kubelet/kubelet.go#L606)) +from the backoff for container restarts +([here](https://github.com/kubernetes/kubernetes/blob/release-1.30/pkg/kubelet/kubelet.go#L855)). + +This KEP considers changes to ImagePullBackoff as out of scope, so during +implementation this will keep the same backoff. This is both to reduce the +number of variables during the benchmarking period for the restart counter, and +because the problem space of ImagePullBackoff could likely be handled by a +completely different pattern, as unlike with CrashLoopBackoff the types of +errors with ImagePullBackoff are less variable and better interpretable by the +infrastructure as recoverable or non-recoverable (i.e. 404s). + +### Relationship with k/k#123602 + +It was observed that there is a bug with the current requeue behavior, described +in https://github.com/kubernetes/kubernetes/issues/123602. The first restart +will have 0 seconds delay instead of the advertised initial value delay, because +the backoff object will not be initialized until a key is generated, which +doesn't happen until after the first restart of a pod (see +[ref](https://github.com/kubernetes/kubernetes/blob/a7ca13ea29ba5b3c91fd293cdbaec8fb5b30cee2/pkg/kubelet/kuberuntime/kuberuntime_manager.go#L1336-L1349)). +That first restart will also not impact future backoff calculations, so the +observed behavior is closer to: + +* 0 seconds delay for first restart +* 10 seconds for second restart +* 10 * 2^(restart_count - 2) for subsequent restarts + +By watching a crashing pod, we can observe that it does not enter a +CrashLoopBackoff state or behave as advertised until after that first "free" +restart: + +``` +# a pod that crashes every 10s +thing 0/1 Pending 0 29s +thing 0/1 Pending 0 97s +thing 0/1 ContainerCreating 0 97s +thing 1/1 Running 0 110s +thing 0/1 Completed 0 116s # first run completes +thing 1/1 Running 1 (4s ago) 118s # no crashloopbackoff observed, 1 restart tracked +thing 0/1 Completed 1 (10s ago) 2m4s # second runc ompletes +thing 0/1 CrashLoopBackOff 1 (17s ago) 2m19s # crashloopbackoff observed +thing 1/1 Running 2 (18s ago) 2m20s # third run starts +thing 0/1 Completed 2 (23s ago) 2m25s +thing 0/1 CrashLoopBackOff 2 (14s ago) 2m38s # crashloopbackoff observed again +thing 1/1 Running 3 (27s ago) 2m51s +thing 0/1 Completed 3 (32s ago) 2m56s +``` + +Ultimately, being able to predict the exact number of restarts or remedying up +to 10 seconds delay for the advertised behavior is not the ultimate goal of this +KEP, though certain assumptions were made when calculating risks, mitigations, +and analyzing existing behavior that are affected by this bug. Since the +behavior is already changing as part of this KEP, and similar code paths will be +changed, it is within scope of this KEP to address this bug if it is a blocker +to implementation for alpha; it can wait until beta otherwise. This is +represented below in the [Graduation Criteria](#graduation-criteria). + +### Test Plan + + + +[x] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +This feature requires two levels of testing: the regular enhancement testing +(described in the template below) and the stress/benchmark testing required to +increase confidence in ongoing node stability given heterogeneous backoff timers +and timeouts. + +Some stress/benchmark testing will still be developed as part of the +implementation of this enhancement, including the kubelet_perf tests indicated +in the e2e section below. + +Some of the benefit of pursuing this change in alpha is to also have the +opportunity to run against the existing SIG-Scalability performance and +benchmarking tests within an alpha candidate. In addition, manual benchmark +testing with GKE clusters can be performed by the author as described in [Design +Details](#benchmarking), and evaluated as candidates for formal, periodic +benchmark testing in the Kubernetes testgrid. + +##### Prerequisite testing updates + + + +* Test coverage of proper requeue behavior; see + https://github.com/kubernetes/kubernetes/issues/123602 + +##### Unit tests + + + + + + +- <<[UNRESOLVED whats up with this]>> + `kubelet/kuberuntime/kuberuntime_manager_test`: **could not find a successful + coverage run on + [prow](https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-coverage-unit/1800947623675301888)** + <<[/UNRESOLVED]>> + +##### Integration tests + + + + + +- k8s.io/kubernetes/test/integration/kubelet: + https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-integration-master/1800944856244162560 + * test with and without feature flags enabled + +##### e2e tests + + + +- Crashlooping container that restarts some number of times (ex 10 times), + timestamp the logs and read it back in the test, and expect the diff in those + time stamps to be minimum the backoff, with a healthy timeout +- k8s.io/kubernetes/test/e2e/node/kubelet_perf: for a given percentage of +heterogenity between "Succeeded" terminating pods, and crashing pods whose +`restartPolicy: Always` or `restartPolicy: OnFailure`, + - what is the load and rate of Pod restart related API requests to the API + server? + - what are the performance (memory, CPU, and pod start latency) effects on the + kubelet component? + - With different plugins (e.g. CSI, CNI) + +### Graduation Criteria + +#### Alpha + +- New `int32 NodeMaxCrashLoopBackOff` field in `KubeletFlags` struct, validated + to a minimum of 1 and a maximum of 300, used when + `EnableKubeletCrashLoopBackoffMax` feature flag enabled, to customize + CrashLoopBackOff per node +- New `ReduceDefaultCrashLoopBackoffDecay` feature flag which, when enabled, + changes CrashLoopBackOff behavior (and ONLY the CrashLoopBackOff behavior) to + a 2x decay curve starting at 1s initial value, 1 minute maximum backoff for + all workloads, therefore affecting workload configured with `restartPolicy: + Always` or `restartPolicy: OnFailure` + - Requires a refactor of image pull backoff from container restart backoff + object +- Maintain current 10 minute recovery threshold by refactoring backoff counter + reset threshold and explicitly implementing container restart backoff behavior + at the current 10 minute recovery threshold +- Initial e2e tests completed and enabled + * Feature flag on or off +- Test coverage of proper requeue behavior; see + https://github.com/kubernetes/kubernetes/issues/123602 +- Actually fix https://github.com/kubernetes/kubernetes/issues/123602 if this + blocks the implementation, otherwise beta criteria +- Low confidence in the specific numbers/decay rate + + +#### Beta + +- Gather feedback from developers and surveys +- High confidence in the specific numbers/decay rate +- Benchmark restart load methodology and analysis published and discussed with + SIG-Node +- Discuss PLEG polling loops and its effect on specific decay rates +- Additional e2e and benchmark tests, as identified during alpha period, are in + Testgrid and linked in KEP + +#### GA + +- 2 Kubernetes releases soak in beta +- Completely finalize the decay curves and document them thoroughly +- Remove the feature flag code +- Confirm the exponential backoff decay curve related tests and code are still + in use elsewhere and do not need to be removed +- Conformance test added for per-node configuration + + + + +### Upgrade / Downgrade Strategy + + + +For `ReduceDefaultCrashLoopBackoffDecay`: + +For an existing cluster, no changes are required to configuration, invocations +or API objects to make an upgrade. + +To use the enhancement, the alpha feature gate is turned on. In the future when +(/if) the feature gate is removed, no configurations would be required to be +made, and the default behavior of the baseline backoff curve would -- by design +-- be changed. + +To stop use of this enhancement, the entire cluster must be restarted or just +kubelet must restarts with the `ReduceDefaultCrashLoopBackoffDecay` feature gate +turned off. Since kubelet does not cache the backoff object, on kubelet restart +Pods will start at the beginning of their backoff curve. The exact backoff curve +a given Pod will use will be either the original one with initial value 10s, or +the per-node configured maximum backoff if the +`EnableKubeletCrashLoopBackoffMax` feature gate is turned on. + +For `EnableKubeletCrashLoopBackoffMax`: + +For an existing cluster, no changes are required to configuration, invocations +or API objects to make an upgrade. + +To make use of this enhancement, on cluster upgrade, the +`EnableKubeletCrashLoopBackoffMax` feature gate must first be turned on for the +cluster. Then, if any nodes need to use a different backoff curve, their kubelet +must be completely redeployed either in the same upgrade or after that upgrade +with the `NodeMaxCrashLoopBackOff` kubelet command line flag set. + +To stop use of this enhancement, there are two options. + +On a per-node basis, nodes can be completely redeployed with +`NodeMaxCrashLoopBackOff` kubelet command line flag unset. Since kubelet does +not cache the backoff object, on kubelet restart they will start from the +beginning of their backoff curve (either the original one with initial value +10s, or the new baseline with initial value 1s, depending on whether they've +turned on the `ReduceDefaultCrashLoopBackoffDecay` feature gate). + +<<[UNRESOLVED]>>Say what happens when `NodeMaxCrashLoopBackOff` exists but +`EnableKubeletCrashLoopBackoffMax is off, log a warning but drop it (because +peeps might set it and be like why it no work)<<[/UNRESOLVED]>> + +Or, the entire cluster can be restarted with the +`EnableKubeletCrashLoopBackoffMax` feature gate turned off. In this case, any +Node configured with a different backoff curve will instead use +the default backoff curve. Again, since the cluster was restarted and Pods were +redeployed, they will not maintain prior state and will start at the beginning +of their backoff curve. Also, again the exact backoff curve they will use will +be either the original one with initial value 10s, or the new baseline with +initial value 1s, depending on whether they've turned on the +`ReduceDefaultCrashLoopBackoffDecay` feature gate. + +#### Conflict resolution + +If on a given node at a given time, the per-node configured maximum backoff is +lower than the initial value, the initial value for that node will instead be +set to the configured maximum. For example, if +`ReduceDefaultCrashLoopBackoffDecay` is turned off so the initial value is 10s, +but `EnableKubeletCrashLoopBackoffMax` is turned on and a given node is +configured to a maximum of `1s`, then the initial value for that node will be +configured to 1s. In other words, operator-invoked configuration will have +precedence over the default if it is faster. + +If on a given node at a given time, the per-node configured maximum backoff is +lower than 1 second or higher than the 300s, validation will +fail and the kubelet will crash. + +If on a given node at a given time, the per-node configured maximum backoff is +higher than the current initial value, but within validation limits as it is +lower than 300s, it will be honored. In other words, operator-invoked +configuration will have precedence over the default, even if it is slower, as +long as it is valid. + +scenario | ReduceDefaultCrashLoopBackoffDecay | EnableKubeletCrashLoopBackoffMax | Effective initial value +---------|---------|----------|--------- +_today's behavior_ | disabled | disabled | 10s +_new default only_ | enabled | disabled | 1s +_faster per node config_ | disabled | 2s | 2s +" | enabled | 2s | 2s +_slower per node config_ | enabled | 10s | 10s +" | enabled | 300s | 300s +" | disabled | 11s | 11s +" | disabled | 300s | 300s +_invalid per node config_ | enabled | 301s | kubelet crashes +" | disabled | 301s | kubelet crashes + + +### Version Skew Strategy + + + +For both the default backoff curve and the per-node, no coordination must be +done between the control plane and the nodes; all behavior changes are local to +the kubelet component and its start up configuration. An n-3 kube-proxy, n-1 +kube-controller-manager, or n-1 kube-scheduler without this feature available is +not affected when this feature is used, since it does not depend on +kube-controller-manager or kube-scheduler. Code paths that will be touched are +exclusively in kubelet component. + +An n-3 kubelet without this feature available will behave like normal, with the +original CrashLoopBackOff behavior. It cannot ingest flags for per node config +and if one is specified on start up, kubelet flag validation will fail. + +## Production Readiness Review Questionnaire + +<<[UNRESOLVED removed when changing from 1.31 proposal to 1.32 proposal, +incoming in separate PR]>> <<[/UNRESOLVED]>> + +## Implementation History + + + +* 04-23-2024: Problem lead opted in by SIG-Node for 1.31 target + ([enhancements#4603](https://github.com/kubernetes/enhancements/issues/4603)) +* 06-04-2024: KEP proposed to SIG-Node focused on providing limited alpha + changes to baseline backoff curve, addition of opt-in `Rapid` curve, and + change to constant backoff for `Succeeded` Pods +* 06-06-2024: Removal of constant backoff for `Succeeded` Pods +* 09-09-2024: Removal of `RestartPolicy: Rapid` in proposal, removal of PRR, in + order to merge a provisional and address the new 1.32 design in a cleaner PR + +## Drawbacks + + + +CrashLoopBackoff behavior has been stable and untouched for most of the +Kubernetes lifetime. It could be argued that it "isn't broken", that most people +are ok with it or have sufficient and architecturally well placed workarounds +using third party reaper processes or application code based solutions, and +changing it just invites high risk to the platform as a whole instead of +individual end user deployments. However, per the [Motivation](#motivation) +section, there are emerging workload use cases and a long history of a vocal +minority in favor of changes to this behavior, so trying to change it now is +timely. Obviously we could still decide not to graduate the change out of alpha +if the risks are determined to be too high or the feedback is not positive. + +Though the issue is highly upvoted, on an analysis of the comments presented in +the canonical tracking issue +[Kubernetes#57291](https://github.com/kubernetes/kubernetes/issues/57291), 22 +unique commenters were requesting a constant or instant backoff for `Succeeded` +Pods, 19 for earlier recovery tries, and 6 for better late recovery behavior; +the latter is arguably even more highly requested when also considering related +issue [Kubernetes#50375](https://github.com/kubernetes/kubernetes/issues/50375). +Though an early version of this KEP also addressed the `Success` case, in its +current version this KEP really only addresses the early recovery case, which by +our quantitative data is actually the least requested option. That being said, +other use cases described in [User Stories](#user-stories) that don't have +quantitative counts are also driving forces on why we should address the early +recovery cases now. On top of that, compared to the late recovery cases, early +recovery is more approachable and easily modelable and improving benchmarking +and insight can help us improve late recovery later on (see also the related +discussion in Alternatives [here](#more-complex-heuristics) and +[here](#late-recovery)). + +## Alternatives + + + +### Global override + +Allow an override of the global constant of a maximum backoff period (or other +settings) in kubelet configuration. + +### Per exit code configuration + +One alternative is for new container spec values that allow individual containers to +respect overrides on the global timeout behavior depending on the exit reason. +These overrides will exist for the following reasons: + +* image download failures +* workload crash: any non-0 exit code from the workload itself +* infrastructure events: terminated by the system, e.g. exec probe failures, + OOMKill, or other kubelet runtime errors +* success: a 0 exit code from the workload itself + +These had been selected because there are known use cases where changed restart +behavior would benefit workloads epxeriencing these categories of failures. + +### `RestartPolicy: Rapid` + +In the 1.31 version of this proposal, this KEP proposed a two-pronged approach +to revisiting the CrashLoopBackoff behaviors for common use cases: +1. modifying the standard backoff delay to start faster but decay to the same 5m + threshold +2. allowing Pods to opt-in to an even faster backoff curve with a lower max cap + +For step (2), the method to allow the Pods to opt-in was by a new enum value, +`Rapid`, for a Pod's `RestartPolicy`. In this case, Pods and restartable init +(aka sidecar) containers would be able to set a new OneOf value, `restartPolicy: +Rapid`, to opt in to an exponential backoff decay that starts at a lower initial +value and maximizes to a lower cap. This proposal suggested we start with a new +initial value of 250ms and cap of 1 minute, and analyze its impact on +infrastructure during alpha. + +!["A graph showing today's decay curve against a curve with an initial value of +250ms and a cap of 1 minute for a workload failing every 10 s"](todayvsrapid.png +"rapid vs todays' CrashLoopBackoff") + +**Why not?**: There was still a general community consensus that even though +this was opt-in, giving the power to reduce the backoff curve to users in +control of the pod manifest -- who as a persona are not necessarily users with +cluster-wide or at least node-wide visibility into load and scheduling -- was +too risky to global node stability. + +In addition, overriding an existing Pod spec +enum value, while convenient, required detailed management of the version skew +period, at minimum across 3 kubelet versions per the [API policy for new enum values in existing fields](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api_changes.md#new-enum-value-in-existing-field). In practice +this meant the API server and kubelets across all nodes must be coordinated. + +Firstly, `Rapid` must be a valid option to the `restartPolicy` in the API server +(which would only be possible if/when the API server was updated), and secondly, +the `Rapid` value must be interpretable by all kubelets on every node. +Unfortunately, it is not possible for the API server to be aware of what version +each kubelet is on, so it cannot serve `Rapid` as `Always` preferentially to +each kubelet depending on its version. Instead, each kubelet must be able to +handle this value properly, both at n-3 kubelet version and -- more easily -- at +its contemporary kubelet version. For updated kubelet versions, each kubelet +would be able to detect if it has the feature gate on, and if so, interpret +`Rapid` to use the new rapid backoff curve; and if the feature gate is off, +interpret it instead as `Always`. But at earlier kubelet versions, `Rapid` must +be ignored in favor of `Always`. Unfortunately for this KEP, the default value +for `restartPolicy` is Never, though even more unfortunately, it looks like +different code paths use a different default value (thank you +[@tallclair](https://github.com/tallclair)!!; +[1](https://github.com/kubernetes/kubernetes/blob/a7ca13ea29ba5b3c91fd293cdbaec8fb5b30cee2/pkg/kubelet/container/helpers.go#L105) +defaults to `Always`, +[2](https://github.com/kubernetes/kubernetes/blob/a7ca13ea29ba5b3c91fd293cdbaec8fb5b30cee2/pkg/kubelet/kubelet_pods.go#L1713) +defaults to `OnFailure`, +[3](https://github.com/kubernetes/kubernetes/blob/a7ca13ea29ba5b3c91fd293cdbaec8fb5b30cee2/pkg/kubelet/kuberuntime/kuberuntime_manager.go#L838-L859) +defaults to `Always`, and +[4](https://github.com/kubernetes/kubernetes/blob/a7ca13ea29ba5b3c91fd293cdbaec8fb5b30cee2/pkg/kubelet/status/status_manager.go#L554-L573) +defaults to `Never`), so if kubelet drops unexpected enum values for +`restartPolicy`, a Pod with `Rapid` will be misconfigured by an old kubelet. + +### Flat-rate restarts for `Succeeded` Pods + +We start from the assumption that the "Succeeded" phase of a Pod in Kubernetes +means that all workloads completed as expected. Most often this is colloquially +referred to as an exit code 0, as this exit code is what is caught by Kuerbenetes +for linux containers. + +The wording of the public documentation +([ref](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-restarts)) +and the naming of the `CrashLoopBackOff` state itself implies that it is a +remedy for a container not exiting as intended, but the restart delay decay +curve is applied to even successful pods if their `restartPolicy = Always`. On the +canonical tracking issue for this problem, a significant number of requests +focus on how an exponential decay curve is inappropriate for workloads +completing as expected, and unnecessarily delays healthy workloads from being +rescheduled. + +This alternative would vastly simplify and cut down on the restart delays for +workloads completing as expected, as detectable by their transition through the +"Succeeded" phase in Kubernetes. The target is to get as close to the capacity +of kubelet to instantly restart as possible, anticipated to be somewhere within +0-10s flat rate + jitter delay for each restart, pending benchmarking in alpha. + +Fundamentally, this change is taking a stand that a successful exit of a +workload is intentional by the end user -- and by extension, if it has been +configured with `restartPolicy = Always`, that its impact on the Kubernetes +infrastructure when restarting is by end user design. This is in contrast to the +prevailing Kubernetes assumption that that on its own, the Pod API best models +long-running containers that rarely or never exit themselves with "Success"; +features like autoscaling, rolling updates, and enhanced workload types like +StatefulSets assume this, while other workload types like those implemented with +the Job and CronJob API better model workloads that do exit themselves, running +until Success or at predictable intervals. If this alternative was pursued, we +would instead interpret an end user's choice to run a relatively fast exiting +Pod (under 10 minutes) with both a successful exit code and configured to +`restartPolicy: Always`, as their intention to restart the pod indefinitely +without penalty. + +!["A graph comparing today's CrashLoopBackOff decay curve with a linear 5s delay for a container that exits successfully every 10s"](flatratesuccessvstoday.png "linear vs today's CrashLoopBackoff") + +**Why not?**: This provides a workaround (and therefore, opportunity for abuse), +where application developers could catch any number of internal errors of their +workload in their application code, but exit successfully, forcing extra fast +restart behavior in a way that is opaque to kubelet or the cluster operator. +Something similar is already being taken advantage of by application developers +via wrapper scripts, but this causes no extra strain on kubelet as it simply +causes the container to run indefinitely and uses no kubelet overhead for +restarts. + +#### On Success and the 10 minute recovery threshold + +The original version of this proposal included a change specific to Pods +transitioning through the "Succeeded" phase to have flat rate restarts. On +further discussion, this was determined to be both too risky and a non-goal for +Kubernetes architecturally, and moved into the Alternatives section. The risk +for bad actors overloading the kubelet is described in the Alternatives section +and is somewhat obvious. The larger point of it being a non-goal within the +design framework of Kubernetes as a whole is less transparent and discussed +here. + +After discussion with early Kubernetes contributors and members of SIG-Node, +it's become more clear to the author that the prevailing Kubernetes assumption +is that that on its own, the Pod API best models long-running containers that +rarely or never exit themselves with "Success"; features like autoscaling, +rolling updates, and enhanced workload types like StatefulSets assume this, +while other workload types like those implemented with the Job and CronJob API +better model workloads that do exit themselves, running until Success or at +predictable intervals. In line with this assumption, Pods that run "for a while" +(longer than 10 minutes) are the ones that are "rewarded" with a reset backoff +counter -- not Pods that exit with Success. Ultimately, non-Job Pods are not +intended to exit Successfully in any meaningful way to the infrastructure, and +quick rerun behavior of any application code is considered to be an application +level concern instead. + +Therefore, even though it is widely desired by commenters on +[Kubernetes#57291](https://github.com/kubernetes/kubernetes/issues/57291), this +KEP is not pursuing a different backoff curve for Pods exiting with Success any +longer. + +For Pods that are today intended to rerun after Success, it is instead suggested +to + +1. exec the application logic with an init script or shell that reruns it + indefinitely, like that described in + [Kubernetes#57291#issuecomment-377505620](https://github.com/kubernetes/kubernetes/issues/57291#issuecomment-377505620): + ``` + #!/bin/bash + + while true; do + python /code/app.py + done + ``` +2. or, if a shell in particular is not desired, implement the application such + that it starts and monitors the restarting process inline, or as a + subprocess/separate thread/routine + +The author is aware that these solutions still do not address use cases where +users have taken advantage of the "cleaner" state "guarantees" of a restarted +pod to alleviate security or privacy concerns between sequenced Pod runs. + +This decision here does not disallow the possibility that this is solved in +other ways, for example: +1. the Job API, which better models applications with meaningful Success states, + introducing a variant that models fast-restarting apps by infrastructure + configuration instead of by their code, i.e. Jobs with `restartPolicy: + Always` and/or with no completion count target +2. support restart on exit 0 as a directive in the container runtime or as a + common independent tool, e.g. `RESTARTABLE CMD mycommand` or + `restart-on-exit-0 -- mycommand -arg -arg -arg` +3. formalized reaper behavior such as discussed in + [Kubernetes#50375](https://github.com/kubernetes/kubernetes/issues/50375) + +However, there will always need to be some throttling or quota for restarts to +protect node stability, so even if these alternatives are pursued separately, +they will depend on the analysis and benchmarking implementation during this +KEP's alpha stage to stay within node stability boundaries. + + +##### Related: API opt-in for flat rate/quick restarts when transitioning from `Succeeded` phase + +Workloads must opt-in with `restartPolicy: FastOnSuccess`, as a +foil to `restartPolicy: OnFailure`. In this case, existing workloads with +`restartPolicy: Always` or ones not determined to be in the critical path would +use the new, yet still relatively slower, front-loaded decay curve and only +those updated with `FastOnSuccess` would get truer fast restart behavior. +However, then it becomes impossible for a workload to opt into both +`restartPolicy: FastOnSuccess` and `restartPolicy: Rapid`. + +##### Related: `Succeeded` vs `Rapid`ly failing: who's getting the better deal? + +When both a flat rate `Succeeded` and a `Rapid` implementation were combined in +this proposal, depending on the variation of the initial value, the first few +restarts of a failed container would be faster than a successful container, +which at first look seems backwards. + +!["A graph showing the intersection of delay curves between a linear rate for +success and an exponential rate for rapid +failures"](successvsrapidwhenfailed.png "success vs rapid CrashLoopBackoff +dcay") + +However, based on the use cases, this is still correct because the goal of +restarting failed containers is to take maximum advantage of quickly recoverable +situations, while the goal of restarting successful containers is only to get +them to run again sometime and not penalize them with longer waits later when +they've behaving as expected. + +### Front loaded decay with interval +In an effort +to anticipate API server stability ahead of the experiential data we can collect +during alpha, the proposed changes are to both reduce the initial value, and include a +step function to a higher delay cap once the decay curve triggers the same +number of total restarts as experienced today in a 10 minute time horizon, in +order to approximate load (though not rate) of pod restart API server requests. + +In short, the current proposal is to implement a new initial +value of 1s, and a catch-up delay of 569 seconds (almost 9.5 minutes) on the 6th +retry. + +!["A graph showing the delay interval function needed to maintain restart +number"](controlfornumberofrestarts.png +"Alternate CrashLoopBackoff decay") + +**Why not?**: If we keep the same decay rate as today (2x), no matter what the +initial value is, the majority of the added restarts are in the beginning. Even +if we "catch up" the delay to the total number of restarts, we expect problem +with kubelet to happen more as a result of the faster restarts in the beginning, +not because we spaced out later ones longer. In addition, we are only talking +about 3-7 more restarts per backoff, even in the fastest modeled case (25ms +initial value), which is not anticipated to be a sufficient enough hit to the +infrastructure to warrant implementing such a contrived backoff curve. + +!["A graph showing the changes to restarts depending on some initial values"](initialvaluesandnumberofrestarts.png "Different CrashLoopBackoff initial values") + +### Late recovery + +There are many use cases not covered in this KEP's target [User +Stories](#user-stories) that share the common properties of being concerned with +the recovery timeline of Pods that have already reached their max cap for their +backoff. Today, some of these Pods will have their backoff counters reset once +they have run successfully for 10 minutes. However, user stories exist where + +1. the Pod will never successfully run for 10 minutes by design +2. the user wants to be able to force the decay curve to restart + ([Kubernetes#50375](https://github.com/kubernetes/kubernetes/issues/50375)) +3. the application knows what to wait for and could communicate that to the + system (like a restart probe) + +As discussed +[here in Alternatives Considered](#on-success-and-the-10-minute-recovery-threshold), +the first case is unlikely to be address by Kubernetes. + +The latter two are considered out of scope for this KEP, as the most common use +cases are regarding the initial recovery period. If there is still sufficient +appetite after this KEP reaches beta to specifically address late recovery +scenarios, then that would be a good time to address them without the noise and +change of this KEP. + +### More complex heuristics + +The following alternatives are all considered by the author to be in the +category of "more complex heuristics", meaning solutions predicated on kubelet +making runtime decisions on a variety of system or workload states or trends. +These approaches all share the common negatives of being: +1. harder to reason about +2. of unknown return on investment for use cases relative to the investment to + implement +3. expensive to benchmark and test + +That being said, after this initial KEP reaches beta and beyond, it is entirely +possible that the community will desire more sophisticated behavior based on or +inspired by some of these considered alternatives. As mentioned above, the +observability and benchmarking work done within the scope of this KEP can help +users provide empirical support for further enhancements, and the following +review may be useful to such efforts in the future. + + * Expose podFailurePolicy to nonJob Pods + * Subsidize successful running time/readinessProbe/livenessProbe seconds in + current backoff delay + * Detect anomalous workload crashes + +## Appendix A + +### Kubelet SyncPod + +* Update API status +* Wait for network ready +* Register pod to secrets and configmap managers +* Create/update QOS cgroups for a restarting, unkilled pod if enabled +* Create mirror pod for static pod if deleted +* Make pod data directories +* Wait for volumes to attach/mount, up to 2 minutes per sync loop +* Get image pull secrets +* Add pod to probe manager (start probe workers) +* Vertical pod scaling; potentially do a resize on podmanager/statusmanager + +### Runtime SyncPod + +* Fix/create pod sandbox if necessary +* Start ephemeral containers +* Start init/sidecar containers +* Start normal containers +* Start = + * Get image volumes + * If still in backoff, error; will come back next round + * Error caught by kubelet, isTerminal=False + * Pull image volumes if enabled + * Pull image + * Generate container config, including generating env and pulling secrets + * Create container + * User Pre start hook + * Start container + * User Post start hook + +### Kubelet SyncTerminatingPod + runtime killPod + +* Update Pod status +* Stop probes +* Kill pod (with grace if enabled) + * Send kill signals and wait + * Stop sandboxes + * update QOS cgroups if enabled +* Remove probes +* Deallocate DRA resources, if enabled +* Set the Pod status again with exit codes + +### Kubelet SyncTerminatedPod + +* Update Pod status +* Unmount volumes (up to 2 minutes 3 seconds per sync loop) +* Unregister secret and configmap +* Remove cgroups for qos if enabled +* Release user namespace (if enabled, needs I/O) +* Update Pod status again + + + +## Infrastructure Needed (Optional) + + diff --git a/keps/sig-node/4603-tune-crashloopbackoff/code-diagram-for-restarts.png b/keps/sig-node/4603-tune-crashloopbackoff/code-diagram-for-restarts.png new file mode 100644 index 00000000000..f5755b09be8 Binary files /dev/null and b/keps/sig-node/4603-tune-crashloopbackoff/code-diagram-for-restarts.png differ diff --git a/keps/sig-node/4603-tune-crashloopbackoff/controlfornumberofrestarts.png b/keps/sig-node/4603-tune-crashloopbackoff/controlfornumberofrestarts.png new file mode 100644 index 00000000000..157c913a378 Binary files /dev/null and b/keps/sig-node/4603-tune-crashloopbackoff/controlfornumberofrestarts.png differ diff --git a/keps/sig-node/4603-tune-crashloopbackoff/crashloopbackoff-succeedingcontainer.png b/keps/sig-node/4603-tune-crashloopbackoff/crashloopbackoff-succeedingcontainer.png new file mode 100644 index 00000000000..d2f73828c64 Binary files /dev/null and b/keps/sig-node/4603-tune-crashloopbackoff/crashloopbackoff-succeedingcontainer.png differ diff --git a/keps/sig-node/4603-tune-crashloopbackoff/differentinitialvalues.png b/keps/sig-node/4603-tune-crashloopbackoff/differentinitialvalues.png new file mode 100644 index 00000000000..47f2113cc97 Binary files /dev/null and b/keps/sig-node/4603-tune-crashloopbackoff/differentinitialvalues.png differ diff --git a/keps/sig-node/4603-tune-crashloopbackoff/flatratesuccessvstoday.png b/keps/sig-node/4603-tune-crashloopbackoff/flatratesuccessvstoday.png new file mode 100644 index 00000000000..c6652320359 Binary files /dev/null and b/keps/sig-node/4603-tune-crashloopbackoff/flatratesuccessvstoday.png differ diff --git a/keps/sig-node/4603-tune-crashloopbackoff/initialvaluesandmaxonnumberofrestarts.png b/keps/sig-node/4603-tune-crashloopbackoff/initialvaluesandmaxonnumberofrestarts.png new file mode 100644 index 00000000000..7d483b56247 Binary files /dev/null and b/keps/sig-node/4603-tune-crashloopbackoff/initialvaluesandmaxonnumberofrestarts.png differ diff --git a/keps/sig-node/4603-tune-crashloopbackoff/initialvaluesandnumberofrestarts.png b/keps/sig-node/4603-tune-crashloopbackoff/initialvaluesandnumberofrestarts.png new file mode 100644 index 00000000000..809d878f4e7 Binary files /dev/null and b/keps/sig-node/4603-tune-crashloopbackoff/initialvaluesandnumberofrestarts.png differ diff --git a/keps/sig-node/4603-tune-crashloopbackoff/kep.yaml b/keps/sig-node/4603-tune-crashloopbackoff/kep.yaml new file mode 100644 index 00000000000..719fe630b37 --- /dev/null +++ b/keps/sig-node/4603-tune-crashloopbackoff/kep.yaml @@ -0,0 +1,52 @@ +title: Tune Crashloop Backoff +kep-number: 4603 +authors: + - "@lauralorenz" +owning-sig: sig-node +# participating-sigs: +# - sig-aaa +# - sig-bbb +status: provisional +creation-date: 2024-04-29 +reviewers: + - "@tallclair" + - "@pehunt" +approvers: + - "SergeyKanzhelev" + +# see-also: +# - "/keps/sig-aaa/1234-we-heard-you-like-keps" +# - "/keps/sig-bbb/2345-everyone-gets-a-kep" +# replaces: +# - "/keps/sig-ccc/3456-replaced-kep" + +# The target maturity stage in the current dev cycle for this KEP. +stage: alpha + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.32" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v1.32" + # beta: "v1.34" + # stable: "v1.37" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: ReduceDefaultCrashLoopBackoffDecay + components: + - kube-apiserver + - kubelet + # - name: EnableRapidCrashLoopBackoffDecay + # components: + # - kube-apiserver + # - kubelet +disable-supported: true + +# The following PRR answers are required at beta release +# metrics: +# - my_feature_metric diff --git a/keps/sig-node/4603-tune-crashloopbackoff/kubeletvsruntime-restartresponsibility.png b/keps/sig-node/4603-tune-crashloopbackoff/kubeletvsruntime-restartresponsibility.png new file mode 100644 index 00000000000..39cb73e51e3 Binary files /dev/null and b/keps/sig-node/4603-tune-crashloopbackoff/kubeletvsruntime-restartresponsibility.png differ diff --git a/keps/sig-node/4603-tune-crashloopbackoff/restarts-vs-elapsed-all.png b/keps/sig-node/4603-tune-crashloopbackoff/restarts-vs-elapsed-all.png new file mode 100644 index 00000000000..90aa3d6e50c Binary files /dev/null and b/keps/sig-node/4603-tune-crashloopbackoff/restarts-vs-elapsed-all.png differ diff --git a/keps/sig-node/4603-tune-crashloopbackoff/restarts-vs-elapsed-minimum-per-node.png b/keps/sig-node/4603-tune-crashloopbackoff/restarts-vs-elapsed-minimum-per-node.png new file mode 100644 index 00000000000..9c7fd5e9994 Binary files /dev/null and b/keps/sig-node/4603-tune-crashloopbackoff/restarts-vs-elapsed-minimum-per-node.png differ diff --git a/keps/sig-node/4603-tune-crashloopbackoff/restarts-vs-elapsed-new-default.png b/keps/sig-node/4603-tune-crashloopbackoff/restarts-vs-elapsed-new-default.png new file mode 100644 index 00000000000..0fea89b9d3a Binary files /dev/null and b/keps/sig-node/4603-tune-crashloopbackoff/restarts-vs-elapsed-new-default.png differ diff --git a/keps/sig-node/4603-tune-crashloopbackoff/successvsrapidwhenfailed.png b/keps/sig-node/4603-tune-crashloopbackoff/successvsrapidwhenfailed.png new file mode 100644 index 00000000000..62860ad7981 Binary files /dev/null and b/keps/sig-node/4603-tune-crashloopbackoff/successvsrapidwhenfailed.png differ diff --git a/keps/sig-node/4603-tune-crashloopbackoff/todayvs1sbackoff.png b/keps/sig-node/4603-tune-crashloopbackoff/todayvs1sbackoff.png new file mode 100644 index 00000000000..0dbefb0103d Binary files /dev/null and b/keps/sig-node/4603-tune-crashloopbackoff/todayvs1sbackoff.png differ diff --git a/keps/sig-node/4603-tune-crashloopbackoff/todayvsrapid.png b/keps/sig-node/4603-tune-crashloopbackoff/todayvsrapid.png new file mode 100644 index 00000000000..9bd5a7d6a84 Binary files /dev/null and b/keps/sig-node/4603-tune-crashloopbackoff/todayvsrapid.png differ