Autoscaler forever waiting for gce-mig "target is not ready" when cpu quota is exceeded #839

sam-sla · 2024-01-26T07:48:25Z

Hello,
So this is an edge case we got stuck in very recently. During some increased testing we apparently reached our default vcpu quota in the gcp project, but because we still didn't have alerts configured it went unnoticed for a while, until we realized that the nomad autoscaler wasn't scaling down nodes from the cluster.

The scenario:

gce-mig is trying to scale up but vcpu quota in the project is reached, the instance group has errors of the type QUOTA_EXCEEDED dev-nomad-client-zpgq europe-west1-b Creating Jan 25, 2024, 3:38:35 PM UTC+01:00 Instance 'dev-nomad-client-zpgq' creation failed: Quota 'N2D_CPUS' exceeded. Limit: 500.0 in region europe-west1.
Nomad autoscaler is periodically checking the mig but doesn't go further because the mig is not ready (stuck trying to scale-up) [TRACE] policy_manager.policy_handler: target is not ready: policy_id=c00c0934-a44e-c3eb-5200-e71e44255633

In our case we requested a vcpu increase and that solved the problem, the mig finished the scaling up event that it had started many hours ago, then the nomad autoscaler saw the mig ready and started to function properly again (in our case to scale down as the load had decreased a lot)

I would like to know if this could have been handled better, maybe the autoscaler could check the mig errors, or force a scaling event (not sure if that's possible)

The text was updated successfully, but these errors were encountered:

lgfa29 · 2024-02-10T02:20:59Z

Hi @sam-sla 👋

Do you know what would happen if you tried to scale a MIG in that state? I guess maybe a scale down would work and probably desirable if possible 🤔

But I'm also not sure how to detect this state. Currently we check this isStable flag to determine if the target is ready for scaling:

nomad-autoscaler/plugins/builtin/target/gce-mig/plugin/mig.go

Line 75 in a4e70bc

return mig.Status.IsStable, mig.TargetSize, nil

Do you know if there's a field we can check for an error like this?

sam-sla · 2024-02-12T08:34:18Z

Hi @lgfa29,

No unfortunately I don't, I should have tried to make a new scale call to the Mig myself to see what would happen but I didn't think of that :/ Possibly an explicit call to scale down could have overridden the scale-up it was stuck on.

What I thought was if the autoscaler could check for mig errors if it detected that mig wasn't stable/ready after a while to at least try to log some information? There is a listErrors endpoint https://cloud.google.com/compute/docs/reference/rest/v1/instanceGroupManagers/listErrors

sam-sla · 2024-04-12T13:05:16Z

Hello again,
We landed on this same problem today, and this time I tried changing the MIG size in GCP and it worked. After lowering the size of the group, the MIG got into a ready state again and the nomad autoscaler recovered.

So GCP is also being sneaky on this getting stuck constantly trying to scale up even when the cpu quota is hit.

lgfa29 added stage/waiting-reply type/question theme/target/gce-mig labels Feb 10, 2024

lgfa29 self-assigned this Feb 10, 2024

tgross unassigned lgfa29 Jun 24, 2024

tgross added stage/needs-investigation and removed stage/waiting-reply labels Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autoscaler forever waiting for gce-mig "target is not ready" when cpu quota is exceeded #839

Autoscaler forever waiting for gce-mig "target is not ready" when cpu quota is exceeded #839

sam-sla commented Jan 26, 2024

lgfa29 commented Feb 10, 2024

sam-sla commented Feb 12, 2024

sam-sla commented Apr 12, 2024

Autoscaler forever waiting for gce-mig "target is not ready" when cpu quota is exceeded #839

Autoscaler forever waiting for gce-mig "target is not ready" when cpu quota is exceeded #839

Comments

sam-sla commented Jan 26, 2024

lgfa29 commented Feb 10, 2024

sam-sla commented Feb 12, 2024

sam-sla commented Apr 12, 2024