Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autoscaler forever waiting for gce-mig "target is not ready" when cpu quota is exceeded #839

Open
sam-sla opened this issue Jan 26, 2024 · 3 comments

Comments

@sam-sla
Copy link

sam-sla commented Jan 26, 2024

Hello,
So this is an edge case we got stuck in very recently. During some increased testing we apparently reached our default vcpu quota in the gcp project, but because we still didn't have alerts configured it went unnoticed for a while, until we realized that the nomad autoscaler wasn't scaling down nodes from the cluster.

The scenario:

  1. gce-mig is trying to scale up but vcpu quota in the project is reached, the instance group has errors of the type QUOTA_EXCEEDED dev-nomad-client-zpgq europe-west1-b Creating Jan 25, 2024, 3:38:35 PM UTC+01:00 Instance 'dev-nomad-client-zpgq' creation failed: Quota 'N2D_CPUS' exceeded. Limit: 500.0 in region europe-west1.
  2. Nomad autoscaler is periodically checking the mig but doesn't go further because the mig is not ready (stuck trying to scale-up) [TRACE] policy_manager.policy_handler: target is not ready: policy_id=c00c0934-a44e-c3eb-5200-e71e44255633

In our case we requested a vcpu increase and that solved the problem, the mig finished the scaling up event that it had started many hours ago, then the nomad autoscaler saw the mig ready and started to function properly again (in our case to scale down as the load had decreased a lot)

I would like to know if this could have been handled better, maybe the autoscaler could check the mig errors, or force a scaling event (not sure if that's possible)

@lgfa29
Copy link
Contributor

lgfa29 commented Feb 10, 2024

Hi @sam-sla 👋

Do you know what would happen if you tried to scale a MIG in that state? I guess maybe a scale down would work and probably desirable if possible 🤔

But I'm also not sure how to detect this state. Currently we check this isStable flag to determine if the target is ready for scaling:

return mig.Status.IsStable, mig.TargetSize, nil

Do you know if there's a field we can check for an error like this?

@sam-sla
Copy link
Author

sam-sla commented Feb 12, 2024

Hi @lgfa29,

No unfortunately I don't, I should have tried to make a new scale call to the Mig myself to see what would happen but I didn't think of that :/ Possibly an explicit call to scale down could have overridden the scale-up it was stuck on.

What I thought was if the autoscaler could check for mig errors if it detected that mig wasn't stable/ready after a while to at least try to log some information? There is a listErrors endpoint https://cloud.google.com/compute/docs/reference/rest/v1/instanceGroupManagers/listErrors

@sam-sla
Copy link
Author

sam-sla commented Apr 12, 2024

Hello again,
We landed on this same problem today, and this time I tried changing the MIG size in GCP and it worked. After lowering the size of the group, the MIG got into a ready state again and the nomad autoscaler recovered.

So GCP is also being sneaky on this getting stuck constantly trying to scale up even when the cpu quota is hit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

3 participants