-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scaling to meet load #303
Comments
From Monday’s call:
Another idea I brought up here was that the loaders could be changed to request all the current availability data from the API at the start of the run, and and only send updates for locations when there is a new |
We've long known that the loaders are probably overprovisioned. This downsizes them by 50%. Partially covers #303.
We've long known that the loaders are probably overprovisioned. This downsizes them by 50%. Partially covers #303. See https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_definition_parameters.html#task_size for valid cpu/memory values and combinations.
We did a few things here, and additional changes are no longer relevant with the service sunsetting in a month. |
In this issue I'm going to lay out a few ideas I've had that we could implement so that we can better meet potential higher traffic loads. Each of these ideas will have an impact vs difficulty score with it to assist with priority.
1: Queue based database updating
Impact: 7
Difficulty: 5
When assessing our per-endpoint metrics in Datadog, the majority of our traffic comes from calls to the
/update
method. In the span of an hour, we have 130k+ update calls, while the closest endpoint only has ~300 calls. In order to better isolate these update calls, and also separate what needs to be able to update the DB vs what needs to read from it, we can have the loaders publish events to a queue, and a small pool of workers that read from the queue to update the DB.Having these events on a queue would help with failure scenarios where if a worker is failing, another worker can pick up the messages. In the scenario of a bad deployment, the queue would back up such that we wouldn't lose any of those updates when resolved. If we do this work in parallel with the read replica work already in flight, we would also be able to continue serving all requests to the API. The tradeoff here is availability vs consistency, we are optimizing for availability by allowing the queue to back up and the API to continue serving requests rather than waiting for everything to be healthy again.
Implementing this option is also cost-efficient. We would be able to scale down the API task pool because of the lighter load, and have a smaller ECS service for the worker itself. SQS itself costs $0.40 per million requests, at our current rate of around 88-100 million requests to our endpoints a month, our total monthly expected SQS cost would be $40. If we batch the requests we would probably be able to get our cost down to $5-10 a month. Given we'll be able to scale down the API, I expect the costs should about even out.
2: Cut down on logging volume
Impact: 2
Difficulty: 1
This is a small issue, but we should look in to it: currently logging is our greatest cost in AWS. It accounts for 40% of our total spend. A quick glance in cloudwatch shows that most of these are just logging "/update 200". Removing that log alone would probably massively decrease our AWS spend and allow us to potentially use a better logging provider.
3: Resize our application's resource requests
Impact: 3
Difficulty: 3
We already have autoscaling set up for the API which is great, but we do need to do some work to more effectively use ECS's reservations. Right now we use ~10% of the memory available and ~30% of the CPU available. By decreasing and sizing our needs, we are going to be able to more effectively autoscale the API as needed rather than being in the overprovisioned state we are now. This also has a larger cost implication because right now vCPU usage is in our top 3 AWS costs and reducing our CPU request will take down the cost. We also will want to probably move the goal posts for autoscaling where is currently a wide range (less than 10% downscale and greater than 85% scale up.)
4: Decouple updating from reading
Impact: 4
Difficulty: 3
This is a more simple, less cost-effective version of the first proposal. Essentially we run two versions of the api simultaneously: one for writing, one for reading. This lets us scale them separately and also allows us to isolate functionality. The problem though is that we will have to create another ALB and another ECS service which is some added cost. Given our TF configuration, however, this is less involved from a code-change standpoint and may even be able to get done through infrastructure changes alone.
5: Migrate to Kubernetes (Idea that we probably shouldn't do)
Impact: 8
Difficulty: 9
I thought a bit about if Kubernetes is right for us, and I came to the conclusion that it's probably not. Kubernetes is a massively powerful tool that would let us eliminate a lot of our larger costs like cloudwatch (logs can be stored and tailed on instance) and ec2 data transfer costs (you can utilize kubernetes networking bypassing the need to expose an LB.) EKS, however, is itself a large operational cost. EKS clusters cost an extra $70 right off the bat not even counting the ec2 node groups you need to run. Beyond that we would need to change a lot about how our ops works, a lot of our terraform goes away and moves in to the helm world. I don't think the benefits we would gain from Kubernetes would be so massive that it would warrant such an undertaking. I am very familiar with how to manage Kubernetes and if we think it's an interesting proposal I could write up a detailed plan. Interested to hear if anyone else has thoughts on this.
note
I will continue thinking about other things we can do here, and will add comments with those ideas as well.
The text was updated successfully, but these errors were encountered: