Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scaling to meet load #303

Closed
jaronoff97 opened this issue Jul 13, 2021 · 2 comments
Closed

Scaling to meet load #303

jaronoff97 opened this issue Jul 13, 2021 · 2 comments
Labels
enhancement New feature or request

Comments

@jaronoff97
Copy link
Collaborator

jaronoff97 commented Jul 13, 2021

In this issue I'm going to lay out a few ideas I've had that we could implement so that we can better meet potential higher traffic loads. Each of these ideas will have an impact vs difficulty score with it to assist with priority.

1: Queue based database updating

Impact: 7
Difficulty: 5

When assessing our per-endpoint metrics in Datadog, the majority of our traffic comes from calls to the /update method. In the span of an hour, we have 130k+ update calls, while the closest endpoint only has ~300 calls. In order to better isolate these update calls, and also separate what needs to be able to update the DB vs what needs to read from it, we can have the loaders publish events to a queue, and a small pool of workers that read from the queue to update the DB.

Having these events on a queue would help with failure scenarios where if a worker is failing, another worker can pick up the messages. In the scenario of a bad deployment, the queue would back up such that we wouldn't lose any of those updates when resolved. If we do this work in parallel with the read replica work already in flight, we would also be able to continue serving all requests to the API. The tradeoff here is availability vs consistency, we are optimizing for availability by allowing the queue to back up and the API to continue serving requests rather than waiting for everything to be healthy again.

Implementing this option is also cost-efficient. We would be able to scale down the API task pool because of the lighter load, and have a smaller ECS service for the worker itself. SQS itself costs $0.40 per million requests, at our current rate of around 88-100 million requests to our endpoints a month, our total monthly expected SQS cost would be $40. If we batch the requests we would probably be able to get our cost down to $5-10 a month. Given we'll be able to scale down the API, I expect the costs should about even out.

2: Cut down on logging volume

Impact: 2
Difficulty: 1

This is a small issue, but we should look in to it: currently logging is our greatest cost in AWS. It accounts for 40% of our total spend. A quick glance in cloudwatch shows that most of these are just logging "/update 200". Removing that log alone would probably massively decrease our AWS spend and allow us to potentially use a better logging provider.

3: Resize our application's resource requests

Impact: 3
Difficulty: 3

We already have autoscaling set up for the API which is great, but we do need to do some work to more effectively use ECS's reservations. Right now we use ~10% of the memory available and ~30% of the CPU available. By decreasing and sizing our needs, we are going to be able to more effectively autoscale the API as needed rather than being in the overprovisioned state we are now. This also has a larger cost implication because right now vCPU usage is in our top 3 AWS costs and reducing our CPU request will take down the cost. We also will want to probably move the goal posts for autoscaling where is currently a wide range (less than 10% downscale and greater than 85% scale up.)

4: Decouple updating from reading

Impact: 4
Difficulty: 3

This is a more simple, less cost-effective version of the first proposal. Essentially we run two versions of the api simultaneously: one for writing, one for reading. This lets us scale them separately and also allows us to isolate functionality. The problem though is that we will have to create another ALB and another ECS service which is some added cost. Given our TF configuration, however, this is less involved from a code-change standpoint and may even be able to get done through infrastructure changes alone.

5: Migrate to Kubernetes (Idea that we probably shouldn't do)

Impact: 8
Difficulty: 9
I thought a bit about if Kubernetes is right for us, and I came to the conclusion that it's probably not. Kubernetes is a massively powerful tool that would let us eliminate a lot of our larger costs like cloudwatch (logs can be stored and tailed on instance) and ec2 data transfer costs (you can utilize kubernetes networking bypassing the need to expose an LB.) EKS, however, is itself a large operational cost. EKS clusters cost an extra $70 right off the bat not even counting the ec2 node groups you need to run. Beyond that we would need to change a lot about how our ops works, a lot of our terraform goes away and moves in to the helm world. I don't think the benefits we would gain from Kubernetes would be so massive that it would warrant such an undertaking. I am very familiar with how to manage Kubernetes and if we think it's an interesting proposal I could write up a detailed plan. Interested to hear if anyone else has thoughts on this.

note
I will continue thinking about other things we can do here, and will add comments with those ideas as well.

@jaronoff97 jaronoff97 added the enhancement New feature or request label Jul 13, 2021
@Mr0grog
Copy link
Collaborator

Mr0grog commented Jul 21, 2021

From Monday’s call:

  1. Queue based database updating: we were worried about a big jump in usage from Vaccinate the States integrating us on their front-end, but that’s no longer going to happen (VtS is ramping down over the next 4 weeks, so will keep to only the existing back-end integration). This task is pretty high effort, but there’s no longer an especially serious need for it, so we should postpone for now.

    • There was also some architectural confusion here: The reason /update is a public endpoint and the loaders don’t just talk directly with the database or have their own private API to talk to is because we had originally been working with partners at vaccinespotter.org and getmyvaccine.org to have them post data directly to the /update endpoint (also one of the reasons the server supports multiple API keys, and why the loaders we operate are named univaf-xyz instead of just xyz). So we’d still need an endpoint to put things onto the proposed queue. (On the other hand, all those groups have shut down or are ramping down, so this is not really a practical concern anymore.)
    • Lambda might be another way to approach this problem.
    • This still seems potentially worth thinking about longer-term (insofar as long-term matters), but the higher priority aspects can probably be better handled in “4. Decouple updating from reading.”
  2. Cut down on logging volume: We should go ahead and do this; @jaronoff97 made a compelling case that request logs aren’t getting us much value in production since we have DataDog metrics now (though it would be nice to retain them in dev). See ☂ Use a real logging library #307.

    • I was dead certain I’d written an issue for this long ago, but that was clearly wrong. All I could find was a side note in another issue and a to-do comment in the code.
  3. Resize our application's resource requests: We should definitely do this, and it seems fairly straightforward, although the impact may be limited because of how spiky our basic pattern of loading in a giant pile of data every N minutes.

  4. Decouple updating from reading: Probably worthwhile, and maybe a simpler approach to many of the same issues as point (1). Would be good to get a clearer handle on the costs, but it sounds like this is worth moving forward on.

  5. Migrate to Kubernetes: All agreed we should not.

Another idea I brought up here was that the loaders could be changed to request all the current availability data from the API at the start of the run, and and only send updates for locations when there is a new valid_at time or which haven’t been checked in a while (e.g. 30 minutes or an hour). This would potentially cut down on unnecessary updates and overall traffic to the API. It could further lighten the load if done in conjunction with #201.

Mr0grog added a commit that referenced this issue Jul 13, 2022
We've long known that the loaders are probably overprovisioned. This downsizes them by 50%. Partially covers #303.
Mr0grog added a commit that referenced this issue Jul 13, 2022
We've long known that the loaders are probably overprovisioned. This downsizes them by 50%. Partially covers #303. See https://docs.aws.amazon.com/AmazonECS/latest/developerguide/task_definition_parameters.html#task_size for valid cpu/memory values and combinations.
@Mr0grog
Copy link
Collaborator

Mr0grog commented May 16, 2023

We did a few things here, and additional changes are no longer relevant with the service sunsetting in a month.

@Mr0grog Mr0grog closed this as completed May 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants