Why shouldn't we just spin up some EC2 instances and call it a day? #340

osterman · 2018-12-21T23:00:23Z

what

Sometimes the question comes up:

Why do we need Kubernetes/ECS/EKS?
We can spin up some EC2 instances in an autoscale group behind a load balancer in about 1-day and be done with it.

why

Yes, that's true. But it's an oversimplification of the problem. It's very easy to bring up a "proof of concept" in about a day. It's an entirely different story to take this to production.

Questions to ask

How are you going to manage upgrades of the host machines operating systems?
- packer is the most common strategy. This still needs to run somewhere. Ideally, it's also wired up with CI/CD to build AMIs automatically. If this process hasn't already been set up, then you'll need to do that to.
- How are you going to configure those machines or AMIs? The classical approach is with "configuration managment" software like chef, puppet, ansible. The scripts for those need to be written and then the CI/CD process wired up. If none of that already exists, it's a large investment.
- How are you going to deploy the new AMIs?
How are you going to handle automated deployments to the autoscaling group?
- A strategy needs to be implemented to handle safe rolling updates or blue/green deployments. Doing this well is non-trivial.
- The strategy needs to work despite servers scaling up/down and cannot assume a fixed size cluster.
- What happens if a server comes online during a software deployment? What version does it get?
- What happens if a rollout fails? Does it rollback? Does it continue? How does it even know that it failed?
- How do you know that all servers are running the current version of the software?
- How will you handle rollbacks?
- What happens if you rollback to a version that was never previously deployed on the server?
- How will you handle rollbacks in the event of failure? (and there will be failures!)
- How will you know if a deployment succeeded?
How are you going to handle Remote Access Management?
- SSH access to this server should be controlled. Sharing SSH keys shouldn't be an option.
- How is SSH access managed? If using public keys, then those need to be distributed somehow. If using certificate-based authentication, now we'll need to setup a CA and manage that whole process.
- How do you access internal apps? E.g. apps that should not be public. The best practice is to use a BeyondCorp style "Identity Aware Proxy" with Single Signon Integration. As a last resort, a VPN can be setup, which comes with additional overhead and maintence.
- Are the webservers public so you can access them? They shouldn't be. All instances should be on a private subnet. But if that's the case, then we'll also need a bastion host somewhere for remote access.
How are logs aggregated from all the servers?
- The logs need to be shipped somewhere they can be easily searched
- CloudWatch logs is one option, but it's primitive compared to things like Splunk or Sumologic
How are services on the servers monitored?
- It's essential to know that the services on the servers are running appropriately. If they crash, they should be automatically restarted.
- How will you know if some percentage of requests are failing?
- What will happen if there's a memory leak?
How do you deploy multiple apps to those servers which might have multiple different dependencies (E.g. different versions of libraries)?
How are you going to manage secrets and configurations?
- A secure mechanism needs to exist for storing secrets and distributing them to applications
- A strategy must exist for rotating these secrets
How are you going to handle service discovery?
- DNS is the most common approach. If using Route53, something will need to update it.
- When services go away, their DNS entries need to be removed.

The long story short, you get most lot of this for "Free" by leveraging a modern container management platform like Kubernetes or ECS. You don't need to implement all this crazy stuff because the platforms provide it for you. In the case of kubernetes, if it doesn't support it out-of-the-box, there are hundreds of apps for it that will probably implement what you need.

So instead reimplementing the wheel and ending up with a snowflake infrastructure, we advise to all of our customers to avoid this trap and go straight for a container management platform.

The text was updated successfully, but these errors were encountered:

osterman added the faq label Dec 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why shouldn't we just spin up some EC2 instances and call it a day? #340

Why shouldn't we just spin up some EC2 instances and call it a day? #340

osterman commented Dec 21, 2018 •

edited

Loading

Why shouldn't we just spin up some EC2 instances and call it a day? #340

Why shouldn't we just spin up some EC2 instances and call it a day? #340

Comments

osterman commented Dec 21, 2018 • edited Loading

what

why

osterman commented Dec 21, 2018 •

edited

Loading