Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why shouldn't we just spin up some EC2 instances and call it a day? #340

Open
osterman opened this issue Dec 21, 2018 · 0 comments
Open
Labels

Comments

@osterman
Copy link
Member

osterman commented Dec 21, 2018

what

  • Sometimes the question comes up:

Why do we need Kubernetes/ECS/EKS?
We can spin up some EC2 instances in an autoscale group behind a load balancer in about 1-day and be done with it.

why

Yes, that's true. But it's an oversimplification of the problem. It's very easy to bring up a "proof of concept" in about a day. It's an entirely different story to take this to production.

Questions to ask

  1. How are you going to manage upgrades of the host machines operating systems?

    • packer is the most common strategy. This still needs to run somewhere. Ideally, it's also wired up with CI/CD to build AMIs automatically. If this process hasn't already been set up, then you'll need to do that to.
    • How are you going to configure those machines or AMIs? The classical approach is with "configuration managment" software like chef, puppet, ansible. The scripts for those need to be written and then the CI/CD process wired up. If none of that already exists, it's a large investment.
    • How are you going to deploy the new AMIs?
  2. How are you going to handle automated deployments to the autoscaling group?

    • A strategy needs to be implemented to handle safe rolling updates or blue/green deployments. Doing this well is non-trivial.
    • The strategy needs to work despite servers scaling up/down and cannot assume a fixed size cluster.
    • What happens if a server comes online during a software deployment? What version does it get?
    • What happens if a rollout fails? Does it rollback? Does it continue? How does it even know that it failed?
    • How do you know that all servers are running the current version of the software?
    • How will you handle rollbacks?
    • What happens if you rollback to a version that was never previously deployed on the server?
    • How will you handle rollbacks in the event of failure? (and there will be failures!)
    • How will you know if a deployment succeeded?
  3. How are you going to handle Remote Access Management?

    • SSH access to this server should be controlled. Sharing SSH keys shouldn't be an option.
    • How is SSH access managed? If using public keys, then those need to be distributed somehow. If using certificate-based authentication, now we'll need to setup a CA and manage that whole process.
    • How do you access internal apps? E.g. apps that should not be public. The best practice is to use a BeyondCorp style "Identity Aware Proxy" with Single Signon Integration. As a last resort, a VPN can be setup, which comes with additional overhead and maintence.
    • Are the webservers public so you can access them? They shouldn't be. All instances should be on a private subnet. But if that's the case, then we'll also need a bastion host somewhere for remote access.
  4. How are logs aggregated from all the servers?

    • The logs need to be shipped somewhere they can be easily searched
    • CloudWatch logs is one option, but it's primitive compared to things like Splunk or Sumologic
  5. How are services on the servers monitored?

    • It's essential to know that the services on the servers are running appropriately. If they crash, they should be automatically restarted.
    • How will you know if some percentage of requests are failing?
    • What will happen if there's a memory leak?
  6. How do you deploy multiple apps to those servers which might have multiple different dependencies (E.g. different versions of libraries)?

  7. How are you going to manage secrets and configurations?

    • A secure mechanism needs to exist for storing secrets and distributing them to applications
    • A strategy must exist for rotating these secrets
  8. How are you going to handle service discovery?

    • DNS is the most common approach. If using Route53, something will need to update it.
    • When services go away, their DNS entries need to be removed.

The long story short, you get most lot of this for "Free" by leveraging a modern container management platform like Kubernetes or ECS. You don't need to implement all this crazy stuff because the platforms provide it for you. In the case of kubernetes, if it doesn't support it out-of-the-box, there are hundreds of apps for it that will probably implement what you need.

So instead reimplementing the wheel and ending up with a snowflake infrastructure, we advise to all of our customers to avoid this trap and go straight for a container management platform.

@osterman osterman added the faq label Dec 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant