Flowchart: How should I run containers on AWS?

[+] atomland|4 years ago|reply

I run multiple small EKS clusters at a small company. It doesn’t cost anywhere near $1 million per year, even taking my salary into account. If you don’t factor in my salary, it’s maybe $50k per year, and that’s for 4 clusters.

Honestly this flowchart is kind of a mess, and I certainly wouldn’t recommend it to anyone.

[+] dzikimarian|4 years ago|reply

Same here. Multiple java/php applications on EKS. It got much better when we found a few guys who focused on resolving the issues instead complaining how hard kubernetes is.

[+] larrymyers|4 years ago|reply

If you have the desire to understand the individual components of running infrastructure for docker containers I'd suggest the full hashicorp stack.

Running nomad, vault, consul together isn't difficult, and will get you a better understanding how to deploy 12 factor apps with good secret storage and service discovery.

Add in Traefik for routing and you've got an equivalent stack to k8s, but you can actually understand what each piece is doing and scale each component as needed.

If you're going to run this all on AWS you can stick to just EC2 and not have to drown in documentation for each new abstraction AWS launches.

As an added bonus nomad can run far more things than just containers, so you have an on-ramp for your legacy apps.

[+] tyingq|4 years ago|reply

I suppose one watch out is to not get too comfortable with the "isn't difficult" part. That seemed to be the root cause of the extended outage at Roblox. That the "it all works together" bit lulled them into not really researching the impact of changes.

[+] jpgvm|4 years ago|reply

Yeah no. ECS is the worst of all worlds, Fargate made it less shit, it didn't make it good.

If you don't have the people to do k8s then stick to Lightsail, don't do containers poorly just because you can.

Half-assing it will just make everyone miserable and end up with a mix-match of "thing that can run on ECS" and "things that can't because they need X" where X is really common/useful things like stateful volumes (EBS).

[+] DandyDev|4 years ago|reply

My experience with ECS is quite okay. You start a cluster, stick a container on it through a task and optionally a service + load balancer and it just works.

It doesn't seem any harder than EKS, and it's mostly cheaper.

I also find some comments on this article about vendor lock-in dubious, because in the end it's a bunch of containers created from Dockerfiles, which you can easily reuse elsewhere.

[+] dmw_ng|4 years ago|reply

ECS can do EBS volumes via the Rex-Ray Docker plugin. Depending on how fast you need new instances to come up, installation can be a 5-liner in userdata

[+] InTheArena|4 years ago|reply

This is awful. It completely ignores the economics of leveraging commodity services and re-usable skills while ignoring the true places where you can get maximal value if you are willing to accept vendor lock in - the high level services.

This is more or less the equivalent of mandating that everyone use VMS once UNIX started to commodify, or use Windows for all of your servers once Linux has taken over the market. Using EKS + Fargate instead of ECS + fargate provides no savings over EKS, All it serves to do is to lock you into a single hyper-scaler infrastructure, at the same time as K8s is forcing he cloud vendors into commoditization.

Want to use AWS effectively? Depend on high level services like Glue, Athena, Kinesis Firehose, Sagemaker. Want to piss away any chance to run your business effectively? Leverage ECS.

If you are a one-man shop, and you know ECS or are willing to depend on underbaked solutions because they solve a problem, more power to you. I suspect that you may benefit over your career by investing in more universal skill sets (alternative, you may benefit from hyper-specializing on AWS toolsets as well(.

[+] glogla|4 years ago|reply

I would very much say it's the other way around.

AWS is good at giving you commodity infrastructure at scale. They are really good at "stupid" services like S3 or EC2 or RDS. The high-level services meanwhile? I worked with quite a few of them and they're mostly shit.

Athena is has account-wide limit of 5 concurrent queries which cannot be increased. Even one larger dashboard will overload it.

Redshift have the same limit of 30 concurrent queries. That's good enough for casual use but not suitable for larger company.

Glue Catalog does not scale at all, and having more than few objects will break it, and you will very soon end up begging for more API limit due to throttling exceptions.

Kinesis has very strange limits (in messages per second) that make it really expensive for use cases where traffic is peaky - which is quite a few streaming use cases.

Just like you wouldn't use Amazon.com to buy high-quality, important goods (because you're pretty likely to get something broken or fake), don't use AWS "high level" services. Amazon is company focused on scaling commodity use cases, not at engineering excellency.

[+] gavinray|4 years ago|reply

I have never heard of AppRunner, thanks for posting this!

I've used Fargate as I thought it was the easiest/cheapest route, but according to the notes on chart:

  "From 2020-2021 the best option was/is ECS and Fargate".

  "Maybe from 2023-ish AppRunner will become the best option. It's a preview service right now but on the path to be awesome!"
 
  "It may not totally take over ECS on Fargate, but it will take over most common usecases."

And according to this chart, AppRunner apparently is the service I ought to be using for most of my apps.

[+] CSDude|4 years ago|reply

AppRunner does not have VPC support since the launch. Otherwise great service.

[+] sombremesa|4 years ago|reply

You don’t know about lightsail containers? Why make a flowchart that omits one of the simpler solutions in favor of complex ones?

Agreed with the other commenters that this flowchart is not to be recommended, for this and other reasons.

[+] tyingq|4 years ago|reply

I'd personally omit them, assuming they are subject to the same crazy CPU throttling that Lightsail VPS servers come with. Too much work to predict when you can actually use it, e.g.: https://aws.amazon.com/blogs/compute/proactively-monitoring-...

(Look at where the top of the "sustainable zone" is...5%? Like 1/20th of a vCPU?)

[+] CSDude|4 years ago|reply

Not a flowchart but has additional inforomation. https://www.lastweekinaws.com/blog/the-17-ways-to-run-contai...

[+] DandyDev|4 years ago|reply

Super informative article, thanks for that!

It comes to the same conclusions that I intuitively had as well:

- EKS if you really have to use k8s - ECS if you have a modicum of complexity but still want to keep it simple - AppRunner if you just want to run a container

[+] bilalq|4 years ago|reply

Yeah, this article is the first thing that came to mind for me as well.

I still need to explore how AppRunner compares to CodeBuild for this purpose.

[+] fivea|4 years ago|reply

Cool link, although I think that in case your deployment package is <100MB then it's preferable to use AWS Lambdas by simply deploying the zip file archive instead of pushing a container image.

[+] time0ut|4 years ago|reply

Last time I tried it (~May 2021), Lambda containers had terrible cold start times. Even a node hello world would cold start in 2-3 seconds. Same code packaged as a ZIP file would cold start in less than 1/10th the time. Maybe its better now?

[+] jayar95|4 years ago|reply

It's been a year since I touched an AWS lambda, but I'd bet money that cold starts are still an issue. There is a common hack that half works: have your function run every minute (you can use eventbridge rules for this); in the function handler, the first thing you should evaluate is whether or not it's a warming event or not, and exit 0 if it is. Your results may vary (mine did lol)

[+] krinchan|4 years ago|reply

Unfortunately, my enterprise scale employer forced me onto ECS on EC2 for some of my apps for a very specific reason: Reserved instance pricing. I think there’s reserved instance-like pricing for Fargate now. For one particular set of containers (Java application we license from a vendor and then customize with plugins somewhat), the CPU and RAM requirements are fairly large so the savings of ECS on EC2 with the longest Reserved Instance Contract means that I will forever be dealing with the idiocy of ASG based Capacity Providers.

For those not in the know, ASG based Capacity Providers are hard to work with because they are very immutable so you end up having to create-then-delete any changes that touch the capacity provider. A capacity provider cannot be deleted if the ASG has any instances. Many tools like terraform’s AWS provider will refuse to delete the ASG till the capacity provider is deleted. The terraform provider just cannot properly reason about the process of discovering and scaling in the ECS tasks on the provider, scaling in the ASG, waiting for 0 instances, and then deleting the Capacity Provider. It’s honestly beyond how providers are supposed to work.

TL;DR: The flow chart is somewhat correct: Do everything in your power to run on ECS Fargate. It’s mature enough and has excellent VPC and IAM support these days. Stay as far away from ECS on EC2 as you can.

As for EKS, I like it but this company runs on the whole “everyone can do whatever” so each team would have to run its own EKS cluster. If we had a centralized team providing a base k8s cluster with monitoring and what not built in for us to deploy on, I’d be more amenable to it. As it stands, I would have to learn both the development and ops AND security sides of running EKS for a handful of apps. ECS while seeming similar on the surface is much simpler and externalizes concepts like ingresses and load balancing and persistence into the AWS concepts and tooling (CDK, CloudFormation, Terraform) you already know (one hopes).

[+] jpgvm|4 years ago|reply

I would go as far to say avoid ECS as much as necessary. Using ECS heavily either means using CloudFormation or Terraform heavily, both of which are shitty tools (TF is probably the best tool in it's class, haven't tried Pulumi yet but doesn't stop it from being shit).

Importantly both of which are almost impossible for "normal" developers to use with any level of competency. This leads to 2 inevitable outcomes, a) snowflakes and all the special per-app ops work that entails and b) app teams pushing everything back to infrastructure teams because they either don't want to work with TF or can't be granted sufficient permissions to use it effectively.

k8s solves these challenges much more effectively assuming your ops team is capable of setting it up, managing the cluster(s), namespaces, RBAC, etc and any base-level services like external-dns, some ingress provider, cert-manager, etc.

Once you do this then app teams are able to deploy directly, they can use helm (eww but it works) to spin up whatever off-the-shelf software they want and are able to easily write manifests in a way that they can't fuck up horribly as easily.

Best for both teams, ops and devs. Downside? Requires competent ops team (hard to find) and also some amount of taste in tooling (use things like tanka to make manifests less shit), not to mention the time to actually spin all this up in peace without being pushed to do tons of ad-hoc garbage continually (i.e competent org).

So in summary, k8s is generally the right solution for larger orgs because it enforces better split of responsibilities and establishes a powerful (relatively) easy to use API that can support practically everything.

Also in the future there is things like ACK (https://aws.amazon.com/blogs/containers/aws-controllers-for-...) coming which will further reduce the need for app teams to interact with TF or Cloudformation.

[+] DandyDev|4 years ago|reply

Even without using reserved instance pricing, ECS on EC2 is much cheaper, isn't it? At work we use Hasura, which is written in Haskell and cannot be (easily?) run as a Lambda. Our alternative solution is to run it as a container on ECS. Given that it's a permanently running service, with Fargate we'd pay just to have it sit idle for half of the time, and Fargate is not cheap.

Even when running non-reserved EC2 instances to make up our ECS cluster, it is cheaper than using Fargate.

[+] speedgoose|4 years ago|reply

If I don't want vendor lock-in, or a minimal amount, how should I run containers on AWS?

[+] atomland|4 years ago|reply

Well, to begin with, I think people worry too much about vendor lock-in. Use the tools your cloud vendor provides to make your life easier. Isn’t that one of the reasons you chose them?

That said, moving containers to another container orchestrator isn’t terribly difficult, so I don’t personally worry about vendor lock-in for containerized workloads. If your workloads have dependencies on other vendor-specific services, that’s a different story, but basically a container is easy to move elsewhere.

[+] glenjamin|4 years ago|reply

If the containerized app you’re deploying follows 12 factor principles it’s very unlikely that you’ll be locked in due to specific functionality

The cost to move your operations expertise to another platform and learn all of its new quirks might be significant though.

[+] InTheArena|4 years ago|reply

EKS + Fargate.

It’s that simple. If you need extensions into the AWS infrastructure, check out their CRD extensions that allow you to provision all of the infrastructure using K8s.

[+] bkanber|4 years ago|reply

Docker or kubernetes or any other orchestration software of your choice on EC2

[+] evoxmusic|4 years ago|reply

You should take a look at qovery.com - quite simple to deploy a containerized app on AWS

[+] lysecret|4 years ago|reply

our infra is running on ECS because we set it up right before docker on lambda haha. Now we dont have the time to switch. Would be much better for us though.

[+] pilotneko|4 years ago|reply

If you have less than 250 engineers, this guide is not for you? Strange.

Edit: Turns out I read it wrong, apologies. This guide is for companies with less than 250 engineers.

[+] bluehatbrit|4 years ago|reply

I believe it's actually saying if you have more than 250 engineers, then the guide is not for you.

[+] GavinAnderegg|4 years ago|reply

There's an explanation about that (and the other points in the flowchart) in the image shown further down in the blog post. You can see it here: https://www.vladionescu.me/posts/flowchart-how-should-i-run-...

[+] fivea|4 years ago|reply

> If you have less than 250 engineers, this guide is not for you? Strange.

I'd guess that if your org is big enough to have 250 engineers in its payroll, AWS services are a waste of cash given you can deploy better and cheaper. For example, Hetzner has less than 200 employees, and it's a global cloud provider.

[+] OJFord|4 years ago|reply

*more than, not less. (And then sibling comments/the annotated version explain why if you still think that's strange.)

[+] unknown|4 years ago|reply

[deleted]

62 comments