Running a modern infrastructure stack

[+] skywhopper|10 years ago|reply

Hmm, this is the second such post I've seen recently that lays out the company's infrastructure stack and after a few points, it mentions how they've outsourced all their logging and alerting to DataDog who solved all their problems in that area. DataDog seems like a nice product, and I know nothing about it, but after seeing how ... aggressively they were marketing at re:Invent, color me skeptical that these stack discussions are entirely spontaneous.

[+] duggan|10 years ago|reply

> color me skeptical that these stack discussions are entirely spontaneous.

Author here, I suspect all this means is that Datadog's marketing is as effective as their product.

It's rare I give any software or service truly glowing praise. Rare enough that I'm pleased to do so at any opportunity I get.

[+] twic|10 years ago|reply

> To me, an orchestration system should control the entire provisioning process, turning a plan defined in code into a production system. As a result I (perhaps unhelpfully) don’t believe anyone has built an orchestration system

Stackbuilder!

https://github.com/tim-group/stackbuilder

The docs for Stackbuilder are still horrible, but here's an example stack:

https://github.com/tim-group/stackbuilder-config-example/blo...

Assuming you have a compute fabric in place, running a script like that through the tool will provision and configure VMs for all the parts of a production system: it knows about Java applications (started with java -jar), Apache proxy servers, Linux NAT, IPVS load balancing, MySQL databases, Puppetmasters, and possibly other things. The fabric provides KVM for VMs, and BIND for DNS, controlled via some custom MCollective plugins. Configuration is done via Puppet, but Stackbuilder creates the Puppetmaster as part of the build. Application firewalls and Nagios checks get configured as part of the build, but i can't remember if it's Stackbuilder itself that does that, or some of the Puppet code.

BOSH is the same space:

https://bosh.io/

Again, it requires an existing fabric, but that can be plain AWS or OpenStack (or VMware). It can then build anything you can write a manifest for; it operates at a lower level of abstraction than Stackbuilder.

[+] falcolas|10 years ago|reply

Saltstack, with the cloud module, fits into this role rather well. It's more of a declarative setup and a discoverable setup, but I've found that to be acceptable.

[+] mwcampbell|10 years ago|reply

Have you looked at Joyent's Triton service? They claim they can run containers securely on bare metal, since the underlying kernel is Illumos (but it can run Linux binaries and thus Docker containers). So the trade-off between isolation and efficient resource usage would disappear.

On the one hand, they don't have all the same higher-level services as AWS, like ELB, let alone managed database services. On the other hand, I don't think I'd sleep well running anything based on EBS ever again, since EBS is so notorious for cascading failures.

[+] pradeepchhetri|10 years ago|reply

I got the chance to play with Joyent's Triton Elastic Container Service. Yes, the trade-off between resource isolation and efficient usage would disappear since they use SmartOS zones rather than Linux namespaces or cgroups to provide strong isolation between containers. They have forked Mesos[1] and added capability to run Triton containers as Mesos tasks[2]. Some of the questions which need to be explored:

- Whether they provide some kind of built-in service discovery.

- Whether all existing Mesos frameworks will support Triton based Mesos deployment since many frameworks make use of different networking modes, docker storage engine.

[1]: https://github.com/joyent/mesos

[2]: https://www.joyent.com/blog/mesos-by-the-pound

[+] duggan|10 years ago|reply

To my recollection, EBS has only suffered one such major failure, some time in 2011 (I was working for Engine Yard at the time). It hasn't factored strongly into my decision making since, other than that since AWS weathered it, they're likely now well prepared against it.

AWS is a known quantity to me, Joyent is unfortunately not. I have a lot of time for Bryan Cantrill though, so I'm sure they're doing good stuff - I just don't have the extra cognitive cycles to handle a whole new base layer on top of all the scheduler / container stuff right now.

[+] flowerpot|10 years ago|reply

Maybe I'm missing the right articles or I'm thinking about this the wrong way, but I can't seem to find any resources on how people run databases in these kinds of infrastructure stacks. I mean I have no problem understanding how I can deploy my 12 factor application on kubernetes for example and load balance to those, but persistence seems to be missing. Do people just use Amazons/Compose/etc Database offerings and don't worry about it themselves?

[+] duggan|10 years ago|reply

Yeah, I sort of alluded to it in the article but didn't expand on it - we're doing a combination of things.

1. Using AWS services where we can (DynamoDB)

2. Provisioning more "traditionally" to instances, and managing those independently of the Mesos scheduler

3. Experimenting with scheduler based solutions[1] (which are still pretty bleeding edge, but are promising)

As I mentioned, EMC are (to me) doing the most interesting stuff here[2] because they're leaning on a lot of existing production systems like EBS and Mesos' own scheduler.

[1]: https://github.com/mesos/elasticsearch

[2]: http://blog.emccode.com/2015/10/08/enabling-external-volume-...

[+] markbnj|10 years ago|reply

Great post, and aligns with many of the practices we're following now. Also links to two other great posts from Segment and Joe Beda, and taken together the three are quite valuable.

On the subject of kubernetes and heterogeneous environments, kubernetes itself may allow a mix of instance types (node types) in a cluster, but as implemented in GKE for example it does not. I believe ECS has similar constraints. Our response has been to think in terms of separate clusters for services, edge routing, persistence, etc.

[+] handimon|10 years ago|reply

I think these posts are not only good to see what other companies are thinking about infrastructure and the tools they use, but how many different levels of control one can apply to the problem of creating software in the cloud these days. Also the amorphous role of devops and new challenges and solutions that come up all the time.

I also appreciate the end where Ross talks about limiting the amount of innovation and trying to be practical whenever possible. Too many people are completely reactive to technology and using it only trades out one set of problems for another because it all has to work together.

The design on the blog and main barricade site is really awesome too. Great post!

[+] room271|10 years ago|reply

One question related to AWS is how to, or whether it is worth, segmenting access to resources for containers. If you are using single instances for each service you can use instance roles for this. These are great as they are temporary, tied to the machine, and most of the client libraries will detect them automatically.

If you are using containers, you need to create proper users presumably. Or you need to give all the permissions to the instance.

Maybe this isn't a problem in practice because people group related services together into separate orchestration clusters (i.e. a Kubernetes for each service grouping).

But it would be great to hear some real experiences on this.

[+] unknown|10 years ago|reply

[deleted]

[+] nodesocket|10 years ago|reply

Great post. In terms of logging, check out Papertrail. They are super simple. Just setup rsyslog to point to Papertail, then update your services (nginx, postgres, mongodb) to use syslog.

[+] ftwynn|10 years ago|reply

> [Logging:] I can’t be the only one who thinks this area is lacking its Stripe equivalent.

Can I ask you to expand on this a little? There are a bunch of cloud log management solutions, and I'm not sure what makes any of them Stripe equivalent or not.

[+] IceyEC|10 years ago|reply

It might be worth checking out Canonical's Juju, we're working on deployable segregation (like security groups) defined in code!

[+] zimbatm|10 years ago|reply

Is anyone using Juju in production ?

23 comments