Running Postgres in Kubernetes [pdf]

[+] sasavilic|5 years ago|reply

Unless you have a really good shared storage, I don't see any advantage for running Postgres in Kubernetes. Everything is more complicated without any real benefit. You can't scale it up, you can't move pod. If pg fails to start for some reason, good luck jumping into container to inspect/debug stuff. I am neither going to upgrade PG every 2 weeks nor it is my fresh new microservice that needs to be restarted when it crashes or scaled up when I need more performances. And PG has high availability solution which kind of orthogonal to what k8s offers.

One could argue that for sake of consistency you could run PG in K8S, but that is just hammer & nail argument for me.

But if you have a really good shared storage, then it is worth considering. But, I still don't know if any network attached storage can beat local attached RAID of Solid state disks in terms of performance and/or latency. And there is/was fsync bug, which is terrible in combination with somewhat unreliable network storage.

For me, I see any database the same way I see etcd and other components of k8s masters: they are the backbone. And inside cluster I run my apps/microservices. This apps are subject to frequent change and upgrades and thus profit most from having automatic recovery, failover, (auto)scaling, etc.

[+] qeternity|5 years ago|reply

You don't run shared/network storage. You run PVCs on local storage and you run an HA setup. You ship log files every 10s/30s/1m to object storage. You test your backups regularly which k8s is great for.

All of this means that I don't worry about most things you mention. PG upgrade? Failover and upgrade the pod. Upgrade fails? wal-g clone from object storage and rejoin cluster. Scaling up? Adjust the resource claims. If resource claims necessitate node migration, see failover scenario. It's so resilient. And this is all with raid 10 nvme direct attached storage just as fast as any other setup.

You mention etcd but people don't run etcd the way you're describing postgres. You run a redundant cluster that can achieve quorum and tolerate losses. If you follow that paradigm, you end up with postgres on k8s.

[+] jonfw|5 years ago|reply

The nice thing about running a DB inside a cluster is running your entire application, end to end, through one unified declarative model. It's really really easy to spin up a brand new dev or staging environment.

generally though in production, you're not going to be taking down DBs on purpose. If it's not supposed to be ephemeral, it doesn't fit the model

[+] halbritt|5 years ago|reply

I ran ~200 instances of Postgres in production for a SaaS product. This was on top of GCP persistent disk, which qualifies as quite good network storage, all of it backed up by what is now called Velero.

This particular database was not a system of record. The database stored the results of a stream processing system. In the event of a total loss of data, the database could be recovered by re-streaming the original data, making the operation of PG in a kubernetes cluster a fairly low risk endeavor. As such, HA was not implemented.

This setup has been running in production for over two years now. In spite of having no HA, each instance of this application backed by this DB had somewhere between four and five nines of availability while being continuously monitored on one minute intervals from some other spot on the Internet.

During the course of my tenure, there was only one data loss incident in which an engineer mistakenly dropped a table. Recovery was painless.

I've since moved on to another role and can't imagine having to run a database without having the benefits of Kubernetes. I'm forced to, however, as some folks aren't as progressive, and damn does it feel archaic.

[+] random3|5 years ago|reply

GCP volumes are over network already. You can deploy stateful workloads using StatefulSets. We've run an HBase workloads for development purposes (about 20-30x cheaper than BigTable) and it worked great (no issues for over 12 months). While Postgres is hardly a distributed database, there may be some advantages to ensure availability and perhaps even more in replicated setup.

[+] jchw|5 years ago|reply

> If pg fails to start for some reason, good luck jumping into container to inspect/debug stuff.

kubectl exec, attach and cp all make this trivial. Whatever type of inspection you want should be relatively doable.

Putting stuff in kubernetes also lets you take advantage of the ecosystem, including network policies, monitoring and alerting systems, health and liveness testing, load balancing, deployment and orchestration, etc.

[+] dilyevsky|5 years ago|reply

Ime most pg deployments don’t need insanely high iops and become cpu bound much quicker. So running ebs or gcp pd ssd or even ceph pd is usually enough.

[+] aprdm|5 years ago|reply

That looks very interesting and super complex.

I wonder how many companies really need this complexity, I bet 99.99% of the companies could get away with vertical scaling the writes and horizontal scaling the read only replica which would reduce the number of moving parts a lot.

I have yet to play much with kubernetes but when I see those diagrams it just baffles me how people are OK with running so much complexity in their technical stack.

[+] adamcharnock|5 years ago|reply

I generally work with smaller companies, but early on (Kubernetes 1.4 ish) I found that hosting mission-critical stateful services inside Kubernetes was more trouble than it was worth. I now run stand-alone Postgres instances in which each service has its own DB. I’ve found this very reliable.

That being said, I think Kubernetes now has much better support for this kind of thing. But given my method has been so stable, I just keep on going with it.

[+] wiml|5 years ago|reply

I've come to the conclusion that, much like how purchasing decisions seem irrational until you realize that different kinds of purchases come out of different budgets, there are different "complexity budgets" or "ongoing operational maintenance burden" budgets in an organization, and some are tighter than others.

[+] mamon|5 years ago|reply

It actually is not that complex. I'm using Crunchy Postgres Operator at my current employer. You get an Ansible playbook to install an Operator inside Kubernetes, and after that you get a commandline administration tool that let's you create a cluster with a simple

pgo create cluster <cluster_name>

command.

Most administrative tasks like creating or restoring backups (which can be automatically pushed to S3) are just one or two pgo commands.

The linked pdf looks complex, because it:

a. compares 3 different operators

b. goes into implementation details that most users are shielded from.

And I'm actually not sure which one of the three operators is the author recommending :)

[+] merb|5 years ago|reply

btw. zalando operator is more rough, but still pretty easy to use. crunchy operator does not work in every environment but is extremly simple (btw. the crunchy operator uses the building blocks of zalando) used zalando operator since k8s 1.4, no data loss, everything just works, ok major upgrades are rough, but they are rough even without zalando operator.

[+] cryptonector|5 years ago|reply

"But it has to run in Kubernetes!"

[+] caiobegotti|5 years ago|reply

Please don't. It's not because it's possible that it is a good idea. The PDF itself clearly shows how it can get complex quickly. The great majority of people won't ever be able to do this properly, securely and with decent reliability. Of course I may have to swallow my words in the future in case a job requires it but unless you REALLY REALLY REALLY need PostgreSQL inside Kubernetes IMHO you should just stick with private RDS or Cloud SQL then point your Kubernetes workloads to it inside your VPCs, all peered etc. Your SRE mental health, your managers and company costs will thank you.

[+] deathanatos|5 years ago|reply

I've done MySQL RDS, and I've seen k8s database setups. (But not w/ PG.)

RDS is okay, but I would not dismiss the maintenance work required; RDS puts you at the mercy of AWS when things go wrong. We had a fair bit of trouble with failovers taking 10x+ longer than they should. We also set up encryption, and that was also a PITA: we'd consistently get nodes with incorrect subjectAltNames. (Also, at the time, the certs were either for a short key or signed by a short key, I forget which. It was not acceptable at that time, either; this was only 1-2 years ago, and I'm guessing hasn't been fixed.) Getting AWS to actually investigate, instead of "have you tried upgrading" (and there's always an upgrade, it felt like). RDS MySQL's (maybe Aurora? I don't recall fully) first implementation of spatial indexes was flat-out broken, and that was another lengthy support ticket. The point is that bugs will happen, and cloud platform support channels are terrible at getting an engineer in contact with an engineer who can actually do something about the problem.

[+] IanGabes|5 years ago|reply

In my personal opinion, there are three database types.

'Small' Databases are the first, and are easy to dump into kubernetes. Anything DB with a total storage requirement 100GB or less (if I lick my finger and try to measure the wind), really, can be easily containerized, dumped into kubernetes and you will be a happy camper because it makes prod / dev testing easy, and you don't really need to think too much here.

'Large' database are too big to seriously put into a container. You will run into storage and networking limits for cloud providers. Good luck transferring all that data off bare metal! Your tables will more than likely need to be sharded to even start thinking about gaining any benefit from kubernetes. From my own rubric, my team runs a "large" Mysql database with large sets of archived data that uses more storage that managed cloud SQL solutions can provide. It would take us months to re-design to take advantage of the Mysql Clustering mechanisms, along with following the learning curve that comes with it.

'Massive' databases need to be planned and designed from "the ground up" to live in multiple regions, and leverage respective clustering technologies. Your tables are sharded, replicated and backed up, and you are running in different DCs attempting to serve edge traffic. Kubernetes wins here as well, but, as the OP suggests, not without high effort. K8S give you the scaling and operational interface to manage hundreds of database nodes.

It seems weird to me that the Vitess and OP belabour their Monitoring, Pooling, and Backup story, when I think the #1 reason you reach for an orchestrator in these problem spaces is scaling.

All that being said, my main point here is that orchestration technologies are tools, and picking the right one is hard , but can be important :) Databases can go into k8s! Make it easy on yourself and choose the right databases to put there

[+] GordonS|5 years ago|reply

So, a bit OT, but I'm looking for some advice on building a Postgres cluster, and I'm pretty sure k8s is going to add a lot of complexity with no benefit.

I'm a Postgres fan, and use it a lot, but I've never actually used it in a clustered setup.

What I'm looking at clustering for is not really for scalability (still at the stage where we can scale vertically), but for high availability and backup - if one node is done for update, or crashes, the other node can take over, and I'd also ideally like point-in-time restore.

There seems to be a plethora of OSS projects claiming to help with this, so it looks like there isn't "one true way" - I'd love to hear how people are actually setting up their Postgres clusters for in practice?

[+] penagwin|5 years ago|reply

Compared to many databases, postgres HA is a mess. It has builtin streaming, but no fail over of any kind, all of that has to be managed by another application.

We've had the best luck with patron, but even then you'll find the documentation confusing, have weird issues, etc. You'll need to setup etcd/Consul to use it. That's right you need a second database cluster to setup your database cluster.... Great...

I have no clue how such a community favorite database has no clear solution to basic HA.

[+] lhenk|5 years ago|reply

Patroni might be interesting: https://github.com/zalando/patroni

[+] random3|5 years ago|reply

The main advantage with Kubernetes (especially in low ops environments like GKE) is not scalability, but availability and ease of development (spinning things up and down is super-easy). The learning curve to stand something up is not very high and pays of over time compared to SSH-ing into VMs.

[+] johncolanduoni|5 years ago|reply

Kubernetes can’t change any database’s HA or durability features; there’s no magic k8s can apply to make a database that does e.g. asynchronous replication have the properties of one that does synchronous replication. So you’ll never gain any properties your underlying database is incapable of providing.

However, if I had to run Postgres as part of something I deployed on k8s AND for some reason couldn’t use my cloud provider’s built in solution (AWS RDS, Cloud SQL, etc.) I would probably go with using/writing a k8s operator. The big advantage of this route is that it gives you good framework for coordinating the operational changes you need to be able to handle to actually have failover and self-healing from a Postgres cluster, in a self-contained and testable part of your infrastructure.

When setting up a few Postgres nodes with your chosen HA configuration you’ll quickly run into a few problems you have to solve:

* I lose connectivity to an instance. Is it ever coming back? How do I signal that it’s dead and buried to the system so it knows to spin up a fresh replica in the cases where this cannot be automatically detected?

* How do I safely follow the process I need to when upgrading a component (Postgres, HAProxy, PGBouncer, etc.)? How do I test this procedure, in particular the not-so-happy paths (e.g. where a node decides to die while upgrading).

* How do I make sure whatever daemon that watches to figure out if I need to make some state change to the cluster (due to a failure or requested state change) can both be deployed in a HA manner AND doesn’t have to contend with multiple instances of itself issuing conflicting commands?

* How do I verify that my application can actually handle failover in the way that I expect? If I test this manually, how confident am I that it will continue to handle it gracefully when I next need it?

A k8s operator is a nice way to crystallize these kinds of state management issues on top of a consistent and easily observable state store (namely the k8s API’s etcd instance). They also provide a great way to run continuous integration tests that you can actually throw the situations you’re trying to prepare for at the implementation of the failover logic (and your application code) to actually give you some confidence that your HA setup deserves the name.

But again, I wouldn’t bite this off if you can use a managed service for the database. Pay someone else to handle that part, and focus on making your app actually not shit the bed if a failover of Postgres happens. The vast majority of applications I’ve worked on that were pointed at a HA instance would have (and in some cases did) broken down during a failover due to things like expecting durability but using asynchronous replication. You don’t get points for “one of the two things that needed to work to have let us avoid that incident worked”.

[+] peterwwillis|5 years ago|reply

Google Cloud blog, gently dissuading you from running a traditional DB in K8s: https://cloud.google.com/blog/products/databases/to-run-or-n...

K8s docs explaining how to run MySQL: https://kubernetes.io/docs/tasks/run-application/run-replica...

You could also run it with Nomad, and skip a few layers of complexity: https://learn.hashicorp.com/nomad/stateful-workloads/host-vo... / https://mysqlrelease.com/2017/12/hashicorp-nomad-and-app-dep...

One of the big problems of K8s is it's a monolith. It's designed for a very specific kind of org to run microservices. Anything else and you're looking at an uphill battle to try to shim something into it.

You can also skip all the automatic scheduling fancyness and just build system images with Packer, and deploy them however you like. If you're on a cloud provider, you can choose how many instances of what kind (manager, read-replica) you deploy, using the storage of your choice, networking of choice, etc. Then later you can add cluster scheduling and other features as needed. This gradual approach to DevOps allows you to get something up and running using best practices, but without immediately incurring the significant maintenance, integration, and performance/availability costs of a full-fledged K8s.

[+] lazyant|5 years ago|reply

> One of the big problems of K8s is it's a monolith

while I pretty much agree with everything else you mention, I think it's kind of the opposite; since k8s is fundamentally an API, it's very modular and extensible and this is why it's being successful (I agree it wants you to do things its way and things like databases need to be shimmed at the moment, so the conclusion is similar... for now)

[+] renewiltord|5 years ago|reply

I much prefer just using RDS Aurora. Far fewer headaches. If I don't need low latency I'd use RDS Aurora no matter which cloud I'm hosted on. Otherwise I'll use hosted SQL.

The reason I mention this is that Kubernetes requires a lot of management to run so the best solution is to use GKE or things like that. If you're using managed k8s, there's little reason to not use managed SQL.

The advantages of k8s are not that valuable for a SQL server cluster. You don't even really get colocation of data because you're realistically going to use a GCE Persistent Disk or EBS volume and those are network attached anyway.

[+] caniszczyk|5 years ago|reply

For the MySQL folks, see Vitess as an example on how to run Kubernetes on MySQL: https://vitess.io

[+] arronax|5 years ago|reply

There are also MySQL operators from Oracle, Presslabs, and Percona. Vitess is much more than just MySQL in k8s, and not everyone will be able to switch to it easily (if at all).

[+] chvid|5 years ago|reply

To all the commenters in this thread.

If kubernetes cannot run a database then what good is it? (And I suppose the same issues pop up for things like a persistent queue or a full text indexer.)

The end goal of Kubernetes is to able to create and recreate environments and scale them up and down at will all based a declarative configuration. But if you take databases out of it; then you are not really achieving that goal and just left with the flipside of kubernetes: a really complex setup and a piece of technology that is very hard to master.

[+] kbumsik|5 years ago|reply

> he end goal of Kubernetes is to able to create and recreate environments and scale them up and down at will all based a declarative configuration.

PG already has its own clustering solution to scale up and down, which is orthogonal to Kubernetes. So running PG in Kubernetes does not add anything. Also, you are much more likely to mess them up when trying to mix two orthogonal technologies.

And the DB is not meant to create and recreate often unless you want to purge the data. So my take is this: Kubernetes is to manage and configure microservices and DBs are not microservices.

[+] kristofarkas|5 years ago|reply

Some say that running stateful applications on K8S is not a good idea anyways, and K8S is best used for stateless applications. Sure you can connect to a stateful DB but the app itself is stateless.

[+] how_gauche|5 years ago|reply

Postgres + stolon + k8 is the easiest time I've ever had bootstrapping a DB for high availability. I'm not sure I'd use it for extremely high throughput apps, but for smallish datasets that NEED to be online, it was amazing. The biggest reason it's amazing? The dev, staging, and prod environments look exactly the same from a coder's perspective, and bringing a fresh one up is always a single command, because that's just how you work in kube-land.

[+] blyry|5 years ago|reply

ooh! I've been running the Zalando operator in production on Azure for ~ a year now, nothing crazy but a couple thousand qps and a tb of data spread across a several clusters. It's been a little rough since it was designed for AWS, but pretty fun. At this point, I'm 50/50, our team is small and i'm not sure that the extra complexity added by k8s solved any problems that azures managed postgres product doesn't also solve. We weren't sure we were going to stay on azure at the time we made the decision as well -- if I was running in a hybrid cloud environment I would 100% choose postgres on k8s.

The operator let us ramp up real quickly with postgres as a POC and gave us mature clustering and point-in-time restoration, and the value is 100% there for dev/test/uat instances, but depending on our team growth it might be worth it to switch to managed for some subset of those clusters once "Logical Decoding" goes GA on the azure side. Their hyperscale option looks pretty fun as well, hopefully some day i'll have that much data to play with.

I can also say that the Zalando crew has been crazy responsive on their github, it's an extremely well managed open source project!

[+] unknown|5 years ago|reply

[deleted]

[+] sspies|5 years ago|reply

I have been running my own postgres helm chart with read replication and pgpool2 for three years and never had major trouble. If you're interested check out https://github.com/sspies8684/helm-repo

[+] jeremychone|5 years ago|reply

Looks interesting but difficult to get the details from just the slides.

Also, not sure why Azure Arc still gets mentioned. I would have expected something more cloud independent.

Our approach, for now, is to use Kubernetes Postgres for dev, test, and even stage, but cloud Postgres for prod. We have one db.yaml that in production just become an endpoint so that all of the services do not even have to know if it is an internal or external Postgres.

Another interesting use of Kubernetes Postgres would be for some transient but bigger than memory store that needs to be queryable for a certain amount of time. It's probably a very niche use-case, but the deployment could be dramatically more straightforward since HA is not performance bound.

[+] zelly|5 years ago|reply

Why? So you pay more money to AWS? Deploying databases is a solved problem. What's the point of the overhead?

[+] kgraves|5 years ago|reply

What's the use-case for running databases in k8s, is this a widely accepted best practice?

[+] ghshephard|5 years ago|reply

I guess I look at it the opposite way - which is why wouldn't you run everything in k8s once you have the basic investment in it. Let's you spin up new environments, vertical scaling becomes trivial, disaster recovery/business continuity is automatic along with everything else in your k8s environment.

[+] dkhenry|5 years ago|reply

I don't think its a widely accepted best practice yet, mainly because its hard to do well, and by its self its hard to take advantage of the benefits of using k8s. The company I work for has been building out the tools require to run databases well in k8s ( fully automated, fully managed, survivable, and scale-able ) and we are seeing people come around to it. Once you have all the tools in place you can have a system that scales right along side your applications on heterogeneous hardware. Isn't dependent on any single server, can be deployed and managed exactly like your applications, and can be transported everywhere. If you want to take a look check out planetscale.com

[+] DasIch|5 years ago|reply

If you are running Kubernetes, happen to be a fairly large organization and use microservices, you probably have many databases. Hundreds of them. Most of them are going to be small, using few resources of any kind.

In that context running postgres on K8S makes a lot of sense. You already have K8S and experience running it. Running postgres there makes it possible to share resources between databases and other applications. That improves utilization which means you can reduce costs significantly.

Another advantage is that unlike managed solutions such as RDS, you can use a more recent postgres version and postgres extensions that RDS doesn't support. Extensions such as PgQ or TimescaleDB or ...

Having said all of this. In a large organization, you have the benefit of economies of scale. Large fixed costs (such as developing the expertise required to run (postgres on) K8S reliably) can be amortized. It's possible for this to be a good idea, even a best practice for large organizations while at the same time a terrible idea for smaller ones. Most of the time, using a managed service like RDS is probably a better choice. In other words: You are not Google. You are not Facebook. You are probably not even Zalando. Figure out what's right for you.

[+] etxm|5 years ago|reply

Losing your data.

JK, sort of.

My first go to is something like RDS, but I’ve run Postgres in k8s for pretty much one use case: everything else is already in k8s _and_ I need a PG extension/functionality not present in RDS.

[+] majewsky|5 years ago|reply

Conway's Law. The hardware team deals with the lower parts of the stack: hardware, OS, up to Kubernetes. The applications team(s) deal exclusively with Kubernetes payloads.

[+] m3kw9|5 years ago|reply

I remember running mongodb one socket had so many gotchas and stuff that it wasn’t worth it.

[+] nightowl_games|5 years ago|reply

I think cockroachDB is designed for this.

[+] tyingq|5 years ago|reply

They've thought about the use case. But it still ends up being a cluster inside a cluster, which sounds potentially pretty bad to me. Clusters of different types, mostly unaware of each other. Schema changes and database version upgrades would be complicated.

[+] pjmlp|5 years ago|reply

Instead of messing around with Kubernetes, I would rather advocate for something like Amazon RDS.

[+] cmcc123|5 years ago|reply

I would agree that the complexity is compounded, having gone through the work automate various operators in kubernetes and the requisite deploy projects for the actual app/service (database) clusters, etc.

The problem is often that the actual costs of maintaining solutions like this isn't always clear and easy to budget for, and perhaps more importantly--explain to management--this includes the continued costs for engineering time to architect H/A solutions, maintain, research solutions, etc. Add to this the abstraction and compounding of complexity and the plethora of hand-waving blogs, etc.

IMHO, the real problems arise when you deploy a PostgreSQL (via Kubernetes Operator) into an existing multi-AZ cloud-based kubernetes cluster--without knowing and understanding all of the requisite requirements and restrictions. At the time when I was working on deploying postgres clusters with the operator (mid 2019 AIR) there was not a lot (much at all) in the strimzi kafka operator docs about handling multi-AZs in kube with the Kubernetes autoscaler and using cluster ASG's, etc. Note that persistentvolumes and persistentvolumeclaims in the cloud cannot span multiple AZ's--this is a critical concern, especially when you throw in Kubernetes and an ASG (autoscaling group). What this means if you have some app/service running in a specific AZ that has persistentvolumes and claims in that AZ, you must ensure that that app/service stays in that AZ and all of its requisite storage resources must also remain available in that AZ. The complexity that is required to manage this is not trivial for most teams. I.e. some helm charts that I installed (after `helm templating` in our IAC code), configured nodelablels on the existing kube clusters worker nodes--but note that this was not documented in the helm chart BTW. So, when we later did a routine upgrade of the Kubernetes version and the ASGs spawned new worker nodes, that left those aps/services processes effectively hard-coded to use nodes that were terminated by the ASG (as they were older versions that were replaced by the newer versions during the upgrade) and their PV's were in a specific AZ, as noted above.

To do it right, I think you'd need to define AZ-specific storage classes and then ensure that when you are deploying apps/services into kubernetes you ensure that you manage those. Again, from my past experience, when you have Kubernetes in the cloud, with the kubernetes autoscaler, and cloud-based ASG (autoscaling groups), running in an H/A (high-availability i.e. multi-AZ), and now add in stateful requirements using PV's, and now add in very resource intensive apps and services, now this starts to get a bit tricky to maintain--again--despite what the "experts" might be blogging about. Keep in mind that the companies sponsoring the experts might have teams of 10-15 DevOps Kubernetes engineers managing a cluster. This is something we definitely don't have.

I'm sure it will get better with time, but for now, we are doing all we can to maintain stateful apps/services externally--i.e.: and per your initial post, this would be PostgreSQL with RDS. IMHO, RDS does a fantastic job and allows us to abstract all of this, and we simply deploy our clusters with IAC and forget about them to some degree. For the cost point and specifically regarding resource contention, I think it's an ideal ROI to have the cloud provider worry about failover, H/A database internals, scaling with multi-AZ storage, etc.

100 comments