top | item 12462846

(no title)

How did you handle shared writable storage and databases?

discuss

ex3ndr|9 years ago

We (actor.im) also moved from google cloud to our servers + k8s. Shared persistent storage is a huge pain. We eventually stopped to try to do this, will try again when PetSets will be in Beta and will be able to update it's images.

We tried:

* gluterfs - cluster can be setup in seconds, really. Just launch daemon sets and manually (but you can automate this) create a cluster, but we hit to that fact that CoreOS can't mount glusterfs shares at all. We tried to mount NFS and then hit next problem.

* NFS from k8s are not working at all, mostly this is because kubelet (k8s agent) need to be run directly on a machine and not via rkt/docker. Instead of updating all our nodes we mounted NFS share directly to our nodes.

* PostgreSQL we haven't tried yet, but if occasional pod kill will take place and then resyncing database can became huge issue. We ended up in running pods that is dedicated to specific node and doing manual master-slave configuration. We are not tried other solutions yet, but they also questionable in k8s cluster.

* RabbitMQ - biggest nightmare of all of them. It needs to have good DNS names for each node and here we have huge problems on k8s side: we don't have static host names at all. Documentation said that it can, but it doesn't. You can open kube-dns code it doesn't have any code at all. For pods we have only domain name that ip-like: "10-0-0-10". We ended up with not clustering rabbitmq at all. This is not very important dataset for us and can be easily lost.

* Consul - while working around problems with RabbitMQ in k8s and fighting DNS we found that Consul DNS api works much better than built-in kube-dns. So we installed it and our cluster just goes down when we kill some Consul pods as they changed it's host names and ip. And there are no straightforward way to fix IP or hostnames (they are not working at all, only ip-like that can easily changed on pod deletion).

So best way is to have some fast(!) external storage and mount it via network to your pods, this is much much slower than direct access to Node's SSD but it give you flexibility.

lobster_johnson|9 years ago

As long as you associate a separate service with each RabbitMQ pod, you can make it work without petsets. (Setting the hostname inside the pod is trivial, just make sure it matches.) Then you can create a "headless" service for clients to connect to, which matches against all the pods.

If you set it up in HA mode, then in theory you don't need persistent volumes, although RabbitMQ is of course flaky for other reasons unrelated to Kubernetes -- I wouldn't run it if I didn't have existing apps that relies on it.

kozikow|9 years ago

What do you recommend instead of RabbitMq on kubernetes? I use RabbitMq as Celery backend . I should probably switch to redis...

jondubois|9 years ago

Yeah this is tricky. I'm not a huge fan of all these distributed file systems like EBS, NFS or others - That doesn't make much sense for most DBs.

I prefer to have a DB which is "cluster-aware", in this case, you can tag your hosts/Nodes and use Node affinity to scale your DB service so that it matches the number of Nodes which are tagged - Then you can just use the a hostDir directory as your volume (so the data will be stored directly on your tagged hosts/Nodes) - This ensures that dead DB Pods will be respawned on the same hosts (and be able to pickup the hostDir [with all the data for the shard/replica] from the previous Pod which had died).

If your DB engine is cluster-aware, it should be able to reshard itself when you scale up or down.

I don't think it's possible for a DB to not be cluster-aware anyway - Since each DB has a different strategy for scaling up and down.

philips|9 years ago

Yes, and for "cluster aware" DBs it is easy to write a small controller that makes it "Kubernetes cluster aware". Here is a demo we did with a WIP etcd controller: https://youtu.be/Pc9NlFEjEOc?list=PL69nYSiGNLP1pkHsbPjzAewvM...

spudfkc|9 years ago

I've found this to be a pain when setting up a container-based environment. The easiest approach is to just to avoid it as much as possible - hopefully your cloud provider has some managed services (i.e. AWS RDS) that will handle most things for you.

Otherwise you need to separate your available container hosts into clusters: Elasticsearch cluster, Cassandra cluster, etc. and treat those differently from your machines you deploy your other apps to, which to be fair, they are different and need to be treated differently.

tzaman|9 years ago

GlusterFS was pretty much the only damn thing that I could get to work in a reasonable amount of time (Tried NFS, Ceph, Gluster, Flocker).

Basically the solution (until we get PetSets at least) is to:

1. manually spin up two pods (gluster-ceonts image) without replication controllers because if they get down we need them to stay down so we can manually fix the issue and bring it back up.

2. each pod should have a custom gcePersistentDisk mounted under /mnt/brick1

3. from within each pod, probe the other (gluster peer probe x.x.x.x)

4. each pod should preferably be deployed to it's own node via nodeSelector

5. once the pods have been paired on gluster, create and start a volume

6. make a service that selects the two pods (via some label you need to put on both)