top | item 12939856

How We Knew It Was Time to Leave the Cloud

379 points| sytse | 9 years ago |about.gitlab.com | reply

262 comments

order
[+] boulos|9 years ago|reply
I appreciate that you didn't sling mud at Azure, but re-reading the commit for the move to Azure [1] there were tell tale signs then that it might be bumpy for the storage layer.

For what it's worth, hardware doesn't provide an IOPS/latency SLA either ;). In all seriousness, we (Google, all providers) struggle with deciding what we can strictly promise. Offering you a "guaranteed you can hit 50k IOPS" SLA isn't much comfort if we know that all it takes is a single ToR failure for that to be not true (providers could still offer it, have you ask for a refund if affected, etc. but your experience isn't changed).

All that said, I would encourage you to reconsider. I know you're frustrated, but rolling your own infrastructure just means you have to build systems even better than the providers. On the plus side, when it's your fault, it's your fault (or the hardware vendor, or the colo facility). You've been through a lot already, but I'd suggest you'd be better off returning to AWS or coming to us (Google) [Note: Our PD offering historically allowed up to 10 TiB per disk and is now a full 64 TiB, I'm sorry if the docs were confusing].

Again, I'm not saying this to have you come to us or another cloud provider, but because I honestly believe this would be a huge time sink for GitLab. Instead of focusing on your great product, you'd have to play "Let's order more storage" (honestly managing Ceph has a similar annoyance). I'm sorry you had a bad experience with your provider, but it's not all the same. Feel free to reach out to me or others, if you want to chat further.

Disclosure: I work on Google Cloud.

[1] https://gitlab.com/gitlab-com/www-gitlab-com/merge_requests/...

[+] jtwaleson|9 years ago|reply
So Gitlab is in a bit of a strange position here. Sticking to a traditional filesystem interface (distributed under the hood) seems stupid at first. Surely there are better technical solutions.

However, considering they make money out of private installs of gitlab, it makes sense to keep gitlab.com as an eat-your-own-dog-food-at-scale environment. Necessary for them to keep experience with large installs of gitlab. If one of their customers that run on-prem has performance issues they can't just say: gitlab.com uses a totally different architecture so you're on your own. They need gitlab.com to be as close as possible to the standard product.

Pivotal does the same thing with Pivotal Web Services, their public Cloud Foundry solution. All of their money is made in Pivotal Cloud Foundry (private installs).

From a business perspective, private installs are a way of distributed computing. Pretty clever, and good way of minimizing risk.

[+] brongondwana|9 years ago|reply
Having never really gone to the cloud, thankfully, FastMail can strongly recommend New York Internet if you're looking for a datacentre provider. They've been amazing.

https://blog.fastmail.com/2014/12/10/security-availability/

Actually, we don't appear to blog quite enough about how awesome NYI are. They're a major part of our good uptime.

And some stuff about our hardware. I'd strongly recommend hot stuff on RAID1 SSDs and colder stuff on hard disks. The performance difference between rust and SSD is just massive.

We're looking at PCI SSDs for our next hardware upgrades:

http://www.intel.com.au/content/www/au/en/solid-state-drives...

(there are two types of SSD on the market. Ones that lose your data, and SSDs from Intel) - we're currently running mostly DC3700 SATA/SAS SSDs.

We customise the layout of our machines very much to match the heat patterns of our data:

https://blog.fastmail.com/2015/12/06/getting-the-most-out-of... https://blog.fastmail.com/2014/12/15/dec-15-putting-the-fast... https://blog.fastmail.com/2014/12/04/standalone-mail-servers...

[+] blorgle|9 years ago|reply
Coming from OpenStack land where Ceph is used heavily, it's well known that you shouldn't run production Ceph on VMs and that CephFS (which is like an NFS endpoint for Ceph itself) has never been as robust as the underlying Rados Block Device stuff.

They probably could have saved themselves a lot of pain by talking to some Ceph experts still working inside RedHat for architectural and other design decisions.

I agree with other poster who asked why do they even need a gigantic distributed fs and how that seems like a design miss.

[+] x0x0|9 years ago|reply
Also -- if you look at the infra updated from the linked article, they mention something about 3M updates/hour to a pg table ([1], slide 9) triggering continuous vacuums. This feels like using a db table as a queue which is not going to be fun at moderate to high loads.

[1] https://about.gitlab.com/2016/09/26/infrastructure-update/

[+] YorickPeterse|9 years ago|reply
> They probably could have saved themselves a lot of pain by talking to some Ceph experts still working inside RedHat for architectural and other design decisions.

We have been in contact with RedHat and various other Ceph experts ever since we started using it.

> I agree with other poster who asked why do they even need a gigantic distributed fs and how that seems like a design miss.

Users can self host GitLab. Using some complex custom block storage system would complicate this too much, especially since the vast majority of users won't need it.

[+] sytse|9 years ago|reply
You're right. We talked to experts and they warned us about running Ceph on VMs and we tried it anyway, shame on us.

You do need either a distributed FS (GitHub made their on with Dgit http://githubengineering.com/introducing-dgit/, we want to try to reuse an existing technology) or buy a big storage appliance.

[+] camkego|9 years ago|reply
Bingo! Seasoned developers and architects with 15-20+ years of experience would very likely question using software stacks like CephFS with warnings on it's website about production use! You really want no exotic 3rd party stuff in your design, and plain-Jane components like ext3 and Ethernet switches. Choosing a newer exotic distributed filesystem may really come back to bite you in the future.
[+] wandernotlost|9 years ago|reply
It sounds like they didn't design for the cloud and are now experiencing the consequences. The cloud has different tradeoffs and performance characteristics from a datacenter. If you plan for that, it's great. Your software will be antifragile as a result. If you assume the characteristics of a datacenter, you're likely to run into problems.
[+] Rapzid|9 years ago|reply
This got me curious again about the pluggable storage backends in Git(I assume AWS code commit is using something like this). I've looked at Azures blob storage API in the past and found it incredibly flexible..

Here is an article from a few years ago: http://blog.deveo.com/your-git-repository-in-a-database-plug...

In any case, GitLab is amazing and I can see how it's tempting to believe that GitLab the omnibus package is the core product. However, HOSTED GitLab's core product is GitLab as a SERVICE. That might require designs tailored a bit more for the cloud than simply operating a yoooge fs and calling it a day.

[+] jghn|9 years ago|reply
Can you go more into the difference in the tradeoffs and how one should design differently?
[+] edejong|9 years ago|reply
I'm trying to understand 'antifragile'. Are you trying to say: 'robust'? If not, what is the difference?
[+] user5994461|9 years ago|reply
They designed for a small operations with limited resources. That might be fair given their budget/funding which we don't know.

At the moment, the discussion in the GitHub issues looks like people who are buying servers to put and run in their garage ^^

[+] solipsism|9 years ago|reply
Planning for the cloud doesn't make your software antifragile. antifragile != robust
[+] zzzcpan|9 years ago|reply
> when you get into the consistency, accessibility, and partition tolerance (CAP) of CephFS, it will just give away availability in exchange for consistency.

Not their fault, POSIX API cannot fit into eventual consistency model to guarantee availability (i.e. latency). Moving to your own hardware doesn't actually solve the problem, just gives some room to scale vertically for some time. After that the only way to keep global consistency but minimize impact of unavailability is to shard everything, at least this way unavailability will be contained in its own shard without any impact on every other shard.

It's better to avoid POSIX API in the first place.

[+] chetanahuja|9 years ago|reply
This is the correct answer to run github, the SAAS product at scale. Probably not practical for github the on-prem installable software. They may have to bite the bullet and separate out the two systems at some point anyway though.

edit: it's gitlab not github. The point still stands though.

[+] diziet|9 years ago|reply
I priced out the hardware that they specced out to be around $1.4 million:

https://gitlab.com/gitlab-com/infrastructure/issues/727#note... The R730xd are probably around $50k after dell discounts - depends a bit on what they ended up configuring with regards to support, exact network configuration, etc

The R830s are about $50k as configured - 1.5gb ram is expensive as the R830 only has 48 DIMMs and they need the relatively expensive 32GB RDIMMs

The R630s should be about 15k each:

The switches say they are 48x 40G QSFP+ which are very expensive (I'd put them at 30k each from Dell)

50k * 20 + 50k * 4 + 15k * 10k + 2 * 30k

1m + 200k + 150k + 60k ~= $1.4m invested

updated with better R830s pricing

From my perspective consumer Samsung 850 EVO drives which can be under-provisioned to match the 1.8tb (and get better performance characteristics) would give Gitlab cheaper and more reliable storage in terms of IOPS/Latency when compared to 10k 1.8tb drives.

[+] connorshea|9 years ago|reply
Unfortunately I don't believe we'll be able to comment on the accuracy of this since the quotes from companies are private information, but I do love math like this :D

(Community Advocate at GitLab)

[+] Rapzid|9 years ago|reply
I would take a look at Supermicro servers and compare pricing. Worked as cloud engineer at a hosting provider built entirely on Supermicro kit. Didn't see anything there I didn't see from the Dell's and HP's when I worked at Rackspace. And Rackspace sure didn't build their cloud on top of HP's or Dell's.
[+] sijoe|9 years ago|reply
Without knowing more on the specific needs (I followed some of the threads to try to grok it), it would be hard to guess what they really need.

[commercial alert]

My company (Scalable Informatics) literally builds very high performance Ceph (and other) appliances, specifically for people with huge data flow/load/performance needs.

Relevant links via shortener:

Main site: http://scalableinformatics.com (everything below is at that site under the FastPath->Unison tab) http://bit.ly/1vp3hGd

Ceph appliance: http://bit.ly/1qiOYpy

Especially relevant given the numbers I saw on the benchmarking ...

Ceph appliance benchmark whitepaper: http://bit.ly/2fMahfJ

Our EC test was about 2x better than the Dell unit (and the Supermicro unit), and our Librados tests were even more significantly ahead.

Petabyte scale appliances: http://bit.ly/2fuTTAH

We've even got some very nice SSD and NVM units, the latter starting around $1USD/GB.

[end commercial alert]

I noticed the 10k RPM drives ... really, drop them and go with SSDs if possible. You won't regret it.

Someone suggested underprovisioned 850 EVO. Our strong recommendation is against this, based upon our experience with Ceph, distributed storage, and consumer SSDs. You will be sorry if you go that route, as you will lose journals/MDS or whatever you put on there.

Additionally, I saw a thread about using RAIDs underneath. Just ... don't. Ceph doesn't like this ... or better, won't be able to make as effective use of it. Use the raw devices.

Depending upon the IOP needs (multi-tenant/massive client systems usually devolve into a DDoS against your storage platforms anyway), we'd probably recommend a number of specific SSD variations at various levels.

The systems we build are generally for people doing large scale genomics and financial processing (think thousands of cores hitting storage over 40-100Gb networks, where latency matters, and sustained performance needs to always be high). We do this with disk, flash, and NVMe.

I am at landman _at_ the company name above with no spaces, and a dot com at that end .

[+] redstripe|9 years ago|reply
What kind of dell discounts does dell typically give larger customers? We get 0 (from HP) at my workplace and I've always wondered.

I'm sure it depends on how much you buy.

Does it vary by components? They seem to charge a lot for drives so I'm guessing those can be heavily discounted.

[+] kmf|9 years ago|reply
As always, the neat thing about GitLab is how open they are with their process. I enjoyed this read, and followed the trail down to a corresponding ticket, where the staff is discussing their actual plans for moving to bare metal. Very cool.

https://gitlab.com/gitlab-com/infrastructure/issues/727

[+] 20after4|9 years ago|reply
If you like open processes like that then you might like following the work of Wikimedia's Technical Operations team. [1]

You won't find any organization that's much more open than Wikimedia.

Disclosure: I work for Wikimedia (on the release engineering team)

[1] https://phabricator.wikimedia.org/tag/operations/

[+] strictfp|9 years ago|reply
Why do they need one gigantic distributed fs? Seems like a design miss to me.
[+] btgeekboy|9 years ago|reply
Indeed. If there's one thing I've learned in >10 years of building large, multi-tenant systems, it's that you need the ability to partition as you grow. Partitioning eases growth, reduces blast radius, and limits complexity.
[+] jondubois|9 years ago|reply
Agreed, I think the wrong conclusion was drawn here.

But I get where they're coming from; container orchestrators like Kubernetes are heavily promoting distributed file systems as being the 'cloud-native approach'. But maybe this issue is more relevant to 'CephFS' specifically than to all distributed file systems in general.

[+] radicalbyte|9 years ago|reply
It might be, but you have to look at what's important for a product like GitLab. It's in a market where those who'll pay want to run their own special version of the system. So it's naturally partitioned.

Architecting, or even spending mental cycles on day 1 on distribution isn't going to win you as much as focusing on making an awesome product.

This move will probably buy them another year or two, which will give them enough time hopefully implement some form of partitioning.

[+] user5994461|9 years ago|reply
No information in this article.

How much data? We don't even know if it's GB/TB/PB/EB? How many files/objects? How many read IOPS are needed? How many write IOPS? What's the current setup on AWS? What's the current cost? What are they hosting? Can it scale horizontally? How do they shard jobs/users? What's running on Postgre? What's running on Ceph? What's running on NFS? How much disk bandwidth is used? How much network bandwidth is used?

How are we supposed to review their architecture if they don't explain anything...

I bet that there is a valid narrative where PostgreSQL and NFS was their doom, but I'd need data to explain that ^^

[+] connorshea|9 years ago|reply
Unfortunately I don't have deep knowledge of our infrastructure, so I can't answer all of these questions myself.

That said, a decent chunk of this info can be found in the discussions on our Infrastructure issue tracker[1].

The last infrastructure update[2] includes some slide decks that contain more data (albeit it's now a little under 2 months old).

Looking at our internal Grafana instance, it looks like we're using about 1.25 TiB combined on NFS and just under 16 TiB on Ceph. We're working on migrating the data currently hosted on Ceph back to NFS soon[3].

I'll get someone from the infrastructure team to respond with more info.

[1]: https://gitlab.com/gitlab-com/infrastructure/issues [2]: https://about.gitlab.com/2016/09/26/infrastructure-update/ [3]: https://gitlab.com/gitlab-com/infrastructure/issues/711

[+] codinghorror|9 years ago|reply
I don't know about this. We had disastrous experiences with Ceph and Gluster on bare metal. I think this says more about the immaturity (and difficulty) of distributed file systems than the cloud per se.
[+] pinewurst|9 years ago|reply
I think it says more about the state of open source distributed file systems than the cloud per se. Ceph and Gluster are not the best examples of these, though Lustre is awful too. I'm paid to dig into these at depth and each is like some combination of religion and dumpster fire. Understand that Red Hat (and Intel in the Lustre case) wants support, training and professional services revenue. Outside of their paid domains, it's truly commerce with demons but inside it's not much better.

BeeGFS is the only nominally open source one that I'd think about trusting my data too. And no, I don't work for them or am compensated in any way for recommendations.

[+] BinaryIdiot|9 years ago|reply
Distributed file systems are tough especially if you're putting it together yourself. I'd go for an already built solution every time unless I absolutely could not for whatever the reason.
[+] sytse|9 years ago|reply
Thanks for posting. What went wrong with Ceph and when was this? We have the idea it improved a lot in the last year or so. But we'd love to learn from your experience.
[+] connorshea|9 years ago|reply
Was your experience RE: Ceph/Gluster from Stack Overflow? I'd definitely be interested in hearing more about the specifics of that.
[+] londons_explore|9 years ago|reply
The advice in this post is, Imo, misguided.

In large systems design, you should always design for a large variation in individual systems performance. You should be able to meet customers expectations if any machine drops to 1% performance at anytime. Here they are blaming the cloud for the variation, but at big enough sizes they'll see the same on real hardware.

Real hardware gets thermally throttled when heatsinks go bad, has io failures which cause dramatic performance drops, CPU's failing leaving only one core out of 32 operational, or ECC memory controllers that have to correct and reread every byte of memory.

In a large enough system, at any time there will always be a system with a fault like this. Sure you only see it occasionally in your 200 node cluster, but in a 20k machine cluster it'll happen every day.

You'll write code to detect the common cases and exclude the machines, but you'll never find all the cases.

The conclusion is that instead you shouldn't try. Your application should handle performance variation, and to make sure it does, you would be advised to deliberately give it variable performance on all the nodes. Run at low CPU or io priority on a shared machine for example.

In the example of a distributed filesystem, all data is stored in many places for redundancy. Overcome the variable performance by selecting the node to read from based on reported load. In a system with a "master" (which is a bad design pattern anyway IMO), instead have 5 masters and use a 3/5 vote. Now your system performance depends on the median performance of those 5.

[+] atsaloli|9 years ago|reply
Not to second guess the infrastructure architects at GitLab, but just to bring up a data point they might possibly not be aware of : Joyent virtual hosts have significatly better I/O performance profiles -- see https://www.joyent.com/blog/docker-bake-off-aws-vs-joyent for detail. Don't necessarily write off the cloud -- there are different clouds out there and if I was running something I/O intensive, I'd want to try it on Joyent. That said, nothing beats complete control, if you have the resources to handle that level of responsibility. =)
[+] ChuckMcM|9 years ago|reply
There is a threshold of performance on the cloud and if you need more, you will have to pay a lot more, be punished with latencies, or leave the cloud.

I've seen this a lot, and for a given workload I can tell when leaving the cloud will be the right choice. But the unspoken part is "can we change our application given the limitations we see in the cloud?" Probably pretty difficult in a DVCS but not impossible.

Sadly, storage isn't a first class service in most clouds (it should be) and so you end up with machines doing storage inefficiently and that costs time, power, and complexity.

[+] eliben|9 years ago|reply
To me it makes total sense for something like Gitlab to do their own HW, since it's really their core business. Sure, there's no point for say, target.com to use their own servers - computers is not really what they do, and cloud helps them keep expensive programmers and sysadmins to a minimum. But Gitlab is a whole different story.
[+] random3|9 years ago|reply
if latency spikes affect the overall performance, it seems more that CephFS may have a design problem (global FS journal) rather than this being a cloud problem.

However perhaps they shouldn't try to run Ceph in the first place. Azure has a rather powerful blob storage (e.g. block, pages and append-only blobs) that allows high performance applications. You could use that directly and it will likely be cheaper and work better than Ceph on bare metal.

Like other commenters suggest, in order to take advantage of cloud infrastructure you need to design with those constraints in mind, rather than trying to shoehorn the familiar technologies.

Bare metal can be better and cheaper, etc. but it requires even more skills and experience and a relatively large scale.

[+] raspasov|9 years ago|reply
How much total storage do you need?

Something to consider: A few years ago I used Rackspace's OnMetal servers https://www.rackspace.com/en-us/cloud/servers/onmetal for a dedicated MySQL 128GB RAM server that would handle 100s of thousands of very active hardcore mobile game users. We were doing thousands of HTTP requests per seconds and 10s of thousands of queries (and a lot of those were writes) per second, all on one server. The DB server would not skip a beat, CPU/IO was always <20% and all of our queries would run in 1-5ms range.

I'm not affiliated with Rackspace in any capacity, but my experience with them in the past has been top-notch, esp. when it comes to "dedicated-like" cloud hardware, which is what OnMetal is - your are 100% on one machine, no neighbors. Their prices can be high but the reliability is top-notch, and the description of the hardware is very accurate, much more detailed than AWS for example, and without "fluffy" cloud terms :).

For example: Boot device: 2x 240 GB hot-swappable SSDs configured in a RAID 1 mirror

Storage: 2x 1.6 TB PCIe storage devices (Seagate Nytro XP6302)

[+] jayofdoom|9 years ago|reply
Thanks :). Glad to hear you liked the product.
[+] sytse|9 years ago|reply
We need about 70TB now and are planning for 256TB.
[+] annerajb|9 years ago|reply
BTW HPE provides some storage solutions where you can scale it up to 8 petabyte (in a single rack afaik).

I have a really small setup but i would personally look into a dl580 as VM host and having two for redundancy. And a dual path storage system in my case I used a 2u MSA2400 (not sure if that is the latest name)

Since it could continue to scale up and provided dual path too.

I don't have experience running CEPH so I am not sure what are the hardware requirements for CEPH.

(Disclaimer I work at Hewlett Packard Enterprises with servers)

[+] gerbilly|9 years ago|reply
If only there was a way to run git in a decentralized manner. :-)

Then users could host their own repositories themselves and manage their storage.

This kind of setup would scale a lot better.

[+] Steeeve|9 years ago|reply
> The problem with CephFS is that in order to work, it needs to have a really performant underlaying infrastructure because it needs to read and write a lot of things really fast. If one of the hosts delays writing to the journal, then the rest of the fleet is waiting for that operation alone, and the whole file system is blocked.

This could easily have been titled "Why we couldn't use SephFS"