top | item 14925846

Go reliability and durability at Dropbox

194 points| astdb | 8 years ago |about.sourcegraph.com | reply

101 comments

order
[+] tlb|8 years ago|reply
What does "Reliability of 99.9999999999% (twelve 9s)" even mean? Obviously you have to exclude large classes of user-visible failures (network outage, account over quota) to achieve that. I don't think they're claiming less than 0.00000000001% chance of a zombie apocalypse/Mad Max/ex Machina/asteroid impact end-of-times situation. So just what failures are counted?

For comparison, public telephony systems aimed for five 9s. That was usually expressed as "20 minutes downtime over 40 years, combined hardware and software budget, for outages affecting more than 32 users." One software crash requiring human intervention would count for more than 20 minutes, so you were allowed <1 of these in 40 years system lifetime.

[+] agrajag|8 years ago|reply
That's a claim of the durability of the data, or the odds that a chunk of data will be lost in a year.

They calculate this through the odds that a single node fails, and then multiplying that odd out through all replicas. This covers the most easily quantifiable failure mode.

Obviously the real odds are somewhat higher when you consider that a rogue admin, malicious actor, or buggy code could delete multiple instances of replicated data at once. There's no way to estimate these odds though, and really they don't matter - they're big enough events that they could spell the end of Dropbox if they happened.

[+] sllabres|8 years ago|reply
The video linked by kylequest below [1] speaks about durability, not reliability: "Create a system that provides annual data DURABILITY of 99.9999999999. Create a system with availability of over 99.99%"

[1] https://youtu.be/5doOcaMXx08?t=220

[+] coldtea|8 years ago|reply
>For comparison, public telephony systems aimed for five 9s. That was usually expressed as "20 minutes downtime over 40 years, combined hardware and software budget, for outages affecting more than 32 users." One software crash requiring human intervention would count for more than 20 minutes, so you were allowed <1 of these in 40 years system lifetime.

And all of that is total bogus (the "aim", not your information), as no public telephony system (and surely not in my country) ever had anything close to that.

A few hours of downtime a few times a year is much more like it, although it has been getting better over time.

[+] tc|8 years ago|reply
The PSTN and similar systems do target five-9s, but fortunately that only requires keeping it to ~20 minutes downtime over 4 years. ~20 minutes over 40 years would be six-9s.

    (* 1e-5 365.2425 24 60 4) => 21.038
    (* 1e-6 365.2425 24 60 40) => 21.038
[+] pebers|8 years ago|reply
Data durability - the probably that your data is not lost.

FWIW Amazon make a similar claim of 11 9s for data durability on S3: https://aws.amazon.com/s3/faqs/

[+] oconnor663|8 years ago|reply
The next bullet point is availability, so presumably they mean reliability in the sense that you won't permanently lose data?
[+] tw04|8 years ago|reply
I feel like they mean resiliency, not reliability. I could see 12x 9's resiliency with them factoring it based on x amount of data stored for y days. There's 0 chance they could claim that level of reliability for the reasons you mentioned among others.
[+] amelius|8 years ago|reply
> What does "Reliability of 99.9999999999% (twelve 9s)" even mean?

Sort of worst case: what it could mean is that every hour, the system reboots and this process takes 10^-12 of an hour, which doesn't seem like much, but you'd have to restart your client as well, which may take longer and is annoying, and you could lose data. So basically, the system would be useless :)

[+] didibus|8 years ago|reply
> It’s easy to be productive in Go.

Hum, I'm not sure what this means. Is it saying go is a productice language, or just that you'll master go quickly and reach peek go productivity quickly?

[+] barsonme|8 years ago|reply
Both, really. Go's so small you can be quite proficient with in months. While it's not as productive as, say, Python (wrt how quickly you can get your code up and running) it's much quicker than other languages (and nicer to use in the long run).
[+] SwellJoe|8 years ago|reply
I think it's only slightly ambiguous. It doesn't seem to be making a claim about mastering go or peak go productivity. It just says it's easy to be productive (not maximally productive, just "productive"). Which seems to be borne out by the huge amount of new code written in go just in the past few years.
[+] _ph_|8 years ago|reply
Go is a very productive language in several senses in my eyes. It is a language you can master quickly, and it is a very productive language in day to day work. There it is important to consider the overall productivity. There are certainly languages, which allow you to implement something a little bit quicker, but you have to consider also the effort of maintaining and further extending a program. For long-term maintenance, Go especially shines. It is easy to come back to some software and pick up modifying it again, and also when there are properly set up packages, Go software tends to be maintainable and extendable.
[+] 0xCMP|8 years ago|reply
Oh, I thought Dropbox was using Rust instead of Go for a lot of things, but maybe they ended up using both. I can see why they'd have wanted to be just moving to either Rust or Go since from what I understand they used to be mostly Python for everything.

Cool that they use Go a lot.

[+] kibwen|8 years ago|reply
Last I checked, server-side Rust usage at Dropbox is reserved for the very bottom of the stack, for the bits that are performance-sensitive enough that the alternative would have demanded they be written in C++. Apparently there's a significant amount of Rust in Dropbox's Windows client as well, though I don't know the story there...
[+] didibus|8 years ago|reply
Easier to go from python to go than to rust.
[+] 0xFFC|8 years ago|reply
Go and Rust are not competing for same space. They are different language for different purposes.
[+] mostafah|8 years ago|reply
I have an off-topic question: This is the second company (after GitLab) I see with an “about” subdomain. Is this a new trend of using “about.x.com” for the marketing website and “x.com” for the web app? Is there a blog post or discussion about this?
[+] sytse|8 years ago|reply
At GitLab we first used www dot the marketing site and the apex for the app but many people assumed they would have the same content. That is why we introduced about. Cool to see we might have started a trend.
[+] usrusr|8 years ago|reply
Browsers are thankfully highlighting the https-verified part of the URL (hostname) relative to the rest, so that a "paypal.com.fake.com" phishing attack is easier to spot. It was just a matter of time before UX people would put that highlighting to creative use. I like it.

On the technological side, I guess separate hostnames might make some ops things a little easier. But that alone can hardly be the reason. Easier and good looks can. Also: In a large scale outage, an about.x.com that is not running on your main cloud provider could be valuable for status updates, because far more people would know about about.x.com than about some status.x.com you might have if your "about" content was on the main hostname.

[+] justinclift|8 years ago|reply
Anyone know where in the talk it has the mention of "Debugging tools (mostly!) work well"?

I'm skipping back and forwards through it, but the talk isn't in the same order as this article which is making it very difficult without watching the whole talk from start -> end.

Asking because debugging is a pain point I've been having with Go for a few months, so am surprised to see it described as mostly working well. I'd like to get my debugging experiences to at least that level of "(mostly) working well". :D

[+] ctrlrsf|8 years ago|reply
What are you having trouble debug? Or what do you think isn't working well for you?
[+] apta|8 years ago|reply
> The biggest pain with Go that Tammy identified was in dealing with race conditions.

> Data races are the hardest type of bug to debug, spot, fix, etc.

Exactly what Rust aims at preventing. Sad to see that the industry is not learning.

[+] jacquesm|8 years ago|reply
> Sad to see that the industry is not learning.

Sad that Rust advocates are not learning. This sort of comment is what drives people away from Rust. Stop ramming your stuff down other people's throats. Go build that exclusively Rust based Dropbox clone that outperforms Dropbox and show how well Rust performs in that situation.

Rust has trade-offs just like Go has trade-offs. Being honest about the deficiencies of ones chosen platform is a good thing, it helps to keep you sharp and to avoid problems associated with those deficiencies.

Besides having an over-zealous community that posts off-topic comments all over threads that have nothing to do with Rust, Rust has deficiencies too.

Note also that Dropbox is already using Rust in some places.

[+] zzzcpan|8 years ago|reply
Yeah, Go has the worst possible model for concurrency there is - shared memory multithreading. Hopefully more and more companies will realize how bad this model really is and start looking into languages with decent concurrency models, like Erlang and Elixir or at least stick to event loops.
[+] falcolas|8 years ago|reply
I give you Python and Ruby as worse - shared memory and an interpreter lock to limit concurrent operations.

That said, Go is the best of the C-style concurrency breed; having typed message passing and green threads built into the language.

[+] innocentoldguy|8 years ago|reply
That is exactly why I use Erlang and Elixir, and not Go. Also, Go processes are 10 times the size of those in Elixir.
[+] jxi|8 years ago|reply

[deleted]

[+] dan-compton|8 years ago|reply
What a terrible article.
[+] lathiat|8 years ago|reply
I think that is, in part, because it's a summary of a talk. So it's written a bit awkwardly and probably misses maybe some more interesting in-depth details from the actual talk. It does seem on the surface though like potentially a good text summary, having not watched the actual talk.

On the flip side, I have been frustrated particularly at linux.conf.au last year at how "not deep" many of the talks were. Having done quite a few talks over many years at similar conferences, it's actually quite hard to nail something technical and be entertaining for a presentation at the same time. Someone who nails that quite consistently is Aaron Paterson (from the Ruby/Rails world). Watch some of his talks on Youtube.. I aspire to produce more content on a level similar to his. To combine good entertaining presentation with actually educating an audience about the technical non-obvious details of something they probably didn't know -- and something that was relevant in a practical project he took on. Working on it, not there yet...

[+] TRManderson|8 years ago|reply
>This post was best-effort live-blogged at the conference

Cut them some slack.