I doubt this HN post could have a more boring title. The actual article is very interesting. Basically Facebook's DB developers talk about what it takes to perform 13 million queries per second, and a lot of the useful tips they've learned along the way to make scaling simpler.
How is the title boring? What could be more telling? Having "Facebook" in the title makes it clear we're going to be talking about astronomical amounts of data. "[insert any db] at Facebook" would be an interesting title in my opinion.
The fact that it's MySQL makes it even more interesting, given the shift of scalability interest to MongoDB, etc.
"They figure out why the response time for the worst query is bad and then fix it."
This can reap huge benefits and doesn't need to be difficult. Just enable the slow query log in MySQL, use the EXPLAIN command to analyze the results, then add indexes where appropriate. I was able to fix poorly indexed tables in a vendor's application with dramatic results. In one case, a twenty-minute(!) query was reduced to less than a second.
While your premise is correct it's not always that simple. More indexes can slow down updates and inserts. A tables overall usage pattern needs to be examined before just adding more indexes. You might have fixed the 20 minute report query while slowing down some other more critical query that loaded or updated data.
"It is OK if a query is slow as long as it is always slow"
I find this enlightening.
I think it is just one instance of difference in emphasis between Google and Facebook. Google, a technology oriented company, would minimize the average response time, while Facebook, a people oriented company would minimize the unpredictability (by minimizing the variance)
For what it's worth, Jeff Dean of Google has famously emphasized 95th and 99th percentile performance in preference to average or median performance over the years. The realization that edge cases more powerfully determine user perception of performance than average cases is a deep one, and is not original to Facebook or Google; Dean and colleagues were making some of the same points when they were at DEC's CRC pre-Google.
I think the more compelling reason for "queries must be fast or slow, never 'both, depending'" is that it prevents engineers from accidentally using code originally built for reporting purposes (which might well have no latency requirements, or very relaxed ones) and putting it in a widget which gets slapped on people's home pages. You'd notice on your local, test, or staging instance "Hmm, my home page is taking 10 seconds to load -- that's bad" prior to pushing it live and, whoopsie, there's now 50 million people hitting a very scarily expensive query all at once.
I'm sorry, but your interpretation here is completely wrong.
Facebook (and Google!) cares about performance variance because it has more of an impact on overall site performance than average performance does. Variable performance has huge impact on downstream systems, and you can quickly end up with cascading performance problems.
I think that quote is slightly misleading without more context. They prioritize optimizing variable-performing queries higher than others. They aren't going to be using slow queries on the Facebook home page.
>> "It is OK if a query is slow as long as it is always slow"
I'm having trouble understanding the motivation. If a slow query is always slow, then I'm always going to be kept waiting for that page/data. It seems logical to worry about the queries that 100% of the time keeps users waiting rather than the queries that keep users waiting <100% of the time.
Does anyone care to explain why this is a good idea (for Facebook at least)?
The concept of variance reduction is not even new with computing - In manufacturing they have six sigma. I believe the famous GE CEO's quote was "Our Customers Feel the Variance, Not the Mean" (Google it).
I enjoyed the section on creating quality, not quantity, and its emphasis on minimizing variance. I can see how these heuristics could be applied to most startups.
The section on diagnosing should be taken with a grain of salt, though. If your company ever gets to the point where you need to monitor everything at subsecond level to catch problems or analyze and understand every layer of your stack to see how it performs, you've already won. That amount of attention to scalability means your company has a huge base of users. Not only that, it means you have the large and impressive engineering resources to devote to that problem.
That's definitely not my startup, and so the tools described, while definitely useful (and probably fun to build!), aren't anything approaching a priority for me. In the words of the stereotypical Yiddish grandmother, you should be so lucky to have those sorts of problems!
It'd be interesting to know some info on the hardware back end, such as number of servers, storage system, etc. Also, how many servers does a typical query touch?
Does anyone know the througput of the largest MSSQL installation? I'm searching the web to show off a little information at work, but I can't find anything that compares.
Everyone is using MySQL. But I attended a talk by one of their DBAs where he said that they large OLTP server, the one that processes the payments is Oracle. Single-instance, because Oracle RAC can't give them the low latency they needed.
They switch the the biggest machine IBM can give them every few month.
I can't understand why anyone who is in the know would sign up for this. I work at a fortune-5 corp where Oracle was once king and is being replaced with Microsoft SQL simply due to the outrageous price gouging. It's as if Oracle is trying to squeeze every last penny out of it's aging database as OSS solutions chip away at its profits. PostgreSQL is, in my opinion, poised to do this. MySQL just pales in comparison to Postgres and Oracle is kitchen sink and then some -- even at a fortune 5 we barely use all the "features" in Oracle.
I'm just astonished that a company like Oracle being around as long as they are could be so dumb. At the fortune-5, Oracle has a similar practice of gouging us on Peoplesoft licenses due to, in my opinion, lost DB sales.
Charging a customer a license by CPU core is just unethical.
[+] [-] ecaron|15 years ago|reply
There's a full video of the talk available at http://www.livestream.com/facebookevents/video?clipId=flv_cc...
[+] [-] catshirt|15 years ago|reply
The fact that it's MySQL makes it even more interesting, given the shift of scalability interest to MongoDB, etc.
[+] [-] unknown|15 years ago|reply
[deleted]
[+] [-] kenjackson|15 years ago|reply
[+] [-] qjz|15 years ago|reply
This can reap huge benefits and doesn't need to be difficult. Just enable the slow query log in MySQL, use the EXPLAIN command to analyze the results, then add indexes where appropriate. I was able to fix poorly indexed tables in a vendor's application with dramatic results. In one case, a twenty-minute(!) query was reduced to less than a second.
[+] [-] matwood|15 years ago|reply
[+] [-] ora600|15 years ago|reply
When all you have is a hammer...
[+] [-] thevivekpandey|15 years ago|reply
[+] [-] kmavm|15 years ago|reply
[+] [-] patio11|15 years ago|reply
[+] [-] nl|15 years ago|reply
Facebook (and Google!) cares about performance variance because it has more of an impact on overall site performance than average performance does. Variable performance has huge impact on downstream systems, and you can quickly end up with cascading performance problems.
I think that quote is slightly misleading without more context. They prioritize optimizing variable-performing queries higher than others. They aren't going to be using slow queries on the Facebook home page.
[+] [-] xtacy|15 years ago|reply
From what I hear, Google/Bing do track response latencies at 99+ percentile.
[+] [-] davidamcclain|15 years ago|reply
I'm having trouble understanding the motivation. If a slow query is always slow, then I'm always going to be kept waiting for that page/data. It seems logical to worry about the queries that 100% of the time keeps users waiting rather than the queries that keep users waiting <100% of the time.
Does anyone care to explain why this is a good idea (for Facebook at least)?
[+] [-] helwr|15 years ago|reply
See Deming's work on statistical process control used in wartime production during WWII: http://en.wikipedia.org/wiki/W._Edwards_Deming
I'm glad Facebook is following this old school engineering tradition
[+] [-] morgo|15 years ago|reply
This is talked about here: http://www.mysqlperformanceblog.com/2010/06/07/performance-o...
[+] [-] codypo|15 years ago|reply
The section on diagnosing should be taken with a grain of salt, though. If your company ever gets to the point where you need to monitor everything at subsecond level to catch problems or analyze and understand every layer of your stack to see how it performs, you've already won. That amount of attention to scalability means your company has a huge base of users. Not only that, it means you have the large and impressive engineering resources to devote to that problem.
That's definitely not my startup, and so the tools described, while definitely useful (and probably fun to build!), aren't anything approaching a priority for me. In the words of the stereotypical Yiddish grandmother, you should be so lucky to have those sorts of problems!
[+] [-] MichaelGG|15 years ago|reply
[+] [-] preek|15 years ago|reply
[+] [-] philwhln|15 years ago|reply
[+] [-] unknown|15 years ago|reply
[deleted]
[+] [-] known|15 years ago|reply
[+] [-] ora600|15 years ago|reply
They switch the the biggest machine IBM can give them every few month.
[+] [-] mobileed|15 years ago|reply
I'm just astonished that a company like Oracle being around as long as they are could be so dumb. At the fortune-5, Oracle has a similar practice of gouging us on Peoplesoft licenses due to, in my opinion, lost DB sales.
Charging a customer a license by CPU core is just unethical.
It's a no wonder... Go riddance