top | item 8557408

Facebook's Top Open Data Problems

180 points| huangwei_chang | 11 years ago |research.facebook.com | reply

61 comments

order
[+] EricBurnett|11 years ago|reply
I strongly dislike Facebook the product, and to lesser extent Facebook the company, but I'm continually impressed with Facebook's approach to engineering in the open. I find this an interesting dichotomy. Would I want to work there? I still don't think so, but my opinion on that front is getting less strong over time.
[+] mFixman|11 years ago|reply
Former Facebook intern here. Facebook the company is a lot more 'hack-y' in the right way of the word than what it looks like on the outside. The company and its products are extremely open, and the projects you do working there have extremely little management and corporate bs.

My experience was really similar to my Google internship, and probably even closer to a "cool startup". I know several people who worked on Google, Facebook, and X (with X another major Silicon Valley company) and say that the first two were a lot closer to each other than to X.

[+] rebootthesystem|11 years ago|reply
Well, The Facebook company can feel rather juvenile to deal with. Some of their decisions in dealing with other businesses sometimes feel like they are being made by fifteen year old kids with no life or business experience.

In terms of the Facebook product one cannot deny the obvious: They touched THE nerve on the internet. World wide. Across languages and cultures.

I don't like the embodiment of the product at all. At the very least it has usability and privacy problems. Yet more people use it successfully than any other web app in the world. So, what do I know? What do experts know?

A similar thing could be said about CraigsList. It's 2014. Every time I use the site I cannot believe what I am looking at. Yet I and lots of other people keep using it. It works.

Facebook has a general lack of elegance (whatever that means). The kind that happens when a product is thrown together and evolved over time. Evolving anything over time means the output "naturally selects" (subverting the theory here) to the environment created by it's users.

They survive because they optimized for what is important to their users. Grandma couldn't care less about UI issues or searchability. She wants to see her grandchildren's pictures and videos. And for that it works very well for a huge percentage of the planet.

At some point it becomes almost impossible to break the mold and clean-up what might be less than ideal. Why would you? It works. Another "Innovator's Dilemma" [1] situation to a large extent.

[1] http://www.amazon.com/The-Innovators-Dilemma-Revolutionary-B...

[+] IndianAstronaut|11 years ago|reply
I was interviewing with them some time ago and just gave up half way. Those guys are absolute assholes and are hugely arrogant. Not a place I would want to work.
[+] Sven7|11 years ago|reply
That I am sure is also what they said about the Manhattan project. Anyways the train has left the station.
[+] ransom1538|11 years ago|reply
I had a really great time talking to the Facebook engineers during my interviews there. The main pattern I noticed was Harvard (i was applying in management). Even more so, the guys interviewing me were extremely talented and smart. What always weirded me out was... the problems they work on are not that difficult. Once you grasp sharding and operations, you are pretty much set. These guys are not the Manhattan project. The true hard problems in their space: developing their own mobile hardware, keeping teens engaged, pushing the boundaries of design, losing tracking systems in mobile, etc; they don't face head on. Moving petabytes around or caching lots of things in memcache - my roomate and I could do with an aws account and a few beers. Memcache for god sakes is what 300 lines of C?
[+] bachback|11 years ago|reply
true - Facebook scales linearly. If you're interested in really hard problems in distributed systems & blockchains let me know. Facebook should not own social data. the future will be end-to-end encrypted and de-centralised.
[+] mandeepj|11 years ago|reply
Can any body throw some light on how facebook's database is designed? I am sure it will be an interesting read.

I was reading somewhere sometime back that each user at fb has its own database. I think that is not possible.

edit: I am googling now again on this topic. First link found is http://www.quora.com/What-is-Facebooks-database-schema

[+] nbm|11 years ago|reply
There isn't one database, although there are a few major types.

The majority of core information (attributes of people and places and pages and so forth, as well as posts and comments) is stored in MySQL and queried through TAO.

Some data is primary stored in things like HBase, such as messages.

Non-primary-storage data (indexes and so forth) exist in various forms optimised for different workloads - so data in either MySQL or HBase might also exist in Hive for data warehouse queries, or in Unicorn for really fast search-style queries.

Other data (such as logs) might reside in one or more of the various data stores, such as Scuba, Hive, HBase, and accessible via Presto, for example.

TAO: https://www.facebook.com/publications/507347362668177/

Unicorn: https://www.facebook.com/publications/219621248185635

Hive: https://www.facebook.com/publications/374595109278618/

Scuba: https://www.facebook.com/publications/148418812023978/

Presto: http://facebook.github.io/presto/

[+] crazypyro|11 years ago|reply
This is slightly off topic, but has any experienced an increase in "fake" toasts from facebook mobile? It seems if I haven't used facebook mobile in a few days or I don't respond to their toasts about very minor people in my life uploading a photo, I tend to start getting toasts that say "You have 5 notifications, 3 pokes and 2 messages.", then I open the app and it takes me to an unknown error page.

Am I being too cynical in thinking that Facebook is intentionally misleading its users in an attempt to bump up their metrics? It interests me that they are seeing jumps in their mobile users (and consequently, ad sales) at the same time that I have been receiving more notifications than ever. Interestingly, the slowdown in fake toast notifications coincided with their quarterly earnings report that show mobile ads accounting for an increasingly large portion of revenue and also mentions an increase in mobile user usage.

Comparing Q1 with Q2 with Q3, Q2-Q3 showed double the increase in ad revenue percent from mobile (59% to 62% to 66%). Maybe this is just all anecdotal evidence, but it seems like these sort of fake notifications should either not be sent out (failure of the system that keeps track of what user receives what toasts) or there was a conscious effort to send these notifications....

[+] crazypyro|11 years ago|reply
If anyone cares, I went and looked at their metrics and it seems that Q2 to Q3 was one of their biggest increases in mobile alone(albeit not by a whole lot) in quite a while, yet if you look at the raw user metrics over all platforms, it was slower than almost every other quarter in terms of users gained. I'm not sure if this adds any credibility to my wild theory, but it does at least show there is something affecting the increase in mobile usage, although that could just be market factors.

Interestingly, Twitter's metrics don't appear to show any similar rise in the rate of adoption.

[+] coolsunglasses|11 years ago|reply
It's been doing this for me via email lately and it's really annoying.
[+] andrewchoi|11 years ago|reply
Sorry if this is a silly question, but what are "toasts"?
[+] beagle3|11 years ago|reply
Something does not add up about hive: They say it has 300 PB, and it generates 4PB per day - which means, at this rate, all data was generated within the last 75 days.
[+] boomzilla|11 years ago|reply
Most likely that the 300PB are distilled/normalized/compacted data whereas the 4PB per day are raw logs.
[+] alkonaut|11 years ago|reply
Also 800.000 tables? Surely not in the sense I'm used to, that is, normalized forms where a table corresponds roughy to some business object/noun? Does table mean something else or are there 800k different types of data in there?
[+] zeroonetwothree|11 years ago|reply
Data is not stored forever. If most data is only stored ~30 days, then the numbers make sense.
[+] Cakez0r|11 years ago|reply
I'm really curious how they handle paging if they're only using memcached. E.G. If a a photo node has 10,000 comment nodes (and thus 10,000 edges linking the photo to the comments), chances are you only want to display the most recent 50 comments. Are all of the 10,000 edges stored in memcached under one key and then paged on the application servers? Are they stored in chunks under multiple keys? How is cache consistency maintained if somebody makes a new comment (maintaining the time ordering seems tricky and expensive)?

This is a problem I'm actively trying to solve for a project, so if somebody knows the answer, please get in touch!

[+] alexgartrell|11 years ago|reply
That's what TAO (mentioned in the article) is for
[+] swah|11 years ago|reply
I'd like to use this opportunity to ask: is it a technical limitation that users still can't search their timeline?
[+] mmmooo|11 years ago|reply
So ~650M daily active users..4PB of data warehouse created each day, that means ~7MB of new data on each active user per day. Given that its data warehouse, I'm going to guess its not images, seems like a lot to me. I guess it shouldn't surprise anyone that every interaction on and off the site, is heavily tracked.
[+] nbm|11 years ago|reply
A lot of that data is duplicated to allow for efficient querying or transformation. It often is too slow to process the data as it comes in, so an initial process will write the data in a raw form, and some other process might select a subset of the data to process, and then submit it in an "annotated" form (filling in, say, the AS number of the client IP). Another process will run later in a batched fashion and perhaps annotate the full set of information and summarize it into a bunch of easily-queried tables.

A lot of that data is also not tied to individuals either - for example the access logs for the CDN (which, being on a different domain by design, does not share cookies so is not attached to an account) even reasonably heavily sampled is probably tens of gigabytes a day, and is rolled up into efficient forms for queries in various ways. A lot of it isn't even about requests coming through the web site/API - it may just be internal inter-service request information, or inter-datacenter flow analysis, or per-machine service metrics ("Oh, look, process A on machines B through E went from 2GB resident to 24GB in 30 seconds a few seconds before the problem manifested").

(Not that it makes too much of a difference at this scale, but it is closer to 860M daily actives.)

[+] srcmap|11 years ago|reply
FB and Google can clone what your thinking, maybe predict what your will be thinking? :-)

I wonder if they can predict with some percentage accuracy on what any particular active US user might vote for today base on the user's graph data?

[+] doque|11 years ago|reply
3. Hive is Facebook's data warehouse, with 300 petabytes of data in 800,000 tables. Facebook generates 4 new petabyes of data and runs 600,000 queries and 1 million map-reduce jobs per day.

So 4 PB per day, but only 300 PB total?

[+] ddoolin|11 years ago|reply
Was wondering the same thing. My guess is that some also gets removed each day as well, but it seems unlikely.
[+] Thaxll|11 years ago|reply
Still using Memcache wow.
[+] Goranek|11 years ago|reply
Whats wrong with Memcached? Why are you so surprised?