I strongly dislike Facebook the product, and to lesser extent Facebook the company, but I'm continually impressed with Facebook's approach to engineering in the open. I find this an interesting dichotomy. Would I want to work there? I still don't think so, but my opinion on that front is getting less strong over time.
Former Facebook intern here. Facebook the company is a lot more 'hack-y' in the right way of the word than what it looks like on the outside. The company and its products are extremely open, and the projects you do working there have extremely little management and corporate bs.
My experience was really similar to my Google internship, and probably even closer to a "cool startup". I know several people who worked on Google, Facebook, and X (with X another major Silicon Valley company) and say that the first two were a lot closer to each other than to X.
Well, The Facebook company can feel rather juvenile to deal with. Some of their decisions in dealing with other businesses sometimes feel like they are being made by fifteen year old kids with no life or business experience.
In terms of the Facebook product one cannot deny the obvious: They touched THE nerve on the internet. World wide. Across languages and cultures.
I don't like the embodiment of the product at all. At the very least it has usability and privacy problems. Yet more people use it successfully than any other web app in the world. So, what do I know? What do experts know?
A similar thing could be said about CraigsList. It's 2014. Every time I use the site I cannot believe what I am looking at. Yet I and lots of other people keep using it. It works.
Facebook has a general lack of elegance (whatever that means). The kind that happens when a product is thrown together and evolved over time. Evolving anything over time means the output "naturally selects" (subverting the theory here) to the environment created by it's users.
They survive because they optimized for what is important to their users. Grandma couldn't care less about UI issues or searchability. She wants to see her grandchildren's pictures and videos. And for that it works very well for a huge percentage of the planet.
At some point it becomes almost impossible to break the mold and clean-up what might be less than ideal. Why would you? It works. Another "Innovator's Dilemma" [1] situation to a large extent.
I was interviewing with them some time ago and just gave up half way. Those guys are absolute assholes and are hugely arrogant. Not a place I would want to work.
I had a really great time talking to the Facebook engineers during my interviews there. The main pattern I noticed was Harvard (i was applying in management). Even more so, the guys interviewing me were extremely talented and smart. What always weirded me out was... the problems they work on are not that difficult. Once you grasp sharding and operations, you are pretty much set. These guys are not the Manhattan project. The true hard problems in their space: developing their own mobile hardware, keeping teens engaged, pushing the boundaries of design, losing tracking systems in mobile, etc; they don't face head on. Moving petabytes around or caching lots of things in memcache - my roomate and I could do with an aws account and a few beers. Memcache for god sakes is what 300 lines of C?
true - Facebook scales linearly. If you're interested in really hard problems in distributed systems & blockchains let me know. Facebook should not own social data. the future will be end-to-end encrypted and de-centralised.
There isn't one database, although there are a few major types.
The majority of core information (attributes of people and places and pages and so forth, as well as posts and comments) is stored in MySQL and queried through TAO.
Some data is primary stored in things like HBase, such as messages.
Non-primary-storage data (indexes and so forth) exist in various forms optimised for different workloads - so data in either MySQL or HBase might also exist in Hive for data warehouse queries, or in Unicorn for really fast search-style queries.
Other data (such as logs) might reside in one or more of the various data stores, such as Scuba, Hive, HBase, and accessible via Presto, for example.
This is slightly off topic, but has any experienced an increase in "fake" toasts from facebook mobile? It seems if I haven't used facebook mobile in a few days or I don't respond to their toasts about very minor people in my life uploading a photo, I tend to start getting toasts that say "You have 5 notifications, 3 pokes and 2 messages.", then I open the app and it takes me to an unknown error page.
Am I being too cynical in thinking that Facebook is intentionally misleading its users in an attempt to bump up their metrics? It interests me that they are seeing jumps in their mobile users (and consequently, ad sales) at the same time that I have been receiving more notifications than ever. Interestingly, the slowdown in fake toast notifications coincided with their quarterly earnings report that show mobile ads accounting for an increasingly large portion of revenue and also mentions an increase in mobile user usage.
Comparing Q1 with Q2 with Q3, Q2-Q3 showed double the increase in ad revenue percent from mobile (59% to 62% to 66%). Maybe this is just all anecdotal evidence, but it seems like these sort of fake notifications should either not be sent out (failure of the system that keeps track of what user receives what toasts) or there was a conscious effort to send these notifications....
If anyone cares, I went and looked at their metrics and it seems that Q2 to Q3 was one of their biggest increases in mobile alone(albeit not by a whole lot) in quite a while, yet if you look at the raw user metrics over all platforms, it was slower than almost every other quarter in terms of users gained. I'm not sure if this adds any credibility to my wild theory, but it does at least show there is something affecting the increase in mobile usage, although that could just be market factors.
Interestingly, Twitter's metrics don't appear to show any similar rise in the rate of adoption.
Something does not add up about hive: They say it has 300 PB, and it generates 4PB per day - which means, at this rate, all data was generated within the last 75 days.
Also 800.000 tables? Surely not in the sense I'm used to, that is, normalized forms where a table corresponds roughy to some business object/noun? Does table mean something else or are there 800k different types of data in there?
I'm really curious how they handle paging if they're only using memcached. E.G. If a a photo node has 10,000 comment nodes (and thus 10,000 edges linking the photo to the comments), chances are you only want to display the most recent 50 comments. Are all of the 10,000 edges stored in memcached under one key and then paged on the application servers? Are they stored in chunks under multiple keys? How is cache consistency maintained if somebody makes a new comment (maintaining the time ordering seems tricky and expensive)?
This is a problem I'm actively trying to solve for a project, so if somebody knows the answer, please get in touch!
So ~650M daily active users..4PB of data warehouse created each day, that means ~7MB of new data on each active user per day. Given that its data warehouse, I'm going to guess its not images, seems like a lot to me. I guess it shouldn't surprise anyone that every interaction on and off the site, is heavily tracked.
A lot of that data is duplicated to allow for efficient querying or transformation. It often is too slow to process the data as it comes in, so an initial process will write the data in a raw form, and some other process might select a subset of the data to process, and then submit it in an "annotated" form (filling in, say, the AS number of the client IP). Another process will run later in a batched fashion and perhaps annotate the full set of information and summarize it into a bunch of easily-queried tables.
A lot of that data is also not tied to individuals either - for example the access logs for the CDN (which, being on a different domain by design, does not share cookies so is not attached to an account) even reasonably heavily sampled is probably tens of gigabytes a day, and is rolled up into efficient forms for queries in various ways. A lot of it isn't even about requests coming through the web site/API - it may just be internal inter-service request information, or inter-datacenter flow analysis, or per-machine service metrics ("Oh, look, process A on machines B through E went from 2GB resident to 24GB in 30 seconds a few seconds before the problem manifested").
(Not that it makes too much of a difference at this scale, but it is closer to 860M daily actives.)
3. Hive is Facebook's data warehouse, with 300 petabytes of data in 800,000 tables. Facebook generates 4 new petabyes of data and runs 600,000 queries and 1 million map-reduce jobs per day.
[+] [-] EricBurnett|11 years ago|reply
[+] [-] mFixman|11 years ago|reply
My experience was really similar to my Google internship, and probably even closer to a "cool startup". I know several people who worked on Google, Facebook, and X (with X another major Silicon Valley company) and say that the first two were a lot closer to each other than to X.
[+] [-] rebootthesystem|11 years ago|reply
In terms of the Facebook product one cannot deny the obvious: They touched THE nerve on the internet. World wide. Across languages and cultures.
I don't like the embodiment of the product at all. At the very least it has usability and privacy problems. Yet more people use it successfully than any other web app in the world. So, what do I know? What do experts know?
A similar thing could be said about CraigsList. It's 2014. Every time I use the site I cannot believe what I am looking at. Yet I and lots of other people keep using it. It works.
Facebook has a general lack of elegance (whatever that means). The kind that happens when a product is thrown together and evolved over time. Evolving anything over time means the output "naturally selects" (subverting the theory here) to the environment created by it's users.
They survive because they optimized for what is important to their users. Grandma couldn't care less about UI issues or searchability. She wants to see her grandchildren's pictures and videos. And for that it works very well for a huge percentage of the planet.
At some point it becomes almost impossible to break the mold and clean-up what might be less than ideal. Why would you? It works. Another "Innovator's Dilemma" [1] situation to a large extent.
[1] http://www.amazon.com/The-Innovators-Dilemma-Revolutionary-B...
[+] [-] IndianAstronaut|11 years ago|reply
[+] [-] beartime|11 years ago|reply
[deleted]
[+] [-] Sven7|11 years ago|reply
[+] [-] ransom1538|11 years ago|reply
[+] [-] e12e|11 years ago|reply
https://github.com/memcached/memcached/blob/master/memcached...
is more like 4000 lines of C (by guesstimating the amount of comments etc).
[+] [-] bachback|11 years ago|reply
[+] [-] mandeepj|11 years ago|reply
I was reading somewhere sometime back that each user at fb has its own database. I think that is not possible.
edit: I am googling now again on this topic. First link found is http://www.quora.com/What-is-Facebooks-database-schema
[+] [-] nbm|11 years ago|reply
The majority of core information (attributes of people and places and pages and so forth, as well as posts and comments) is stored in MySQL and queried through TAO.
Some data is primary stored in things like HBase, such as messages.
Non-primary-storage data (indexes and so forth) exist in various forms optimised for different workloads - so data in either MySQL or HBase might also exist in Hive for data warehouse queries, or in Unicorn for really fast search-style queries.
Other data (such as logs) might reside in one or more of the various data stores, such as Scuba, Hive, HBase, and accessible via Presto, for example.
TAO: https://www.facebook.com/publications/507347362668177/
Unicorn: https://www.facebook.com/publications/219621248185635
Hive: https://www.facebook.com/publications/374595109278618/
Scuba: https://www.facebook.com/publications/148418812023978/
Presto: http://facebook.github.io/presto/
[+] [-] tristanz|11 years ago|reply
[+] [-] crazypyro|11 years ago|reply
Am I being too cynical in thinking that Facebook is intentionally misleading its users in an attempt to bump up their metrics? It interests me that they are seeing jumps in their mobile users (and consequently, ad sales) at the same time that I have been receiving more notifications than ever. Interestingly, the slowdown in fake toast notifications coincided with their quarterly earnings report that show mobile ads accounting for an increasingly large portion of revenue and also mentions an increase in mobile user usage.
Comparing Q1 with Q2 with Q3, Q2-Q3 showed double the increase in ad revenue percent from mobile (59% to 62% to 66%). Maybe this is just all anecdotal evidence, but it seems like these sort of fake notifications should either not be sent out (failure of the system that keeps track of what user receives what toasts) or there was a conscious effort to send these notifications....
[+] [-] crazypyro|11 years ago|reply
Interestingly, Twitter's metrics don't appear to show any similar rise in the rate of adoption.
[+] [-] coolsunglasses|11 years ago|reply
[+] [-] andrewchoi|11 years ago|reply
[+] [-] beagle3|11 years ago|reply
[+] [-] boomzilla|11 years ago|reply
[+] [-] alkonaut|11 years ago|reply
[+] [-] zeroonetwothree|11 years ago|reply
[+] [-] Cakez0r|11 years ago|reply
This is a problem I'm actively trying to solve for a project, so if somebody knows the answer, please get in touch!
[+] [-] alexgartrell|11 years ago|reply
[+] [-] swah|11 years ago|reply
[+] [-] mmmooo|11 years ago|reply
[+] [-] nbm|11 years ago|reply
A lot of that data is also not tied to individuals either - for example the access logs for the CDN (which, being on a different domain by design, does not share cookies so is not attached to an account) even reasonably heavily sampled is probably tens of gigabytes a day, and is rolled up into efficient forms for queries in various ways. A lot of it isn't even about requests coming through the web site/API - it may just be internal inter-service request information, or inter-datacenter flow analysis, or per-machine service metrics ("Oh, look, process A on machines B through E went from 2GB resident to 24GB in 30 seconds a few seconds before the problem manifested").
(Not that it makes too much of a difference at this scale, but it is closer to 860M daily actives.)
[+] [-] unknown|11 years ago|reply
[deleted]
[+] [-] srcmap|11 years ago|reply
I wonder if they can predict with some percentage accuracy on what any particular active US user might vote for today base on the user's graph data?
[+] [-] doque|11 years ago|reply
So 4 PB per day, but only 300 PB total?
[+] [-] ddoolin|11 years ago|reply
[+] [-] Thaxll|11 years ago|reply
[+] [-] toomuchtodo|11 years ago|reply
[1] https://github.com/facebook/mcrouter
[+] [-] Goranek|11 years ago|reply
[+] [-] ransom1538|11 years ago|reply
[deleted]