Reddit explains their architecture, scaling and recent downtime

[+] KrisJordan|16 years ago|reply

"It turns out that this one little decision makes it so that we can't horizontally scale that layer of our architecture without losing all the data already there (because all of the keys would point to the wrong server if we added a new one)."

I'm surprised they're looking to switch stacks entirely as opposed to a consistent hashing / distributed hash table (a la chord, dynamo, etc.)

http://en.wikipedia.org/wiki/Consistent_hashing

http://en.wikipedia.org/wiki/Chord_(peer-to-peer)

[+] jedberg|16 years ago|reply

We can't just switch hashing at this point, because none of the data would be in the right place.

Since we have to move all the data anyway, we figured now would be a good time to switch stacks. Memcachedb isn't really a very solid product, so even if we scale it, it isn't a good long term solution.

[+] showerst|16 years ago|reply

I love how transparent reddit is about these sorts of things. In their video from pycon, they even discussed their server hosting bills!

Crazy how such a minor design decision (not using consistent hashing) could have such huge consequences down the line...

[+] kscaldef|16 years ago|reply

"[memcached] can no longer return data fast enough for our needs, due to the way it interacts with BDb (its underlying data store)."

Does anyone know what the specific issue being alluded to here is?

[+] patio11|16 years ago|reply

MemcacheDB periodically flushes its in-memory DB to disk, using BDb. That causes a global read/write lock for a looooong time. (This pauses BCC for 5 seconds at a time when it happens, and I only have 20 MB stored in memcachedb. To rectify this I'm moving to Redis as soon as I get a day free. Suffice it to say that if Bingo Card Creator taxes the architecture of your key/value store, you may not quite be ready for prime time.)

[+] sadiq|16 years ago|reply

They're talking about memcachedb here rather than memcached.

It's memcache but with berkeleydb underneath for persistence.

http://memcachedb.org/

[+] jedberg|16 years ago|reply

Yeah, that was a typo. Memcachedb just isn't fast enough. It used blocking IO when reading the disk, so if it is waiting on the disk for some data, none of the other requests can go through.

[+] synnik|16 years ago|reply

"It is a highly customized user experience, on par with something like Facebook (just not as many users)"

Really, Reddit?

While, I don't doubt that they have complexities to deal with, this sounds like skewed perspective of either overestimating their own complexity, or underestimating Facebook's.

[+] jedberg|16 years ago|reply

We have had multiple discussions with the Facebook folks, and are well aware of the complexities of both sites. I believe you are actually underestimating reddit's complexity.

Here are some examples:

When you load a comments page with 500 comments, there are 1000s of data points that have to be loaded. For every comment, we have to check how you voted on it to draw the arrows, we have to check if you are the author, we have to check if you are allowed to remove that comment as a moderator, we have to check if you can edit it, if the author is your friend, and so on. And we have to do that 500 times.

When you load a listing, we have to pull your subscriptions, and then merge the results from all the reddits you subscribe to. Then we have to check most of the same things as we do for comments.

After all that checking, we have a to render a page that is customized just for you. Some of that will come from the render cache and some from the data cache, but it is still highly customized.

The big difference is that Facebook doesn't have any logged out users, whereas we do.

Akamai takes care of our logged out users, but rendering the page for a logged in user is extremely complicated.

[+] papaf|16 years ago|reply

Two things surprise me about this article - probably because I've misunderstood it and don't see the big picture.

One is that there are master and slave databases and searches are done off the master - I've always seen them done off the slaves in other systems. The other is that they state that using MD5 doesn't allow for horizontal scaling. One of the qualities of MD5 is that all bits have an equal probability of being 0/1. Surely the last 1 or 2 bits can be used to indicate which server is holding the data?

[+] smanek|16 years ago|reply

Searches are likely done off slaves - I suspect that is not presented properly because of the oversimplification of the diagram.

You can just use a few bits from an MD5 hash to decide server as long as you know how many servers you're going to have up front. The problem is that if you later wanted to add or remove a server, you would need to come up with a new scheme and move every piece of data around so it's on the right server (which would take days/weeks).

The more scalable/flexible solution is to use a consistent hashing algorithm (check out some of the papers on Chord) so that adding or removing a server doesn't require you to move as much data around.

[+] jedberg|16 years ago|reply

The search machine is its own database. It feeds its data from the masters for consistency, but the searches themselves run against the search database.

I think the MD5 thing was covered well below.

16 comments