top | item 13888565

(no title)

I really enjoy Discord's blog. Their Cassandra write up was excellent as well. A couple of thoughts and questions:

- Having many clusters and assigning messages to a specific cluster seems like an interesting solution.

- I'm curious how they managed to lazily index messages.

- Since only message, channel and server ids are stored in ES, have there been any problems reindexing data after an index fails?

discuss

jhgg|9 years ago

The first time you run a search in a server (or the first time you run a search in a server after the index fails) - will trigger a full re-index of that server. Ctrl-F "Historical Index" in the blog post for more details! If you've never used search in a server - the messages are not indexed in real time until you do for the first time. Both these things make the system "lazy".

The worst case to an index failure is that the search query is delayed as the index rebuilds itself. We throttle the rate of historical indexing into ES to a safe level so that we're not degrading performance of other components of the system.

stickperson|9 years ago

Oh, I think I get it now-- is it that the _initial_ indexing is lazy, but all indexing after that is done automatically by the historical index workers? Basically when a user searches for something do you check that ES has something for that user, if it doesn't start off the initial indexing process, and from there the workers do their thing?

daddykotex|9 years ago

I agree with you, and I've got a question as well.

I'm wondering how long does it take to execute the ES refresh on a search query when the Shard was marked as dirty?

If the search requests are mostly real time, I suspect this is really short, but if the Shard ingest new messages for a while (let's say 50 minutes) and it's marked as dirty, a search query would ask ES to refresh 50 minutes worth of documents before running the actual query.

As it shown to be a problem? Is the refreshing time growing along with the number of documents inserted since the last refresh?

jhgg|9 years ago

Good question. So far we've noticed the refresh time to be negligible (worst case in the tens of milliseconds). It's worth noting that most of the cost of doing a search on Discord is in pulling the message context from Cassandra to provide enough data to render the results in the client.