How we replaced Elasticsearch and MongoDB with Rust and RocksDB

brunohaid|6 months ago

Bit thin on details and not looking like they’ll open source it, but if someone clicked the post because they’re looking for their “replace ES” thing:

Both https://typesense.org/ and https://duckdb.org/ (with their spatial plugin) are excellent geo performance wise, the latter now seems really production ready, especially when the data doesn’t change that often. Both fully open source including clustered/sharded setups.

No affiliation at all, just really happy camper.

j_kao|6 months ago

These are great projects, we use DuckDB to inspect our data lake and for quick munging.

We will have some more blog posts in the future describing different parts of the system in more detail. We were worried too much density in a single post would make it hard to read.

sureglymop|6 months ago

These are great. I am eternally grateful that projects like this are open source, I do however find it hard to integrate them into your own projects.

A while ago I tried to create something that has duckdb + its spatial and SQLite extensions statically linked and compiled in. I realized I was a bit in over my head when my build failed because both of them required SQLite symbols but from different versions.

atombender|6 months ago

DuckDB does not have any kind of sharding or clustering? It doesn't even have a server (unless you count the HTTP Server Extension)?

jjordan|6 months ago

Typesense is an absolute beast, and it has a pretty great dev experience to boot.

mcdonje|6 months ago

Not sure what they'll opensource. The rust code? They're calling it a DB, but they described an entire stack.

ericcholis|6 months ago

Typsense as a product has been great (hosted cluster). Customer support has been awesome as well.

maelito|6 months ago

I wonder if this could help Photon, the open source ElasticSearch/OpenSearch search engine for OSM data.

It's a mini-revolution in the OSM world, where most apps have a bad search experience where typos aren't handled.

https://github.com/komoot/photon

hyc_symas|6 months ago

A system built on LMDB will work better for this use case than RocksDB. And OSM Express already uses it. https://wiki.openstreetmap.org/wiki/OSM_Express

pm90|6 months ago

Slightly meta, but I find its a good sign that we're back to designing and blogging about in-house data storage systems/ Query engines again. There was an explosion of these in the 2010's which seemed to slow down/refocus on AI recently.

0xbadcafebee|6 months ago

It slowed down not because of AI, but because it turned out it was mostly pointless. Highly specialized stacks that could usually be matched in performance by tweaking an existing system or scaling a different way.

In-house storage/query systems that are not a product being sold by itself are NIH syndrome by a company with too much engineering resources.

8n4vidtmkvmk|6 months ago

Is it good? What's left to innovate on in this space? I don't really want experimental data stores. Give me something rock solid.

wavemode|6 months ago

NoSQL/alternative databases became kind of a meme once people realized that 95% of enterprises can do fine with just Postgres.

pianoben|6 months ago

Lol I "love" that the first benefit this company lists in their jobs page is "In-Office Culture". Do people actually believe that having to commute is a benefit?

nickm12|6 months ago

You can't reduce the in-office or remote experience purely to commuting. It's just one aspect about how and where you work and work life balance in general.

But since you asked, yes, I actually enjoy commuting when it is less than 30 minutes each way and especially when it involves physical activities. My best commutes have been walking and biking commutes of around 20-25 minutes each way. They give me exercise, a chance to clear my head, and provide "space" between work and home.

During 2020, I worked from home the entire time and eventually I found it just mentally wasn't good for me to work and live in the same space. I couldn't go into the office, so I started taking hour long walks at the end of every day to reset. It helped a lot.

That said, I've also done commutes of up to an hour each way by crowed train and highway driving and those are...not good.

01HNNWZ0MV43FF|6 months ago

In-office culture would be dope if there were actual benefits to an office like maybe

Learning from smart people, making friends, free food and drinks, a DDR machine

My last office job had none of that. Instead it was just sort of like a depressing scaled up version of my home office

victorbjorklund|6 months ago

Some people like being in office. People are different.

aflag|6 months ago

I rather commute than WFH. So yeah, people do. Maybe not all the people, but certainly some people.

michaelcampbell|6 months ago

> Do people actually believe that having to commute is a benefit?

Everything is subjective here. I don't love commuting, but I'm remote now and there are days I kind of miss it. I got a lot more podcasting listening in when I did which I really do miss, and I enjoyed getting out of the house, on a schedule, and seeing my city and area.

As for BEING in the office, yes I also miss that. I miss the friendships with people from other parts of the org that I made; I miss the getting together at lunch and talking about both work and non-work stuff; I miss the pinball machines that one enthusiast set up.

THAT SAID, I abhor the _requirement_ to be in an office; it's a top down, heavy handed, hamfisted attempt at trying to force something that IMO can only come naturally, under the guise of "CuLtUrE!", and unless forced to I won't consider any job that requires it. (NB: This, too, is a tradeoff - if it's close to my house and I've got some latitude as to what time to make it there so I can have some freedom to avoid the heaviest of traffic, sure.)

This is just another example of the "open office" concept. When that came out everyone hated it except for the C-suite that didn't have to do it, under the mistaken idea that it forces "collaboration, which is good", when the reality was that the "good" part was emergent, holistic, and natural, and any forcing function kills it. But of course we also know that it was nothing but a cost-savings issue, and the "collaboration" argument was a gaslight retcon of the highest order. Open offices actually worked when PART of the office was open, allowing collaboration _as needed_ and driven by the teams/groups that wanted to do it, not by management. RTO is exactly the same.

trimbo|6 months ago

This article is lacking detail. For example, how is the data sharded, how much time between indexing and serving, and how does it handle node failure, and other distributed systems questions? How does the latency compare? Etc. etc.

softwaredoug|6 months ago

It’s interesting as someone in the search space how many companies are aiming to “replace Elasticsearch”

j_kao|6 months ago

Author here! We were really motivated to turn a "distributed system" problem into a "monolithic system" from an operations perspective and felt this was achievable with current hardware, which is why we went with in-process, embedded storage systems like RocksDB and Tantivy.

Memory-mapping lets us get pretty far, even with global coverage. We are always able to add more RAM, especially since we're running in the cloud.

Backfills and data updates are also trivial and can be performed in an "immutable" way without having to reason about what's currently in ES/Mongo, we just re-index everything with the same binary in a separate node and ship the final assets to S3.

mikeocool|6 months ago

In my experience, the care and feeding that goes into an Elastic Search cluster feels like it's often substantially higher than that involved in the primary data store, which has always struck me as a little odd (particularly in cases where the primary data store is an RDBMS).

I'd be very happy to use simpler more bulletproof solutions with a subset of ES's features for different use cases.

tracker1|6 months ago

Nice... it's cool to see how different companies are putting together best fit solutions. I'm also glad that they at least started out with off the shelf apps instead of jumping to something like a bespoke solution early on.

Quickwit[1] looks interesting, found via Tantivity reference. Kind of like ES w/ Lucene.

1. https://github.com/quickwit-oss/quickwit

francoismassot|6 months ago

it's tantivy :)

tapirl|6 months ago

It is weird to include "Rust" (a language) in the title. Readers might wonder what is replaced by Rust? Elasticsearch or MongoDB?

0xbadcafebee|6 months ago

Rocks is a fork of Level, and Level is well known for data corruption and other bugs. They are both "run at production scale", but at least back when I worked on stuff that used Level, nobody talked publicly about all the toil spent on cleaning up and repairing Level to keep the services based on it running.

Whenever you see an advertisement like this (these posts are ads for the companies publishing them), they will not be telling you the full truth of their new stack, like the downsides or how serious they can be (if they've even discovered them yet). It's the same for tech talks by people from "big name companies". They are selling you a narrative.

Jweb_Guru|6 months ago

RocksDB diverged from LevelDB a long time ago at this point and has had extensive work done on it by both industry and academia. It's not a toy database like LevelDB was. I can't speak to the problems they're supposedly hiding in their stack, but they are unlikely to come from RocksDB.

KAdot|6 months ago

This is not my experience. I've been running RocksDB for 4 years on thousands of machines, each storing terabytes of data, and I haven't seen a single correctness issue caused by RocksDB.

sophia01|6 months ago

They're not open sourcing it though?

j_kao|6 months ago

It's a bit difficult at the moment, given we have a lot of proprietary data at the moment and a lot of the logic follows it. I'm hoping we can get it to a state where it can be indexed and serving OSM data but that is going to take some time.

That being said, we are currently working on getting our Google S2 Rust bindings open-sourced. This is a geo-hashing library that makes it very easy to write a reverse geocoder, even from a point-in-polygon or polygon-intersection perspective.

pbowyer|6 months ago

Doesn't sound like it, but it's a nice writeup of the tools they stitched together. For someone to copy and open source... hopefully :)

9cb14c1ec0|6 months ago

Clicked because of Elasticsearch, then wondered why I hadn't known of radar.com before. Just the autocomplete at a reasonable price that I need.

kosolam|6 months ago

Side note 1: ES can also be embedded in your app (on the JVM). Note 2: I actually used RocksDB to solve many use cases and it’s quite powerful and very performant. If anything from this post take this, it’s open source and a very solid building block. Note 3: I would like to test drive quickwit as an ES replacement. Haven’t got the time yet.

j_kao|6 months ago

1 - I think if we were sticking with the JVM, I do wonder if Lucene would be the right choice in that case

2 - It's a great tool with a lot of tuneability and support!

3 - We've been using it for K8s logs and OTEL (with Jaeger). Seems good so far, though I do wonder how the future of this will play out with the $DDOG acquisition.

vips7L|6 months ago

I really enjoy embedding things in the vm. I run a discord bot with a few thousand users with embedded H2. Recently I’ve been looking at trying to embed keycloak (or something similar) for some other apps.

mexxixan|6 months ago

Would love to know how they scaled it. Also, what happens when you lose the machine and the local db? I imagine there are backups but they should have mentioned it. Even with backups how do you ensure zero data loss.

jothirams|6 months ago

Is horizondb publicly available for us to try as well..

darqis|6 months ago

Searching for HorizonDB I find a Python project on github.

I'm guessing it's closed source *aas only?

unknown|6 months ago

[deleted]

reactordev|6 months ago

I mean, anything could replace elasticsearch, but can it actually?

It sounds like they had the wrong architecture to start with and they built a database to handle it. Kudos. Most would have just thrown cache at it or fine tuned a readonly postgis database for the geoip lookups.

Without benchmarks it’s just bold claims we’ll have to ascertain.

dboreham|6 months ago

These are not the same kinds of things.

nekitamo|6 months ago

I've used RocksDB a lot in the past and am very satisfied with it. It was helpful building a large write-heavy index where most of the data had to be compressed on disk.

I'm wondering if anyone here has experience with LMDB and can comment on how they compare?

https://www.symas.com/mdb

I'm looking at it next for a project which has to cache and serve relatively small static data, and write and look up millions of individual points per minute.

hyc_symas|6 months ago

LMDB is for read-heavy workloads. The opposite of RocksDB.

RocksDB can use thousands of file descriptors at once, on larger DBs. Makes it unsuitable for servers that may also need to manage thousands of client connections at once.

LMDB uses 2 file descriptors at most; just 1 if you don't use its lock management, or if you're serving static data from a readonly filesystem.

RocksDB requires extensive configuration to tune properly. LMDB doesn't require any tuning.

lisbbb|6 months ago

I can see ditching Mongo, but what's bad about ElasticSearch? Too expensive in some way?

Isn't RocksDB just the db engine for Kafka?

tinyhouse|6 months ago

fastText? last time I checked it wasn't even maintained.

feverzsj|6 months ago

Sounds like all they need is Postgres or just Sqlite.

benjiro|6 months ago

Yep, their goal was to use a monolite solution, instead of a clustered elasticsearch.

Postgres + pg_search (= tantivy) will have gotten them there for 80%. Sure, postgres really needs a plugin storage engine for better SSD support (see orioledb).

But creating your own database for your own company, is just silly.

There is a lot of money in the database market, and everybody wants to do their own thing to tied customers down to those databases. And that is their main goal.

100 comments