Building a full-text search engine in 150 lines of Python code

[+] arafalov|5 years ago|reply

Nice introduction. A good ramp-up from basic splitting to ranking.

It does need to be said that when Lucene had a set of features this small, it was also pretty tiny. And, if those are the needs, one could still download it and it will probably run on modern JVM:

https://archive.apache.org/dist/lucene/java/

lucene-1.4.3.jar 2004-11-29 14:13 316K

It is a bit bigger of course, but that's because it already had stemmers, multilingual support, multiple Query Parsers including phrase, RAM and Disk directory implementations and a bunch of other advanced features one can easily see by unzipping the jar file.

[+] finikytou|5 years ago|reply

it amaze me how much is about creating an eco-system to scale and handling complexity and how software grow to answer that instead of trying to split into two different piece of software. Many people/companies/websites just need those basic features when search is not a core essential capability

[+] rcarmo|5 years ago|reply

I went down this rabbit hole some ten years ago, and somewhere along the line I discovered SQLite has an FTS engine (up to version 5 now, I think) and just started using that instead. Massive performance for tens of gigs’ worth of text content on a single core.

[+] superyesh|5 years ago|reply

Neat! One production ready python library I love to use in this space is https://radimrehurek.com/gensim/ It is quite mature and can handle a good amount of data well! It offers topic modeling too and can b helpful to find similar documents also.

[+] happyweasel|5 years ago|reply

As Joel spolsky said: "Just do me a favor and search the damned hard drive, quickly, for the string I typed, using full-text indexes and other technologies that were boring in 1973."

[+] tarsinge|5 years ago|reply

That's a standard feature I use nearly everyday and I'm very glad for in MacOS.

[+] habibur|5 years ago|reply

Excellent read. Was searching for a full text search engine but not finding any suitable one. Plan to implement one just this way.

[+] bartdegoede|5 years ago|reply

Author here: you may want to check out something like Whoosh https://whoosh.readthedocs.io/en/latest/intro.html (it's like a clone of Lucene but in pure Python). I've used this to build some basic search for a small Python website and it was more than fast enough for my purposes :-)

[+] rakoo|5 years ago|reply

SQLite has a pretty good built-in fts engine: https://www.sqlite.org/fts5.html

[+] dmitriid|5 years ago|reply

Many years ago I ran into this paper "Self-indexing inverted files for fast text retrieval" http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.18....

It's short and to the point. And then I implemented all that ... in PHP and MySQL :)

It feels daunting at first, but once you understand what it wants you to do, it's actually not that hard (for this particular paper, and this particular approach).

However, you do want to employ a stemming library to normalize word forms.

[+] js2|5 years ago|reply

A well worked example using Python and Redis:

https://redislabs.com/ebook/part-2-core-concepts/chapter-7-s...

[+] ignoramous|5 years ago|reply

Reminds of this David Crawshaw (CTO at Tailscale) presentation on full-text search with SQLite which probably requires 10 lines or less: https://www.youtube-nocookie.com/embed/RqubKSF3wig

[+] asdfasgasdgasdg|5 years ago|reply

It's not relevant or an indicator of power that this example requires ten lines of SQL. This presentation just uses SQLite's builtin full-text search system [1]. Of course it's going to require less code to call a library than to implement it.

[1]: https://sqlite.org/fts5.html

[+] edoceo|5 years ago|reply

Yea! FTS5 FTW! Amazing for what it costs.

[+] KAdot|5 years ago|reply

The article looks suspiciously similar to https://artem.krylysov.com/blog/2020/07/28/lets-build-a-full.... Very similar examples, code and structure.

[+] jeroenhd|5 years ago|reply

Does it? Both are implementations and explanations of a well-known algorithm. Most articles on quicksort will also look alike, but there's no reason to assume the author has plagiarized anything.

[+] pmiller2|5 years ago|reply

So? Wikipedia is one of the most convenient, large English corpora available, and I doubt there are many significantly different ways to write the bit of functionality that's built up here. I'm not sure if that's what you're meaning to suggest, or that there was some kind of plagiarism / inspiration going on here.

[+] yesenadam|5 years ago|reply

I saw a case once, I forget where, when someone was accused of plagiarising from wikipedia and the truth was that they'd themselves written a large part of the wikipedia article.

[+] diogenesjunior|5 years ago|reply

Except that one is written in go and the other in python...

[+] argaba|5 years ago|reply

Everything is a paraphrase of the original source - whatever and/or whenever that maybe from.

[+] machiaweliczny|5 years ago|reply

Cool article.

I recently build program in Go that takes wikipedia article and gets all dependencies then using tfidf*count ranks concepts in order of "importance". Seems quite good for math articles to get list of more basic concepts to understand first.

[+] throwaway320912|5 years ago|reply

Fun times. I once applied pagerank onto a set of 8000 math articles and ran the result as a web app in 2009/10.

http://web.archive.org/web/20091230103939/http://myyn.org/m

As a gimmick, I created 36 groups, with group 1 containing the most important concepts:

http://web.archive.org/web/20100109055506/http://myyn.org/m/...

Spoiler, the top ten concepts were:

Function · Set · Number · Integer · Real Number · Point · Property · Finite · Ring · Relation Theory

BTW: Glad the archive has a copy, because I do not.

[+] runningmike|5 years ago|reply

Using nlp techniques to create tokens is for a fast and simple python search often not needed. It makes things slow and python standard string function are often good enough. Issues arise when searching for combinations of words like ‘machine learning’ in a sentence. Nice read, but the GitHub repro needs a license to be more useful.

[+] bartdegoede|5 years ago|reply

Added an MIT license. Have fun :-)

[+] gfxgirl|5 years ago|reply

Failed for me. I use Chinese and Japanese. There are no spaces to split on. I also search code where this also fails. I know it was meant to be simple and illustrate some points. I think the point it fails at is that this is actually a much harder problem in real life than a simple 150 line solution suggests.

[+] samcodes|5 years ago|reply

if you use a library for Chinese/Japanese tokenization (which is harder because the lack of space), it seems like the rest of the code would work?

[+] Bridgeburner4|5 years ago|reply

@op great article. I ran your code first without fully reading the article. It took the code almost an hour to parse and index the data using the terminal.

A suggestion, if you want to make the code more friendly. Either:

1. Write a note to README.md that once you exit the program, you loose the indexed data.

or

2. Save the indices to disk. :)

thanks for the article

[+] boynamedsue|5 years ago|reply

  return [token for token in tokens if token]

What the what?

[+] suresk|5 years ago|reply

Comprehensions in Python are pretty handy, but they can look weird sometimes. This is filtering out `None`, empty string, etc values from a list.

It is iterating over a list (tokens) and creating a temporary variable (token). It tests the truthyness of it (if token), which means None and '' will return False and thus be excluded, and then returns it (the first token).

[+] mplanchard|5 years ago|reply

Aka

    list(filter(None, tokens))

Or

    list(filter(lambda x: bool(x), tokens))

[+] throwaway1777|5 years ago|reply

Lol. Compact map.

[+] andrewmatte|5 years ago|reply

This doesn't account for synonyms. I'd rather use a document embedding search.

[+] xmly|5 years ago|reply

Very nice work! Thank you for sharing.

[+] denysvitali|5 years ago|reply

Does anybody work for Atlassian here? If so, can you please share this with the Confluence team? Thanks...

[+] tyingq|5 years ago|reply

If you want to feel extra sad about it, see this: https://jira.atlassian.com/browse/CONFSERVER-13499

[+] boyter|5 years ago|reply

Do people really find the search in confluence that bad? Would they actually be willing to pay to have it fixed?

[+] bigtones|5 years ago|reply

And the Jira team

[+] mlthoughts2018|5 years ago|reply

I worked previously for a very high traffic ecommerce company (Alexa top 300 site).

As part of the search team, I worked on a project where we deliberately rewrote the whole product search engine in Python and Cython, including our own algorithms manipulating documents for deletion, low latency reindexing after edits, and more.

We did this because SOLR was too slow and the process of defining custom sort orders (for example, new sort orders defined by machine learning ranking algorithms, and needing to be A/B tested regularly) was awful and performance in SOLR was poor.

It was a really fun project. One of the slogans for our group at the time was “rewriting search in Python for speed.”

The ultimate system we deployed was insanely fast. I doubt you could have made it faster even writing the entire thing directly in C or C++. It became a reference project in the company to help avoid various flavors of arrogant dismissal of Python as a backend service language for performance critical systems.

[+] inertiatic|5 years ago|reply

Defining custom sort orders in Solr is as simple as uploading a text file with the values you intend to use for ranking. This is a great feature that is in fact missing from Elasticsearch and saves you so much reindexing time.

There certainly are usecases where Lucene based solutions aren't the best fit. But I think the claim that you couldn't make something faster by moving away from Python is outlandish.

[+] NicoJuicy|5 years ago|reply

Could i send you a mail somewhere? I'd like to have a casual conversation about e-commerce and some of your insights.

[+] creamytaco|5 years ago|reply

Why would one choose Python for this instead of Go or Rust?

I imagine Python performance would be terrible.

[+] bartdegoede|5 years ago|reply

The idea was to illustrate the concepts of an inverted index and tf-idf :-) I've built some stuff like this for a smaller project written in Python because it was Easier™ than spinning up an Elasticsearch cluster (ie it definitely didn't need to scale :-D)

[+] theplague42|5 years ago|reply

Because it's easier for the average developer to follow Python code? This is a tutorial/demo, not a library.

[+] Thaxll|5 years ago|reply

The article is good and so can be applied to any language, also Python is not that slow.

[+] roelschroeven|5 years ago|reply

As a lot of the work is done by code library code, there's a very good chance that the code is not all that slow.

[+] neolog|5 years ago|reply

If the dataset is small, speed doesn't matter. Also, there's PyPy if you need speed.

83 comments