Nice introduction. A good ramp-up from basic splitting to ranking.
It does need to be said that when Lucene had a set of features this small, it was also pretty tiny. And, if those are the needs, one could still download it and it will probably run on modern JVM:
It is a bit bigger of course, but that's because it already had stemmers, multilingual support, multiple Query Parsers including phrase, RAM and Disk directory implementations and a bunch of other advanced features one can easily see by unzipping the jar file.
it amaze me how much is about creating an eco-system to scale and handling complexity and how software grow to answer that instead of trying to split into two different piece of software. Many people/companies/websites just need those basic features when search is not a core essential capability
I went down this rabbit hole some ten years ago, and somewhere along the line I discovered SQLite has an FTS engine (up to version 5 now, I think) and just started using that instead. Massive performance for tens of gigs’ worth of text content on a single core.
Neat! One production ready python library I love to use in this space is https://radimrehurek.com/gensim/ It is quite mature and can handle a good amount of data well!
It offers topic modeling too and can b helpful to find similar documents also.
As Joel spolsky said: "Just do me a favor and search the damned hard drive, quickly, for the string I typed, using full-text indexes and other technologies that were boring in 1973."
Author here: you may want to check out something like Whoosh https://whoosh.readthedocs.io/en/latest/intro.html (it's like a clone of Lucene but in pure Python). I've used this to build some basic search for a small Python website and it was more than fast enough for my purposes :-)
It's short and to the point. And then I implemented all that ... in PHP and MySQL :)
It feels daunting at first, but once you understand what it wants you to do, it's actually not that hard (for this particular paper, and this particular approach).
However, you do want to employ a stemming library to normalize word forms.
It's not relevant or an indicator of power that this example requires ten lines of SQL. This presentation just uses SQLite's builtin full-text search system [1]. Of course it's going to require less code to call a library than to implement it.
Does it? Both are implementations and explanations of a well-known algorithm. Most articles on quicksort will also look alike, but there's no reason to assume the author has plagiarized anything.
So? Wikipedia is one of the most convenient, large English corpora available, and I doubt there are many significantly different ways to write the bit of functionality that's built up here. I'm not sure if that's what you're meaning to suggest, or that there was some kind of plagiarism / inspiration going on here.
I saw a case once, I forget where, when someone was accused of plagiarising from wikipedia and the truth was that they'd themselves written a large part of the wikipedia article.
I recently build program in Go that takes wikipedia article and gets all dependencies then using tfidf*count ranks concepts in order of "importance". Seems quite good for math articles to get list of more basic concepts to understand first.
Using nlp techniques to create tokens is for a fast and simple python search often not needed. It makes things slow and python standard string function are often good enough. Issues arise when searching for combinations of words like ‘machine learning’ in a sentence. Nice read, but the GitHub repro needs a license to be more useful.
Failed for me. I use Chinese and Japanese. There are no spaces to split on. I also search code where this also fails. I know it was meant to be simple and illustrate some points. I think the point it fails at is that this is actually a much harder problem in real life than a simple 150 line solution suggests.
@op great article. I ran your code first without fully reading the article. It took the code almost an hour to parse and index the data using the terminal.
A suggestion, if you want to make the code more friendly. Either:
1. Write a note to README.md that once you exit the program, you loose the indexed data.
Comprehensions in Python are pretty handy, but they can look weird sometimes. This is filtering out `None`, empty string, etc values from a list.
It is iterating over a list (tokens) and creating a temporary variable (token). It tests the truthyness of it (if token), which means None and '' will return False and thus be excluded, and then returns it (the first token).
I worked previously for a very high traffic ecommerce company (Alexa top 300 site).
As part of the search team, I worked on a project where we deliberately rewrote the whole product search engine in Python and Cython, including our own algorithms manipulating documents for deletion, low latency reindexing after edits, and more.
We did this because SOLR was too slow and the process of defining custom sort orders (for example, new sort orders defined by machine learning ranking algorithms, and needing to be A/B tested regularly) was awful and performance in SOLR was poor.
It was a really fun project. One of the slogans for our group at the time was “rewriting search in Python for speed.”
The ultimate system we deployed was insanely fast. I doubt you could have made it faster even writing the entire thing directly in C or C++. It became a reference project in the company to help avoid various flavors of arrogant dismissal of Python as a backend service language for performance critical systems.
Defining custom sort orders in Solr is as simple as uploading a text file with the values you intend to use for ranking.
This is a great feature that is in fact missing from Elasticsearch and saves you so much reindexing time.
There certainly are usecases where Lucene based solutions aren't the best fit. But I think the claim that you couldn't make something faster by moving away from Python is outlandish.
The idea was to illustrate the concepts of an inverted index and tf-idf :-) I've built some stuff like this for a smaller project written in Python because it was Easier™ than spinning up an Elasticsearch cluster (ie it definitely didn't need to scale :-D)
[+] [-] arafalov|5 years ago|reply
It does need to be said that when Lucene had a set of features this small, it was also pretty tiny. And, if those are the needs, one could still download it and it will probably run on modern JVM:
https://archive.apache.org/dist/lucene/java/
lucene-1.4.3.jar 2004-11-29 14:13 316K
It is a bit bigger of course, but that's because it already had stemmers, multilingual support, multiple Query Parsers including phrase, RAM and Disk directory implementations and a bunch of other advanced features one can easily see by unzipping the jar file.
[+] [-] finikytou|5 years ago|reply
[+] [-] rcarmo|5 years ago|reply
[+] [-] superyesh|5 years ago|reply
[+] [-] happyweasel|5 years ago|reply
[+] [-] tarsinge|5 years ago|reply
[+] [-] habibur|5 years ago|reply
[+] [-] bartdegoede|5 years ago|reply
[+] [-] rakoo|5 years ago|reply
[+] [-] dmitriid|5 years ago|reply
It's short and to the point. And then I implemented all that ... in PHP and MySQL :)
It feels daunting at first, but once you understand what it wants you to do, it's actually not that hard (for this particular paper, and this particular approach).
However, you do want to employ a stemming library to normalize word forms.
[+] [-] js2|5 years ago|reply
https://redislabs.com/ebook/part-2-core-concepts/chapter-7-s...
[+] [-] ignoramous|5 years ago|reply
[+] [-] asdfasgasdgasdg|5 years ago|reply
[1]: https://sqlite.org/fts5.html
[+] [-] edoceo|5 years ago|reply
[+] [-] KAdot|5 years ago|reply
[+] [-] jeroenhd|5 years ago|reply
[+] [-] pmiller2|5 years ago|reply
[+] [-] yesenadam|5 years ago|reply
[+] [-] diogenesjunior|5 years ago|reply
[+] [-] argaba|5 years ago|reply
[+] [-] machiaweliczny|5 years ago|reply
I recently build program in Go that takes wikipedia article and gets all dependencies then using tfidf*count ranks concepts in order of "importance". Seems quite good for math articles to get list of more basic concepts to understand first.
[+] [-] throwaway320912|5 years ago|reply
http://web.archive.org/web/20091230103939/http://myyn.org/m
As a gimmick, I created 36 groups, with group 1 containing the most important concepts:
http://web.archive.org/web/20100109055506/http://myyn.org/m/...
Spoiler, the top ten concepts were:
Function · Set · Number · Integer · Real Number · Point · Property · Finite · Ring · Relation Theory
BTW: Glad the archive has a copy, because I do not.
[+] [-] runningmike|5 years ago|reply
[+] [-] bartdegoede|5 years ago|reply
[+] [-] gfxgirl|5 years ago|reply
[+] [-] samcodes|5 years ago|reply
[+] [-] Bridgeburner4|5 years ago|reply
A suggestion, if you want to make the code more friendly. Either:
1. Write a note to README.md that once you exit the program, you loose the indexed data.
or
2. Save the indices to disk. :)
thanks for the article
[+] [-] boynamedsue|5 years ago|reply
[+] [-] suresk|5 years ago|reply
It is iterating over a list (tokens) and creating a temporary variable (token). It tests the truthyness of it (if token), which means None and '' will return False and thus be excluded, and then returns it (the first token).
[+] [-] mplanchard|5 years ago|reply
[+] [-] throwaway1777|5 years ago|reply
[+] [-] andrewmatte|5 years ago|reply
[+] [-] xmly|5 years ago|reply
[+] [-] denysvitali|5 years ago|reply
[+] [-] tyingq|5 years ago|reply
[+] [-] boyter|5 years ago|reply
[+] [-] bigtones|5 years ago|reply
[+] [-] mlthoughts2018|5 years ago|reply
As part of the search team, I worked on a project where we deliberately rewrote the whole product search engine in Python and Cython, including our own algorithms manipulating documents for deletion, low latency reindexing after edits, and more.
We did this because SOLR was too slow and the process of defining custom sort orders (for example, new sort orders defined by machine learning ranking algorithms, and needing to be A/B tested regularly) was awful and performance in SOLR was poor.
It was a really fun project. One of the slogans for our group at the time was “rewriting search in Python for speed.”
The ultimate system we deployed was insanely fast. I doubt you could have made it faster even writing the entire thing directly in C or C++. It became a reference project in the company to help avoid various flavors of arrogant dismissal of Python as a backend service language for performance critical systems.
[+] [-] inertiatic|5 years ago|reply
There certainly are usecases where Lucene based solutions aren't the best fit. But I think the claim that you couldn't make something faster by moving away from Python is outlandish.
[+] [-] NicoJuicy|5 years ago|reply
[+] [-] creamytaco|5 years ago|reply
I imagine Python performance would be terrible.
[+] [-] bartdegoede|5 years ago|reply
[+] [-] theplague42|5 years ago|reply
[+] [-] Thaxll|5 years ago|reply
[+] [-] roelschroeven|5 years ago|reply
[+] [-] neolog|5 years ago|reply