What's a common approach for keeping the index up to date? A live ETL from the DB to the search engine doesn't sound simple. Another method I can think of, after existent data has been loaded, is to send the data directly at the same time to both the database and the search engine every time a user makes a CRUD operation but lots of works too if you don't already have a HTTP api and are doing mostly server-side-rendered HTML.
Apart from being written in Rust, MeiliSearch (https://github.com/meilisearch/meilisearch) differs mostly on the use of a bucket sort to rank the documents retrieved within the index.
Both MeiliSearch and Typesense use a reverse index with a Levenshtein automaton to handle typos, but when it comes to sorting document:
- Typesense use a default_sorting_field on each document, it means that before indexing your documents you need to compute a relevancy score for typesense to be able to sort them based on your needs (https://typesense.org/docs/0.11.1/guide/#ranking-relevance)
- On the other hand MeiliSearch, uses a bucket sort which means that there is a default relevancy algorithm based on the proximity of words in the documents, the fields in which the words are found and the number of typos (https://docs.meilisearch.com/guides/advanced_guides/ranking....). And you can still add you own custom rules if you want to alter the default search behavior.
I have played around with lucene a lot and it seems typesense is a very close match to the feature set. - Apart from the REST interface on top.
Was the decision to not use the mature lucene platform technical?
The memory and hardware requirements of lucene are quite small, even if Elastic or Solr leave a very different impression.
Glad to see a solution positioning itself a bit leaner than Solr/Elastic though, they really are a bit heavy for many occasions.
Yes, for typo correction + instant search, Lucene definitely is not fast enough on large datasets. There are also some limitations with fuzzy searching when you also want to sort/rank documents at the same time. Lucene is also a very generic mature library for a wider set of usecases.
> when 1 million Hacker News titles are indexed along with their points, Typesense consumes 165 MB of memory. The same size of that data on disk in JSON format is 88 MB.
I like the compact filter_by, sort_by with qualifiers
I'm new to search libraries (frameworks?) but have been looking for something to use for a huge data dump I'm working with.
Storing everything in memory seems fast, but seems like it'd be quite the resource hog on a server -- is that a normal approach to take?
It's reassuring that the examples and documentation all revolve around books (as my data set is actually ~55 million books also), but since theirs seems to be quite the subset of that I worry about how well this scales and I don't know enough about search libs to even evaluate that.
Is there a good place to start learning about what kinds of situations Typesense works best in (besides needing a Levenshtein-based search), versus what kinds of situations it wouldn't work well in (and perhaps what other libraries would work better)?
Typesense's primary focus is speed and developer convenience. It makes an assumption (which is true for perhaps 99% of the time) that memory is cheap enough for indexing most datasets. Especially given the effort of development time and the benefits from a solid search user experience.
Other libraries like Elastic offer more customization but also has a steeper learning curve.
Talking about fastest time to market this is the biggest one rather than setting up elastic, which annoying as it is is still faster than creating the UI.
There is a bug on the demo search box in your home page, if no search results found (either due to empty string or no result found for the search term), it will display "undefined result. Page 1 of NaN"
Looks great. One of Algolia's strongest features is InstantSearch for vanilla JS, React, Vue, Angular, iOS and Android. Hopefully there can be this level of support for Typesense
[+] [-] Scarbutt|6 years ago|reply
[+] [-] ronlobo|6 years ago|reply
[+] [-] tpayet|6 years ago|reply
Both MeiliSearch and Typesense use a reverse index with a Levenshtein automaton to handle typos, but when it comes to sorting document:
- Typesense use a default_sorting_field on each document, it means that before indexing your documents you need to compute a relevancy score for typesense to be able to sort them based on your needs (https://typesense.org/docs/0.11.1/guide/#ranking-relevance)
- On the other hand MeiliSearch, uses a bucket sort which means that there is a default relevancy algorithm based on the proximity of words in the documents, the fields in which the words are found and the number of typos (https://docs.meilisearch.com/guides/advanced_guides/ranking....). And you can still add you own custom rules if you want to alter the default search behavior.
[+] [-] ysleepy|6 years ago|reply
I have played around with lucene a lot and it seems typesense is a very close match to the feature set. - Apart from the REST interface on top.
Was the decision to not use the mature lucene platform technical? The memory and hardware requirements of lucene are quite small, even if Elastic or Solr leave a very different impression.
Glad to see a solution positioning itself a bit leaner than Solr/Elastic though, they really are a bit heavy for many occasions.
[+] [-] karterk|6 years ago|reply
[+] [-] LrnByTeach|6 years ago|reply
> when 1 million Hacker News titles are indexed along with their points, Typesense consumes 165 MB of memory. The same size of that data on disk in JSON format is 88 MB.
I like the compact filter_by, sort_by with qualifiers
let searchParameters = { 'q' : 'harry', 'query_by' : 'title', 'filter_by' : 'publication_year:<1998', 'sort_by' : 'publication_year:desc' }
[+] [-] drusepth|6 years ago|reply
Storing everything in memory seems fast, but seems like it'd be quite the resource hog on a server -- is that a normal approach to take?
It's reassuring that the examples and documentation all revolve around books (as my data set is actually ~55 million books also), but since theirs seems to be quite the subset of that I worry about how well this scales and I don't know enough about search libs to even evaluate that.
Is there a good place to start learning about what kinds of situations Typesense works best in (besides needing a Levenshtein-based search), versus what kinds of situations it wouldn't work well in (and perhaps what other libraries would work better)?
[+] [-] karterk|6 years ago|reply
Other libraries like Elastic offer more customization but also has a steeper learning curve.
[+] [-] KaoruAoiShiho|6 years ago|reply
Talking about fastest time to market this is the biggest one rather than setting up elastic, which annoying as it is is still faster than creating the UI.
[+] [-] jabo|6 years ago|reply
[+] [-] prayze|6 years ago|reply
[+] [-] tkfu|6 years ago|reply
[+] [-] wiradikusuma|6 years ago|reply
[+] [-] ng7j5d9|6 years ago|reply
Does it do normalization as part of the typo search (in case of missed/incorrent accent marks, etc)?
Does it do stemming at all? For English or other languages? (ie, I search for "run" and you show me documents for "running" or the other way around).
Any support for Chinese text (which typically doesn't have whitespace between words)?
[+] [-] karterk|6 years ago|reply
While it does not support stemming, with fuzzy prefix matching, it will largely work and practically more useful.
No typo or fuzzy correction for Chinese text yet.
[+] [-] lxe|6 years ago|reply
[+] [-] neurostimulant|6 years ago|reply
[+] [-] karterk|6 years ago|reply
[+] [-] ericcholis|6 years ago|reply
[+] [-] SahAssar|6 years ago|reply
[+] [-] karterk|6 years ago|reply
[+] [-] agentile|6 years ago|reply
[+] [-] azhenley|6 years ago|reply
[+] [-] beagle3|6 years ago|reply
[+] [-] ChrisCinelli|6 years ago|reply
[+] [-] veeralpatel979|6 years ago|reply
I think mentioning any of them would be okay.
[+] [-] unknown|6 years ago|reply
[deleted]
[+] [-] chocolatkey|6 years ago|reply
[+] [-] jacquesc|6 years ago|reply
[+] [-] jabo|6 years ago|reply