top | item 12622809

(no title)

k_kelly | 9 years ago

I've done two large search projects in the last year. One with Postgres search and one with elastic search. The elastic search version was trivial to configure, and was embarrassingly superior to the point where I wish I had never considered Postgres.

discuss

r888888888|9 years ago

It's not that simple.

If you're using a secondary data store like Elasticsearch presumably you're combining it with a traditional relational database like Postgres or MySQL. And in most use cases, you're going to require additional constraints on top of full text, like restricting to a specific customer or topic, etc. This means data duplication. Any experienced developer should be breaking out in hives at this point because this means (1) keeping two data stores in sync; (2) adding another service to the deployment process; (3) adding another service to development environments; (4) adding another service to debug when problems crop up; (5) large scale schema changes and migrations are complicated because of two data stores; (6) operational overhead from maintaining Elasticsearch like separate backup scripts and deployment scripts and sharding config (across production, qa, and dev); (7) queries are potentially complicated because you may have to aggregate information from two data stores (pagination and delayed updates makes this even more hellish); (8) if your company is anal about devs accessing production data then ES just introduces another layer of requesting permissions and every time the schema changes you have to ask again.

And maybe the advantages of Elasticsearch far outweigh these negatives, but this is a conversation you're going to have with your devs because the cost of using ES is not zero. There are tradeoffs, and maybe just using Postgres full text search is adequate for your use case given sufficient additional constraints.

nostrademons|9 years ago

Data duplication virtually always happens as the organization gets larger. Even if you stick to RDBMS-only, you're going to get it once you setup master/slave replication, batch jobs, data science, tape backups, etc.

It's not a serious problem as long as you designate one data source as primary and every other data source as derived. That way, if you ever get data corruption issues, you can re-create the derived data and blow away the old copy. You also don't face complex data-sync issues, because all data-transfer is unidirectional. You write into Postgres or your other source-of-truth, and then you write from Postgres to ElasticSearch. If there's a difference between them, the Postgres version is authoritative, so copy its data over ElasticSearch.

Most of your other points are common to adding any new major feature, and you should always weigh the user benefits against the maintenance costs of a large new dependency. Presumably, if you're looking at ElasticSearch, it's because full-text search in your RDBMS is not good enough (which won't always be the case, but has been my experience and apparently the experience of several other people here).

ngrilly|9 years ago

For those thinking that using Elasticsearch is "trivial", please read the parent comment. And read it again. This is pure wisdom.

vkjv|9 years ago

Not only do some of us no break out in hives at data duplication we actually design with it in mind!

http://martinfowler.com/bliki/CQRS.html

patsplat|9 years ago

The devops issues aren't a valid counter-point. Tuning an RDBMS is an art.

It's much easier to scale a ephemeral search service than a transactional database. Likely even one search node will support a higher query throughput.

vkjv|9 years ago

[deleted]

irrational|9 years ago

If it's so trivial, can you put the instructions here so the rest of us can learn from your experience?

SanFranManDan|9 years ago

For me it was

sudo apt-get install elastic-search

pip install django-haystack

add haystack to installed_apps, add the elastic search backend and set the end point.

use a haystack index class that is prebuilt to hook into all model change signals.

done. No weird configurations, everything was vanilla.

Maybe I am missing a step but it took me less than an hour to get everything going and it hasn't had to have any maintenance.

When you are adding ngram support and all the indexes and views in postgres to replicate the behavior, ES looks to be less complicated. At the very least I don't see a reduction in complexity doing the postgres way just that you have 1 less dependency to worry about.

nostrademons|9 years ago

ElasticSearch has a very good introductory guide included in the documentation. Go through the first half of the Getting Started section and you should be well on your way, with most of the following sections being just for fine-tuning:

https://www.elastic.co/guide/en/elasticsearch/guide/2.x/gett...

ngrilly|9 years ago

irrational is right: it sounds trivial only when you don't consider the evil details.

threeseed|9 years ago

I have used ElasticSearch also on a number of projects and it is an extremely simple to install, configure, use and manage. Most of the functions are trivially available over simple REST HTTP calls.

Unless you need basic search and you desperately want to keep everything in the same database I would split it out. Architecturally it makes more sense to split it out just to allow for different scaling needs.

nugator|9 years ago

From my experience Elasticsearch is far from trivial to configure right. At least compared to PostgreSQL.

twic|9 years ago

Superior in what ways?

rpedela|9 years ago

With Elasticsearch and Solr, you can easily customize analysis and scoring. There are several scoring algorithms built into them [1] such as BM25 (the default now) which is considered the state of the art for keyword relevance. For analysis, you can remove stopwords, stem, apply synonyms, etc [2]. Elasticsearch is specifically designed to scale across multiple machines which is necessary for TB datasets. There are also things like "more like this" queries and context-aware spell checking. Some of that you can do with PG, but not all of it. If PG can do it, it is usually harder to set up.

1. https://www.elastic.co/guide/en/elasticsearch/reference/curr...

2. https://www.elastic.co/guide/en/elasticsearch/guide/current/...

eva1984|9 years ago

Rich library of tokenizers and analyzers. A test proof analyzer model and pipeline. For full text search, different score modes are supported that is beyond the trivial case mentitioned tf-idf model, how are going to do field centric ranking in postgres?

As far as I am concerned, it is far superior to the goodies that mentioned in this article.

craigkerstiens|9 years ago

For many the source of truth is still Postgres, but you may want some of the flexibility and search power that comes along with ElasticSearch in many cases. An interesting project to look at may be ZomboDB which lets you maintain ElasticSearch indexes from directly within Postgres and query from within PG directly. (https://github.com/zombodb/zombodb)

combatentropy|9 years ago

Postgres was so much easier for me. In fact I cannot see the reason Elasticsearch is so popular. For our website search I first researched Elasticsearch. Each of the short pages of the so-called Getting Started guide (https://www.elastic.co/guide/en/elasticsearch/guide/current/...) felt like walking into either (a) a brick wall or (b) an empty room:

p. 1: sales pitch

p. 2: sales pitch

p. 3: "The only requirement for installing Elasticsearch is a recent version of Java." Ugh.

pp. 4-10: A new JSON API to learn. This is just the part about getting your stuff into Elasticsearch and, at last, doing a query that is no better than "select * from table where last_name like '%Smith%';

p. 11: "More complicated searches," an apt title. Here is the Elasticsearch query that's equivalent to "select * from table where last_name like '%smith%' and age > 30":

  GET /megacorp/employee/_search
  {
      "query" : {
          "filtered" : {
              "filter" : {
                  "range" : {
                      "age" : { "gt" : 30 } 
                  }
              },
              "query" : {
                  "match" : {
                      "last_name" : "smith" 
                  }
              }
          }
      }
  }

It only gets more complicated from there. There is talk about ranking, but it was still a black box even by the end of the Getting Started guide, and I was going to want to tweak it. It just seemed more obvious how to do so with Postgres (https://www.postgresql.org/docs/9.6/static/textsearch-contro...).

How is Postgres merely "good enough"? How are other products better? What is their draw?

Is it ease of set-up? For me Elasticsearch was harder to install than Postgres and made me learn a bunch of new syntax, beyond the SQL I have to know anyway to write web applications in the first place. Postgres handles stemming, stopwords, ranking, snippets, and highlighting.

Is it better ranking? Postgres can weigh the search words' frequency, their proximity to each other, the length of the document, the number of unique words in the document, and the "mean harmonic distance between extents." I was able to further tweak the results by using other columns and joining other tables.

Is it faster search results? Postgres searches tens of thousands of pages in a split second.

mozumder|9 years ago

It seems like setting up and syncing an entire separate database for search wouldn't exactly be "trivial"...

eric_khun|9 years ago

I've migrated last week 500M documents (100GB) of data from mongodb to ES. It's quite simple and straightforward. Used the bulk API. Just make sure to have a good machine (16vCPU) and use parallelization and just wait. after 24hours, it just works.

threeseed|9 years ago

Elasticsearch can be installed in seconds with most package managers and syncing is trivial depending on what you want to index. You can use a JDBC river for bulk indexing or rely on trigger's.

unknown|9 years ago

[deleted]