Improving Postgres text search speed

zasdffaa|3 years ago

> A key gotcha that tripped us up: when querying with a list of primary keys, Postgres will not return the records in the same order as the IDs provided in a “where … in” clause.

That's frightening they don't know that. So, burn this into your minds: no ordering is guaranteed in SQL unless an ORDER BY is applied to it (and any ordering in a subquery is lost). Even if it seems to work, it will fail. No guarantees on order unless... scratch that onto your eyeballs so you never forget it.

Also, will people please stop posting rainbow-on-black screenshots, especially with the screenshotted text size smaller than the main text.

silvestrov|3 years ago

Sometimes I wish for a 'debug' version of PG which adds an "order by random()" to all queries (or appends a "random()" clause to an existing "order by" clause).

Or even better: have it as a connection/session parameter.

jonatron|3 years ago

I know this now, but when I didn't and when Postgres returned the same order thousands of times, then changed for no apparent reason, it made fixing flaky tests very tricky.

irjustin|3 years ago

> That's frightening they don't know that

It's a common problem when moving from Mysql. We personally ran into that shortly after switching to Postgres, but seems like something they should've discovered a lot earlier for sure.

kuzee|3 years ago

We shared the mistake so others don't accidentally make the same assumption; I'm sure a few people either learned it for the first time or appreciated the reminder. Rest easy this never made it to prod.

icedchai|3 years ago

This does seem fundamental.

jrochkind1|3 years ago

How can querying for PK's than doing a second query on the FK be faster than a join that's semantically equivalent? Postgres optimizer gone terribly wrong or something? Or am I misunderstanding what's going on?

It would never have occurred to me to even try this, I would have assumed postgres would do as well as could be done at what looks like a very standard join. So I guess that's a lesson not to make assumptions?

Not sure, without even having a theory of why pg would behave this way, I'd be worried that as the data changes it might regress again. Or if the phase of the moon changes or something -- since it seems like magic!

Olreich|3 years ago

It’s almost never faster to do two queries. But DBs are complicated query engines with sophisticated compilers, so you have to go in and try to convince the compiler to optimize the way you want that will make the query fast. EXPLAIN_ANALYZE is your best friend for understanding what the compiler thinks is best. And then you can adjust the query to convince it that it should do something faster.

In this specific case, I’d bet that having the FROM and JOIN tables reversed would be enough to get even better performance than 2 queries: `SELECT * FROM os JOIN o …`.

zasdffaa|3 years ago

It's a failure of the optimiser. AS a DB guy (MSSQL, not postgres), there's always a risk of this, you always have to be aware of this, and don't assume. MSSQL's optimiser is pretty good but I've also seen slowdowns on plain queries after adding an index. Optimisers are complex and individual optimisation rules interact in odd and sometimes counterproductive ways. Also if your stats are not up-to-date, trouble is guaranteed.

bastawhiz|3 years ago

You're right. I had the same thought.

Frankly, it's such a simple case that I'm hesitant to pin the problem on Postgres. I'm now inclined to believe that the ORM/query builder they're using (which I'm not familiar with) is generating a query that does something unusual. I'm also having a hard time understanding the code, so it may just be a bad query in the first place. If the generated SQL was shown, it would be much easier to look into.

Generally speaking, if you're just joining some data onto the results of an expensive operation, it's exactly the same as doing a second query, just without the round trip and overhead.

The lesson here is to always look at EXPLAIN ANALYZE any time you're doing Postgres optimization work. There's a wealth of useful information that will often point you to your mistake. Databases are quite often extremely good at doing common-sense operations well.

kuzee|3 years ago

We secretly agree and aren't sure why it's faster to do two queries, but it measurably is. We're going to try some of the suggestions littered in this conversation and will report back, this time with some EXPLAIN output. We appreciate the suggestions and theories.

Groxx|3 years ago

While I generally wonder the same question... I've seen basically every database follow this pattern. There are fairly frequent scenarios where doing a couple separate indexed queries performs better than an equivalent join. Particularly when queries get kinda-large and up.

I can certainly see it when running into memory limits per table, because doing the full join might require more memory at once, but it frequently happens far below where I'd expect that to occur (in the low thousands of rows). Dunno why. Maybe it uses caches more efficiently? Many simple operations are often more cache-able than a few complex ones.

rubyist5eva|3 years ago

I’ve seen it happen when people forget to index the foreign key columns, it also may be faster on one to many joins if you need to use DISTINCT to remove duplicate rows (instead of joining on a specific subquery that eliminates the duplication implicitly).

kuzee|3 years ago

I wrote up my recent experience optimizing Postgres text search on a database with a few million records without relying on a new service like Elastic Search.

canadiantim|3 years ago

Awesome thanks for the write up. Definitely cool to see breaking up the query into 2 queries had such a marked improvement! I know I can often get too obsessed with making just a single query. Cheers.

norwalkbear|3 years ago

Has anyone overcome the 16382 positional limits of tsvector?

That and the automatic stemming and lemming of search words even in phrase searches makes postgres awful for any software where accurate search is critical.

hombre_fatal|3 years ago

I remember when I ran into that first issue on my forum, I felt like I was the first person to ever use Postgres full-text search. (8?) Years ago, googling the exact error brought up nothing except some dev email/listserv chatter. Nobody else was indexing longer text documents? Wasn't very encouraging.

Oh well. The vast majority of forum posts don't hit that limit so I just excluded the exceptionally long posts from the index. I never revisited it again.

jmull|3 years ago

I don’t know ab out the positional limits, but the built-in stemming and other transformations are a default, which you can change. I think this is the relevant mechanism: https://www.postgresql.org/docs/current/textsearch-dictionar...

iav|3 years ago

I used a trigger function to detect long text and trim the source before it got indexed. That meant that any text over 500kb just got dropped from the index. I also used one index per long text field rather than combining with other fields.

manigandham|3 years ago

The lack of BM25 relevance scoring is the bigger problem. Postgres FTS is fine as a better "LIKE" filter with very little overhead, but it's a poor choice for serious search applications or scale.

eric4smith|3 years ago

PostgreSQL text search is awesome - for English and Roman type languages.

But Asian languages such as Thai, Japanese, Korean, etc are not going to work at all.

PostgreSQL is weird about joins. Joining on certain columns could be super fast but others dog slow. And this can flip depending on size of table and this index Cardinality.

That’s why it’s important on databases that grow quickly to check the performance of even simple queries as those can balloon in execution time as the profile of the data changes.

asah|3 years ago

Support for i18n: https://www.google.com/search?q=pgroonga

jxf|3 years ago

> But Asian languages such as Thai, Japanese, Korean, etc are not going to work at all.

I understand that there are likely to be many significant differences in graphemes in these languages from Roman text but I'm not familiar enough with any Asian language to construct an example. Can you give an illustrative example that explains why the search doesn't work as well (or at all) in those cases?

SnowHill9902|3 years ago

Can you give more information about your JOINs performance? It’s mostly dependent on the presence and optimization of INDEXes and cardinality as you mention. JOIN ON indexed integers is usually fastest.

rdoherty|3 years ago

I'd love to see some details on the why using EXPLAIN ANALYZE on each query and schema. It seems like the changes were done with a hunch as to why they were slow?

hinkley|3 years ago

> It seems like the changes were done with a hunch as to why they were slow?

You say that like it’s a bad thing. If the hunch works does that cheapen the effect? The why can come after. It often does.

Mastery isn’t pondering things faster, it’s using system 1 thinking for most of the process with some system 2 sprinkled on top. Or to use different terminology, intuition with a bit of high level executive function guiding things.

Or a lot of hunches and a little thinking.

vosper|3 years ago

PGSync might be useful for those who don't mind also running Elasticsearch

https://github.com/toluaina/pgsync

kabes|3 years ago

In that case you may also want to take a look at https://github.com/zombodb/zombodb

avinassh|3 years ago

This is so neat! How does it handle updates? Are there similar systems which can sync with Redis for quick caching?

netcraft|3 years ago

Thank you for sharing. Does anyone know of something similar for dynamodb?

maverwa|3 years ago

I remember achieving some quite good results when implementing a simple „fuzzyesque“ search for some B2B e-commerce system a few years back, but what hit us hard was the german dictionary used for stemming for the ts_vector index. The original one coming with psql did not support german compound words correctly and the „fix“ was to import a 3rd party one. I learned the hard ways that this comes at the cost of Postgres loading the dictionary every time a new connection uses the index. Running a simple, small php app that could not reuse/pool connections, every search query done was coming in on a clean connection and hit a >1s fee for having to load the dict.

iirc pgbouncer saved the day there.

Otherwise it worked fine.

anshul|3 years ago

You might want to test using the first query as a sub-query or cte in the second one. That would likely give you the same / better perf. It would avoid the join and save a round trip.

hobs|3 years ago

If you dont need the output of the first query you'll almost always have the best performance in sql using exists syntax eg

select * from query1 as q where exists ( select * from query2 as q2 where q.col = q2.col )

57 comments