> A key gotcha that tripped us up: when querying with a list of primary keys, Postgres will not return the records in the same order as the IDs provided in a “where … in” clause.
That's frightening they don't know that. So, burn this into your minds: no ordering is guaranteed in SQL unless an ORDER BY is applied to it (and any ordering in a subquery is lost). Even if it seems to work, it will fail. No guarantees on order unless... scratch that onto your eyeballs so you never forget it.
Also, will people please stop posting rainbow-on-black screenshots, especially with the screenshotted text size smaller than the main text.
Sometimes I wish for a 'debug' version of PG which adds an "order by random()" to all queries (or appends a "random()" clause to an existing "order by" clause).
Or even better: have it as a connection/session parameter.
I know this now, but when I didn't and when Postgres returned the same order thousands of times, then changed for no apparent reason, it made fixing flaky tests very tricky.
It's a common problem when moving from Mysql. We personally ran into that shortly after switching to Postgres, but seems like something they should've discovered a lot earlier for sure.
We shared the mistake so others don't accidentally make the same assumption; I'm sure a few people either learned it for the first time or appreciated the reminder. Rest easy this never made it to prod.
How can querying for PK's than doing a second query on the FK be faster than a join that's semantically equivalent? Postgres optimizer gone terribly wrong or something? Or am I misunderstanding what's going on?
It would never have occurred to me to even try this, I would have assumed postgres would do as well as could be done at what looks like a very standard join. So I guess that's a lesson not to make assumptions?
Not sure, without even having a theory of why pg would behave this way, I'd be worried that as the data changes it might regress again. Or if the phase of the moon changes or something -- since it seems like magic!
It’s almost never faster to do two queries. But DBs are complicated query engines with sophisticated compilers, so you have to go in and try to convince the compiler to optimize the way you want that will make the query fast. EXPLAIN_ANALYZE is your best friend for understanding what the compiler thinks is best. And then you can adjust the query to convince it that it should do something faster.
In this specific case, I’d bet that having the FROM and JOIN tables reversed would be enough to get even better performance than 2 queries: `SELECT * FROM os JOIN o …`.
It's a failure of the optimiser. AS a DB guy (MSSQL, not postgres), there's always a risk of this, you always have to be aware of this, and don't assume. MSSQL's optimiser is pretty good but I've also seen slowdowns on plain queries after adding an index. Optimisers are complex and individual optimisation rules interact in odd and sometimes counterproductive ways. Also if your stats are not up-to-date, trouble is guaranteed.
Frankly, it's such a simple case that I'm hesitant to pin the problem on Postgres. I'm now inclined to believe that the ORM/query builder they're using (which I'm not familiar with) is generating a query that does something unusual. I'm also having a hard time understanding the code, so it may just be a bad query in the first place. If the generated SQL was shown, it would be much easier to look into.
Generally speaking, if you're just joining some data onto the results of an expensive operation, it's exactly the same as doing a second query, just without the round trip and overhead.
The lesson here is to always look at EXPLAIN ANALYZE any time you're doing Postgres optimization work. There's a wealth of useful information that will often point you to your mistake. Databases are quite often extremely good at doing common-sense operations well.
We secretly agree and aren't sure why it's faster to do two queries, but it measurably is. We're going to try some of the suggestions littered in this conversation and will report back, this time with some EXPLAIN output. We appreciate the suggestions and theories.
While I generally wonder the same question... I've seen basically every database follow this pattern. There are fairly frequent scenarios where doing a couple separate indexed queries performs better than an equivalent join. Particularly when queries get kinda-large and up.
I can certainly see it when running into memory limits per table, because doing the full join might require more memory at once, but it frequently happens far below where I'd expect that to occur (in the low thousands of rows). Dunno why. Maybe it uses caches more efficiently? Many simple operations are often more cache-able than a few complex ones.
I’ve seen it happen when people forget to index the foreign key columns, it also may be faster on one to many joins if you need to use DISTINCT to remove duplicate rows (instead of joining on a specific subquery that eliminates the duplication implicitly).
I wrote up my recent experience optimizing Postgres text search on a database with a few million records without relying on a new service like Elastic Search.
Awesome thanks for the write up. Definitely cool to see breaking up the query into 2 queries had such a marked improvement! I know I can often get too obsessed with making just a single query. Cheers.
Has anyone overcome the 16382 positional limits of tsvector?
That and the automatic stemming and lemming of search words even in phrase searches makes postgres awful for any software where accurate search is critical.
I remember when I ran into that first issue on my forum, I felt like I was the first person to ever use Postgres full-text search. (8?) Years ago, googling the exact error brought up nothing except some dev email/listserv chatter. Nobody else was indexing longer text documents? Wasn't very encouraging.
Oh well. The vast majority of forum posts don't hit that limit so I just excluded the exceptionally long posts from the index. I never revisited it again.
I used a trigger function to detect long text and trim the source before it got indexed. That meant that any text over 500kb just got dropped from the index. I also used one index per long text field rather than combining with other fields.
The lack of BM25 relevance scoring is the bigger problem. Postgres FTS is fine as a better "LIKE" filter with very little overhead, but it's a poor choice for serious search applications or scale.
PostgreSQL text search is awesome - for English and Roman type languages.
But Asian languages such as Thai, Japanese, Korean, etc are not going to work at all.
PostgreSQL is weird about joins. Joining on certain columns could be super fast but others dog slow. And this can flip depending on size of table and this index Cardinality.
That’s why it’s important on databases that grow quickly to check the performance of even simple queries as those can balloon in execution time as the profile of the data changes.
> But Asian languages such as Thai, Japanese, Korean, etc are not going to work at all.
I understand that there are likely to be many significant differences in graphemes in these languages from Roman text but I'm not familiar enough with any Asian language to construct an example. Can you give an illustrative example that explains why the search doesn't work as well (or at all) in those cases?
Can you give more information about your JOINs performance? It’s mostly dependent on the presence and optimization of INDEXes and cardinality as you mention. JOIN ON indexed integers is usually fastest.
I'd love to see some details on the why using EXPLAIN ANALYZE on each query and schema. It seems like the changes were done with a hunch as to why they were slow?
> It seems like the changes were done with a hunch as to why they were slow?
You say that like it’s a bad thing. If the hunch works does that cheapen the effect? The why can come after. It often does.
Mastery isn’t pondering things faster, it’s using system 1 thinking for most of the process with some system 2 sprinkled on top. Or to use different terminology, intuition with a bit of high level executive function guiding things.
I remember achieving some quite good results when implementing a simple „fuzzyesque“ search for some B2B e-commerce system a few years back, but what hit us hard was the german dictionary used for stemming for the ts_vector index. The original one coming with psql did not support german compound words correctly and the „fix“ was to import a 3rd party one. I learned the hard ways that this comes at the cost of Postgres loading the dictionary every time a new connection uses the index. Running a simple, small php app that could not reuse/pool connections, every search query done was coming in on a clean connection and hit a >1s fee for having to load the dict.
You might want to test using the first query as a sub-query or cte in the second one. That would likely give you the same / better perf. It would avoid the join and save a round trip.
zasdffaa|3 years ago
That's frightening they don't know that. So, burn this into your minds: no ordering is guaranteed in SQL unless an ORDER BY is applied to it (and any ordering in a subquery is lost). Even if it seems to work, it will fail. No guarantees on order unless... scratch that onto your eyeballs so you never forget it.
Also, will people please stop posting rainbow-on-black screenshots, especially with the screenshotted text size smaller than the main text.
silvestrov|3 years ago
Or even better: have it as a connection/session parameter.
jonatron|3 years ago
irjustin|3 years ago
It's a common problem when moving from Mysql. We personally ran into that shortly after switching to Postgres, but seems like something they should've discovered a lot earlier for sure.
kuzee|3 years ago
icedchai|3 years ago
jrochkind1|3 years ago
It would never have occurred to me to even try this, I would have assumed postgres would do as well as could be done at what looks like a very standard join. So I guess that's a lesson not to make assumptions?
Not sure, without even having a theory of why pg would behave this way, I'd be worried that as the data changes it might regress again. Or if the phase of the moon changes or something -- since it seems like magic!
Olreich|3 years ago
In this specific case, I’d bet that having the FROM and JOIN tables reversed would be enough to get even better performance than 2 queries: `SELECT * FROM os JOIN o …`.
zasdffaa|3 years ago
bastawhiz|3 years ago
Frankly, it's such a simple case that I'm hesitant to pin the problem on Postgres. I'm now inclined to believe that the ORM/query builder they're using (which I'm not familiar with) is generating a query that does something unusual. I'm also having a hard time understanding the code, so it may just be a bad query in the first place. If the generated SQL was shown, it would be much easier to look into.
Generally speaking, if you're just joining some data onto the results of an expensive operation, it's exactly the same as doing a second query, just without the round trip and overhead.
The lesson here is to always look at EXPLAIN ANALYZE any time you're doing Postgres optimization work. There's a wealth of useful information that will often point you to your mistake. Databases are quite often extremely good at doing common-sense operations well.
kuzee|3 years ago
Groxx|3 years ago
I can certainly see it when running into memory limits per table, because doing the full join might require more memory at once, but it frequently happens far below where I'd expect that to occur (in the low thousands of rows). Dunno why. Maybe it uses caches more efficiently? Many simple operations are often more cache-able than a few complex ones.
rubyist5eva|3 years ago
kuzee|3 years ago
canadiantim|3 years ago
norwalkbear|3 years ago
That and the automatic stemming and lemming of search words even in phrase searches makes postgres awful for any software where accurate search is critical.
hombre_fatal|3 years ago
Oh well. The vast majority of forum posts don't hit that limit so I just excluded the exceptionally long posts from the index. I never revisited it again.
jmull|3 years ago
iav|3 years ago
manigandham|3 years ago
eric4smith|3 years ago
But Asian languages such as Thai, Japanese, Korean, etc are not going to work at all.
PostgreSQL is weird about joins. Joining on certain columns could be super fast but others dog slow. And this can flip depending on size of table and this index Cardinality.
That’s why it’s important on databases that grow quickly to check the performance of even simple queries as those can balloon in execution time as the profile of the data changes.
asah|3 years ago
jxf|3 years ago
I understand that there are likely to be many significant differences in graphemes in these languages from Roman text but I'm not familiar enough with any Asian language to construct an example. Can you give an illustrative example that explains why the search doesn't work as well (or at all) in those cases?
SnowHill9902|3 years ago
rdoherty|3 years ago
hinkley|3 years ago
You say that like it’s a bad thing. If the hunch works does that cheapen the effect? The why can come after. It often does.
Mastery isn’t pondering things faster, it’s using system 1 thinking for most of the process with some system 2 sprinkled on top. Or to use different terminology, intuition with a bit of high level executive function guiding things.
Or a lot of hunches and a little thinking.
vosper|3 years ago
https://github.com/toluaina/pgsync
kabes|3 years ago
avinassh|3 years ago
netcraft|3 years ago
maverwa|3 years ago
iirc pgbouncer saved the day there.
Otherwise it worked fine.
anshul|3 years ago
hobs|3 years ago
select * from query1 as q where exists ( select * from query2 as q2 where q.col = q2.col )