top | item 32834182

Debugging the Postgres query planner (2018)

61 points| IvanVergiliev | 3 years ago |gocardless.com

24 comments

In general I love what RDBMS and postgresql in particular can bring to you, but this is one corner of them that I hate: Planners are too smart for their own good.

This is a standard story: A query ran for a long time without issue, and then, one day, some minor shift in your statistics happens, and now you have a major performance issue in your hands, without any real change on prod. No non-productive environment can help you: They don't have the same data or they do have the same data but the statistic sampling was slightly different.

If it happens, you have to find an incantation that makes the planner comprehend what's going wrong. Postgresql has things like the statistic object and in this case the statistic column property, but finding the right incantation can be black magic, and adding indexes/statistics/... can take hours so trial and error is slow.

Dumber databases have an edge here: Their performance is probably lower, but it is predictable.

Some RDBMS have mitigations, e.g. Oracle's optimizer plan stability allows you to make the plan unchangeable. It's a 2-sided knife of course: It won't get better if the data has a chance for it, but it won't get worse either.

lawrjone|3 years ago

Author of the article here, and thought it's worth noting on:

> No non-productive environment can help you: They don't have the same data or they do have the same data but the statistic sampling was slightly different.

GoCardless still has a massive Postgres database (10TB or there-abouts) and only managed to scale it by investing heavily in tooling that helps developers work with it safely.

One example is https://github.com/gocardless/draupnir, a tool to create instances of production datasets very quickly (just `eval $(draupnir new); psql` and you have a mini production in ~3s) so you could try things like adding indexes, tweaking the plan settings (`set enable_seq_scan='off'`) and reshaping the data to see how your planner behaved.

I think it's very doable, though the planner still has blindspots. I had a side project to add linear correlation statistics to the planner that I abandoned when I stopped working with big Postgres databases, but that's an example of statistics that Postgres just doesn't track but lead to these pathological edge cases.

I'd rather have the clever planner than not, though. I've a healthy appreciation for the heavy lifting Postgres can do for you.

rockwotj|3 years ago

We've had a issue similar here due to using SERIALIZABLE transactions, and postgres chosing an index that caused it to lock the whole relation due to how locks are upgraded if you scan for too much [1]

Every change to our prod DB requires running EXPLAIN and EXPLAIN ANALYZE on some data to make sure the queries are doing the right thing (we use GCP Query Insights to watch for regressions [2]).

The cast majority of our queries are single index scans. I wish there was a database that we could fix the plan when our app is deployed. For the most part our schema is fairly denormalized so we don't need very complex queries. The flexibility/power of SQL is really for debugging, analytics and other one off queries.

Hot take: I wish there was a DB that didn't force SQL. At least for the application, instead you just told it what scan you wanted to do (basically the embed plan directly in the query you send). There could be a reporting mechanism if the DB detected a more efficient plan for the query or something. You could still have a SQL layer for your debug and one off sessions.

I would vastly prefer the predictability over occasional performance spikes or in our case a spike of transaction failures due to a predicate lock being grabbed for a whole table.

[1]: the default here is 32 rows (https://www.postgresql.org/docs/current/runtime-config-locks...)

[2]: https://cloud.google.com/sql/docs/postgres/using-query-insig...

petergeoghegan|3 years ago

> Planners are too smart for their own good.

I know what you mean, but I don't think that that quite captures it. It's more like this: planners are built on a set of assumptions that are often pretty far from robust, but nevertheless work adequately well in almost all cases. Including many cases where the assumptions haven't been met!

The best example is the standard assumption that multiple conditions/columns are independent of each other -- all optimizers make this assumption (some can be coxed into recognizing specific exceptions). This is obviously not true much of the time, even with a well normalized schema. Because: why would it be?

All kinds of correlations naturally appear in real data. It's just that it mostly doesn't cause huge problems most of the time, for messy reasons that can't quite be pinned down. You have to get unlucky; the correlations usually have to be very high, and the planner makes completely the wrong inference for the actual query that you ran (not some hypothetical other query). The planner only has to have approximately the right idea to discover the cheapest plan. And the planner doesn't have to discover the cheapest plan in many cases -- there may be quite a few adequate plans (it's really hard to generalize, but that's often true).

Overall, the fact that cost-based optimizers work as well as they do seems quite surprising to me.

darksaints|3 years ago

> Some RDBMS have mitigations, e.g. Oracle's optimizer plan stability allows you to make the plan unchangeable. It's a 2-sided knife of course: It won't get better if the data has a chance for it, but it won't get worse either.

That's simply not true, it's just less noticeable. Because even if your query plan is not changing, your data is. There will always be some point where your data grows and a reasonable planner (whether you or your database) has to adapt to that as it grows. For example, if a small lookup table grows enough, it stops being faster to do a full table scan on that table, and it becomes reasonable to do an index lookup. If your plan never changes, your performance gets worse. You may argue that fixed query plans are more predictable, but they are not objectively better.

cube2222|3 years ago

Afaik Aurora lets you pin plans and will at the same time ask you to approve new, better plans, if ones are discovered.

I haven't used it yet, but it sounds like that gives you the cake and lets you eat it too.

gurjeet|3 years ago

> A query ran for a long time without issue, and then, one day, some minor shift in your statistics happens, and now you have a major performance issue ..

This is the precise problem I’m working on solving. See the pg_plan_guarantee extension.

https://github.com/DrPostgres/pg_plan_guarantee

paulryanrogers|3 years ago

There is also pg_hint_plan. MySQL too can suffer from this lurking thresholds problem. Though it has hints built in.

Upgrading major versions and huge surges of writes are often the catalyst of crossing performance cliffs, IME. The first can be planned for and since load testing can help get ahead of the second.

stonemetal12|3 years ago

Could be cool if the RDBMS A B tested their plans, if the new plan isn't better don't switch to it. Though that would certainly add to the Black Magic of it, maybe a command to show the dev the top 5 plans and allow them to pick and pin.

lawrjone|3 years ago

Well this is funny: I'm the author of this post, and pleasantly surprised to see it coming around for a second time!

If you like this content, there's a bunch more on my blog that you'll probably like too:

https://blog.lawrencejones.dev/

While I continue to work with Postgres, I'm enjoying the life of an early start-up where your default is "not much data", though it does make you unlearn a lot of the habits developed when dealing with multi-terabyte databases!

kroolik|3 years ago

Any reason the article has sections appearing out of no where, that don't fit the surrounding parts? For example:

> large, heterogenous datasets. When all the rows look similar there will be fewer candidate plans and less chance of a disasterous mis-step.

unknown|3 years ago

[deleted]