top | item 43627646

PostgreSQL Full-Text Search: Fast When Done Right (Debunking the Slow Myth)

427 points| VoVAllen | 10 months ago |blog.vectorchord.ai

85 comments

I'm one of the pg_search maintainers. Hello! A few thoughts.

First, both strategies - the one outlined by the Neon/ParadeDB article, and the one used here -- are presented as viable alternatives by the Postgres docs: https://www.postgresql.org/docs/current/textsearch-tables.ht....

Second - as the article correctly demonstrates, the problem with Postgres FTS isn't "how can I pick and optimize a single pre-defined query" it's "how do I bring Postgres to Elastic-level performance across a wide range of real-world boolean, fuzzy, faceted, relevance-ranked, etc. queries?"

`pg_search` is designed to solve the latter problem, and the benchmarks were made to reflect that. You can always cherry-pick a query and optimize it at the expense of data duplication and complexity. The Neon/ParadeDB benchmarks contained 12 queries in total, and the benchmarks could have:

- Created composite b-tree indexes for each of the queries with boolean predicates

- Extracted the all the text fields from JSONBs, stored and indexed them as a separate columns for queries against JSONB

But that's not realistic for many real-world use cases. `pg_search` doesn't require that - it's a simple index definition that works for a variety of "Elastic style" queries and Postgres types and doesn't ask the user to duplicate every text column.

lostb1t|10 months ago

just an fyi: The blog link in your readme does not work.

cryptonector|10 months ago

From the blog about pg_search linked by TFA:

  This is what we did: 

    DB with pg_search: We created a single BM25 index 
    DB without pg_search: We created all these indexes
        GIN index on message (for full-text search)
        GIN index on country (for text-based filtering)
        B-tree indexes on severity, timestamp, and metadata->>'value' (to speed up filtering, ordering, and aggregations)

See the problem? You didn't create an index on the vector in the without-pg_search case. You didn't compare apples to apples. TFA is all about that.

Perhaps you can argue that creating a fastupdates=on index would have been the right comparison, but you didn't do that in that blog.

> You can always cherry-pick a query and optimize it at the expense of data duplication and complexity. The Neon/ParadeDB benchmarks contained 12 queries in total, and the benchmarks could have:

TFA isn't cherry-picking to show you that one query could have gone faster. TFA is showing that you didn't compare apples to apples. Looking at those 12 queries nothing screams at me that TFA's approach of storing the computed tsvector wouldn't work for those too.

Perhaps pg_search scales better and doesn't require trading off update for search performance, and that would be a great selling point, but why not just make that point?

danpalmer|10 months ago

> Mistake #1: Calculating tsvector On-the-Fly (Major issue)

I'm shocked that the original post being referred to made this mistake. I recently implemented Postgres FTS in a personal project, and did so by just reading the Postgres documentation on FTS following the instructions. The docs lead you through the process of creating the base unoptimized case, and then optimising it, explaining the purpose of each step and why it's faster. It's really clear that is what it's doing, and I could only assume that someone making this mistake is either doing so to intentionally misrepresent Postgres FTS, or because they haven't read the basic documentation.

aobdev|10 months ago

This is not my area of expertise so take this with a grain of salt, but my initial instinct was to question why you would need to store the tsvector both in the table and in the index (because the tsvector values will in fact be stored losslessly in a GIN index).

The PG docs make it clear that this only affects row rechecks, so this would only affect performance on matching rows when you need to verify information not stored in the index, e.g. queries with weighted text or queries against a lossy GiST index. It's going to be use-case dependent but I would check if your queries need this before using up the additional disk space.

brightball|10 months ago

I’ve been a Postgres FTS advocate for over a decade since replacing a Solr search with it and getting easier maintenance, more flexibility with queries and virtually no difference in speed.

It’s pretty great.

Elastic is on a different level for a lot of use cases, but pg is more than enough for the vast majority of workloads.

DoctorOW|10 months ago

> I could only assume that someone making this mistake is either doing so to intentionally misrepresent Postgres FTS, or because they haven't read the basic documentation.

Small writing note, I probably would've swapped the order of those. Hanlon's Razor and all. :)

timClicks|10 months ago

Perhaps the most generous interpretation is that the authors were writing an article for people who do the naïve thing without reading the docs. There are quite a few people in that category.

fnord123|10 months ago

> I could only assume that someone making this mistake is either doing so to intentionally misrepresent Postgres FTS, or because they haven't read the basic documentation.

vibe sysadminning, bro

ltbarcly3|10 months ago

I first used pg full text in around 2008. I've also used SOLR and ElasticSearch to power search and recommendation in substantial products.

The issue I have had with postgres full text search isn't that it's too slow, it's that it's too inflexible. It's a nice way to add simple search to fields but poor if you want to tune the search at all. Even allowing for general substrings is too much to ask, even allowing for custom tokenization is too much to ask. There's no tokenization pipeline to speak of unless you want to write c extensions (which of course you can't do for hosted databases anyway). Solr and Elasticsearch let you set up very complex indexes and search processing via configuration. There's absolutely nothing that would prevent postgres from adopting a lot of this capability, but instead postgres offers literally NOTHING. I get the impression that most of the developers for postgres full text haven't spent much time with other solutions as from previous discussions they don't really understand what I mean when I talk about tokenization and filter setup, and they don't really understand why this is a deal-breaker even for very simple applications. Postgres just splits on whitespace (and lets you basically manually use stopwords and stemming, which is crap). There is really no way to concatenate fields in a clean way into a single index, which again makes it extremely annoying to work with. There's no way to score searches based on field weighting or really any other kind of weighting beyond BM. Compared to the alternatives it's a toy system.

nattaylor|10 months ago

I wish there were some explain plans in either post, since I don't get what's going on.

If the query uses the index, then the on the fly tsvector rechecks are only on the matches and the benchmark queries have LIMIT 10, so few rechecks right?

Edit: yes but the query predicates have conditions on 2 gin indexes, so I guess the planner chooses to recheck all the matches for one index first even though it could avoid lots of work by rechecking row-wise

nostrebored|10 months ago

I'm legitimately curious -- why do people want to put EVERYTHING into postgres? I don't understand this trend (vector search, full text search, workload orchestration, queues, etc.)

simonw|10 months ago

I've built a number of systems that run a database and a separate search index (Elasticsearch, Solr, Xapian). The hardest part by far is keeping the search index in sync with the database. I gave a talk about this a while ago: https://simonwillison.net/2017/Aug/16/denormalized-query-eng...

Using the search engine built into PostgreSQL, MySQL or SQLite makes this problem SO MUCH less difficult.

bmelton|10 months ago

If you can avoid adding an extra service without paying too much penalty, it means not having to acquire an extra skill or hire another devops person or keep yet another service in sync / maintained / etc.

The cost of adding services to an app is so much higher than people give it credit for at organizations of every size, it's shocking to me that more care isn't done to avoid it. I certainly understand at the enterprise level that the value add of a comprehensive system is worth the cost of a few extra employees or vendors, but if you could flatten all the weird services required by all the weird systems that use them in 30,000+ employee enterprises and replace them with one database and one web/application server, you'd probably save enough money to justify having done it.

bvirb|10 months ago

We've been using Elasticsearch + PG and it's pretty nice and fast, but it adds a whole layer of extra stuff to deal with when your data is in PG but then also needs to be indexed into an external server outside of those PG transactions. In our case I'm pretty convinced it hasn't been worth the effort. I think we could've optimized PG to be as fast as we needed with a lot less overhead than dealing with an external search index.

We moved our queues to PG and it cuts out the same kind of overhead to be able to wrap an update and start a job in a transaction. PG has been plenty fast to keep up with our queue demand.

Ultimately I think being able to do things transactionally just avoids a whole class of syncing issues, which are basically caching issues, and cache invalidation is one of the 2 hard things.

bob1029|10 months ago

You can abstract this to any RDBMS, and the justification is that it makes everything a lot faster & easier.

I just got off a call with a client where their developers were using ORM-style abstractions to manipulate data for downstream processing in code, turning what should have been a few seconds of one custom SQL command into several hours of passing objects around multiple computer systems.

If we can put the FTS engine inside the SQL engine, we can avoid the entire space of APIs, frameworks, 3rd parties, latency, etc that goes along with otherwise integrating this feature.

Modern SQL dialects represent universal computation over arguably the best way we know how to structure complex data. Adding custom functions & modules into the mix is mostly just syntactic sugar over what is already available.

There is ZERO honor in offloading what could have been a SQL view into a shitty pile of nested iterators somewhere. I don't understand where any of this energy comes from. The less code the better. It's pure downside.

NavinF|10 months ago

Maintaining a new service sucks. Not being able to do atomic commits to both postgres and the other db sucks.

klysm|10 months ago

Avoiding distributed systems problems. Distributed systems are so incredibly hard to get right that I will vertically scale postgres until I hit an insurmountable wall before giving in.

eqvinox|10 months ago

It's because postgres is in fact good at a lot of vaguely database-looking things. Even if it weren't the best for anything, if it does 80% of things at 80% best possible — it is reasonable to have postgres as "first thing to reach for" by default.

That said, it's easy to forget to check if you're in either of those 20% (or both.) There's probably a whole bunch of postgres usage where really something else should be used, and people just never checked.

cryptonector|10 months ago

Because PG is a fantastic platform for doing everything you need in SQL and it performs real well. Add PostgREST and you've got a REST API almost for free (you have to design a schema of what to expose, so not entirely free, but still, pretty close). Also, PG moves _fast_ and has a very active developer and user community, which means you'll be getting more awesome functionality in the coming future. (E.g., I've been following the AIO thread and the performance improvements coming from that patch set will be hefty.)

hylaride|10 months ago

When you're starting something new and the amounts of data is still small, it's often better early on to focus on the product than optimizing for theoretical performance optimization that may never pan out (either because the project will fail or that the bottlenecks may ultimately not be what you thought or expected).

At my current gig, we used to shove everything (including binary data) into postgres because it was easy and all our code plugged into it anyways. When it started to become uneconomical (mostly due to RDS storage costs), we then started shunting data to S3, DynamoDB, etc.

Also, not everybody can be on a cloud with easy access to all the fancy products for queuing, caching, etc. Sometimes it's better overall to have to deal with one complex beast (that you'd have to maintain anyways) than spending time deploying Kafka, MongoDB, etc (even though it can sometimes be easier than ever with pre-built manifests for K8s) as well as securing and keeping them all up to date.

I do strongly encourage people to treat code that deals with these things with as much abstraction as possible to make migrations easier later on, though.

sgarland|10 months ago

Tbf everything you mentioned, MySQL also has (in the latest version, for vectors).

But either way, the answer is simplicity and cost. I assume you’ve heard of Choose Boring Technology [0]? Postgres is boring. It’s immensely complex when you dive into it, but in return, you get immense performance, reliability, and flexibility. You only have to read one manual. You only have to know one language (beyond your app, anyway – though if you do write an app in pure SQL, my hat is off to you). You only have to learn the ins and outs of one thing (again, other than your app). ES is hard to administer; I’ve done it. Postgres is also hard to administer, but if I have to pick between mastering one hard thing and being paged for it, or two hard things and getting paged for them, I’ll take one every day.

[0]: https://boringtechnology.club/

hmaxdml|10 months ago

It is very good at doing the job that people over-eagerly offload to specialized services. Queues, notifications, scheduled jobs. And can be specialized with extensions.

drittich|10 months ago

I am going to hazard a guess that it's because the closer the services are to your data, the easier they are to implement, and you often get great speed too. FTS in Postgres has been fantastic for me, and combining it with vector search and RAG gives you a pretty sweet deal for low effort.

Cthulhu_|10 months ago

Disclaimer, I have no experience with this kind of thing. However, theoretically, less tools is better for an organization - see [0] - and if your job adverts say just "postgres" instead of "postgres, elasticsearch, tool x, tool y, tool z" etc, you don't need to find (or train) a unicorn that is up to speed on all of them.

That said, "postgres" is a very broad subject if you take all of those into consideration, if you need to specialize your search for someone who knows how to do X in PG specifically you're almost back at the same spot. (I say almost because I'm sure it's easier to learn a specialization in Postgres if you're already familiar with Postgres than it is to learn a completely new tool)

And caveat, there's a high golden hammer risk there. I'd start questioning things when needing to query JSON blobs inside a database.

[0] https://mcfunley.com/choose-boring-technology

tianqi|10 months ago

I agree with one of the points from The Art of PostgreSQL by Dimitri Fontaine, which is appropriate for answering this question:

"So when designing your software architecture, think about PostgreSQL NOT as storage layer, but rather as a concurrent data access service. This service is capable of handling data processing."

Spivak|10 months ago

All these replies have me so confused, the reason to shove everything into your database when you can is because you can transact across them. That's thing you can't get once you have a second system.

petesergeant|10 months ago

It’s free, it’s stable, it’s excellent, and it’s everywhere. The question should be “why don’t people put everything in Pg?”

The only things you shouldn’t put in Pg are things where there’s an obviously better alternative that’s so much better as to make running an extra service worth it. There definitely are cases where that’s true (Redis when you need exceptionally quick responses, for example) but it’s a high bar to clear.

edoceo|10 months ago

PostgreSQL hasn't let me down in 20+ years. It's not perfect but it's really damn good for all your data cases (may require tuning)

wordofx|10 months ago

Because the vast majority of people on HN or in the real world don’t need to scale beyond 10 concurrent users. It’s insane how much infrastructure script kiddies add to their projects when it could be done in the database and scale well.

whalesalad|10 months ago

Why not? Pareto principle. It’ll get you 80% of the way there for most things… then when you need a highly optimized solution you pivot to that.

ketzo|10 months ago

There are good reasons mentioned already, but additionally, there’s a real strong cargo cult developing around Postgres these days.

Vonng|10 months ago

BTW for anyone who interested, I've packed pg_search and vchord_bm25 extension RPM/DEBs https://pigsty.io/ext/fts/vchord_bm25 https://pigsty.io/ext/fts/pg_search In case anyone want to benchmark by themselves. ;)

zhousun|10 months ago

Glab to see more 'postgres-native' full-text search implementation.

Alternative solutions (lucene/ tantivy) are both designed for 'immutable segments' (indexing immutable files), so marrying them with postgres heap table would results in a worse solution.

retakeming|10 months ago

The segments themselves being immutable doesn't mean that Tantivy is incompatible with Postgres - it just means that Tantivy needs to be made compatible with Postgres' concurrency control mechanisms (MVCC) and storage format (block storage). This blog post explains the latter: https://www.paradedb.com/blog/block_storage_part_one

atemerev|10 months ago

10 million records is a toy dataset. Usually, you can fit it in memory on a laptop.

There are open large(-ish) text datasets like full Wikipedia or pre-2022 Reddit comments, that would work much better for benchmarking.

some_developer|10 months ago

Years ago I wanted to use native FTS (because of tall the things mentioned, having to sync to external simply adds complexity) and it failed at another point.

Not completely surprising, but on a table with _potentially_ couple of thousand of inserts / seconds, it slowed down the overall updates to the point that transactions timed out.

We already added an index for one of the columns we wanted to index and were running the statement for the second one. The moment this the second index finished, we started to see timeouts from our system when writing to that table, transaction failing etc.

We had to drop the indices again. So, sadly, we did never get to the point to test the actual FTS performance :/ I would have like to test this, because didn't necessarily had to search hundreds of millions of documents, due to customer tenants this would always be constrained to a few million _at most_.

ps: I already wrote about this -> https://news.ycombinator.com/item?id=27977526 . Never got a chance to try it nowadays (newer versions of everything, never hardware, etc.)

thr0w|10 months ago

Sounds like the issue was just co-location of search index + other transactional data in the same table. If you had a table acting as your search index only then would insert lag on that table matter? I could maybe see connections piling up, but with proper batching I bet it'd be fine.

chenhoey1211|10 months ago

I’ve seen a lot of teams jump straight to Elasticsearch or Meilisearch without realizing how much performance you can get out of native PG FTS when used properly.

could we get similar performance in the browser using something like SQLite + FTS5 + Wasm? Seems like an interesting direction for offline-first apps...

johnthescott|10 months ago

for text search we use the "rum" extension from postgrespro. search terabytes of pdfs < 1sec. a talk here

    https://github.com/jmscott/talk/blob/master/pgday-austin-20161112.pdf

lamp_book|10 months ago

> 10M log entries

I don’t think the question is speed, it’s scale. Use it until it breaks, though.

Thaxll|10 months ago

When it's custom code that runs on PG I don't call it native.

eqvinox|10 months ago

Only the ranking function at the end of TFA is custom. Everything until there is standard postgres.

unit149|10 months ago

[deleted]

apps4datr|10 months ago

[deleted]

Sytten|10 months ago

Off topic but this kind of content marketing is excellent for any startup that tries to get its name out compared with a well-known competitor, like it is the case here with ParadeDB.