Waiting for Postgres 18: Accelerating Disk Reads with Asynchronous I/O

drdrek|9 months ago

Postgres is such a cool project, I have so much respect for its maintainers and community! For me its the second most impactful OSS project in the business tech world behind Linux itself. A real public good to be cherished and praised.

belter|9 months ago

You can get no better recommendation, that a phrase I once heard at a presentation by Richard Hipp primary author of SQLite. As he was explaining how he analyzed new features and design decisions he says some of his guidance is: "What would Postgres do?"

freilanzer|9 months ago

I recently started a ML project using text data and the choice was between MySQL and Postgres. Having looked at the respective features and pros and cons, the choice was immediately obvious. Also, with pgvector and https://postgresml.com available, the choice for Postgres was even easier.

the8472|9 months ago

On linux there also is preadv2(..., RWF_NOWAIT) which can be used to do optimistic non-blocking read from the page cache. That might be useful for io_method = worker to shave off a bit of latency. Try reading on the main thread with NOWAIT and only offload to a worker thread when that fails.

anarazel|9 months ago

FWIW, I played with that - unfortunately it seems that the the overhead of doing twice the page cache lookups is a cure worse than the disease.

Note that we do not offload IO to workers when doing I/O that the caller will synchronously wait for, just when the caller actually can do IO asynchronously. That reduces the need to avoid the offload cost.

It turns out, as some of the results in Lukas' post show, that the offload to the worker is often actually beneficial particularly when the data is in the kernel page cache - it parallelizes the memory copy from kernel to userspace and postgres' checksum computation. Particularly on Intel server CPUs, which have had pretty mediocre per-core memory bandwidth in the last ~ decade, memory bandwidth turns out to be a bottleneck for page cache access and checksum computations.

Edit: Fix negation

zX41ZdbW|9 months ago

The usage in ClickHouse database: https://github.com/ClickHouse/ClickHouse/blob/d2697c0ea112fd...

Aside from a few problems in specific Linux kernel versions, it works great.

nu11ptr|9 months ago

Is this new async. I/O feature for Linux only?

I know Windows has IOCP and also now an IORing implementation of its own (Less familiar with macOS capabilities other than POSIX AIO).

https://learn.microsoft.com/en-us/windows/win32/api/ioringap...

Update: Most of the comments below seem to be missing the fact that Windows now also has an IORing implementation, as I mentioned above. Comparison article here:

https://windows-internals.com/ioring-vs-io_uring-a-compariso...

macdice|9 months ago

I am not a Windows guy but I (with help) managed to get IOCP working for this in a basic prototype. Will share publicly soon. I also sketched out an IoRing version (if you are interested in helping debug and flesh that out let me know!).

Main learnings: the IOCP version can't do asynchronous flush! Which we want. The IoRing version can! But it can't do scatter/gather AKA vector I/O yet! Which is an essential feature for buffer pool implementation. So actually I am basically waiting for IoRing to add support for that before taking it too seriously (I can see they are working on it because the ops are present in an enum, it's just that the build functions are missing).

So my guess is that in a year or so we should be able to run all PostgreSQL disk I/O through IoRing on Windows. Maybe?

Another complications is that it really wants to be multithreaded (consuming completions for IOs started in another process requires a lot of hoop jumping, I made it work but...) This will resolve itself naturally with ongoing work to make PostgreSQL multithreaded.

The next problem is that IoRing doesn't support sockets! So in future work on async networking (early prototypes exist) we will likely also need IOCP for that part.

lfittl|9 months ago

It depends on the I/O method - as described in the article, "io_uring" is only available on Linux (and requires building with liburing, as well as io_uring to be enabled in the Kernel), but the default (as of beta1) is actually "worker", which works on any operating system.

The "worker" method uses a dedicated pool of I/O worker processes that run in the background, and whilst not as performant as io_uring in our benchmark, did clearly outperform the "sync" method (which is the same as what Postgres currently has in 17 and older).

stingraycharles|9 months ago

Sounds like this feature is based on io_uring which is a Linux feature. I would be surprised if they implemented async io on Windows before they would on Linux given the user/deployment base being very Linux-heavy.

spwa4|9 months ago

Yes, although Windows has had async I/O since Windows NT 3.1, their API is still not supported by Postgres.

mijoharas|9 months ago

Does anyone run postgres on windows?

Given that they still target it there must be a user base. Does anyone know the statistics of usage by platform? Anyone here use it?

Genuinely curious, windows backend dev is something I know very little about.

unknown|9 months ago

[deleted]

kev009|9 months ago

A lot of work has gone into FreeBSD's aio(4) so it will be interesting to see how that works, because it doesn't have the drawbacks of Linux/glibc aio.

macdice|9 months ago

BTW I have patches for PostgreSQL AIO on FreeBSD, which I will propose for v19. It works pretty well! I was trying to keep out of Andres's way for the core architectural stuff and basic features ie didn't want to overload the pipes with confusing new topics for v18 :-)

tiffanyh|9 months ago

Would you mind expanding more on this topic.

Is FreeBSD doing anything significantly different and/or better?

skeptrune|9 months ago

How close is this to the way MySQL does it with InnoDB? It appears to be about the same.

seunosewa|9 months ago

Yep. It's a low hanging fruit they should've picked years ago.

They will eventually figure out using b-trees for tables too.

Tostino|9 months ago

Thank you for the effort that went into getting this committed. I remember seeing the first discussions about async I/O (and using io_uring) like 6 or 7 years ago. Amazing amount of work to get the design right.

Looking forward to the other places that async I/O can be used in future Postgres releases now that the groundwork is done.

shayonj|9 months ago

Very nicely written post! I'd love to start running these in production on NVMe and hope its something major cloud providers start to offer ASAP. The performance gains are _extremely_ attractive

CoolCold|9 months ago

My understanding is that local storage like NVMe is not much affected as latency is very low compared to network based block devices like ebs

niux|9 months ago

I recently deployed Postgres on a dedicated Hetzner EX-44 server (20 cores, 64GB RAM, 2x 512GB NVMe SSDs in RAID 1) for €39/month. The price-to-performance ratio is exceptional, providing enterprise-level capacity at a fraction of typical cloud costs.

For security, I implemented TailScale which adds only ~5ms of latency while completely eliminating public network exposure - a worthwhile tradeoff for the significant security benefits.

My optimization approach includes:

- Workload-specific configuration generated via PGTune (https://pgtune.leopard.in.ua/)

- Real-time performance monitoring with PgHero for identifying bottlenecks

- Automated VACUUM ANALYZE operations scheduled via pgcron targeting write-heavy tables, which prevents performance degradation and helps me sleep soundly

- A custom CLI utility I built for ZSTD-compressed backups that achieves impressive compression ratios while maintaining high throughput, with automatic S3 uploading: https://github.com/overflowy/pgbackup

This setup has been remarkably stable and performant, handling our workloads with substantial headroom for growth.

trollied|9 months ago

That’s great, but you need a solid HA/backup and recovery strategy if you even remotely care about your data.

Tostino|9 months ago

I would absolutely use another backup utility (additionally if you want) if I were you (barman, pgbackrest, etc).

You are just wrapping pgdump, which is not a full featured backup solution. Great for a snapshot...

Use some of the existing tools and you get point-in-time recovery, easy restores to hot standbys for replication, a good failover story, backup rotations, etc.

codegeek|9 months ago

Would you be willing to open source the setup ? I would love to learn from it.

natmaka|9 months ago

> Automated VACUUM ANALYZE operations scheduled via pgcron targeting write-heavy tables

Why don't you set (per table) the autovacuum_analyze_scale_factor parameter (or autovacuum_analyze_threshold) then let AUTOVACUUM handle this?

dpacmittal|9 months ago

EX44 is 14 physical cores. Just wanted to point it out.

sroussey|9 months ago

BTW, bun now has a built-in S3 client.

martinald|9 months ago

I sort of had to chuckle at the 20k IOPS AWS instance, given even a consumer $100-200 NVMe gives ~1million+ IOPS these days. I suspect now we have PCIe 5.0 NVMes this will go up to

I always do wonder how much "arbitrary" cloud limits on things like this cause so many issues. I'm sure that async IO is very helpful anyway, but I bet on a 1million IOPS NVMe it is nowhere near as important.

We're effectively optimising critical infrastructure tech for ~2010 hardware because that's when big cloud got going and there has been so few price reductions on things since then vs the underlying hardware costs.

Obviously a consumer NVMe is not "enterprise" but my point is we are 3+ orders of magnitude off performance on cheap consumer hardware vs very expensive 'enterprise' AWS/big cloud costs.

gopalv|9 months ago

> had to chuckle at the 20k IOPS AWS instance, given even a consumer $100-200 NVMe gives ~1million+ IOPS these days

The IOPS figure usually hides the fact that it is not a single IOP that is really fast, but a collection of them.

More IOPS generally is done best by reducing latency of a single operation but the average latency is what actually contributes to the "fast query" experience. Because a lot of the next IO is branchy from the last one (like an index or filter lookup).

As more and more disks to CPU connectivity goes over the network, we can really deliver a large IOPS even when we have very high latencies (by spreading the data across hundreds of SSDs and routing it fast), because with the network storage we pay a huge latency cost for durability of the data simply because of location diversification.

Every foot is a nanosecond, approximately.

That the tradeoff is worth it, because you don't need clusters to deal with a bad CPU or two. Stop & start, to fix memory/cpu errors.

The AWS model pushes the latency problem to the customer and we see it in the IOPS measurements, but it is really the latency x queue depth we're seeing not the hardware capacity.

lfittl|9 months ago

Yep, I find cloud storage performance to be quite frustrating, but its the reality for many production database deployments I've seen.

Its worth noting that even on really fast local NVMe drives the new asynchronous I/O work delivers performance benefits, since its so much more efficient at issuing I/Os and reducing syscall overhead (for io_uring).

Andres Freund (one of the principal authors of the new functionality) did a lot of benchmarking on local NVMe drives during development. Here is one mailinglist thread I could find that shows a 2x and better benefit with the patch set at the time: https://www.postgresql.org/message-id/flat/uvrtrknj4kdytuboi...

codegeek|9 months ago

You probably already know this but I will say it anyway. These cloud services like AWS are not succeeding in enterprise because they have outdated hardware. They succeed because in enterprise, CIOs and CTOs want something that is known, has a brand and everyone else uses it. It's like the old adage of "No one got fired for using IBM". Now it is "No one gets fired for hosting with AWS no matter how ridiculous the cost and corresponding feature is".

Hilift|9 months ago

Everything in the cloud is throttled. Network, IOPS, CPU. And probably implemented incorrectly. AWS makes billions if the customer infrastructure is great or terrible. I found that anything smaller than an AWS EC2 m5.8xlarge had noticeably bad performance on loaded servers (Windows). The list price for that would be about $13k per year, but most organizations get lower than list prices.

This also applies to services, not only compute. Anything associated with Microsoft Office 365 Exchange, scripts may run 10x slower against the cloud using the MSOnline cmdlets. It's absolute insanity, I used to perform a dump of all mailbox statistics that would take about one hour, it could take almost 24 hours against Office 365. You have to be careful to not use the same app or service account in multiple places, because the throttle limits are per-account.

immibis|9 months ago

I noticed this with bandwidth. AWS price for bandwidth: $90.00/TB after 0.1TB/month. Price everywhere else (low cost VPSes): $1.50/TB after 1-5TB/month. Price some places (dedicated servers): $0.00/TB up to ~100TB/month, $1.50/TB after.

You pay 60 times the price for the privilege of being on AWS.

Bandwidth is just their most egregious price difference. The servers are more expensive too. The storage is more expensive (except for Glacier). The serverless platforms are mostly more expensive than using a cheap server.

There are only two AWS products that I understand to have good prices: S3 Glacier (and only if you never restore!), and serverless apps (Lambda / API Gateway) if your traffic is low enough to fit in the Always Free tier. For everything else, it appears you get ripped off by using AWS.

__s|9 months ago

Even worse on Azure where we had to ask customers to scale up vcpu to increase iops

https://azure.microsoft.com/en-us/pricing/details/managed-di...

Increasing vcpu also opened up more disk slots to try improve situation with disk striping

maherbeg|9 months ago

instance store on aws can give up to 3.3mil iops https://aws.amazon.com/blogs/aws/now-available-i3-instances-... - the main problem is just using networked storage.

anonymars|9 months ago

> given even a consumer $100-200 NVMe gives ~1million+ IOPS these days

In the face of sustained writes? For how long?

yxhuvud|9 months ago

FWIW, using the same approach as in the article, ie io_uring, is one of the few ways to actually reach anywhere close to that 1 million, so it is not as if they are competing concerns.

binary132|9 months ago

Meanwhile people are running things on raspberry pi home clusters thinking they’re winning

p_ing|9 months ago

Is io_uring still plagued by security issues enabled by it's use? Or have those largely been fixed? My understanding was many Linux admins (or even distros by default?) were disabling io_uring.

hansvm|9 months ago

https://github.com/axboe/liburing/discussions/1047

znpy|9 months ago

Disabling io_uring because “guy on the internet said so” or “$faang_company says so” is beyond dumb.

One should evaluate the risk according to their specific use case.

It can be a good idea to disable it of you run untrusted workloads (eg: other people’s containers, sharing the same kernel) but if you have a kernel on a machine (virtual or real) dedicated to your own workload you can pretty much keep using io_uring. There are other technologies to enforce security (eg: selinux emand similar).

song|9 months ago

Are there good performance comparisons between postgres, mariadb and percona? I'm really curious at this point in which case each of those database shine.

KronisLV|9 months ago

Probably depends on the particular workload, but there are at least some attempts at benchmarking vaguely typical workloads: https://datasystemreviews.com/postgresql-vs-mariadb-performa...

ttfkam|9 months ago

Apples to apples, Postgres might lose, but that'd be tying both hands behind its back first.

Remember that Postgres's feature set is far larger than those alternatives. If you can use a range with an exclusion constraint, an unnest with an array, or the like, you'll be seeing Postgres leave the alternatives in the dust.

Imagine writing a benchmark comparing programming languages, but the benchmark only includes idioms that all tested languages shared in common. Wouldn't be a fair comparison, would it?

pseudopersonal|9 months ago

Does anyone know when the update allowing more concurrent connections is dropping, so we can stop using pgbouncer?

hoherd|9 months ago

I recently read a great article exploring what would change if they were to switch from processes to threads for each connection. Running a connection pooler didn't seem so bad to me after reading it. https://medium.com/@tusharmalhotra_81114/why-postgresql-choo...

__s|9 months ago

That'll likely need conversion from process per connection, so not any time soon

CoolCold|9 months ago

Do you know, how well pgbouncer works with prepared statement nowadays? We slowly migrating our clusters to pg16/new bouncer and feel unsure on stability and reliability of prepared statements support

rks404|9 months ago

I remember back in the day when mysql vs postgres was a real debate with mysql being the popular fave. Genuinely interesting to see that postgres eventually won.

cryptonector|9 months ago

I've been following the `AIO 2.5` thread on the PG mailing lists, and I've been salivating at the thought of this being released.

dbbk|9 months ago

This looks promising! Wonder if it's coming to Neon

clarkbw|9 months ago

It is!

And we plan to have 18 out as quickly as we did 17; on the day of release.

WhyNotHugo|9 months ago

It's pretty disappointing that simply using O_NONBLOCK doesn't work as expected on regular files. It would be such a simple and portable mechanism to do async I/O using the same interfaces that we already use for networking.

curtisszmania|9 months ago

[deleted]

gitroom|9 months ago

insane how long it took postgres to get async i/o right - feels like all the big changes spark a million little tradeoffs, right? you think stuff like io_uring is finally gonna push postgres to catch up with the clouds

154 comments