Postgres is such a cool project, I have so much respect for its maintainers and community! For me its the second most impactful OSS project in the business tech world behind Linux itself.
A real public good to be cherished and praised.
You can get no better recommendation, that a phrase I once heard at a presentation by Richard Hipp primary author of SQLite. As he was explaining how he analyzed new features and design decisions he says some of his guidance is: "What would Postgres do?"
I recently started a ML project using text data and the choice was between MySQL and Postgres. Having looked at the respective features and pros and cons, the choice was immediately obvious. Also, with pgvector and https://postgresml.com available, the choice for Postgres was even easier.
On linux there also is preadv2(..., RWF_NOWAIT) which can be used to do optimistic non-blocking read from the page cache. That might be useful for io_method = worker to shave off a bit of latency. Try reading on the main thread with NOWAIT and only offload to a worker thread when that fails.
FWIW, I played with that - unfortunately it seems that the the overhead of doing twice the page cache lookups is a cure worse than the disease.
Note that we do not offload IO to workers when doing I/O that the caller will synchronously wait for, just when the caller actually can do IO asynchronously. That reduces the need to avoid the offload cost.
It turns out, as some of the results in Lukas' post show, that the offload to the worker is often actually beneficial particularly when the data is in the kernel page cache - it parallelizes the memory copy from kernel to userspace and postgres' checksum computation. Particularly on Intel server CPUs, which have had pretty mediocre per-core memory bandwidth in the last ~ decade, memory bandwidth turns out to be a bottleneck for page cache access and checksum computations.
Update: Most of the comments below seem to be missing the fact that Windows now also has an IORing implementation, as I mentioned above. Comparison article here:
I am not a Windows guy but I (with help) managed to get IOCP working for this in a basic prototype. Will share publicly soon. I also sketched out an IoRing version (if you are interested in helping debug and flesh that out let me know!).
Main learnings: the IOCP version can't do asynchronous flush! Which we want. The IoRing version can! But it can't do scatter/gather AKA vector I/O yet! Which is an essential feature for buffer pool implementation. So actually I am basically waiting for IoRing to add support for that before taking it too seriously (I can see they are working on it because the ops are present in an enum, it's just that the build functions are missing).
So my guess is that in a year or so we should be able to run all PostgreSQL disk I/O through IoRing on Windows. Maybe?
Another complications is that it really wants to be multithreaded (consuming completions for IOs started in another process requires a lot of hoop jumping, I made it work but...) This will resolve itself naturally with ongoing work to make PostgreSQL multithreaded.
The next problem is that IoRing doesn't support sockets! So in future work on async networking (early prototypes exist) we will likely also need IOCP for that part.
It depends on the I/O method - as described in the article, "io_uring" is only available on Linux (and requires building with liburing, as well as io_uring to be enabled in the Kernel), but the default (as of beta1) is actually "worker", which works on any operating system.
The "worker" method uses a dedicated pool of I/O worker processes that run in the background, and whilst not as performant as io_uring in our benchmark, did clearly outperform the "sync" method (which is the same as what Postgres currently has in 17 and older).
Sounds like this feature is based on io_uring which is a Linux feature. I would be surprised if they implemented async io on Windows before they would on Linux given the user/deployment base being very Linux-heavy.
A lot of work has gone into FreeBSD's aio(4) so it will be interesting to see how that works, because it doesn't have the drawbacks of Linux/glibc aio.
BTW I have patches for PostgreSQL AIO on FreeBSD, which I will propose for v19. It works pretty well! I was trying to keep out of Andres's way for the core architectural stuff and basic features ie didn't want to overload the pipes with confusing new topics for v18 :-)
Thank you for the effort that went into getting this committed. I remember seeing the first discussions about async I/O (and using io_uring) like 6 or 7 years ago. Amazing amount of work to get the design right.
Looking forward to the other places that async I/O can be used in future Postgres releases now that the groundwork is done.
Very nicely written post! I'd love to start running these in production on NVMe and hope its something major cloud providers start to offer ASAP. The performance gains are _extremely_ attractive
I recently deployed Postgres on a dedicated Hetzner EX-44 server (20 cores, 64GB RAM, 2x 512GB NVMe SSDs in RAID 1) for €39/month. The price-to-performance ratio is exceptional, providing enterprise-level capacity at a fraction of typical cloud costs.
For security, I implemented TailScale which adds only ~5ms of latency while completely eliminating public network exposure - a worthwhile tradeoff for the significant security benefits.
- Real-time performance monitoring with PgHero for identifying bottlenecks
- Automated VACUUM ANALYZE operations scheduled via pgcron targeting write-heavy tables, which prevents performance degradation and helps me sleep soundly
- A custom CLI utility I built for ZSTD-compressed backups that achieves impressive compression ratios while maintaining high throughput, with automatic S3 uploading: https://github.com/overflowy/pgbackup
This setup has been remarkably stable and performant, handling our workloads with substantial headroom for growth.
I would absolutely use another backup utility (additionally if you want) if I were you (barman, pgbackrest, etc).
You are just wrapping pgdump, which is not a full featured backup solution. Great for a snapshot...
Use some of the existing tools and you get point-in-time recovery, easy restores to hot standbys for replication, a good failover story, backup rotations, etc.
I sort of had to chuckle at the 20k IOPS AWS instance, given even a consumer $100-200 NVMe gives ~1million+ IOPS these days. I suspect now we have PCIe 5.0 NVMes this will go up to
I always do wonder how much "arbitrary" cloud limits on things like this cause so many issues. I'm sure that async IO is very helpful anyway, but I bet on a 1million IOPS NVMe it is nowhere near as important.
We're effectively optimising critical infrastructure tech for ~2010 hardware because that's when big cloud got going and there has been so few price reductions on things since then vs the underlying hardware costs.
Obviously a consumer NVMe is not "enterprise" but my point is we are 3+ orders of magnitude off performance on cheap consumer hardware vs very expensive 'enterprise' AWS/big cloud costs.
> had to chuckle at the 20k IOPS AWS instance, given even a consumer $100-200 NVMe gives ~1million+ IOPS these days
The IOPS figure usually hides the fact that it is not a single IOP that is really fast, but a collection of them.
More IOPS generally is done best by reducing latency of a single operation but the average latency is what actually contributes to the "fast query" experience. Because a lot of the next IO is branchy from the last one (like an index or filter lookup).
As more and more disks to CPU connectivity goes over the network, we can really deliver a large IOPS even when we have very high latencies (by spreading the data across hundreds of SSDs and routing it fast), because with the network storage we pay a huge latency cost for durability of the data simply because of location diversification.
Every foot is a nanosecond, approximately.
That the tradeoff is worth it, because you don't need clusters to deal with a bad CPU or two. Stop & start, to fix memory/cpu errors.
The AWS model pushes the latency problem to the customer and we see it in the IOPS measurements, but it is really the latency x queue depth we're seeing not the hardware capacity.
Yep, I find cloud storage performance to be quite frustrating, but its the reality for many production database deployments I've seen.
Its worth noting that even on really fast local NVMe drives the new asynchronous I/O work delivers performance benefits, since its so much more efficient at issuing I/Os and reducing syscall overhead (for io_uring).
Andres Freund (one of the principal authors of the new functionality) did a lot of benchmarking on local NVMe drives during development. Here is one mailinglist thread I could find that shows a 2x and better benefit with the patch set at the time: https://www.postgresql.org/message-id/flat/uvrtrknj4kdytuboi...
You probably already know this but I will say it anyway. These cloud services like AWS are not succeeding in enterprise because they have outdated hardware. They succeed because in enterprise, CIOs and CTOs want something that is known, has a brand and everyone else uses it. It's like the old adage of "No one got fired for using IBM". Now it is "No one gets fired for hosting with AWS no matter how ridiculous the cost and corresponding feature is".
Everything in the cloud is throttled. Network, IOPS, CPU. And probably implemented incorrectly. AWS makes billions if the customer infrastructure is great or terrible. I found that anything smaller than an AWS EC2 m5.8xlarge had noticeably bad performance on loaded servers (Windows). The list price for that would be about $13k per year, but most organizations get lower than list prices.
This also applies to services, not only compute. Anything associated with Microsoft Office 365 Exchange, scripts may run 10x slower against the cloud using the MSOnline cmdlets. It's absolute insanity, I used to perform a dump of all mailbox statistics that would take about one hour, it could take almost 24 hours against Office 365. You have to be careful to not use the same app or service account in multiple places, because the throttle limits are per-account.
I noticed this with bandwidth. AWS price for bandwidth: $90.00/TB after 0.1TB/month. Price everywhere else (low cost VPSes): $1.50/TB after 1-5TB/month. Price some places (dedicated servers): $0.00/TB up to ~100TB/month, $1.50/TB after.
You pay 60 times the price for the privilege of being on AWS.
Bandwidth is just their most egregious price difference. The servers are more expensive too. The storage is more expensive (except for Glacier). The serverless platforms are mostly more expensive than using a cheap server.
There are only two AWS products that I understand to have good prices: S3 Glacier (and only if you never restore!), and serverless apps (Lambda / API Gateway) if your traffic is low enough to fit in the Always Free tier. For everything else, it appears you get ripped off by using AWS.
FWIW, using the same approach as in the article, ie io_uring, is one of the few ways to actually reach anywhere close to that 1 million, so it is not as if they are competing concerns.
Is io_uring still plagued by security issues enabled by it's use? Or have those largely been fixed? My understanding was many Linux admins (or even distros by default?) were disabling io_uring.
Disabling io_uring because “guy on the internet said so” or “$faang_company says so” is beyond dumb.
One should evaluate the risk according to their specific use case.
It can be a good idea to disable it of you run untrusted workloads (eg: other people’s containers, sharing the same kernel) but if you have a kernel on a machine (virtual or real) dedicated to your own workload you can pretty much keep using io_uring. There are other technologies to enforce security (eg: selinux emand similar).
Are there good performance comparisons between postgres, mariadb and percona? I'm really curious at this point in which case each of those database shine.
Apples to apples, Postgres might lose, but that'd be tying both hands behind its back first.
Remember that Postgres's feature set is far larger than those alternatives. If you can use a range with an exclusion constraint, an unnest with an array, or the like, you'll be seeing Postgres leave the alternatives in the dust.
Imagine writing a benchmark comparing programming languages, but the benchmark only includes idioms that all tested languages shared in common. Wouldn't be a fair comparison, would it?
I recently read a great article exploring what would change if they were to switch from processes to threads for each connection. Running a connection pooler didn't seem so bad to me after reading it. https://medium.com/@tusharmalhotra_81114/why-postgresql-choo...
Do you know, how well pgbouncer works with prepared statement nowadays? We slowly migrating our clusters to pg16/new bouncer and feel unsure on stability and reliability of prepared statements support
I remember back in the day when mysql vs postgres was a real debate with mysql being the popular fave. Genuinely interesting to see that postgres eventually won.
It's pretty disappointing that simply using O_NONBLOCK doesn't work as expected on regular files. It would be such a simple and portable mechanism to do async I/O using the same interfaces that we already use for networking.
insane how long it took postgres to get async i/o right - feels like all the big changes spark a million little tradeoffs, right? you think stuff like io_uring is finally gonna push postgres to catch up with the clouds
drdrek|9 months ago
belter|9 months ago
freilanzer|9 months ago
the8472|9 months ago
anarazel|9 months ago
Note that we do not offload IO to workers when doing I/O that the caller will synchronously wait for, just when the caller actually can do IO asynchronously. That reduces the need to avoid the offload cost.
It turns out, as some of the results in Lukas' post show, that the offload to the worker is often actually beneficial particularly when the data is in the kernel page cache - it parallelizes the memory copy from kernel to userspace and postgres' checksum computation. Particularly on Intel server CPUs, which have had pretty mediocre per-core memory bandwidth in the last ~ decade, memory bandwidth turns out to be a bottleneck for page cache access and checksum computations.
Edit: Fix negation
zX41ZdbW|9 months ago
Aside from a few problems in specific Linux kernel versions, it works great.
nu11ptr|9 months ago
I know Windows has IOCP and also now an IORing implementation of its own (Less familiar with macOS capabilities other than POSIX AIO).
https://learn.microsoft.com/en-us/windows/win32/api/ioringap...
Update: Most of the comments below seem to be missing the fact that Windows now also has an IORing implementation, as I mentioned above. Comparison article here:
https://windows-internals.com/ioring-vs-io_uring-a-compariso...
macdice|9 months ago
Main learnings: the IOCP version can't do asynchronous flush! Which we want. The IoRing version can! But it can't do scatter/gather AKA vector I/O yet! Which is an essential feature for buffer pool implementation. So actually I am basically waiting for IoRing to add support for that before taking it too seriously (I can see they are working on it because the ops are present in an enum, it's just that the build functions are missing).
So my guess is that in a year or so we should be able to run all PostgreSQL disk I/O through IoRing on Windows. Maybe?
Another complications is that it really wants to be multithreaded (consuming completions for IOs started in another process requires a lot of hoop jumping, I made it work but...) This will resolve itself naturally with ongoing work to make PostgreSQL multithreaded.
The next problem is that IoRing doesn't support sockets! So in future work on async networking (early prototypes exist) we will likely also need IOCP for that part.
lfittl|9 months ago
The "worker" method uses a dedicated pool of I/O worker processes that run in the background, and whilst not as performant as io_uring in our benchmark, did clearly outperform the "sync" method (which is the same as what Postgres currently has in 17 and older).
stingraycharles|9 months ago
spwa4|9 months ago
mijoharas|9 months ago
Given that they still target it there must be a user base. Does anyone know the statistics of usage by platform? Anyone here use it?
Genuinely curious, windows backend dev is something I know very little about.
unknown|9 months ago
[deleted]
kev009|9 months ago
macdice|9 months ago
tiffanyh|9 months ago
Is FreeBSD doing anything significantly different and/or better?
skeptrune|9 months ago
seunosewa|9 months ago
They will eventually figure out using b-trees for tables too.
Tostino|9 months ago
Looking forward to the other places that async I/O can be used in future Postgres releases now that the groundwork is done.
shayonj|9 months ago
CoolCold|9 months ago
niux|9 months ago
For security, I implemented TailScale which adds only ~5ms of latency while completely eliminating public network exposure - a worthwhile tradeoff for the significant security benefits.
My optimization approach includes:
- Workload-specific configuration generated via PGTune (https://pgtune.leopard.in.ua/)
- Real-time performance monitoring with PgHero for identifying bottlenecks
- Automated VACUUM ANALYZE operations scheduled via pgcron targeting write-heavy tables, which prevents performance degradation and helps me sleep soundly
- A custom CLI utility I built for ZSTD-compressed backups that achieves impressive compression ratios while maintaining high throughput, with automatic S3 uploading: https://github.com/overflowy/pgbackup
This setup has been remarkably stable and performant, handling our workloads with substantial headroom for growth.
trollied|9 months ago
Tostino|9 months ago
You are just wrapping pgdump, which is not a full featured backup solution. Great for a snapshot...
Use some of the existing tools and you get point-in-time recovery, easy restores to hot standbys for replication, a good failover story, backup rotations, etc.
codegeek|9 months ago
natmaka|9 months ago
Why don't you set (per table) the autovacuum_analyze_scale_factor parameter (or autovacuum_analyze_threshold) then let AUTOVACUUM handle this?
dpacmittal|9 months ago
sroussey|9 months ago
martinald|9 months ago
I always do wonder how much "arbitrary" cloud limits on things like this cause so many issues. I'm sure that async IO is very helpful anyway, but I bet on a 1million IOPS NVMe it is nowhere near as important.
We're effectively optimising critical infrastructure tech for ~2010 hardware because that's when big cloud got going and there has been so few price reductions on things since then vs the underlying hardware costs.
Obviously a consumer NVMe is not "enterprise" but my point is we are 3+ orders of magnitude off performance on cheap consumer hardware vs very expensive 'enterprise' AWS/big cloud costs.
gopalv|9 months ago
The IOPS figure usually hides the fact that it is not a single IOP that is really fast, but a collection of them.
More IOPS generally is done best by reducing latency of a single operation but the average latency is what actually contributes to the "fast query" experience. Because a lot of the next IO is branchy from the last one (like an index or filter lookup).
As more and more disks to CPU connectivity goes over the network, we can really deliver a large IOPS even when we have very high latencies (by spreading the data across hundreds of SSDs and routing it fast), because with the network storage we pay a huge latency cost for durability of the data simply because of location diversification.
Every foot is a nanosecond, approximately.
That the tradeoff is worth it, because you don't need clusters to deal with a bad CPU or two. Stop & start, to fix memory/cpu errors.
The AWS model pushes the latency problem to the customer and we see it in the IOPS measurements, but it is really the latency x queue depth we're seeing not the hardware capacity.
lfittl|9 months ago
Its worth noting that even on really fast local NVMe drives the new asynchronous I/O work delivers performance benefits, since its so much more efficient at issuing I/Os and reducing syscall overhead (for io_uring).
Andres Freund (one of the principal authors of the new functionality) did a lot of benchmarking on local NVMe drives during development. Here is one mailinglist thread I could find that shows a 2x and better benefit with the patch set at the time: https://www.postgresql.org/message-id/flat/uvrtrknj4kdytuboi...
codegeek|9 months ago
Hilift|9 months ago
This also applies to services, not only compute. Anything associated with Microsoft Office 365 Exchange, scripts may run 10x slower against the cloud using the MSOnline cmdlets. It's absolute insanity, I used to perform a dump of all mailbox statistics that would take about one hour, it could take almost 24 hours against Office 365. You have to be careful to not use the same app or service account in multiple places, because the throttle limits are per-account.
immibis|9 months ago
You pay 60 times the price for the privilege of being on AWS.
Bandwidth is just their most egregious price difference. The servers are more expensive too. The storage is more expensive (except for Glacier). The serverless platforms are mostly more expensive than using a cheap server.
There are only two AWS products that I understand to have good prices: S3 Glacier (and only if you never restore!), and serverless apps (Lambda / API Gateway) if your traffic is low enough to fit in the Always Free tier. For everything else, it appears you get ripped off by using AWS.
__s|9 months ago
https://azure.microsoft.com/en-us/pricing/details/managed-di...
Increasing vcpu also opened up more disk slots to try improve situation with disk striping
maherbeg|9 months ago
anonymars|9 months ago
In the face of sustained writes? For how long?
yxhuvud|9 months ago
binary132|9 months ago
p_ing|9 months ago
hansvm|9 months ago
znpy|9 months ago
One should evaluate the risk according to their specific use case.
It can be a good idea to disable it of you run untrusted workloads (eg: other people’s containers, sharing the same kernel) but if you have a kernel on a machine (virtual or real) dedicated to your own workload you can pretty much keep using io_uring. There are other technologies to enforce security (eg: selinux emand similar).
song|9 months ago
KronisLV|9 months ago
ttfkam|9 months ago
Remember that Postgres's feature set is far larger than those alternatives. If you can use a range with an exclusion constraint, an unnest with an array, or the like, you'll be seeing Postgres leave the alternatives in the dust.
Imagine writing a benchmark comparing programming languages, but the benchmark only includes idioms that all tested languages shared in common. Wouldn't be a fair comparison, would it?
pseudopersonal|9 months ago
hoherd|9 months ago
__s|9 months ago
CoolCold|9 months ago
rks404|9 months ago
cryptonector|9 months ago
dbbk|9 months ago
clarkbw|9 months ago
And we plan to have 18 out as quickly as we did 17; on the day of release.
WhyNotHugo|9 months ago
curtisszmania|9 months ago
[deleted]
gitroom|9 months ago