We want to write a paper on the vast extensions ecosystem for Postgres at some point. It's on my TODO list for 2023.
Having so many plug-ins/add-ons available is unique to Postgres and due to Stonebraker's original vision of including extensibility as a first-class design principle in the DBMS.
I want to ask a question to the engineers with experience in WebAssembly.
Last year I invited a student to compile ClickHouse with WebAssembly. But the task appeared harder than expected, supposedly due to poor 64-bit support in WASM.
The best performance is achieved with the combined method, named 'pread_threadpool'. It uses the `preadv2` syscall to check if the file is in the page cache.
And there is a pull request adding io_uring: https://github.com/ClickHouse/ClickHouse/pull/38456 - Unfortunately, it's unfinished and cannot be merged because the CI has found bugs. Nevertheless, the advantages of io_uring for analytical databases are negligible.
> The best performance is achieved with the combined method, named 'pread_threadpool'.
I found the same when testing my own database engine. `pread` in a suitably well-designed threadpool outperformend every other option for NVMe random-access 4k reads not in cache.
Variations on how many and which types of locks, work queuing and ordering, and when to start and stop threads and control their number made a considerable difference as well. In certain system configs, `CLONE_IO` made a difference. I use tiny (smaller than 1 page) userspace stacks with `clone`-based threads, and dynamic auto-tuning of the number of blocked and executing threads.
That's `preadv2` with the `RWF_NOWAIT` flag. It proved slower when I tried it in my db engine, which was a bit of a surprise. `RWF_NOWAIT` is used to read data synchronously from cache, before passing it to the thread pool to read asynchronously if not in cache.
I expected an average speedup when there are many cache hits, so I was surprised and disappointed to find the `preadv2(..,RWF_NOWAIT)` syscall to be slow enough that it was usually a performance loss overall to use it, at least on the kernel versions and hardware I tested on (a reasonably fast software-RAID NVMe).
A nicer way to look at that is that the auto-tuning thread pool was sufficiently sleek and fast that the asynchronous read was fast enough to make the benefit of a synchronous path too small to be worth it.
One not in your list is mmap_threadpool. I found for many workloads, that was faster than pread_threadpool, and of course it does a better job of sharing memory with the kernel. Unlike synchronous mmap, it is effectively an asynchronous read, where the thread does the page fault instead of syscall, so the main thread is not blocked and device I/O queues are kept full enough.
Compared with `pread_threadpool` equivalent (in my db engine), I found `io_uring` was sometimes similar, sometimes slower, never better, so not the preferred default. It makes sense that it could almost reach the devices' I/O capability, though with less control over queue depths than doing it directly in threads.
But I was surprised that the "zero-syscall" queues of `io_uring` didn't provide a noticable improvement over `pread` syscalls, given that I measure a considerable baseline syscall overhead of all syscalls like `pread`, `preadv2` and `futex`, with that overhead having a significant throughput effect in `pread_threadpool` equivalent (because the NVMe devices were fast enough for syscall overhead to affect throughput).
> - Unfortunately, it's [io_uring] unfinished and cannot be merged because the CI has found bugs.
I found what I think is a subtle memory barrier bug in `liburing`. If the ClickHouse implementation is using `liburing` or copying its methods, it's conceivable that may be the cause of the hangs seen in CI. There are also kernel versions where `io_uring` was buggy, evidenced by changes in later kernels to fix bugs.
There are a number of ideas in the database space that the industry is adopting across the board:
- Separation of storage and compute (Neon, AlloyDB, Aurora). Every cloud database should built one. It's a big undertaking, but benefits are undeniable.
- Good query processor for analytics (Snowflake, Velox, Singlestore)
- Open source. Especially in OLTP open source == trust
- HTAP. Can run mixed workload (Singlestore, Unistore): both OLTP (apps) and OLAP (reporting). This has always been a dream, but we still live in the world of dedicated systems: E.g. Snowflake and Postgres.
- Shared nothing sharding (Vitess). This is the most controversial as you lose compatibility with the mothership (MySQL for Vitess). So it's unclear this will be the dominant architecture in the future. I think the world may get to "dynamic sharding" where storage stays separate and compute can be multinode and the user can easily and instantly change the number of nodes.
The separation of storage and compute is controversial, though it is definitely convenient for cloud operators. It is literally going backward in terms of the available storage bandwidth, and many modern database architectures can very effectively use all the storage bandwidth available in modern systems. There is not enough network bandwidth, and won’t be for the foreseeable future, for running storage over networks to be efficient. Database workloads are also starting to move to edge environments, which make it even worse. The resource mismatch currently leaves a lot of compute sitting idle in practice.
Another issue with this model is that it tends to offer poor ratios of bandwidth to storage size. This is at odds with the desire to run mixed workloads in a single system.
Regarding separation of storage and compute, do those DBs you mention enable merging the application backend code and db compute node to a single process? Meaning I would deploy my application to the DB compute node and use its runtime, or have the compute node present as a component I can import to my backend code as a library. I think this is one powerful thing about Datomic (and maybe XTDB?).
I mentioned it recently[0], but this looks like a very good topic to plug our new database, Seafowl, that we released last year [1]. It also uses Apache DataFusion (like IOx) and separates storage and compute (like Neon, Snowflake etc) but is designed for client-side Web apps to run analytical SQL queries over HTTP (using semantics that make the query results cacheable by browser caches and CDNs). This makes it really useful for things like interactive visualizations or dashboards.
We're currently doing a lot of work at Splitgraph to reposition the product around this "analytics at the edge" use case, with usage-based billing, eventually moving our query execution from PostgreSQL to Seafowl.
You can use ClickHouse as a stateless query engine to process externally hosted datasets.
This way you just publish a MergeTree table to S3 bucket or a set of static files on an HTTP server, and then plugging it on an empty ClickHouse server.
My takes for where databases are going in 2023 and beyond:
Separation of storage and compute is clearly the right direction for cloud databases. Pay for what you use. A cache between object storage and compute will paper over the differences and enable supporting workloads that match yesterday's and today's typical application architectures.
Still waiting for a database architecture that embraces the reality of distributed systems / concurrent mutation and exposes the right set of tools to reliably hit the theoretically minimum tradeoffs across a wide range of applications. CRDTs for example are making progress here but have not been generalized enough yet.
At some point transactional workload volume will grow to what we call "big data" today, and we'll be forced to use more storage-efficient columnar data representations for OLTP. I'm curious to see where 'build your own database kit'-DataFusion and efficient dataframe tools like polars/arrow2 can take us.
RDBMS haven't evolved at all for a very long time but only doing minor incremental improvements and consumers have to turn to complex queries as the principle never changes. It's time we get something more modern.
- It has web UI to handle data.
- It can write logic with JS (The more logic in DB the better.)
Surreal looks good; I checked it a few times. I was wondering if the Rust is compiled to wasm, given surreal supports offline/online data sync, could it be used in the browser and app as an offline solution? Like pouch/couch? As that’s what most solutions are missing for the projects we do.
I am writing a Python based SQL query engine: https://github.com/marsupialtail/quokka. My personal goal is to get Andy Pavlo to mention it in his year-end blogs.
I agree with many of the points made in the blog by Andy. Writing a distributed database has become way easier due to open source components like Ray, Arrow, Velox, DuckDB, SQLGlot etc.
I personally believe we will see a switch from JVM based technologies to Rust/C based with Python wrapper
While the author seems to be an expert in production or web scale databases, I feel like this was a big year for SQLite also! Many people including myself are looking at it for server side use cases.
I'm the founder of RisingWave Labs. Glad to know that our company was mentioned in the blog.
Technology wise, agreed with Nikita that storage-compute-decoupling is the trend. Severless will also be the main focus.
Market wise, I observed several trends:
* streaming. streamlining the data pipeline is a new trend. RisingWave and Materialize are both addresssing this market. They will probably need to figure out how to compete against established solutions like KsqlDB and Apache Flink.
* real-time. There are probably 20+ vendors selling real-time OLAP databases. They all claim that they are cheaper, faster, better than Snowflake/Redshift. Just name a few: Druid, Pinot, ClickHouse, StarRocks etc. Let's see who will be the winner in 2023.
* small data. The biggest winner in this domain is Motherduck, a startup commercializing DuckDB. Redshift/Snowflake are a overkill for most companies who do not own large amount of data. Motherduck may be the right choice.
* hybrid. Several companies have started selling hybrid databases (mainly HTAP for now): Google AlloyDB, Snowflake Unistore, SingleStore, PingCAP TiDB+TiFlash, etc.
In 2023, we will also watch how Databricks challenges Snowflake in the data warehouse domain.
Happy to find this blog, wish there was more talk about NewSQL databases. I've personally seen the amount of cost that goes into migrating off of traditional RDBMs systems to sharded RDBMs systems and was wondering why people don't go with these scalable systems from the start now that they exist.
In fact, a large part of the work of many emerging database systems, including Doris, is to supplement the missing functions compared with the traditional databases(Like Oracle), so as to help users smoothly migrate from the original system to the new system. Such as the CTE mentioned by zX41ZdbW.
At the same time, most of projects are emphasizing their performance and cost-effectiveness, because this is the first factor for user selection. For example, Doris and SelectDB (Doris Commercial Company) also participated in Clickbench to reflect their performance advantages.
But the implementations of various projects are actually converging (Doris' execution engine also refers to Clickhouse, thanks for open source), so we can only look for advantages in specific scenarios, such as high concurrency, time series, etc.
On the other hand, performance and cost-effectiveness do not really solve users' problems, so we are also thinking about what users really need besides technology.
Another interesting point is the trend of "de-distribution". Like Clickhouse and DuckDB, they can run well on a single machine with a multi-core, and can also meet the needs of many users. The reliability, availability, and scalability problems solved by "distribution" can all be solved in the PaaS of cloud vendors.
Are there any good recommendations for a lecture or series of lectures that go over Databases from a theoretical perspective but also do a practical survey of the various DB systems and technologies out there?
Oh great, it seems like this year he made the sarcasm about Larry Ellison a lot easier to pick up. Frankly I was still unsure how he really felt about Larry after his review from last year[1]. After reading this, it confirms to me that he doesn’t like Larry :).
Andy's obsession with mmap is interesting. I don't think it's the worst idea in some cases, like Prometheus. They use mmap to simplify the code and only keep the relevant (hot) parts of the index/data in memory. I guess the fact that the data is immutable helps, but that's the point; it's a good idea for some usecases and I don't agree with a blanket "mmap bad"
I came back to MariaDB after leaving MySQL about a decade ago.
The JDBC driver is 10x as big and times-out after less than 12 hours default.
Things are not going in the right direction, but on the upside I now learned how to install MariaDB manually so installers can't brick my OS like MySQL did back in 2014 (MySQL bricked itself and no way to unbrick it except reinstalling Windows).
> Too much blood has been shed in the database community in previous benchmark turf wars ... I even broke up with a girlfriend once because of sloppy benchmark results.
Well, better over that than over east const vs const west.
[+] [-] lioeters|3 years ago|reply
https://supabase.com/
pg_graphql - https://supabase.com/blog/pg-graphql-v1
pg_jsonschema - https://supabase.com/blog/pg-jsonschema-a-postgres-extension...
Postgres WASM - https://supabase.com/blog/postgres-wasm
..And other open-source features.
[+] [-] apavlo|3 years ago|reply
Having so many plug-ins/add-ons available is unique to Postgres and due to Stonebraker's original vision of including extensibility as a first-class design principle in the DBMS.
[+] [-] gavinray|3 years ago|reply
https://github.com/wasmerio/wasmer-postgres
[+] [-] zX41ZdbW|3 years ago|reply
Last year I invited a student to compile ClickHouse with WebAssembly. But the task appeared harder than expected, supposedly due to poor 64-bit support in WASM.
Here is a pull request: https://github.com/ClickHouse/ClickHouse/pull/37871 that we have to close because it is unfinished.
I suspect maybe we can get the result with some additional effort...
[+] [-] danielvaughn|3 years ago|reply
edit: I'm okay with being downvoted, but I'm honestly asking why, not complaining. Is Planetscale not an interesting database?
[+] [-] zX41ZdbW|3 years ago|reply
ClickHouse has switchable IO engines: - read; - pread; - mmap; - pread_threadpool;
The best performance is achieved with the combined method, named 'pread_threadpool'. It uses the `preadv2` syscall to check if the file is in the page cache.
More details here: https://github.com/ClickHouse/ClickHouse/pull/26791
There was also big research of performance on HDD, SSD, and Optane, comparing various methods: https://clickhouse.com/blog/a-journey-to-io_uring-aio-and-mo...
And there is a pull request adding io_uring: https://github.com/ClickHouse/ClickHouse/pull/38456 - Unfortunately, it's unfinished and cannot be merged because the CI has found bugs. Nevertheless, the advantages of io_uring for analytical databases are negligible.
The `mmap` method is still useful. For example, for data import. See here: https://github.com/ClickHouse/ClickHouse/pull/43927
[+] [-] jlokier|3 years ago|reply
I found the same when testing my own database engine. `pread` in a suitably well-designed threadpool outperformend every other option for NVMe random-access 4k reads not in cache.
Variations on how many and which types of locks, work queuing and ordering, and when to start and stop threads and control their number made a considerable difference as well. In certain system configs, `CLONE_IO` made a difference. I use tiny (smaller than 1 page) userspace stacks with `clone`-based threads, and dynamic auto-tuning of the number of blocked and executing threads.
> It uses the `preadv2` syscall to check if the file is in the page cache. More details here: https://github.com/ClickHouse/ClickHouse/pull/26791
That's `preadv2` with the `RWF_NOWAIT` flag. It proved slower when I tried it in my db engine, which was a bit of a surprise. `RWF_NOWAIT` is used to read data synchronously from cache, before passing it to the thread pool to read asynchronously if not in cache.
I expected an average speedup when there are many cache hits, so I was surprised and disappointed to find the `preadv2(..,RWF_NOWAIT)` syscall to be slow enough that it was usually a performance loss overall to use it, at least on the kernel versions and hardware I tested on (a reasonably fast software-RAID NVMe).
A nicer way to look at that is that the auto-tuning thread pool was sufficiently sleek and fast that the asynchronous read was fast enough to make the benefit of a synchronous path too small to be worth it.
One not in your list is mmap_threadpool. I found for many workloads, that was faster than pread_threadpool, and of course it does a better job of sharing memory with the kernel. Unlike synchronous mmap, it is effectively an asynchronous read, where the thread does the page fault instead of syscall, so the main thread is not blocked and device I/O queues are kept full enough.
> And there is a pull request adding io_uring: https://github.com/ClickHouse/ClickHouse/pull/38456 [..] Nevertheless, the advantages of `io_uring` for analytical databases are negligible.
Compared with `pread_threadpool` equivalent (in my db engine), I found `io_uring` was sometimes similar, sometimes slower, never better, so not the preferred default. It makes sense that it could almost reach the devices' I/O capability, though with less control over queue depths than doing it directly in threads.
But I was surprised that the "zero-syscall" queues of `io_uring` didn't provide a noticable improvement over `pread` syscalls, given that I measure a considerable baseline syscall overhead of all syscalls like `pread`, `preadv2` and `futex`, with that overhead having a significant throughput effect in `pread_threadpool` equivalent (because the NVMe devices were fast enough for syscall overhead to affect throughput).
> - Unfortunately, it's [io_uring] unfinished and cannot be merged because the CI has found bugs.
I found what I think is a subtle memory barrier bug in `liburing`. If the ClickHouse implementation is using `liburing` or copying its methods, it's conceivable that may be the cause of the hangs seen in CI. There are also kernel versions where `io_uring` was buggy, evidenced by changes in later kernels to fix bugs.
[+] [-] nikita|3 years ago|reply
There are a number of ideas in the database space that the industry is adopting across the board:
- Separation of storage and compute (Neon, AlloyDB, Aurora). Every cloud database should built one. It's a big undertaking, but benefits are undeniable.
- Good query processor for analytics (Snowflake, Velox, Singlestore)
- Open source. Especially in OLTP open source == trust
- HTAP. Can run mixed workload (Singlestore, Unistore): both OLTP (apps) and OLAP (reporting). This has always been a dream, but we still live in the world of dedicated systems: E.g. Snowflake and Postgres.
- Shared nothing sharding (Vitess). This is the most controversial as you lose compatibility with the mothership (MySQL for Vitess). So it's unclear this will be the dominant architecture in the future. I think the world may get to "dynamic sharding" where storage stays separate and compute can be multinode and the user can easily and instantly change the number of nodes.
[+] [-] jandrewrogers|3 years ago|reply
Another issue with this model is that it tends to offer poor ratios of bandwidth to storage size. This is at odds with the desire to run mixed workloads in a single system.
[+] [-] zX41ZdbW|3 years ago|reply
It is open-source, has the separation of storage and compute and the best performance.
ClickHouse is a source of inspiration for many emerging database engines: Velox, Doris, DuckDB, Datafusion...
[1] https://github.com/ClickHouse/ClickHouse
Disclaimer: I work on ClickHouse.
[+] [-] fulafel|3 years ago|reply
[+] [-] gavinray|3 years ago|reply
https://www.youtube.com/watch?v=rES0yzeERns
Also, it's been neat to have mostly worked our replacing our old Heroku integration at Hasura with you folks.
[+] [-] swyx|3 years ago|reply
trying to see how much you mean it that the industry is "adopting (HTAP) across the board"
[+] [-] Ataraxy|3 years ago|reply
[+] [-] emptysea|3 years ago|reply
[+] [-] carlineng|3 years ago|reply
Wow... that exchange is surreal. I wonder if it was some kind of fraternity prank or troll job?
[1] https://www.youtube.com/watch?t=1490&v=uikbtpVZS2s
[+] [-] softwaredoug|3 years ago|reply
[+] [-] hinkley|3 years ago|reply
[+] [-] mildbyte|3 years ago|reply
We're currently doing a lot of work at Splitgraph to reposition the product around this "analytics at the edge" use case, with usage-based billing, eventually moving our query execution from PostgreSQL to Seafowl.
[0] https://news.ycombinator.com/item?id=34175545
[1] https://seafowl.io
[+] [-] zX41ZdbW|3 years ago|reply
This way you just publish a MergeTree table to S3 bucket or a set of static files on an HTTP server, and then plugging it on an empty ClickHouse server.
ClickHouse also wins in the comparison of serverless engines: https://datastudio.google.com/u/0/reporting/c870737c-e8b6-46...
Discussed here: https://github.com/ClickHouse/ClickHouse/issues/43589#issuec...
[+] [-] ddorian43|3 years ago|reply
[+] [-] infogulch|3 years ago|reply
Separation of storage and compute is clearly the right direction for cloud databases. Pay for what you use. A cache between object storage and compute will paper over the differences and enable supporting workloads that match yesterday's and today's typical application architectures.
Still waiting for a database architecture that embraces the reality of distributed systems / concurrent mutation and exposes the right set of tools to reliably hit the theoretically minimum tradeoffs across a wide range of applications. CRDTs for example are making progress here but have not been generalized enough yet.
At some point transactional workload volume will grow to what we call "big data" today, and we'll be forced to use more storage-efficient columnar data representations for OLTP. I'm curious to see where 'build your own database kit'-DataFusion and efficient dataframe tools like polars/arrow2 can take us.
[+] [-] mekster|3 years ago|reply
RDBMS haven't evolved at all for a very long time but only doing minor incremental improvements and consumers have to turn to complex queries as the principle never changes. It's time we get something more modern.
- It has web UI to handle data.
- It can write logic with JS (The more logic in DB the better.)
- Can write a lot like SQL but way more powerful.
- Receive real time updates.
- Incremental backups.
https://surrealdb.com/
(Many examples on the landing page.)
https://github.com/surrealdb/surrealdb
https://news.ycombinator.com/item?id=32550543
[+] [-] Omie6541|3 years ago|reply
why the more logic in DB the better?
It's not visible to all devs, we can't put it in git, we can't write unit tests for the same!
[+] [-] valenterry|3 years ago|reply
You mean Javascript? If we are looking to make progress, then we shouldn't use those outdated (despite widely distributed) languages.
[+] [-] tluyben2|3 years ago|reply
[+] [-] KRAKRISMOTT|3 years ago|reply
[+] [-] marsupialtail_2|3 years ago|reply
I agree with many of the points made in the blog by Andy. Writing a distributed database has become way easier due to open source components like Ray, Arrow, Velox, DuckDB, SQLGlot etc.
I personally believe we will see a switch from JVM based technologies to Rust/C based with Python wrapper
[+] [-] jerrygenser|3 years ago|reply
[+] [-] yingjunwu|3 years ago|reply
Technology wise, agreed with Nikita that storage-compute-decoupling is the trend. Severless will also be the main focus.
Market wise, I observed several trends:
* streaming. streamlining the data pipeline is a new trend. RisingWave and Materialize are both addresssing this market. They will probably need to figure out how to compete against established solutions like KsqlDB and Apache Flink.
* real-time. There are probably 20+ vendors selling real-time OLAP databases. They all claim that they are cheaper, faster, better than Snowflake/Redshift. Just name a few: Druid, Pinot, ClickHouse, StarRocks etc. Let's see who will be the winner in 2023.
* small data. The biggest winner in this domain is Motherduck, a startup commercializing DuckDB. Redshift/Snowflake are a overkill for most companies who do not own large amount of data. Motherduck may be the right choice.
* hybrid. Several companies have started selling hybrid databases (mainly HTAP for now): Google AlloyDB, Snowflake Unistore, SingleStore, PingCAP TiDB+TiFlash, etc.
In 2023, we will also watch how Databricks challenges Snowflake in the data warehouse domain.
Best luck!
[+] [-] friedman23|3 years ago|reply
[+] [-] morningman-cmy|3 years ago|reply
- https://github.com/apache/doris
In fact, a large part of the work of many emerging database systems, including Doris, is to supplement the missing functions compared with the traditional databases(Like Oracle), so as to help users smoothly migrate from the original system to the new system. Such as the CTE mentioned by zX41ZdbW.
At the same time, most of projects are emphasizing their performance and cost-effectiveness, because this is the first factor for user selection. For example, Doris and SelectDB (Doris Commercial Company) also participated in Clickbench to reflect their performance advantages. But the implementations of various projects are actually converging (Doris' execution engine also refers to Clickhouse, thanks for open source), so we can only look for advantages in specific scenarios, such as high concurrency, time series, etc.
On the other hand, performance and cost-effectiveness do not really solve users' problems, so we are also thinking about what users really need besides technology.
Another interesting point is the trend of "de-distribution". Like Clickhouse and DuckDB, they can run well on a single machine with a multi-core, and can also meet the needs of many users. The reliability, availability, and scalability problems solved by "distribution" can all be solved in the PaaS of cloud vendors.
[+] [-] getcrunk|3 years ago|reply
Are there any good recommendations for a lecture or series of lectures that go over Databases from a theoretical perspective but also do a practical survey of the various DB systems and technologies out there?
[+] [-] whitepoplar|3 years ago|reply
[+] [-] saurabhnanda|3 years ago|reply
Didn't get the closing remarks about Larry Ellison, though. Any explanation?
[+] [-] SPBS|3 years ago|reply
[1] https://news.ycombinator.com/item?id=29733729
[+] [-] friedman23|3 years ago|reply
[+] [-] jimmyed|3 years ago|reply
[+] [-] bullen|3 years ago|reply
The JDBC driver is 10x as big and times-out after less than 12 hours default.
Things are not going in the right direction, but on the upside I now learned how to install MariaDB manually so installers can't brick my OS like MySQL did back in 2014 (MySQL bricked itself and no way to unbrick it except reinstalling Windows).
[+] [-] unknown|3 years ago|reply
[deleted]
[+] [-] mythhouse|3 years ago|reply
weird that dbt is mentioned in 'big database' funding. Also suprised to see the huge investment in dbt.
[+] [-] jwr|3 years ago|reply
[+] [-] einpoklum|3 years ago|reply
From his 2021 DBMS year in review:
> Too much blood has been shed in the database community in previous benchmark turf wars ... I even broke up with a girlfriend once because of sloppy benchmark results.
Well, better over that than over east const vs const west.