wuputah's comments

wuputah | 1 year ago | on: pg_duckdb: Splicing Duck and Elephant DNA

I think it will be straightforward to expose time_bucket in pg_duckdb. Feel free to open an issue for the feature.

wuputah | 1 year ago | on: Using ClickHouse to scale an events engine

That is an incorrect and baseless accusation, we had nothing to do with "Postgres (tuned)". My commits are only in the `hydra` folder. There are no restrictions on how you set up the benchmark in Clickbench and the settings we use there are analogous with what we use on our cloud service for a similar sized instance.

As the linked post points out, the main 'advantage' of the "tuned" benchmark is the indexes, which are tuned specifically to the queries in the benchmark. We do not use indexes in our version of the benchmark, aside from the primary key (which actually provides no performance advantage).

wuputah | 2 years ago | on: Show HN: pgxman – npm for Postgres extensions

> does it work with the existing postgres apt/yum repos?

We only support apt for now but plan to support other package managers in the future. It works with existing Postgres apt packages, we recommend using PGDG but the default system packages on Debian/Ubuntu work as well.

> Does it work with the postgres Docker image?

yes, in fact this how our `container` feature works. https://docs.pgxman.com/container

wuputah | 2 years ago | on: Show HN: pgxman – npm for Postgres extensions

In my opinion, we plan on accomplishing this by using a container; it's not quite something we have today, but this is good feedback. :)

On Ubuntu/Debian, Postgres doesn't typically work this way, so it's not the way that pgxman works. pgxman works on top of the existing `postgresql` packages and with the existing package manager (apt) in order to install extensions -- which is also how it handles runtime dependencies, whether libraries or even other extensions.

So, that said, we have a container feature I could see using to effectively isolate for a single project. Right now there is only one single "global" container (per Postgres version) that pgxman will manage for you, but this is just a MVP of this feature. I could definitely see something like `pgxman c dev` or similar which will read a local pgxman pack file (pgxman.yaml) in your project and boot a "local" Postgres for you just for that project.

The pgxman pack is already a thing and is how the local container config is maintained, but we haven't tied it together in the way described above... yet. For more on both pgxman pack and the container feature, check out our docs.

wuputah | 2 years ago | on: Show HN: Hydra - Open-Source Columnar Postgres

Thanks for calling these out, as these are just misunderstandings. We will certainly tweak the language around these.

- Installing the extension itself does not change the default table type, this is only the case on Hydra Cloud and our Docker image.

- "Hydra is not a fork" refers to the fact that Hydra did not fork Postgres; it is an extension. We have put in a lot of effort since forking Citus, but it's not our intent to hide that fact.

- Yes, "Hydra External Tables" is a productization around FDWs, there's more we want to do with it but it hasn't been our focus lately.

wuputah | 2 years ago | on: Show HN: Hydra 1.0 – open-source column-oriented Postgres

First we added a bitmask to mark rows as deleted - these rows are filtered out on read. Then updates are implemented as deletions + inserts. We have also added vacuum functions to remove/rewrite stripes that have >20% of deleted rows in order to reclaim space and optimize those stripes.

wuputah | 2 years ago | on: Show HN: Hydra 1.0 – open-source column-oriented Postgres

of course :) drop by our Discord if there's something you'd like to contribute and want to chat about it beforehand, need help/have questions getting started, etc. https://hydra.so/discord

wuputah | 3 years ago | on: Hydra – the fastest Postgres for analytics [benchmarks]

Yes, Hydra Columnar and PostGIS get along just fine. We've not looked into any PostGIS-specific optimizations yet, but if users run into issues, we'll be happy to investigate.

wuputah | 3 years ago | on: Hydra – the fastest Postgres for analytics [benchmarks]

Of course, that's the power of Postgres. You can join between columnar tables or between columnar and heap (row-based) tables. The performance of joins hasn't been a specific focus of our engineering work yet, but I made a little test of enriching an analytical query with user data here: https://gist.github.com/wuputah/e62b83f86880bd7e6623809afe4c...

wuputah | 3 years ago | on: Hydra – the fastest Postgres for analytics [benchmarks]

Yeah, I agree! However, ClickBench has used 500GB GP2 as the "standard" for some time, so I stuck to it for consistency. We use GP3 for our hosted service, and I did test on GP3 as well with identical settings as GP2 and the results are very similar.

wuputah | 3 years ago | on: Hydra – the fastest Postgres for analytics [benchmarks]

The metadata can act as a basic form of indexing (or sometimes caching, though Hydra doesn't use metadata to calculate results yet), but it's not an index in the traditional sense. It's used to eliminate stripes and blocks from consideration during a scan.

Columnar is not ideal for a `users` table where you want to select and update specific rows, often in very small, quick transactions (OLTP). You would want to continue to use a traditional (heap) table in that case. That's certainly something you can still do with Hydra, and combining both kinds of tables is considered HTAP, and something that is a unique use case of our product.

To contrast, columnar is best for "fact" tables -- data about something that happened (thus it does not change) that will be analyzed in an aggregate way. Those might be logs, events, transactions, etc.

wuputah | 3 years ago | on: Hydra – the fastest Postgres for analytics [benchmarks]

A narrow distinction, but Hydra is Postgres - we only install an extension - while Greenplum and Redshift are forks but remain Postgres-compatible (to varying degrees). I'm not up on when Greenplum last merged updates from Postgres, but I would be concerned that it only runs on Ubuntu 18.04. If you have a look at the Greenplum install in ClickBench[1], you'll see it's not a typical Postgres setup. Hopefully we will be able to beat Greenplum straight-up soon. :)

Redshift is multi-node, which puts it in a different category -- with considerably higher costs.

[1]: https://github.com/ClickHouse/ClickBench/blob/main/greenplum...

wuputah | 4 years ago | on: Hydra (YC W22) releases the Postgres data warehouse

We hope you find our new product useful and engaging! You can get started with a free 10 GB Hydra at https://hydras.io/ or read our blog post at https://hydras.io/blog/2022-03-21-announcing-hydra-postgres-....

wuputah | 4 years ago | on: Heroku was Down

I agree with your lead statement and argued as such, but was overruled. Last I knew and understood, Heroku Status is static pages pushed out to Fastly, with the internal admin site (that does that work) running in a Heroku Private Space. If you look at the DNS, it still appears to be served by Fastly, and Heroku Private Spaces are generally pretty isolated infra, so I would be curious what the failure mode was here. But ultimately this is the fire you play with when you self-host your status site...

wuputah | 4 years ago | on: Launch HN: Hydra (YC W22) – Query any database via Postgres

Hi! Definitely reach out to us to discuss further.

wuputah | 4 years ago | on: Launch HN: Hydra (YC W22) – Query any database via Postgres

Certainly, I see us expanding to other cloud providers as we follow customer demand, but it will take some time. I think if you wanted to move faster and have a higher level of control, self-hosted would be the way to go. We are offering that but it's not on our web site.

Definitely stay tuned [0] for Clickhouse! And yes, exactly, you can continue to use your ORM of choice.

0: social links are at the bottom of hydras.io :)

wuputah | 4 years ago | on: Launch HN: Hydra (YC W22) – Query any database via Postgres

Hi, JD here, CTO at Hydra. In an HTAP scenario, local transactional data would be replicated, but your data warehouse will likely have a great amount of data that your Postgres database does not. You can still connect that data to Postgres with Hydra. Ultimately, it's up to you if/how you choose to replicate your data -- along with guidance from our team along the way.

wuputah | 4 years ago | on: Launch HN: Hydra (YC W22) – Query any database via Postgres

Hi, JD here, Hydra's CTO. It's still early days and we are considering open source; for now, we wanted to leave our options open, and OSS feels like a one-way door. I think you make a great point here - thanks for sharing your past pain / experience. Definitely food for thought.

Our "no lock-in" claim refers to your data, since Hydra is Postgres, you're not stuck using "HydraDB" forever -- it's relatively easy to migrate in or out since you can use well established Postgres tools. We also are open to licensing the product should you wish to self-host, on-prem, etc.

wuputah | 4 years ago | on: Launch HN: Hydra (YC W22) – Query any database via Postgres

Hi, JD here, Hydra's CTO. Thanks for the interest and questions!

Today, queries need to be Postgres-compatible to be intelligently routed, but queries with specific query syntax or functions beyond Postgres can be routed with our manual router[1]. This is our first solution to this problem and plan to iterate in response to customer pain.

Sorry for the confusion! Data moves asynchronously -- we're not trying to implement multi-phase commits -- but we can act on data very quickly once committed. Our solution here uses Postgres logical replication. Using the Data Bridge is optional and a customer's existing solutions are welcome as well.

[1]: https://hydras-io.notion.site/Router-a91f5282f1354c54a9ba894...

wuputah | 4 years ago | on: Launch HN: Hydra (YC W22) – Query any database via Postgres

Hydra doesn't ship data to the client in order to then do further work like aggregations -- that's the whole point of Hydra -- but that also means that you won't be able to "workaround" a performance issue with an underlying data store. For that, we'd need to find a way to replicate the data to a data store that can solve the aggregation performance issue.