Solving duplicate data with performant deduplication

goodroot|2 years ago

Hey! Thanks for upvoting.

Happy to answer any questions about deduplication. One thing that's not included in the write-up is that we also address out-of-order indexing alongside deduplication.

CommanderHux|2 years ago

The dataset link seems to be dead. Do you have a mirror?

goenning|2 years ago

If your ClickHouse ReplacingMergeTree returns twice the expected row count is because your query is wrong. You don’t need to FINAL it, just use aggregation on your queries as per their docs

supercoco9|2 years ago

Hi. Sorry if my query offended you.

I basically executed literally what Clickhouse recommends at their guides for deduplication https://clickhouse.com/docs/en/guides/developer/deduplicatio....

Of course you can also materialize with aggregations or just use a group by, or even force optimize of the table. But my point is that you don't really get exactly once guarantees. Whoever is querying that table needs to be aware than a `SELECT * FROM tb` might contain duplicates and needs to adapt their queries accordingly.

benjaminwootton|2 years ago

Indeed, you should still aggregate even on mergetree tables.

I'm not sure what is about the database world where people are happy to discuss their competitors and include either mistakes or misinformation. It doesn't seem to happen in other parts of the industry.

adren123|2 years ago

An initial import with DuckDB from all the 15 files takes only 36 seconds on a regular (6 years old) desktop computer with 32GB of RAM and 26 seconds (5 times quicker than QuestDB) on a Dell PowerEdge 450 with 20 cores Intel Xeon and 256GB of RAM.

Here is the command to input the files:

CREATE TABLE ecommerce_sample AS SELECT * from read_csv_auto('ecommerce_*.csv');

supercoco9|2 years ago

Thank you! (original article writer here)

DuckDB is awesome. A couple of comments here. First of all, this is totally my fault, as I didn't explain it properly.

I am trying to simulate performance for streaming ingestion, which is the typical use case for QuestDB, Clickhouse, and Timescale. The three of them can do batch processing as well, but they shine when data is coming at high throughput in real time. So, while the data is presented on CSV (for compatibility reasons), I am reading line after line, and sending the data from a python script using the streaming-ingestion API exposed by those databases.

I guess the equivalent in DuckDB would be writing data via INSERTS in batches of 10K records, which I am sure would still be very performant!

The three databases on the article have more efficient methods for ingesting batch data (in QuestDB's case, the COPY keyword, or even importing directly from a Pandas dataframe using our Python client, would be faster than ingesting streaming data). I know Clickhouse and Timescale can also efficiently ingest CSV data way faster than sending streaming inserts.

But that's not the typical use case, or the point of the article, as removing duplicates on batch is way easier than on streaming. I should have made that clearer and will probably update it, so thank you for your feedback.

Other than that, I ran the batch experiment you mention on the box I used from the article (had already done it in the past actually) and the performance I am getting is 37 seconds, which is slower than your numbers. The reason here is that we are using a cloud instance using an EBS drive, and those are slower than a local SSD on your laptop.

You can use local drives on AWS and other cloud providers, but those are way more expensive and have no persistence or snapshots, so not ideal for a database (we use them sometimes for large one-off imports to store the original CSV and speeding up reads while writing to the EBS drive).

Actually one of the claims in DuckDB's entourage is that "big data is dead". With the power in a developer's laptop today you can do things faster than with cloud at unprecedented scale. DuckDB is designed to run locally on a data scientist machine, rather than on a remote server (of course you have Motherduck if you want to do remote, but you are now adding latency and a proprietary layer on top).

Once again thank you for your feedback!

whalesalad|2 years ago

Can anyone comment on QuestDB vs Clickhouse vs TimescaleDB? Real world experience around ergonomics, ops, etc.

Currently using BigQuery for a lot of this (ingesting ~5-10TB monthly) but would like to begin exploring in-house tooling.

On the flip side, we still use PSQL/RDS a lot and I enjoy it for the low operations burden - but we're doing some time series stuff with it now that is starting to fall over. TimescaleDB is nice because it is postgres, but afaik cannot work inside RDS. Clickhouse is next on my list for a test deployment, but QuestDB looks pretty neat too.

anonacct37|2 years ago

Can't answer your question directly but I've touched a few of those here and there.

Clickhouse was deployed to replace a home grown distributed storage system that at one point in time a long time ago was much cheaper and faster than the results the team was getting with BQ.

We evaluated clickhouse and druid for a few data stores for doing interactive queries on a fairly high throughput clickstream data pipeline.

Clickhouse won in terms of performance. We did some chaos testing on it in the form of simulating node outages and network partitions and we were happy with the results. My only complaint is that there are some other database options which don't require me to run a database and that's nice if I can get away with it.

One of the things you might try is dumping your data into parquet files on gcs. There are quite a few databases you can query that with and get good results depending on your indexing and partitioning needs. It's tough to get lower operational burden than "stick it in cloud storage and sometimes spin up some stateless compute to query it".

I think duckdb is super cool but for me at the moment it's a solution that I'm still in search of a problem for.

hansvm|2 years ago

I looked at it every year for a few years, most recently 1-2 yrs ago. It really is 20-50x faster at a lot of workloads than something like postgres, and it really does have a relational core and decent time-series functionality. I'd love to use it more, but IMO it's far too buggy. However it got that way, nothing about query execution felt cleanly composable.

No single query directly corrupted any persistent data, but "complicated" queries (any more nesting than the classic "select from join where" SQL backbone) were prone to returning no results, wrong results, error conditions, or slowly. For a bit I worked around that by wrapping right-sized queries (big enough to do something meaningful and avoid comms overhead, small/simple enough to work as expected) with a normal turing-complete language, but it was painful. Plus the defaults were quirky (broken telemetry opt-out, connections time out, RAM limits being per-arcane-questdb-subcomponent rather than per-questdb, ...) IMO, not that I would have minded enough to look elsewhere if the queries were correct.

Colored by that experience, I'd caution to thoroughly go through the docs and get it configured exactly as you want before deploying to prod, testing that the features you need behave correctly (best practice for most software, but IME not strictly necessary for a lot of products if your application code is sane), and also do something to probe reliability like spending a day hammering it with a toy project (maybe a few 10M row tables so that it's a fast experiment but anything fishy still stinks appropriately) and checking the query results. The current GitHub issues look fairly tame, so maybe it's better by now.

nhourcard|2 years ago

I'd be curious to hear how RDS is starting to fall over with time series data, is it a bottleneck on ingestion, queries, or both?

gigatexal|2 years ago

What about iceberg tables and a lake approach on GCS and then picking a querying engine?

jimsimmons|2 years ago

What is the best way to deduplicate a corpus of documents

marginalia_nu|2 years ago

If you mean in the sense of dealing with documents that are very similar but not binary identical, a locality sensitive hash would do the job.

OnlyMortal|2 years ago

A Rabin finger printing algorithm and a hash of the data it generates.

Reference count the hashes.

25 comments