top | item 46628896

(no title)

I had a great euphoric epiphany feeling today. Doesn't come along too often, will celebrate with a nice glass of wine :)

Am doing data engineering for some big data (yeah, big enough) and thinking about efficiency of data enrichment. There's this classic trilemma with data enrichment where you can have good write efficiency, good read efficiency and/or good storage cost, pick two.

E.g. you have a 1TB table and you want to add a column that, say, will take 1GB to store.

You can create a new table that is 1.1TB and then delete the old table, but this is both write-inefficient and often breaks how normal data lake orchestration works.

You can create a new wide table that is 1.1TB and keep it along side the old table, but this is both write-inefficient and expensive to store.

You can create a narrow companion table that has just a join key and 1GB of data. This is efficient to write and store, but inefficient to query when you force all users to do joins on read.

And I've come up with a cunning forth way where you write a narrow table and read a wide table so its literally best of all worlds! Kinda staggering :) Still on a high.

Might actually be a conference paper, which is new territory for me. Lets see :)

/off dancing

discuss

Fazebooking|1 month ago

Sounds off to me tbh.

Were your table is stored shouldn't matter that much if you have proper indezes which you need and if you change anything, your db is rebuilding the indezes anyway

nurettin|1 month ago

You mean you discovered parallel arrays?

willvarfar|1 month ago

specifically I've discovered how to 'trick' mainstream cloud storage and mainstream query engines using mainstream table formats how to read parallel arrays that are stored outside the table without using a classic join and treat them as new columns or schema evolution. It'll work on spark, bigquery etc.

hahahahhaah|1 month ago

Whats a good place to see parallel arrays defined. I have no data lake expetience. Know how relational db works.

anonu|1 month ago

look into vector databases. for most representations, a column is just another file on disk