top | item 46627966

(no title)

remywang | 1 month ago

What are the columns and why are there so many of them? The standard approach is to explode into many tables and introduce joins as you said. Why don’t you want joins?

discuss

jamesblonde|1 month ago

If they are exploding categorical variables using OHE and storing the columns - that is the wrong thing to do. You should only ever store untransformed feature data in tables. You apply the feature transformations, like OHE, on reading from the tables, as those transformations are parameterized by the data you read (the training data subset you select).

Reference: https://www.hopsworks.ai/post/a-taxonomy-for-data-transforma...

anotherpaul|1 month ago

I am speculating here but as it genomics data I assume it's information such as: gene count, epigenetic information (methylation, histones etc) Once you do 20k times a few post translational modifications you can come to a few columns quickly.

Usually this would be stored in a sparse long form though. So I might be wrong.

hobs|1 month ago

If you want to do that why not just do an EVA pattern or something else that can translate rows to columns?