What are the columns and why are there so many of them? The standard approach is to explode into many tables and introduce joins as you said. Why don’t you want joins?
If they are exploding categorical variables using OHE and storing the columns - that is the wrong thing to do. You should only ever store untransformed feature data in tables. You apply the feature transformations, like OHE, on reading from the tables, as those transformations are parameterized by the data you read (the training data subset you select).
I am speculating here but as it genomics data I assume it's information such as: gene count, epigenetic information (methylation, histones etc)
Once you do 20k times a few post translational modifications you can come to a few columns quickly.
Usually this would be stored in a sparse long form though. So I might be wrong.
jamesblonde|1 month ago
Reference: https://www.hopsworks.ai/post/a-taxonomy-for-data-transforma...
anotherpaul|1 month ago
Usually this would be stored in a sparse long form though. So I might be wrong.
hobs|1 month ago