(no title)
emcq | 5 years ago
If your concerns are about over fitting there are lots of regularization techniques used in practice like dropout, weight decay, and data augmentation.
There's nothing preventing you from sharing weights across layers, and would be interesting to see some research about that.
microtonal|5 years ago
E.g. the ALBERT model does that:
https://arxiv.org/abs/1909.11942
I have done model distillation of XLM-RoBERTa into ALBERT-based models with multiple layer groups and for the tasks that I was working on (syntax) it works really well.
E.g. we have gone from a finetuned ~1000MiB XLM-R base model to a 74MiB ALBERT-based model with barely any loss in accuracy.
auraham|5 years ago
https://weightagnostic.github.io/