top | item 25316012

(no title)

emcq | 5 years ago

You can make a model of any size with deep learning.

If your concerns are about over fitting there are lots of regularization techniques used in practice like dropout, weight decay, and data augmentation.

There's nothing preventing you from sharing weights across layers, and would be interesting to see some research about that.

discuss

order

microtonal|5 years ago

There's nothing preventing you from sharing weights across layers, and would be interesting to see some research about that.

E.g. the ALBERT model does that:

https://arxiv.org/abs/1909.11942

I have done model distillation of XLM-RoBERTa into ALBERT-based models with multiple layer groups and for the tasks that I was working on (syntax) it works really well.

E.g. we have gone from a finetuned ~1000MiB XLM-R base model to a 74MiB ALBERT-based model with barely any loss in accuracy.