(no title)
rob_c | 1 month ago
It would be awesome to have a way of finding them in advance but this is also just a case of avoid pure DNNs due to their strong reliance on initialization parameters.
Looking at transformers by comparison you see a much much weaker dependence of the model on the input initial parameters. Does this mean the model is better or worse at learning or just more stable?
snaking0776|1 month ago
[1] https://distill.pub/2020/circuits/branch-specialization/