(no title)
miven | 1 month ago
Though I wonder how much of the gains in this case are actually due to 75% extra parameters compared to the baseline, even if the inference FLOPs are matched.
Can't help but see this as a just different twist on parameter use sparsity idea leveraged by MoE models, as those also gain in performance at constant forward pass FLOPs because of extra parameters.
No comments yet.