top | item 42823284

(no title)

ImageXav | 1 year ago

Something that stuck out to me in the updated blog [0] is that Demon Adam performed much better than even AdamW, with very interesting learning curves. I'm wondering now why it didn't become the standard. Anyone here have insights into this?

[0] https://johnchenresearch.github.io/demon/

discuss

gzer0|1 year ago

Demon Adam didn’t become standard largely for the same reason many “better” optimizers never see wide adoption: it’s a newer tweak, not clearly superior on every problem, is less familiar to most engineers, and isn’t always bundled in major frameworks. By contrast, AdamW is now the “safe default” that nearly everyone supports and knows how to tune, so teams stick with it unless they have a strong reason not to.

Edit: Demon involves decaying the momentum parameter over time, which introduces a new schedule or formula for how momentum should be reduced during training. That can feel like additional complexity or a potential hyperparameter rabbit hole. Teams trying to ship products quickly often avoid adding new hyperparameters unless the gains are decisive.