top | item 39955126

(no title)

This is a pretty hyped-up optimizer that seems to have okay-ish performance in-practice, but there are a number of major red flags here. For one, the baselines are decently sandbagged, but the twitter posts sharing them (which are pretty hype-y) directly says that the baselines are "highly tuned" and that there's no benchmark trickery (which is flat-out wrong). If someone has not had experience with said benchmarks, it is a plausible statement, having worked with some these datasets very closely, some of the baselines are simply terrible, I don't know where they came from.

Additionally, the optimizer does actually appear to have a kind of momentum, despite claims directly saying the contrary, and uses it with a nesterov-like step (line 2 of 3 in the inner loop). Finally, it is 'schedule-free' because the schedule is actually hardcoded into the algorithm itself -- 1./steps_taken which is not necessarily a rare learning rate schedule. This is a decently robust but sometimes suboptimal schedule, and I find it sketchy to make claims that it is 'schedule-free'. This also cripples the optimizer by tying performance to the number of steps taken -- which is potentially a problem if you are using any batchsize+lr scaling strategies as I understand.

There is a mixture of hype and substance here, and I wish the author was more straightforward with their approach and claims. I think there is the potential for a good "bolts-included" optimizer with some of the ideas being presented here, but the amount of overhyping and deception makes me not want to trust any of the following work coming.

Unfortunately, hype is what sells best on Twitter, and some of the claims being made here appear to be at the very best deceptive, and at the very worst, untrue. I could be wrong -- these are just my personal opinions from my own experience, but I do occasionally find myself distraught about the things that tend to catch wind in the technical news cycle.

-Fern

discuss

aarondefazio|1 year ago

The behavior is actually more complex than a 1/t schedule. It behaves like a linear decay schedule 1-t/T with fixed stopping time T, as if T had been chosen in advance as the current timestep. When warmup is included, this is similar to high performance triangular learning rate schedules. Schedules of the form 1/t schedules perform really poorly in practice, we actually did a large scale comparison that included them in a prior paper: https://arxiv.org/pdf/2310.07831.pdf

danielhanchen|1 year ago

My main current concerns are I tried asking for a transformer benchmark to see if this worked on transformers, but didn't get any response. Also they seem particularly focused on CNN type benchmarks, but did not bother to benchmark superconvergence + Ranger21 + the learning rate range finder, since they explicitiy said Schedule-Free needs tuning as well.

Their past research on D-Adpatation (won ICML best paper 2023) and their follow up work Prodigy all did worse / similar than AdamW, so maybe this works on CNNs, but does not on transformers - but for CNNs we have superconvergence.

I shall wait for their paper which will come in 1-2 months.