top | item 42821273

(no title)

sk11001 | 1 year ago

That’s how interviews go though, it’s not like I’ve ever had to use Bayes rule at work but for a few years everyone loved asking about it in screening rounds.

discuss

mike-the-mikado|1 year ago

In my experience a lot of people "know" maths, but fail to recognise the opportunities to use it. Some of my colleagues were pleased when I showed them that their ad hoc algorithm was equivalent to an application of Bayes' rule. It gave them insights into the meaning of constants that had formerly been chosen by trial and error.

janalsncm|1 year ago

Everyone’s experience is different but I’ve been in dozens of MLE interviews (some of which I passed!) and have never once been asked to explain the internals of an optimizer. The interviews were all post 2020, though.

Unless someone had a very good reason I would consider it weird to use anything other than AdamW. The compute you could save on a slightly better optimizer pale in comparison to the time you will spend debugging an opaque training bug.

yobbo|1 year ago

For example, if it is meaningful to use large batch sizes, the gradient variance will be lower and adam could be equivalent to just momentum.

As a model is trained, the gradient variance typically falls.

Those optimizers all work to reduce the variance of the updates in various ways.

esafak|1 year ago

I'd still expect an MLE to know it though.

janalsncm|1 year ago

Why would you? Implementing optimizers isn’t something that MLEs do. Even the Deepseek team just uses AdamW.

An MLE should be able to look up and understand the differences between optimizers but memorizing that information is extremely low priority compared with other information they might be asked.