Nocedal's text, 'Numerical Optimization' is the standard for that field.
As he notes, I've always been surprised that more techniques in ML do not leverage the Hessian to get quadratic convergence rates.
Nevertheless, the most interesting tidbit of this text, speaking as a Computational Scientist, was,
'Much more could be said about this rapidly evolving field. Perhaps most importantly, we have neither discussed nor analyzed at length the opportunities offered by parallel and distributed computing'
The scalability of these algorithms, in particular across distributed memory systems (e.g. MPI) at extreme scale will be an extremely important question. I'm very interested in attempting to scale these networks to tens or hundreds of thousands of processing cores. With heroic scale systems now often eclipsing millions of cores, there is quite a bit of room to scale up, if the algorithms are indeed robust.
lbfgs is quite common for eg regression w/o l1 penalties.
MPI is not great at even high hundreds of cores; it's too much work to build redundancy / retry / restart / clean failure in. You really need a framework that helps with this.
The intuition for it is that when optimizing a machine learning objective all the way to machine precision, you at some point cross an (unknown) threshold where you are over-optimizing the parameters to the particular model class you're using, but that's probably too much faith in your model specification. So stochastic optimization and early stopping (before gradient is zero) provides a form of regularization.
[+] [-] arcanus|9 years ago|reply
As he notes, I've always been surprised that more techniques in ML do not leverage the Hessian to get quadratic convergence rates.
Nevertheless, the most interesting tidbit of this text, speaking as a Computational Scientist, was,
'Much more could be said about this rapidly evolving field. Perhaps most importantly, we have neither discussed nor analyzed at length the opportunities offered by parallel and distributed computing'
The scalability of these algorithms, in particular across distributed memory systems (e.g. MPI) at extreme scale will be an extremely important question. I'm very interested in attempting to scale these networks to tens or hundreds of thousands of processing cores. With heroic scale systems now often eclipsing millions of cores, there is quite a bit of room to scale up, if the algorithms are indeed robust.
[+] [-] x0x0|9 years ago|reply
lbfgs is quite common for eg regression w/o l1 penalties.
MPI is not great at even high hundreds of cores; it's too much work to build redundancy / retry / restart / clean failure in. You really need a framework that helps with this.
[+] [-] jey|9 years ago|reply
The intuition for it is that when optimizing a machine learning objective all the way to machine precision, you at some point cross an (unknown) threshold where you are over-optimizing the parameters to the particular model class you're using, but that's probably too much faith in your model specification. So stochastic optimization and early stopping (before gradient is zero) provides a form of regularization.