Non-convex optimization

cosmos 5th July 2019 at 12:51am
Optimization

See resources at Neural network theory

Hiding solutions in random satisfiability problems: A statistical mechanics approach

Entropy landscape and non-Gibbs solutions in constraint satisfaction problems

Learning theory and neural networks gingko tree

See Statistical physics and inference

https://2017.icml.cc/Conferences/2017/Schedule?showEvent=777


convexity-like notions

"when the gradient vanishes something good happens."

See here: Recent Developments in Over-parametrized Neural Networks, Part I

See Recent developments in overparametrized neural networks

See here: Recent Developments in Over-parametrized Neural Networks, Part I

https://youtu.be/uC2IGoTE2u4?t=2518 - should say Hessian negative? YouTube Recent Developments in Over-parametrized Neural Networks, Part I Jason Lee (University of Southern California) https://simons.berkeley.edu/talks/optimizations-i Deep Learning Boot Camp

https://youtu.be/uC2IGoTE2u4?t=3231 - pero no siempre no? Si haces como yo por ejemplo que ejecuto SGD hasta que el classification error es 0, no estas necesariamente en un critical point del diffenertiable loss que uses; y el set de parametros con training classification error 0 no es measure 0 . Also tampoco dicen que SGD returnee critical points sino epsilon critical, aunq tampoco es suficiente que el Jacobian sea full rank, sino que su minimo eigenvalue sea suficientemente grande i guess.

hmm, pero actually los puntos que puede encontrar SGD estan en un set de measure 0 xq utiliza un numero finito de random bits... La inicialization no, pero como demonstrar que no converge a un numero finito de puntos posibles aun asi...

SGD can scape second order saddle point. But not higher order ones apparently. Why? why doesn't SGD find higher order descent directions, by the same argument it finds second order ones?

One hidden layer with quadratic activation functions has no local minima of degree higher than two, so SGD can't get stuck

functional gradient descent and optimality conditions of nonlinear least squares. If the Jacobian from parameters to predictions (on training set) is non-degenerate at a local optimum, then it is a global optimum! Sugoi!

Almost all points (and apparently almost all critical points too?) have a non-degenerate Jacobian!

no spurious local minima (can scape any critical point with a second order descent direction, which SGD can apparently find), but only for analytic activation functions, like quadratic, monomial etc.

connections to Frank-Wolfe

arxiv.org/abs/1605.08361 arxiv.org/abs/1710.10928 arxiv.org/abs/1803.02968