Stochastic gradient descent

cosmos 9th November 2018 at 10:43pm
Deep learning Gradient descent Optimization

An stochastic version of Gradient descent, which can be used for Online learning

To calculate gradients, for Artificial neural networks, we use Backpropagation

Nando's vidHugo's vid

See Learning theory for more on optimization for learning. See these notes chapter 8 for convergence uarantees

Stochastic gradient descent is not Gibbsian

(Online algorithm, you process the data sequentially, by chunks. You need this if you do not access to all of it at the same time, or you have so much data that not all of it fits on your RAM..) You only use a mini-batch (a small sample) of input data at a time, in practice

There're theorems that show that this converges well.

Downpour – Asynchronous SGD

Polyak averaging. Running average over the parameter values at all time steps performed up to now.

Momentum. You add inertia to the particle so that the gradient descent is not just velocity = gradient (as it'd be in viscous fluid), but it is acceleration = (viscosity) + gradient.

Adagrad: Put more weight on rare features [Duchi et al]. Very useful Rare features (i.e. value along a dimension for example) tend to have more information, i.e., they are able to tell you more about what the output yy should be. This seems maybe related to AIT. Compensate for underrepresetantion in gradient descent of rare features

AdamOptimizer

rmsprop is an optimizer that utilizes the magnitude of recent gradients to normalize the gradients

Online Learning Rate Adaptation with Hypergradient Descent

Don't Decay the Learning Rate, Increase the Batch Size

Proof of convergence of Adam is wrong paper (David worked on that)

See also Loss surface


Deriving Gibbs distribution from stochastic gradients

Idea I had of adversarial mini-batches (make examples which are classified wrong more likely)

Coupling Adaptive Batch Sizes with Learning Rates

Papers

...

Léon Bottou. Stochastic gradient learning in neural networks. Proceedings of Neuro-Nımes, 91(8), 1991.

Léon Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pp. 177–186. Physica-Verlag HD, 2010.

Yann LeCun, Léon Bottou, GB Orr, and K-R Müller. Efficient backprop. Lecture notes in computer science, pp. 9–50, 1998.

Dauphin, Yann N, Pascanu, Razvan, Gulcehre, Caglar, Cho, Kyunghyun, Ganguli, Surya, & Bengio, Yoshua. 2014. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. Pages 2933–2941 of: Advances in Neural Information Processing Systems.

Saddles in deep learning https://arxiv.org/abs/1605.07110 Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. How much of a problem are saddle points?

Duchi, John, Hazan, Elad, & Singer, Yoram. 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul), 2121–2159.

Jason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Gradient descent converges to minimizers. University of California, Berkeley, 1050:16, 2016.

James Martens. Deep learning via hessian-free optimization. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 735–742, 2010. James Martens. New insights and perspectives on the natural gradient method. arXiv preprint arXiv:1412.1193, 2014.

Hossein Mobahi. Training recurrent neural networks by diffusion. arXiv preprint arXiv:1601.04114, 2016.

Ioannis Panageas and Georgios Piliouras. Gradient descent only converges to minimizers: Nonisolated critical points and invariant regions. arXiv preprint arXiv:1605.00405, 2016.

Panos M Pardalos, David Shalloway, and Guoliang Xue. Optimization methods for computing global minima of nonconvex potential energy functions. Journal of Global Optimization, 4(2):117–133, 1994.

Barak A Pearlmutter. Fast exact multiplication by the hessian. Neural computation, 6(1):147–160, 1994.

Tom Schaul, Sixin Zhang, and Yann LeCun. No more pesky learning rates. ICML (3), 28:343–351, 2013. –> newer methods now

The Marginal Value of Adaptive Gradient Methods in Machine Learning See also Zhang et al

We finally observe that both the gradient descent and the stochastic gradient descent methods can reach this level within the same number of steps. – https://arxiv.org/abs/1412.6615

A Walk with SGD


See also Loss surface of neural networks