Gradient descent: Cosmos — All that is, or was, or ever will be

Gradient descent

cosmos 9th June 2018 at 1:57pm

A Local optimization technique based on following gradients to decrease a function.

Take steps of a size given by the learning rate along the Gradient.

Can prove convergence for Convex functions

Stochastic gradient descent

Convergence conditions for gradient descent, to determine learning rate

Accelerated gradient descent

Nesterov's accelerated gradient

line-search to find step-size

Second-order method

Methods which use information of second derivative

Constrained optimization with projection based methods

Projected gradient descent: Gradient step and project on the surface of the constraint area..

Natural gradients

Original NIPS paper

Natural gradients for deep learning

Subgradient methods

Used when the function is non-differentiable

Mirror descent

Used when the space is non-Eculidean

Coordinate descent

Convergence

Can prove convergence for Convex functions

https://twitter.com/gabrielpeyre/status/959715154937221120

See these notes. See end of Chapter 1, and Chapter 2.

Black-box model

Assuming there is an Oracle that gives you the subgradient of the function.

Here are the convergence rates for Projected subgradient descent:

These are the lower bounds in the black-box models:

Subgradient descent reaches the lower bound for Lipschitz functions, but not for smooth ones. One need accelarated methods to reach the lower bound for smooth ones.

Stochastic oracle

Assume that the oracle yields a random variable which is an unbiased estimator of the gradient (so its expectaction equals the gradient).