A Local optimization technique based on following gradients to decrease a function.
Take steps of a size given by the learning rate along the Gradient.
Can prove convergence for Convex functions
Convergence conditions for gradient descent, to determine learning rate
Nesterov's accelerated gradient
line-search to find step-size
Methods which use information of second derivative
Projected gradient descent: Gradient step and project on the surface of the constraint area..
Natural gradients for deep learning
Used when the function is non-differentiable
Used when the space is non-Eculidean
See this video
Can prove convergence for Convex functions
https://twitter.com/gabrielpeyre/status/959715154937221120
See these notes. See end of Chapter 1, and Chapter 2.
Black-box model
Assuming there is an Oracle that gives you the subgradient of the function.
Here are the convergence rates for Projected subgradient descent:
These are the lower bounds in the black-box models:
Subgradient descent reaches the lower bound for Lipschitz functions, but not for smooth ones. One need accelarated methods to reach the lower bound for smooth ones.
Stochastic oracle
Assume that the oracle yields a random variable which is an unbiased estimator of the gradient (so its expectaction equals the gradient).