Newton's method

cosmos 7th November 2016 at 9:08pm
Gradient descent

A method to find stationary points of a function. Note that it doesn't necessarily minimize.. It's more computationally intensive than Gradient descent per step, but it tends to converge faster!

It is based on a local quadratic approximation to the function using Multivariate Taylor expansion:

f(x+δx)f(x)+Tf(x)δx+12δxTH(x)δxf(x+\delta x) \approx f(x)+\nabla^T f(x) \delta x + \frac{1}{2} \delta x^T H(x) \delta x

If we minimize this by taking the Gradient w.r.t δx\delta x and setting it to zero:

Tf(x)+H(x)δx=0\nabla^T f(x) + H(x) \delta x = 0

δx=H(x)1f(x)\delta x=-H(x)^{-1}\nabla f(x)

See section 7 here, videoHugo's video

(Offline algorithm, you process all the data at each step)

Taylor expand to second order (in multi-variate way) and minimize that (i.e. take derivative (gradient)) and set to 0. It performs upper bound minimization.

Newton CG (conjugate gradient) algorithms.

Expensive thing is computing Hessian, and inverting it. It only works if the function is locally convex, so that Hessian is invertible. Approximate methods like BFGS, LBFGS.

Trust regions..

Line search

It is very similar to standard gradient descent, but were the learning rate adapts to the curvature.

Possible problems

Quadratic approximation may be inaccurate

Hessian and its inverse may be expensive –> Quasi-Newton method

Quasi-Newton method