A method to find stationary points of a function. Note that it doesn't necessarily minimize.. It's more computationally intensive than Gradient descent per step, but it tends to converge faster!
It is based on a local quadratic approximation to the function using Multivariate Taylor expansion:
If we minimize this by taking the Gradient w.r.t and setting it to zero:
See section 7 here, video – Hugo's video
(Offline algorithm, you process all the data at each step)
Taylor expand to second order (in multi-variate way) and minimize that (i.e. take derivative (gradient)) and set to 0. It performs upper bound minimization.
Newton CG (conjugate gradient) algorithms.
Expensive thing is computing Hessian, and inverting it. It only works if the function is locally convex, so that Hessian is invertible. Approximate methods like BFGS, LBFGS.
Trust regions..
Line search
It is very similar to standard gradient descent, but were the learning rate adapts to the curvature.
Possible problems
Quadratic approximation may be inaccurate
Hessian and its inverse may be expensive –> Quasi-Newton method