Mirror descent

cosmos 9th June 2018 at 1:10pm
Gradient descent

https://www.stats.ox.ac.uk/~lienart/blog_opti_mda.html

Mirror descent can be derived as GD with a different norm over the domain (where this can be seen when rearranging the usual form of projected GD to isolate a term that depends on the norm, see here)

Going back to our gradient descent algorithm this means that in general the equation xηf(x)x-\eta\nabla f(x) doesn’t make sense, because f(x)\nabla f(x) is an element of the dual and xx is an element of the primal. Note that the exception in case of the euclidean norm arises because we can write linear functionals as an appropriate scalar product, see Remark 5.0.1. (More generally, the Riesz representation theorem implies that the dual of a Hilbert space is isometrically isomorph to its primal.)

Mirror descent allows us to circumvent this problem by mapping the current point of our descent algorithm to its dual, perform the gradient descent step and map back to our primal space. In general there is no guarantee that our new point in the primal space is in our restriction set X and, hence, an additional projection may be required. Summarised we get the following algorithm.

Uses Bregman divergence

.......

Mirror descent works exponentially better (w.r.t. dimension of solution space) than projected gradient descent, when optimizing on a probability simplex. Can take advantage of KL divergence as a good metric.


One can apply normal GD on the parameter space of a manifold

But then, one can ask the question of what if I want to take a step of a certain fixed size (how to measure the size of the step? Well a natural notion is the size given by the metric, of course), but maximize how much the function changes.

can then solve this constrained maximization problem using Lagrange multipliers. The result would be a direction given by the inverse metric tensor multiplied by the gradient. This makes sense because the inverse metric tensor converts covariant vectors to contravariant ones..