aka ELBO
An objective function which is maximized in Variational inference. We are interested in minimizing the KL divergence between the variational distribution q(w) and the true Posterior p(w∣D) of some paramters w given the data D, so we can maximize
LVI(q)=logp(D)−DKL[q(w)∣∣p(w∣D)]
with respect to q (as the first term doesn't depend on q). Note that because the KL divergence is 0 if and only if q(w)=p(w∣D), if the set of q over which we are optimizing contains the posterior, then the unique minimizer of LVI(q) is the posterior.
Note that because KL divergence is always nonnegative
LVI(q)≤logp(D)
with equality iff q is the posterior. logp(D) is the Bayesian evidence. This is why we call LVI(q) the evidence lower bound
The issue is that evaluating the posterior p(w∣D) is hard, by assumption, so we can't compute DKL[q(w)∣∣p(w∣D)], so the above form is not very useful. Instead we rewrite it as:
LVI(q)=−DKL[q(w)∣∣p(D)p(w∣D)]=−DKL[q(w)∣∣p(w,D)]
=−DKL[q(w)∣∣p(D∣w)p(w)]
=Eq(w)[logp(D∣w)]−DKL[q(w)∣∣p(w)] |
which can also be written in terms of the joint probability and the entropy of the variational distribution:
=Eq(w)[logp(D,w)]−Eq(w)[logq(w)]
the last form is the one used for the optimization!
Overall we have shown that
logp(D)≥LVI(q)=Eq(w)[logp(D∣w)]−DKL[q(w)∣∣p(w)]
Now, logp(D) is what appears in the Gibbs posterior version of the PAC-Bayes theorem. On the other hand if we substitute the right hand side in place of logp(D), we get the general version of the PAC-Bayes theorem, which shows us that the Gibbs posterior gives the tightest PAC-Bayes bound! (remember that Eq(w)[logp(D∣w)]=−Eq(w)[∑il(xi;w)], where l is the loss function and xi are the data).
Therefore, maximizing the ELBO is like minimizing the PAC-Bayesian bound! |
(note that under right relation between loss function and likelihood, Gibbs posterior is just the Bayes Posterior)
courtesy of Adria Garriga Alonso.
http://users.umiacs.umd.edu/~xyang35/files/understanding-variational-lower.pdf