Evidence lower bound

cosmos 10th April 2019 at 10:58am
Variational inference

aka ELBO

An objective function which is maximized in Variational inference. We are interested in minimizing the KL divergence between the variational distribution q(w)q(w) and the true Posterior p(wD)p(w|\mathcal{D}) of some paramters ww given the data D\mathcal{D}, so we can maximize

LVI(q)=logp(D)DKL[q(w)p(wD)]\mathcal{L}_{VI}(q) = \log{p(\mathcal{D})}-\mathbb{D}_{KL}[q(w)||p(w|\mathcal{D})]

with respect to qq (as the first term doesn't depend on qq). Note that because the KL divergence is 00 if and only if q(w)=p(wD)q(w)=p(w|\mathcal{D}), if the set of qq over which we are optimizing contains the posterior, then the unique minimizer of LVI(q)\mathcal{L}_{VI}(q) is the posterior.

Note that because KL divergence is always nonnegative

LVI(q)logp(D)\mathcal{L}_{VI}(q) \leq \log{p(\mathcal{D})}

with equality iff qq is the posterior. logp(D)\log{p(\mathcal{D})} is the Bayesian evidence. This is why we call LVI(q)\mathcal{L}_{VI}(q) the evidence lower bound

The issue is that evaluating the posterior p(wD)p(w|\mathcal{D}) is hard, by assumption, so we can't compute DKL[q(w)p(wD)]\mathbb{D}_{KL}[q(w)||p(w|\mathcal{D})], so the above form is not very useful. Instead we rewrite it as:

LVI(q)=DKL[q(w)p(D)p(wD)]=DKL[q(w)p(w,D)]\mathcal{L}_{VI}(q) =-\mathbb{D}_{KL}[q(w)|| p(\mathcal{D})p(w|\mathcal{D})] =-\mathbb{D}_{KL}[q(w)||p(w,\mathcal{D})] =DKL[q(w)p(Dw)p(w)] =-\mathbb{D}_{KL}[q(w)||p(\mathcal{D}|w)p(w)]

=Eq(w)[logp(Dw)]DKL[q(w)p(w)] =\mathbb{E}_{q(w)}[\log{p(\mathcal{D}|w)}]-\mathbb{D}_{KL}[q(w)||p(w)]

which can also be written in terms of the joint probability and the entropy of the variational distribution:

=Eq(w)[logp(D,w)]Eq(w)[logq(w)] =\mathbb{E}_{q(w)}[\log{p(\mathcal{D},w)}]-\mathbb{E}_{q(w)}[\log{q(w)}]

the last form is the one used for the optimization!

Overall we have shown that

logp(D)LVI(q)=Eq(w)[logp(Dw)]DKL[q(w)p(w)]\log{p(\mathcal{D})} \geq \mathcal{L}_{VI}(q) = \mathbb{E}_{q(w)}[\log{p(\mathcal{D}|w)}]-\mathbb{D}_{KL}[q(w)||p(w)]

Now, logp(D)\log{p(\mathcal{D})} is what appears in the Gibbs posterior version of the PAC-Bayes theorem. On the other hand if we substitute the right hand side in place of logp(D)\log{p(\mathcal{D})} , we get the general version of the PAC-Bayes theorem, which shows us that the Gibbs posterior gives the tightest PAC-Bayes bound! (remember that Eq(w)[logp(Dw)]=Eq(w)[il(xi;w)]\mathbb{E}_{q(w)}[\log{p(\mathcal{D}|w)}] = - \mathbb{E}_{q(w)}[\sum_i l(x_i;w)], where ll is the loss function and xix_i are the data).

Therefore, maximizing the ELBO is like minimizing the PAC-Bayesian bound!

(note that under right relation between loss function and likelihood, Gibbs posterior is just the Bayes Posterior)


courtesy of Adria Garriga Alonso.

http://users.umiacs.umd.edu/~xyang35/files/understanding-variational-lower.pdf