Variational inference: Cosmos — All that is, or was, or ever will be

Variational inference

cosmos 31st March 2019 at 5:15am

aka VI, variational Bayes, Ensemble learning, Minimum description length

Variational inference is based on reducing inference to an Optimization problem. In Statistical inference, we are often interested in finding a Posterior distribution, defined via a Probabilistic model, like a Graphical model, and some observed data. However, posterior distributions are often hard to compute explicitly, and an approximate method is needed.

Variational inference consists on finding an approximate representation of the posterior distribution (the variational distribution), minimizing the KL divergence with the real Posterior distribution. This is often done by choosing a tractable family of distributions $\mathcal{P}$ and maximizing a variational objective function like the Evidence lower bound (ELBO) over this family.

See here.

Why is variational inference more efficient than MCMC methods?

but how does it compare in terms of accuracy?

The interesting thing is that the KL divergence/ELBO is defined via a in integral over the latent/hidden variables, which is precisely the thing that we assumed was hard, and the reason to use variational inference! However, the variational distribution $q(w)$ may be much easier to integrate than the posterior!

Otherwise, we can use samples instead of the integation and use Stochastic optimization (like Stochastic gradient descent) [ref]. But we could also have used sampling for the original integral! Or we could have used Monte Carlo methods (MCMC! In that case, the nontrivial thing is that using stochastic optimization can give good answer with much fewer samples than when approximating the integral in naive Bayesian inference (the denominator..)<- which I think is true, the difference in convergence rates of this two is quite big in high dimensional problems I think.. It is also typically more scalable (efficient for large datasets) than MCMC methods. See examples in Variational inference for Gaussian processes

To understand this would require analysis of MCMC, Monte Carlo integration, and stochastic optimization.

Relation to Maximum likelihood estimation

The Evidence lower bound is, as its name suggest a lower bound on the Bayesian evidence (aka Marginal likelihood of the data). In fact if the global minimum (over all distributions) is the actual likelihood of the data. Therefore, we can do maximum likelihood estimation, optimizing the ELBO w.r.t. the parameters of a model of the data (a Generative model in the usual context). So if a model has a likelihood which is intractable to compute directly, we can instead optimize the ELBO w.r.t. to the model parameters as well as w.r.t. the variational distribution. If we optimized the variational distribution part first to convergence over a sufficiently expressive family then this becomes the same as direct maximum likelihood etimation of the parameters of the model. Optimizing both at the same time is usually more efficient, of course.

This can be used even in supervised learning when one does Generative supervised learning, and when doing that one can do Semi-supervised learning (see here for e.g.). One may have other benefits like more principled (and therefore potentially more accurate) uncertainty estimates, more types of inference one can do than in a simple discriminative model, and potentially being easier to incorporate prior knowledge into the generative model than in the discriminative one.

Zero-avoidance property

Zero-avoidance property of variational inference. In $\mathbb{D}_{KL}[q||p]$ there is a large positive contribution from regions in which $p$ is near zero unless $q$ is also close to zero. On the other hand, $\mathbb{D}_{KL}[p||q]$ is minimized by $q$ that covers the mass of $p$ . You want the first distribution to be like a "subset" of the second; because the first is "testing" the second one's code. So you don't want it to test it in a place which will utterly surprise the second one, as it'll have an infinite code length there, and KL divergence, would diverge^{badum tss}

The above implies that variational inference via the ELBO overconcentrates the posterior. Minimizing $\mathbb{D}_{KL}[p(w|D)||q(w)]$ leads to a different type of method ,e.g., Expectation-propagation. This is then expected to underconcentrate the posterior.The two methods obtain very different approximations. See [Bishop, 2006] Figure 10.3

See here

I guess variational inference can be used to estimate the Marginal likelihood, by inverting Bayes' theorem to get $p(D) = \frac{p(D|w)p(w)}{p(w|D)}\approx \frac{p(D|w)p(w)}{q(w)}$ , which can be evaluated at any $w$ , but preferentially at $w$ sampled from variational distribution, as in those $w$ the variational approximation is likely better. The numerator is tractable to compute in cases where VI is applied. Using this formula we can see that a method (such as ELBO) which tends to overconcentrate the posterior (resulting in larger $q(w)$ ), will give a smaller likelihood (the intuition being that if this data is less likely is because there are fewer $w$ which produce it, so we give more weight to this one); on the other hand methods (such as EP) which tend to underconcentrate posterior, will give a larger likelihood. This probably explains why we obtain lower values for the PAC-Bayes bounds when using EP, when compared to VI.

Mean field VI

A choice for the space of distributions over which to optimize, is distributions where the individual parameters are independent. This is known as a mean field approximation (as we ignore statistical dependences).

Applications

Variational autoencoder

It is typically applied for Bayesian inference, but to be more specific it may be specified as Bayesian variational inference.