Expectation propagation

cosmos 10th April 2019 at 10:58am
Approximate Bayesian inference Message passing

http://www.gaussianprocess.org/gpml/chapters/RW3.pdf

https://tminka.github.io/papers/ep/roadmap.html

The expectation propagation approach to Gaussian process classification, where the Likelihood function is intractable, approximates the likelihood function (as a function of the latent variable fif_i) as and unormalized Gaussian

p(yifi)ti(fiZ~i,μ~i,σ~i2)Z~iN(fiμi,σ~i2)p\left(y_{i} | f_{i}\right) \simeq t_{i}\left(f_{i} | \tilde{Z}_{i}, \tilde{\mu}_{i}, \tilde{\sigma}_{i}^{2}\right) \triangleq \tilde{Z}_{i} \mathcal{N}\left(f_{i} | \overline{\mu}_{i}, \tilde{\sigma}_{i}^{2}\right)

Z~i,μ~i,σ~i2\tilde{Z}_{i}, \tilde{\mu}_{i}, \tilde{\sigma}_{i}^{2} are the site parameters. Note that even though the likelihood (for instance a probit or a Hevaiside likelihood) may look very different to a Gaussian the idea really is to approximate the value of the likelihood near the regions where the posterior places significant weight, and those regions are more concentrated (i.e. unlike the likelihood function which may be 11 for fif_i aribitrarily big), due to the prior on fif_i (which is Gaussian in the case of Gaussian processes).

The approximation to the posterior is then:

q(fX,y)1ZEPp(fX)i=1nti(fiZ~i,μ~i,σ~i2)=N(μ,Σ)q(\mathbf{f} | X, \mathbf{y}) \triangleq \frac{1}{Z_{\mathrm{EP}}} p(\mathbf{f} | X) \prod_{i=1}^{n} t_{i}\left(f_{i} | \tilde{Z}_{i}, \tilde{\mu}_{i}, \tilde{\sigma}_{i}^{2}\right)=\mathcal{N}(\mathbf{\mu}, \Sigma)

with μ=ΣΣ~1μ~, and Σ=(K1+Σ1)1\mathbf{\mu}=\Sigma \tilde{\Sigma}^{-1} \tilde{\mathbf{\mu}}, \quad \text { and } \quad \Sigma=\left(K^{-1}+\overline{\Sigma}^{-1}\right)^{-1}, where KK is the Kernel matrix of the GP.

ZEPZ_{\mathrm{EP}}, the normalization constant, is therefore the approximation to the Marginal likelihood of the data.

The EP iterative procedure

After initializing the site parameters with some initial guess, we do this the following update the posterior approximation so that it becomes more accurate (the following will be iterated several times every time becoming more accurate, hopefully until convergence to the best accuracy our approximate for of the posterior may perform)

1. We compute the cavity distribution for site ii. This is the approximate marginal distribtion of the latent variable at site ii, conditioned on the observations at {every site except ii}. This can be obtained by dividing the approximate marginal of fif_i conditioned on every site (which is q(fiX,y)=N(fiμi,σi2)q\left(f_{i} | X, \mathbf{y}\right)=\mathcal{N}\left(f_{i} | \mu_{i}, \sigma_{i}^{2}\right)), by the approximate likelihood for site ii (the local approximation tt), and then normalizing; this gives:
qi(fi)N(fiμi,σi2) where μi=σi2(σi2μiσ~i2μ~i), and σi2=(σi2σ~i2)1\begin{aligned} q_{-i}\left(f_{i}\right) & \triangleq \mathcal{N}\left(f_{i} | \mu_{-i}, \sigma_{-i}^{2}\right) \\ \text { where } \mu_{-i} &=\sigma_{-i}^{2}\left(\sigma_{i}^{-2} \mu_{i}-\tilde{\sigma}_{i}^{-2} \tilde{\mu}_{i}\right), \text { and } \sigma_{-i}^{2}=\left(\sigma_{i}^{-2}-\tilde{\sigma}_{i}^{-2}\right)^{-1} \end{aligned}
As Rasmussen says in here, we can also think of constructing qi(fi)q_{-i}\left(f_{i}\right) by multiplying only the approximate likelihoods in q(fX,y)q(\mathbf{f} | X, \mathbf{y}) that don't correspond to site ii, and then integrating over all f\mathbf{f} except fif_i

2. To proceed, we find the new (un-normalized) Gaussian marginal which best approximates the product of the cavity distribution and the exact likelihood:
q^(fi)Z^iN(μ^i,σ^i2)qi(fi)p(yifi)\hat{q}\left(f_{i}\right) \triangleq \hat{Z}_{i} \mathcal{N}\left(\hat{\mu}_{i}, \hat{\sigma}_{i}^{2}\right) \simeq q_{-i}\left(f_{i}\right) p\left(y_{i} | f_{i}\right)
It is well known that when q(x)q(x) is Gaussian, the distribution q(x)q(x) which minimizes KL(p(x)q(x))KL(p(x)||q(x)) is the one whose first and second moments match thatof p(x)p(x), see eq. (A.24) in Rassmussen . As q^(fi)\hat{q}(f_i) is un-normalized we choose additionally to impose the condition that the zero-th moments (normalizing constants) should match. I mean, we could have just decided to match the normalized marginal, moving Z^i\hat{Z}_i to the RHS, right? The parameters of the new marginal distribution:
Z^i=Φ(zi),μ^i=μi+yiσi2N(zi)Φ(zi)1+σi2σ^i2=σi2σi4N(zi)(1+σi2)Φ(zi)(zi+N(zi)Φ(zi)), where zi=yiμi1+σi2\begin{array}{l}{\hat{Z}_{i}=\Phi\left(z_{i}\right), \quad \hat{\mu}_{i}=\mu_{-i}+\frac{y_{i} \sigma_{-i}^{2} \mathcal{N}\left(z_{i}\right)}{\Phi\left(z_{i}\right) \sqrt{1+\sigma_{-i}^{2}}}} \\ {\hat{\sigma}_{i}^{2}=\sigma_{-i}^{2}-\frac{\sigma_{-i}^{4} \mathcal{N}\left(z_{i}\right)}{\left(1+\sigma_{-i}^{2}\right) \Phi\left(z_{i}\right)}\left(z_{i}+\frac{\mathcal{N}\left(z_{i}\right)}{\Phi\left(z_{i}\right)}\right), \quad \text { where } \quad z_{i}=\frac{y_{i} \mu_{-i}}{\sqrt{1+\sigma_{-i}^{2}}}}\end{array}
where N\mathcal{N} is the standard Gaussian distribution and Φ\Phi is the standard Gaussain Cumulative distribution function
3. Finally, we update the local likelihood approximation at sitte ii to match the approximate marginal of fif_i. The site parameters in terms of the marginal distribution parameters are:
μ~i=σ~i2(σ^i2μ^iσi2μi),σ~i2=(σ^i2σi2)1Z~i=Z^i2πσi2+σ~i2exp(12(μiμ~i)2/(σi2+σ~i2))\begin{aligned} \tilde{\mu}_{i}&=\tilde{\sigma}_{i}^{2}\left(\hat{\sigma}_{i}^{-2} \hat{\mu}_{i}-\sigma_{-i}^{-2} \mu_{-i}\right), \tilde{\sigma}_{i}^{2}=\left(\hat{\sigma}_{i}^{-2}-\sigma_{-i}^{-2}\right)^{-1} \\ \tilde{Z}_{i} &=\hat{Z}_{i} \sqrt{2 \pi} \sqrt{\sigma_{-i}^{2}+\tilde{\sigma}_{i}^{2}} \exp \left(\frac{1}{2}\left(\mu_{-i}-\tilde{\mu}_{i}\right)^{2} /\left(\sigma_{-i}^{2}+\tilde{\sigma}_{i}^{2}\right)\right) \end{aligned}

So overall, we update the approximte posterior parameters by saying assume that the marginal of fif_i given all the observations other than ii is right. Then approximate the marginal of fif_i using the true likelihood p(yifi)p\left(y_{i} | f_{i}\right), with a Gaussian. And finally, find an approximation to the likelihood at ii that gives this (optimal) Gaussian approximation to the marginal of fif_i. The idea is that this approximation to the likelihood, approximates the likelihood well in the regions where fif_i is likely (as determined by its marginal distribution qi(fi)q_{-i}\left(f_{i}\right) determined by the other sites). We keep updating, and propagating the improved estimates of the marginals.

Note that we use the i-i subscript for the parameters of the marginal (not conditioned on site ii), qi(fi)q_{-i}\left(f_{i}\right), which is Gaussian because the approximate posterior is Gaussian, and the likelihood approximations are Gaussian.

Implementation

See http://www.gaussianprocess.org/gpml/chapters/RW3.pdf#page=26 for pseudocode, and here for real code https://gpy.readthedocs.io/en/deploy/_modules/GPy/likelihoods/bernoulli.html

In implementing it, one typically defines the following variables:

τ~i=σ~i2S~=diag(τ~),ν~=S~μ~,τi=σi2,νi=τiμi\tilde{\tau}_{i}=\tilde{\sigma}_{i}^{-2} \quad \tilde{S}=\text{diag}(\tilde{\tau}), \quad \tilde{\nu}=\tilde{S} \tilde{\mu}, \quad \tau_{-i}=\sigma_{-i}^{-2}, \quad \nu_{-i}=\tau_{-i} \mu_{-i}

Also note that, as used in here, N(zi)Φ(zi)\frac{\mathcal{N}\left(z_{i}\right)}{ \Phi\left(z_{i}\right)} equals the derivative of Φ(zi)\Phi\left(z_{i}\right) w.r.t. its argument.