http://www.gaussianprocess.org/gpml/chapters/RW3.pdf
https://tminka.github.io/papers/ep/roadmap.html
The expectation propagation approach to Gaussian process classification, where the Likelihood function is intractable, approximates the likelihood function (as a function of the latent variable fi) as and unormalized Gaussian
p(yi∣fi)≃ti(fi∣Z~i,μ~i,σ~i2)≜Z~iN(fi∣μi,σ~i2)
Z~i,μ~i,σ~i2 are the site parameters. Note that even though the likelihood (for instance a probit or a Hevaiside likelihood) may look very different to a Gaussian the idea really is to approximate the value of the likelihood near the regions where the posterior places significant weight, and those regions are more concentrated (i.e. unlike the likelihood function which may be 1 for fi aribitrarily big), due to the prior on fi (which is Gaussian in the case of Gaussian processes).
The approximation to the posterior is then:
q(f∣X,y)≜ZEP1p(f∣X)∏i=1nti(fi∣Z~i,μ~i,σ~i2)=N(μ,Σ)
with μ=ΣΣ~−1μ~, and Σ=(K−1+Σ−1)−1, where K is the Kernel matrix of the GP.
ZEP, the normalization constant, is therefore the approximation to the Marginal likelihood of the data.
The EP iterative procedure
After initializing the site parameters with some initial guess, we do this the following update the posterior approximation so that it becomes more accurate (the following will be iterated several times every time becoming more accurate, hopefully until convergence to the best accuracy our approximate for of the posterior may perform)
- 1. We compute the cavity distribution for site i. This is the approximate marginal distribtion of the latent variable at site i, conditioned on the observations at {every site except i}. This can be obtained by dividing the approximate marginal of fi conditioned on every site (which is q(fi∣X,y)=N(fi∣μi,σi2)), by the approximate likelihood for site i (the local approximation t), and then normalizing; this gives:
- q−i(fi) where μ−i≜N(fi∣μ−i,σ−i2)=σ−i2(σi−2μi−σ~i−2μ~i), and σ−i2=(σi−2−σ~i−2)−1
- As Rasmussen says in here, we can also think of constructing q−i(fi) by multiplying only the approximate likelihoods in q(f∣X,y) that don't correspond to site i, and then integrating over all f except fi
- 2. To proceed, we find the new (un-normalized) Gaussian marginal which best approximates the product of the cavity distribution and the exact likelihood:
- q^(fi)≜Z^iN(μ^i,σ^i2)≃q−i(fi)p(yi∣fi)
- It is well known that when q(x) is Gaussian, the distribution q(x) which minimizes KL(p(x)∣∣q(x)) is the one whose first and second moments match thatof p(x), see eq. (A.24) in Rassmussen . As q^(fi) is un-normalized we choose additionally to impose the condition that the zero-th moments (normalizing constants) should match. I mean, we could have just decided to match the normalized marginal, moving Z^i to the RHS, right? The parameters of the new marginal distribution:
- Z^i=Φ(zi),μ^i=μ−i+Φ(zi)√1+σ−i2yiσ−i2N(zi)σ^i2=σ−i2−(1+σ−i2)Φ(zi)σ−i4N(zi)(zi+Φ(zi)N(zi)), where zi=√1+σ−i2yiμ−i
- where N is the standard Gaussian distribution and Φ is the standard Gaussain Cumulative distribution function
- 3. Finally, we update the local likelihood approximation at sitte i to match the approximate marginal of fi. The site parameters in terms of the marginal distribution parameters are:
- μ~iZ~i=σ~i2(σ^i−2μ^i−σ−i−2μ−i),σ~i2=(σ^i−2−σ−i−2)−1=Z^i√2π√σ−i2+σ~i2exp(21(μ−i−μ~i)2/(σ−i2+σ~i2))
So overall, we update the approximte posterior parameters by saying assume that the marginal of fi given all the observations other than i is right. Then approximate the marginal of fi using the true likelihood p(yi∣fi), with a Gaussian. And finally, find an approximation to the likelihood at i that gives this (optimal) Gaussian approximation to the marginal of fi. The idea is that this approximation to the likelihood, approximates the likelihood well in the regions where fi is likely (as determined by its marginal distribution q−i(fi) determined by the other sites). We keep updating, and propagating the improved estimates of the marginals.
Note that we use the −i subscript for the parameters of the marginal (not conditioned on site i), q−i(fi), which is Gaussian because the approximate posterior is Gaussian, and the likelihood approximations are Gaussian.
Implementation
See http://www.gaussianprocess.org/gpml/chapters/RW3.pdf#page=26 for pseudocode, and here for real code https://gpy.readthedocs.io/en/deploy/_modules/GPy/likelihoods/bernoulli.html
In implementing it, one typically defines the following variables:
τ~i=σ~i−2S~=diag(τ~),ν~=S~μ~,τ−i=σ−i−2,ν−i=τ−iμ−i
Also note that, as used in here, Φ(zi)N(zi) equals the derivative of Φ(zi) w.r.t. its argument.