Draft of 'Marginal likelihood'

cosmos 26th June 2019 at 12:01am
Bayesian inference

aka Bayesian evidence

https://www.wikiwand.com/en/Marginal_likelihood

It's the denominator in Bayes' theorem

P(b)=aAP(ba)P(a)P(b) = \sum_{a\in A} P(b|a) P(a), where P(a)P(a) is the Prior and P(ba)P(b|a) is the likelihood.

When aAa\in A belongs to a finite set, the sum is in general straight forward compute. If aa belongs to a a continuous space (like some manifold, or subset of RR), then the sum becomes an integral, and it may not have a simple analytical form.

A simple Monte Carlo approximation of the integral may work, but is only feasible if AA is low dimensional, because in high dimensions there are many distinct parts of the space AA where the integrand is different and can't be reconstructed accurately from its neughbouring values (Curse of dimensionality).

So if P(b)=AP(ba)P(a)daP(b) = \int_{A}P(b|a) P(a) da , where dada is the volume element of AA.

Then if we sampled NN aas in uniformly from AA and computed 1NaP(ba)P(a)\frac{1}{N} \sum_a P(b|a) P(a) , then this would approximate 1V(A)AP(ba)P(a)da\frac{1}{V(A)} \int_{A}P(b|a) P(a) da. But doing this has the problem of curse of dimensionality which we described.

But we are wasting a lot of samples in regions where either P(a)P(a) or P(ba)P(b|a) is small. A more clever way is to sample aP(a)a \sim P(a), and then compute the empirical average

1NaP(ba)\frac{1}{N} \sum_a P(b|a)

which will approximate its expected value under P(a)P(a), which is P(b)P(b).

We are now not sampling from regions with low P(a)P(a), but we are still sampling from regions with low likelihood P(ba)P(b|a) value. But if we instead sampled from the posterior P(ab)=P(ba)P(a)/P(b)P(a|b) = P(b|a)P(a)/P(b) (Bayes' theorem)? so that we don't sample from regions with low P(ba)P(b|a) either!?

Well duh, we can't compute the posterior, because we don't have P(b)P(b), that's the very thing we wanted to compute in the first place u dumbfucc; chicken and egg moch m8?

Guess wot m8, u don't need to nkow P(b)P(b) to sample from posterior, u can use Markov chain Monte Carlo (MCMC).

How can we express the marginal likelihood P(b)P(b) using samples aa from the posterior? Here is one way:

aA1P(ba)P(ba)P(a)P(b)=aP(a)1P(b)=1P(b)\sum_{a\in A} \frac{1}{P(b|a)}\frac{P(b|a)P(a)}{P(b)} = \sum_a P(a) \frac{1}{P(b)} = \frac{1}{P(b)}

The estimate is known as the hamonic mean estimate:

P(b)^=11Na1P(ba)\hat{P(b)} = \frac{1}{\frac{1}{N}\sum_a \frac{1}{P(b|a)}}

This is literally THE WORST Monte Carlo estimate everr. See this: https://radfordneal.wordpress.com/2008/08/17/the-harmonic-mean-of-the-likelihood-worst-monte-carlo-method-ever/, intutition can be seen as follows: the terms that contribute significantly to the sum come from regions in AA with large prior prob P(a)P(a), but the samples are from regions with large posterior prob P(ab)P(a|b), and so we may need to wait a humongous time to get samples from places with high prior but very low posterior probability.

What alternatives are there? See https://stats.stackexchange.com/questions/209810/computation-of-the-marginal-likelihood-from-mcmc-samples

Instead consider A1P(ba)P(a)P(ba)P(a)P(b)da=1P(b)V(A)\int_{A} \frac{1}{P(b|a)P(a)}\frac{P(b|a)P(a)}{P(b)} da = \frac{1}{P(b)} V(A), but actually we don't need to integrate over the whole of A, the identity holds for any region AAA' \subseteq A. Therefore we can choose the region cleverly. We want parts of AA that contribute to the sum to be sampled significantly. So let's choose AA' to be a region which is all of high posterior probability. This is called a high probability density (HPD) region of the posterior!

Variational approach

https://arxiv.org/pdf/1203.3506.pdf. Eq (7) gives the estimator:

The idea is that we find an objective functional such that when maximized, ensures that the optimum function is normalized and following the same distribution as some other distribution from which got samples.

For applying this to the marginal likelihood, the distribution which we sample should be the posterior, which we can do with MCMC. p_0 should be the unormalized posterior, and p_n could be anything (the paper discuss good choices I guess). And in the case of marginal likelihood, the "x" are the parameters, and so really the only parameter over which we are optimizing is the marginal likelihood estimator itself!

For the case of Gaussian process, we should consider the posterior over real-valued functions, for which we can compute the numerator of Bayes theorem easily, and which we can sample using MCMC, but for which we can't compute the marginal likelihood easily for non-Gaussian likelihoods.

This approach, of course only makes sense when the marginal likelihood is an integral over a large space, otherwise simple integration / sumation would suffice. This is why thinking of the posterior over discrete-valued functions on the case of GP classification doesn't make much sense; in that case the complication is in computing the numerator of Bayes' theorem itself (as the marginalization is over a single discrete function, which is the one that fits the data)..., so not a good framing for applying this technique which assumes the numerator is easy and the integration is the hard part.

conditions on pnp_n

In Theorem 1 the condition on p_n is stated: the support of p_n must include the support of the "data distribution" (or the posterior distribution in the case of marginal likelihood). So for instance using the prior would work, or the posterior itself (but that can't be used because we need to evaluate it in the experssion of the objective function itself).

Linear regression approach