Bayesian evidence

cosmos 15th May 2019 at 6:07pm

Marginal likelihood

aka Bayesian evidence

https://www.wikiwand.com/en/Marginal_likelihood

It's the denominator in Bayes' theorem

P(b)=aAP(ba)P(a)P(b) = \sum_{a\in A} P(b|a) P(a), where P(a)P(a) is the Prior and P(ba)P(b|a) is the likelihood.

When aAa\in A belongs to a finite set, the sum is in general straight forward compute. If aa belongs to a a continuous space (like some manifold, or subset of RR), then the sum becomes an integral, and it may not have a simple analytical form.

A simple Monte Carlo approximation of the integral may work, but is only feasible if AA is low dimensional, because in high dimensions there are many distinct parts of the space AA where the integrand is different and can't be reconstructed accurately from its neughbouring values (Curse of dimensionality).

So if P(b)=AP(ba)P(a)daP(b) = \int_{A}P(b|a) P(a) da , where dada is the volume element of AA.

Then if we sampled NN aas in uniformly from AA and computed 1NaP(ba)P(a)\frac{1}{N} \sum_a P(b|a) P(a) , then this would approximate 1V(A)AP(ba)P(a)da\frac{1}{V(A)} \int_{A}P(b|a) P(a) da. But doing this has the problem of curse of dimensionality which we described.

But we are wasting a lot of samples in regions where either P(a)P(a) or P(ba)P(b|a) is small. A more clever way is to sample aP(a)a \sim P(a), and then compute the empirical average

1NaP(ba)\frac{1}{N} \sum_a P(b|a)

which will approximate its expected value under P(a)P(a), which is P(b)P(b).

We are now not sampling from regions with low P(a)P(a), but we are still sampling from regions with low likelihood P(ba)P(b|a) value. But if we instead sampled from the posterior P(ab)=P(ba)P(a)/P(b)P(a|b) = P(b|a)P(a)/P(b) (Bayes' theorem)? so that we don't sample from regions with low P(ba)P(b|a) either!?

Well duh, we can't compute the posterior, because we don't have P(b)P(b), that's the very thing we wanted to compute in the first place u dumbfucc; chicken and egg moch m8?

Guess wot m8, u don't need to nkow P(b)P(b) to sample from posterior, u can use Markov chain Monte Carlo (MCMC).

How can we express the marginal likelihood P(b)P(b) using samples aa from the posterior? Here is one way:

aA1P(ba)P(ba)P(a)P(b)=aP(a)1P(b)=1P(b)\sum_{a\in A} \frac{1}{P(b|a)}\frac{P(b|a)P(a)}{P(b)} = \sum_a P(a) \frac{1}{P(b)} = \frac{1}{P(b)}

The estimate is known as the hamonic mean estimate:

P(b)^=11Na1P(ba)\hat{P(b)} = \frac{1}{\frac{1}{N}\sum_a \frac{1}{P(b|a)}}

This is literally THE WORST Monte Carlo estimate everr. See this: https://radfordneal.wordpress.com/2008/08/17/the-harmonic-mean-of-the-likelihood-worst-monte-carlo-method-ever/, intutition can be seen as follows: the terms that contribute significantly to the sum come from regions in AA with large prior prob P(a)P(a), but the samples are from regions with large posterior prob P(ab)P(a|b), and so we may need to wait a humongous time to get samples from places with high prior but very low posterior probability.

What alternatives are there? See https://stats.stackexchange.com/questions/209810/computation-of-the-marginal-likelihood-from-mcmc-samples

Instead consider A1P(ba)P(a)P(ba)P(a)P(b)da=1P(b)V(A)\int_{A} \frac{1}{P(b|a)P(a)}\frac{P(b|a)P(a)}{P(b)} da = \frac{1}{P(b)} V(A), but actually we don't need to integrate over the whole of A, the identity holds for any region AAA' \subseteq A. Therefore we can choose the region cleverly. We want parts of AA that contribute to the sum to be sampled significantly. So let's choose AA' to be a region which is all of high posterior probability. This is called a high probability density (HPD) region of the posterior!