aka Bayesian evidence
https://www.wikiwand.com/en/Marginal_likelihood
It's the denominator in Bayes' theorem
, where is the Prior and is the likelihood.
When belongs to a finite set, the sum is in general straight forward compute. If belongs to a a continuous space (like some manifold, or subset of ), then the sum becomes an integral, and it may not have a simple analytical form.
A simple Monte Carlo approximation of the integral may work, but is only feasible if is low dimensional, because in high dimensions there are many distinct parts of the space where the integrand is different and can't be reconstructed accurately from its neughbouring values (Curse of dimensionality).
So if , where is the volume element of .
Then if we sampled s in uniformly from and computed , then this would approximate . But doing this has the problem of curse of dimensionality which we described.
But we are wasting a lot of samples in regions where either or is small. A more clever way is to sample , and then compute the empirical average
which will approximate its expected value under , which is .
We are now not sampling from regions with low , but we are still sampling from regions with low likelihood value. But if we instead sampled from the posterior (Bayes' theorem)? so that we don't sample from regions with low either!?
Well duh, we can't compute the posterior, because we don't have , that's the very thing we wanted to compute in the first place u dumbfucc; chicken and egg moch m8?
Guess wot m8, u don't need to nkow to sample from posterior, u can use Markov chain Monte Carlo (MCMC).
How can we express the marginal likelihood using samples from the posterior? Here is one way:
The estimate is known as the hamonic mean estimate:
This is literally THE WORST Monte Carlo estimate everr. See this: https://radfordneal.wordpress.com/2008/08/17/the-harmonic-mean-of-the-likelihood-worst-monte-carlo-method-ever/, intutition can be seen as follows: the terms that contribute significantly to the sum come from regions in with large prior prob , but the samples are from regions with large posterior prob , and so we may need to wait a humongous time to get samples from places with high prior but very low posterior probability.
What alternatives are there? See https://stats.stackexchange.com/questions/209810/computation-of-the-marginal-likelihood-from-mcmc-samples
Instead consider , but actually we don't need to integrate over the whole of A, the identity holds for any region . Therefore we can choose the region cleverly. We want parts of that contribute to the sum to be sampled significantly. So let's choose to be a region which is all of high posterior probability. This is called a high probability density (HPD) region of the posterior!
https://arxiv.org/pdf/1203.3506.pdf. Eq (7) gives the estimator:
The idea is that we find an objective functional such that when maximized, ensures that the optimum function is normalized and following the same distribution as some other distribution from which got samples.
For applying this to the marginal likelihood, the distribution which we sample should be the posterior, which we can do with MCMC. p_0 should be the unormalized posterior, and p_n could be anything (the paper discuss good choices I guess). And in the case of marginal likelihood, the "x" are the parameters, and so really the only parameter over which we are optimizing is the marginal likelihood estimator itself!
For the case of Gaussian process, we should consider the posterior over real-valued functions, for which we can compute the numerator of Bayes theorem easily, and which we can sample using MCMC, but for which we can't compute the marginal likelihood easily for non-Gaussian likelihoods.
This approach, of course only makes sense when the marginal likelihood is an integral over a large space, otherwise simple integration / sumation would suffice. This is why thinking of the posterior over discrete-valued functions on the case of GP classification doesn't make much sense; in that case the complication is in computing the numerator of Bayes' theorem itself (as the marginalization is over a single discrete function, which is the one that fits the data)..., so not a good framing for applying this technique which assumes the numerator is easy and the integration is the hard part.
conditions on
In Theorem 1 the condition on p_n is stated: the support of p_n must include the support of the "data distribution" (or the posterior distribution in the case of marginal likelihood). So for instance using the prior would work, or the posterior itself (but that can't be used because we need to evaluate it in the experssion of the objective function itself).