http://approximateinference.org/
Introduction
Method
- Define variables
- Define Probabilistic model that we are going to consider.
- We first choose a Prior distribution over the set of hypotheses, for instance favouring simple ones (see regularization below), which defines the parametrized family of Likelihood functions
- We then calculate the posterior distribution using Bayes' theorem
- And we can then make a new prediction by weighting over all hypothesis to calculate the expected value of the output for a new input. I think one can show (see Elements of statistical learning book) that if we knew the real distribution of output given input, the expectation value is the prediction that minimizes the Generalization error
The last two steps are often computationally very difficult. So, what's commonly done is maximizing the posterior distribution (MAP principle, above).
Posteriors summaries
- Point summaries.
- Interval summaries. Prefer estimates incorporating uncertainty over point estimates.
Depending on the Loss function, different choices may be optimal, as studied by Decision theory. However, generally prefer posterior mean or median over MAP.
Ways of dealing with the problem of integrating prior to find normalization
- Conjugate priors are particular choices of prior distributioj which give posterior distributions which are analytically integrable.
- Discretize Baye's rule.
- Sampling
slides. Often, we can't calculate the posterior distritbution directly, and so we sample, using Monte Carlo methods (basically just sampling methods).
- Rejection sampling, creates independent samples, but it becomes increasingly inefficient as dimension increases (one example of the Curse of dimensionality).
- Dependent sampling. A sampling algorithm where the next sample depends on the current value."
- Markov chain Monte Carlo. Where to step next is determined via a distribution conditional on the current parameter value (1st order Markov chain). We want to choose starting position, and conditional sampling distribution so that the distribution converges to the posterior.
- Metropolis algorithm. Random walk Metropolis. Under quite general conditions the Random Walk Metropolis sampler converges asymptotically to the posterior. Ergodic theorem... We move based the ratio of the proposed un-normalised posterior to our current location => no need to calculate troublesome denominator. Efficient Bayesian inference with Hamiltonian Monte Carlo -- Michael Betancourt (Part 1). To check for convergence, multiple walkers are used (Multiple chain convergence monitoring). Still the measure to use isn't clear. Gelman and Rubin (1992) had the idea of comparing within-chain to between-chain variability. Dependence ↑ => Effective sample size ↓
- Metropolis-Hastings. See here. Help with uniform convergence near boundaries. For unconstrained parameters we are free to use symmetric jumping kernels. However for constrained parameters we are forced to break this symmetry.
- Gibbs sampling
- Hamiltonian Monte Carlo
Hierarchical models
Estimating ODE/PDE parameters. Add random noise around DE solution
Can use random walk Metropolis-Hastings algorithm..
Posterior predictive distribution
from θ∣X to X~∣X. Find probability distribution over new observations by marginalizing over posterior P(X~∣X)=∑θP(X~∣θ,X)P(θ∣X).
Lecture course - notes pdf. notes2
https://www.wikiwand.com/en/Bayesian_inference
Bayesian inference exercises
As exemplified by Gaussian processes, one can also apply Bayes' theorem by modeling the joint Data + parameter (or thing to be inferred) distribution, which appears in the numerator.
Bayes: what's the optimal predictor for a given prior. What is the optimal prior?
Learning theory: is your prior good enough for the data you have? What is a good enough predictor?