Bayesian inference

cosmos 30th November 2018 at 7:12pm

Introduction

Method

Likelihood + Prior – Bayes' theorem –> Posterior

Define variables
Define Probabilistic model that we are going to consider.
We first choose a Prior distribution over the set of hypotheses, for instance favouring simple ones (see regularization below), which defines the parametrized family of Likelihood functions
We then calculate the posterior distribution using Bayes' theorem
And we can then make a new prediction by weighting over all hypothesis to calculate the expected value of the output for a new input. I think one can show (see Elements of statistical learning book) that if we knew the real distribution of output given input, the expectation value is the prediction that minimizes the Generalization error

The last two steps are often computationally very difficult. So, what's commonly done is maximizing the posterior distribution (MAP principle, above).

Posteriors summaries

Point summaries.
- Posterior mean (gives less expected error).
- Posterior Median
- Posterior mode (Maximum a posteriori)
Interval summaries. Prefer estimates incorporating uncertainty over point estimates.
- Credible intervals
- Central posterior interval (CPI)
- Highest density region/interval (HDI). Useful if avoiding nonsensical (low density) regions is important

Depending on the Loss function, different choices may be optimal, as studied by Decision theory. However, generally prefer posterior mean or median over MAP.

Ways of dealing with the problem of integrating prior to find normalization

Conjugate priors are particular choices of prior distributioj which give posterior distributions which are analytically integrable.
Discretize Baye's rule.
Sampling

Approximate Bayesian inference

Sampling

slides. Often, we can't calculate the posterior distritbution directly, and so we sample, using Monte Carlo methods (basically just sampling methods).

Rejection sampling, creates independent samples, but it becomes increasingly inefficient as dimension increases (one example of the Curse of dimensionality).
Dependent sampling. A sampling algorithm where the next sample depends on the current value."
- Markov chain Monte Carlo. Where to step next is determined via a distribution conditional on the current parameter value (1st order Markov chain). We want to choose starting position, and conditional sampling distribution so that the distribution converges to the posterior.
  - Metropolis algorithm. Random walk Metropolis. Under quite general conditions the Random Walk Metropolis sampler converges asymptotically to the posterior. Ergodic theorem... We move based the ratio of the proposed un-normalised posterior to our current location => no need to calculate troublesome denominator. Efficient Bayesian inference with Hamiltonian Monte Carlo -- Michael Betancourt (Part 1). To check for convergence, multiple walkers are used (Multiple chain convergence monitoring). Still the measure to use isn't clear. Gelman and Rubin (1992) had the idea of comparing within-chain to between-chain variability. Dependence $\uparrow$ => Effective sample size $\downarrow$
    - Metropolis-Hastings. See here. Help with uniform convergence near boundaries. For unconstrained parameters we are free to use symmetric jumping kernels. However for constrained parameters we are forced to break this symmetry.
- Gibbs sampling
- Hamiltonian Monte Carlo

Variational inference

Approximate Bayesian computation

Hierarchical models

Differential equations models

Estimating ODE/PDE parameters. Add random noise around DE solution

Can use random walk Metropolis-Hastings algorithm..

Posterior predictive distribution

from $\theta | X$ to $\tilde{X}|X$ . Find probability distribution over new observations by marginalizing over posterior $P(\tilde{X}|X) = \sum_{\theta} P(\tilde{X}|\theta, X)P(\theta|X)$ .

Lecture course - notes pdf. notes2

https://www.wikiwand.com/en/Bayesian_inference

Bayesian inference exercises

As exemplified by Gaussian processes, one can also apply Bayes' theorem by modeling the joint Data + parameter (or thing to be inferred) distribution, which appears in the numerator.

Bayes: what's the optimal predictor for a given prior. What is the optimal prior? Learning theory: is your prior good enough for the data you have? What is a good enough predictor?