PAC-Bayesian learning: Cosmos — All that is, or was, or ever will be

PAC-Bayesian learning

cosmos 14th October 2019 at 2:19pm

An extension of Probably approximately correct learning to the cases where not all concepts in the concept class are equally likely, but instead have a general probability distribution over them, which can be interpreted as a Prior distribution.

PAC-Bayes theorem (Gibbs posterior version) (Some PAC-Bayesian Theorems)

For any measure $P$ on any concept space and any measure on a space of instances we have, for $0< \delta \leq 1$ , that with probability at least $1-\delta$ over the choice of sample of $m$ instances all measurable subsets $U$ of the concepts such that every element of $U$ is consistent with the sample and with $P(U) > 0$ satisfies the following:
$\epsilon(U) \leq \frac{\ln{\frac{1}{P(U)}}+\ln{\frac{1}{\delta}} + 2\ln{m} + 1}{m}$
where $P(U)=\sum_{c\in U} P(c)$ , and where $\epsilon(U) := E_{c\in U} \epsilon(c)$ , i.e. the expected value of the generalization errors over concepts $c$ in $U$ with probability given by the posterior $\frac{P(c)}{P(U)}$ . Here, $\epsilon(c)$ is the generalization error (probability of the concept $c$ disagreeing with the target concept, when sampling inputs).

See proof Some PAC-Bayesian Theorems and also in my notebook in the office.

The Gibbs posterior version is a special case of the General PAC-Bayes theorem (PAC-Bayes model averaging):

This is actually a particular case of the Relative entropy general PAC-Bayes theorem from Langford which tightens the above bound, specially at large values of the bound (in particular the Langford one is automatically smaller than $1$ ). The theorem is here:

Bounds for Averaging Classifiers

See here for proof of general (KL) version. See derivation on blackboard (pic..) for connection with Gibbs learner version (graph of bounds for KL of binary variables)

To see how the Theorem 1 from Some PAC-Bayesian Theorems follows from the general theorem below ( PAC-Bayes model averaging ), see picture of blackboard derivation, or see here.

One can also show that the Gibbs posterior gives the optimal bound of the form given by the general theorem. See discussion in Evidence lower bound.

– Tutorial on Practical Prediction Theory for Classification (has nice proof in here too)

History

PAC-Bayes is inspired by Structural risk minimization. See this paper by Shaw and Taylor

There is a version that is essentially just Structural risk minimization/Weighted Union bound:

PAC-Bayesian Theory

video lecture on PAC-Bayesian learning

Nonvacuous bounds for stochastic DNNs paper: https://arxiv.org/pdf/1703.11008.pdf

Distribution-dependent bounds

See here and here

Data-dependent PAC-Bayes priors via differential privacy

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

Deep learning generalizes because the parameter-function map is biased towards simple functions

PAC-Bayesian Generalisation Error Bounds for Gaussian Process Classification

Optimal priors

See notes on paper

https://photos.google.com/search/blackboard/photo/AF1QipMudW2vT2w7OTE5t0xflAJRCT7yPC8fEN5oC1tC

See notes on Telegram

I think you get that the average bound is some average entropy independent of the pacbayes prior, play the cross entropy between the average probability of obtaining a function and the pb prior. Therefore the optimal prior is the average probability of obtaining a function.

find this by expressing KL div a sum over functions (definition), and the sum over U|X as another sum over functions (as I did in the other derivations), and then swap the sums.

to be precise, optimal prior is pi(c)=sum_X D(X) sum_c' P(c')Q_U(c) = sum_c' P(c') sum_X D(X) Q_U(c)

where U is a function of c' and X