An extension of Probably approximately correct learning to the cases where not all concepts in the concept class are equally likely, but instead have a general probability distribution over them, which can be interpreted as a Prior distribution.
PAC-Bayes theorem (Gibbs posterior version) (Some PAC-Bayesian Theorems)
For any measure on any concept space and any measure on a space of instances we have, for , that with probability at least over the choice of sample of instances all measurable subsets of the concepts such that every element of is consistent with the sample and with satisfies the following:
where , and where , i.e. the expected value of the generalization errors over concepts in with probability given by the posterior . Here, is the generalization error (probability of the concept disagreeing with the target concept, when sampling inputs).
See proof Some PAC-Bayesian Theorems and also in my notebook in the office.
The Gibbs posterior version is a special case of the General PAC-Bayes theorem (PAC-Bayes model averaging):
This is actually a particular case of the Relative entropy general PAC-Bayes theorem from Langford which tightens the above bound, specially at large values of the bound (in particular the Langford one is automatically smaller than ). The theorem is here:
See here for proof of general (KL) version. See derivation on blackboard (pic..) for connection with Gibbs learner version (graph of bounds for KL of binary variables)
To see how the Theorem 1 from Some PAC-Bayesian Theorems follows from the general theorem below ( PAC-Bayes model averaging ), see picture of blackboard derivation, or see here.
One can also show that the Gibbs posterior gives the optimal bound of the form given by the general theorem. See discussion in Evidence lower bound.
– Tutorial on Practical Prediction Theory for Classification (has nice proof in here too)
History
PAC-Bayes is inspired by Structural risk minimization. See this paper by Shaw and Taylor
There is a version that is essentially just Structural risk minimization/Weighted Union bound:
video lecture on PAC-Bayesian learning
Nonvacuous bounds for stochastic DNNs paper: https://arxiv.org/pdf/1703.11008.pdf
Distribution-dependent bounds
Data-dependent PAC-Bayes priors via differential privacy
A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks
Deep learning generalizes because the parameter-function map is biased towards simple functions
PAC-Bayesian Generalisation Error Bounds for Gaussian Process Classification
Optimal priors
See notes on paper
https://photos.google.com/search/blackboard/photo/AF1QipMudW2vT2w7OTE5t0xflAJRCT7yPC8fEN5oC1tC
See notes on Telegram
I think you get that the average bound is some average entropy independent of the pacbayes prior, play the cross entropy between the average probability of obtaining a function and the pb prior. Therefore the optimal prior is the average probability of obtaining a function.
find this by expressing KL div a sum over functions (definition), and the sum over U|X as another sum over functions (as I did in the other derivations), and then swap the sums.
to be precise, optimal prior is pi(c)=sum_X D(X) sum_c' P(c')Q_U(c) = sum_c' P(c') sum_X D(X) Q_U(c)
where U is a function of c' and X