Margin-based generalization bound

cosmos 19th December 2019 at 7:05pm
Generalization Rademacher complexity Structural risk minimization

aka Margin theory, margin bound

See Understanding Machine Learning by Shai and Shai for SVM bounds; they are mostly based on Structural risk minimization and bounding Rademacher complexity by bounding norms of the parameters/inputs. But some developed by Langford and others are based on PAC-Bayesian theory, and may be tighter (see section below)

Theory of classification: A survey of some recent advances.

section from UML

Video about margin theory

Estimation of Dependences Based on Empirical Dataa training algorithm for optimal margin classifiersSupport-vector networks.The Nature of Statistical Learning Theory

The algorithmic idea of large margin classifiers was introduced in the linear case by Vapnik (1982) (see also (Boser et al., 1992; Cortes and Vapnik, 1995)). Vapnik (1995) gave an intuitive explanation ofthe performance of these methods based on a sample-dependent VC-dimension calculation, but withoutgeneralization bounds. The first rigorous generalization bounds for large margin linear classifiers (Shawe-Taylor et al., 1998) required a scale-sensitive complexity analysis of real-valued function classes. At thesame time, a large margin analysis was developed for two-layer networks (Bartlett, 1996), indeed with a proof technique that inspired the layer-wise induction used to prove Theorem 1.1 in the present work .Margin theory was quickly extended to many other settings (see for instance the survey by Boucheron et al.(2005)), one major success being an explanation of the generalization ability of boosting methods, whichexhibit an explicit growth in the size of the function class over time, but a stable excess risk (Schapireet al., 1997)

Structural risk minimization over data-dependent hierarchiesFor valid generalization the size of the weights is more important than the size of thenetworkThe sample complexity of pattern classification with neural networks: the size ofthe weights is more important than the size of the network.

Boosting the margin: a new explanation for the effectiveness of voting methods

Paper with applications of Rademacher complexity to classes of margin classifiers: http://www.jmlr.org/papers/volume5/luxburg04b/luxburg04b.pdf

PAC-Bayesian approach to margin bounds

Bounds for Averaging Classifiers

PAC-Bayes & Margins (Langford and Shawe-Taylor). In this paper they show that if there exists a weight vector ww with a margin risk lγ(w,S)l_\gamma(w,S) for margin γ\gamma and training set SS (i.e. probability of sample having a margin larger than γ\gamma), then we can choose a certain posterior distribution QQ for the general PAC-Bayesian theorem which depends on ww. For this posterior we can compute the relative entropy, and bound the empirical risk by the margin irsk of ww plus a term which decreases with increasing margin. The idea of this second term is that QQ is a distribution supported on ww' which lie on a halfplane parallel to ww. The larger the margin, the more of these vectors that classify well, so the smaller the bound on the empirical risk.

Simplified PAC-Bayesian Margin Bounds

.The fundamental idea behind the PAC-Bayesian approach to margin bounds is that a small error rate relative to a large safety margin ensures the existence of a posterior distribution (a Gibbs classifier) with a small training error and a small KL-divergence from the prior.