Margin-based generalization bound: Cosmos — All that is, or was, or ever will be

Margin-based generalization bound

cosmos 19th December 2019 at 7:05pm

aka Margin theory, margin bound

See Understanding Machine Learning by Shai and Shai for SVM bounds; they are mostly based on Structural risk minimization and bounding Rademacher complexity by bounding norms of the parameters/inputs. But some developed by Langford and others are based on PAC-Bayesian theory, and may be tighter (see section below)

Theory of classification: A survey of some recent advances.

section from UML

Video about margin theory

Estimation of Dependences Based on Empirical Data – a training algorithm for optimal margin classifiers – Support-vector networks. – The Nature of Statistical Learning Theory

The algorithmic idea of large margin classifiers was introduced in the linear case by Vapnik (1982) (see also (Boser et al., 1992; Cortes and Vapnik, 1995)). Vapnik (1995) gave an intuitive explanation ofthe performance of these methods based on a sample-dependent VC-dimension calculation, but withoutgeneralization bounds. The first rigorous generalization bounds for large margin linear classifiers (Shawe-Taylor et al., 1998) required a scale-sensitive complexity analysis of real-valued function classes. At thesame time, a large margin analysis was developed for two-layer networks (Bartlett, 1996), indeed with a proof technique that inspired the layer-wise induction used to prove Theorem 1.1 in the present work .Margin theory was quickly extended to many other settings (see for instance the survey by Boucheron et al.(2005)), one major success being an explanation of the generalization ability of boosting methods, whichexhibit an explicit growth in the size of the function class over time, but a stable excess risk (Schapireet al., 1997)

Structural risk minimization over data-dependent hierarchies – For valid generalization the size of the weights is more important than the size of thenetwork – The sample complexity of pattern classification with neural networks: the size ofthe weights is more important than the size of the network.

Boosting the margin: a new explanation for the effectiveness of voting methods

Paper with applications of Rademacher complexity to classes of margin classifiers: http://www.jmlr.org/papers/volume5/luxburg04b/luxburg04b.pdf

PAC-Bayesian approach to margin bounds

Bounds for Averaging Classifiers

PAC-Bayes & Margins (Langford and Shawe-Taylor). In this paper they show that if there exists a weight vector $w$ with a margin risk $l_\gamma(w,S)$ for margin $\gamma$ and training set $S$ (i.e. probability of sample having a margin larger than $\gamma$ ), then we can choose a certain posterior distribution $Q$ for the general PAC-Bayesian theorem which depends on $w$ . For this posterior we can compute the relative entropy, and bound the empirical risk by the margin irsk of $w$ plus a term which decreases with increasing margin. The idea of this second term is that $Q$ is a distribution supported on $w'$ which lie on a halfplane parallel to $w$ . The larger the margin, the more of these vectors that classify well, so the smaller the bound on the empirical risk.

Simplified PAC-Bayesian Margin Bounds

.The fundamental idea behind the PAC-Bayesian approach to margin bounds is that a small error rate relative to a large safety margin ensures the existence of a posterior distribution (a Gibbs classifier) with a small training error and a small KL-divergence from the prior.