Structural risk minimization: Cosmos — All that is, or was, or ever will be

Structural risk minimization

cosmos 17th May 2019 at 4:09pm

aka SRM

Structural risk minimization is an extension of Probably approximately correct learning, where

one partitions the Hypothesis class into a nested family of hypothesis classes $\mathcal{H} = \cup_i \mathcal{H}_i$ , where $\mathcal{H}_1 \subseteq \mathcal{H}_2 \subseteq ...$ (although I think it can be extended easily to the case where the subclasses are not nested)
One applies the weighted version of the Union bound to get a bound (on the Generalization error) that is universal over the classes, rather than uniform. Meaning, this is a bound that works with high probability probability simultaneously for functions in all the classes but the value of the bound is class-dependent.

I've sometimes in this wiki used SRM to refer to the weighted Union bound, abusing nomenclature a bit

One typically combines this with an algorithm that finds a hypothesis with the best bound on its error, which can imply making a tradeoff between the training error and the complexity of the class to which the bound belongs.

See here – A framework for structural risk minimization – chap 7 in Understanding Machine Learning by Shai and Shai

One can have agnostic or realizable versions of SRM bounds.

Training error-dependent VC dimension-based SRM generalization error bound

See here and here. See more at Growth function, and Agnostic learning, and here. I wonder how this bound quantitatively compares with standard training-error dependent bounds! At the very least, it allows putting a prior over training errors!

The "unlukiness" framework

Structural Risk Minimization Over Data-Dependent Hierarchies. We address a shortcoming of the SRM method which Vapnik [48, p. 161] highlights: according to the SRM principle the structure has to be defined a priori before the training data appear.

One can ask the question if one can obtain better generalization error bounds by making the hierarchy of classes data-dependent.

See more at Introduction to supervised learning theory. SRM is a bound that is hypothesis-dependent, but is not explicitly data-dependent! One also has approaches like Rademacher complexity-based bounds, or Growth function-based bounds which are data-dependent, but not hypothesis-subclass, or hypothesis dependent. One can combine both Rademacher with SRM, to obtain hypothesis and data-dependent bounds, and this is what's done to obtain Margin-based generalization bounds (applied e.g. to Support vector machines). In fact, there are ways to obtain margin bounds using PAC-Bayes also, which is a general (and probably the tightest in most cases) method to obtain hypothesis and data-dependent bounds!

In here, Shawe-Taylor et al. propose the "unlukiness" framework as a general framework for hypothesis and data-dependent bounds. In fact, they later applied this to Bayesian-like algorithms where they showed that the Bayesian evidence works as a luckiness function which has the right property to obtain generalization bounds.

Apparently this was in fact a precuresor of McAllister's PAC-Bayes

they show that the function which measures the VC dimension of the set of hypotheses on the sample points is a valid (un)luckiness function. This leads to a bound on the generalization performance in terms of this measured dimension rather than the “worst case” bound which involves the VC dimension of the set of hypotheses over the whole input space.

Our approach can be interpreted as a general way of encoding our bias, or prior assumptions, and possibly taking advantage of them if they happen to be correct. In the case of the fixed hierarchy, we expect the target (or a close approximation to it) to be in a class with small . In the maximal separation case, we expect the target to be consistent with some classifying hyperplane that has a large separation from the examples. This corresponds to a collusion between the probability distribution and the target concept, which would be impossible to exploit in the standard PAC distribution-independent framework. If these assumptions happen to be correct for the training data, we can be confident we have an accurate hypothesis from a small data set (at the expense of some small penalty if they are incorrect)

This is as general as the general (KL divergence) version of PAC-Bayes, which allows the posterior, and thus the bound to depend on the inputs distributions. Actually the simpler Bayesian PAC-Bayes, also implicitly depends on the input-distribution. But in a less general way?

The Bayesian PAC-Bayes can't encode priors over the pair (function,data distribution), while the general one (and margin bounds also), implictly can! Interesting!

Some intuition for Agnostic structural risk minimization: I can either do really close to the best among a small set of expert, but they all do pretty badly anyways, or I can do somewhat worse than the best among a bigger set of experts, but as the best does significantly better, I will do better. This tradeoff is what structural risk minimization does.

Assuming realizability, SRM would just choose the smallest class that has zero training error, as that would give the best bound on generalization error!

What about the algorithm: Check the size of the training set, $m$ , and see what is the largest $H$ in the collection such that $m$ guarantees generalization gap smaller than $\epsilon$ . Then just do ERM with that $H$ ? Would that work for Nonuniform learnability?