Introduction to supervised learning theory

cosmos 10th April 2019 at 10:58am
Learning theory Supervised learning

In Supervised learning, we want to convert a set of example "input-output" pairs, into a procedure for predicting the outputs for new inputs (see Supervised learning for examples).

~Get ready for a deluge of jargon~

Input and output sets. To be precise we define two Sets, the domain (or input space, or set of instances) X\mathcal{X}, and the range (or output space, or co-domain; or set of labels, or classes, if the space is finite) Y\mathcal{Y}.

Informal intro to the supervised learning problem (plus some definitions)

What we are given in an instatiation of a supervised learning problem is a training set (which is actually a Tuple) S=((x1,y1),...,(xm,ym))S= ((x_1,y_1),...,(x_m,y_m)), consisting of mm pairs (xi,yi)(x_i,y_i), i=1,...,mi=1,...,m, where for each ii, xiXx_i \in \mathcal{X} and yiYy_i \in \mathcal{Y}. (\in means "belongs to the set") These pairs are called examples (or sometimes example pairs; also sometimes "data points", or "data"; this is the "data" everyone talks about in ML)

What we want in a supervised learning problem is to find a procedure for converting a training set SS into a predictor f:XYf: \mathcal{X} \to \mathcal{Y} (a Function from the input space to the output space), which is "good". A good predictor has guarantees about its generalization on new examples, which we'll explain below (on the section "The metric for assessing quality of predictors").

The assumption of the data distribution

The point now is that to have any hope of predicting, we need to assume that new example pairs (x,y)(x,y), that we use to assess the quality of the predictor, are somehow related in a similar way to the examples in the training set, at least with high probability. The standard way to ensure this is that examples in the training set and test set are identically and independently distributed (if one considers examples which are more generally distributed, one has to start looking at other areas, like Reinforcement learning). So for supervised learning we assume

(xi,yi)D(x_i,y_i) \sim \mathcal{D} i.i.d. for all i=1,...,mi=1,...,m, for some data distribution D\mathcal{D} (\sim means that the left hand side is a Random variable, and the right hand side is the Probability distribution it follows). We have a notation for saying that the mm examples are drawn i.i.d. which is S=((x1,y1),...,(xm,ym))DmS=((x_1,y_1),...,(x_m,y_m)) \sim \mathcal{D}^m.

We also assume that new examples on which we test a predictor (sometimes called the test set) are also statistically distributed according to the same distribution (x,y)D(x,y) \sim \mathcal{D}.

The metric for assessing quality of predictors

We are now ready to define what we formally mean by a "good" predictor. To do this, we define a measure of how bad the predictor is, called the Loss function. The loss function is a function which maps example pairs and a predictor to a Real number,

l:F×X×YRf,x,yl(f,x,y)\begin{aligned}l : \mathcal{F} \times \mathcal{X} \times \mathcal {Y} &\to \mathbb{R}\\ f,x,y &\mapsto l(f,x,y)\end{aligned}

where F\mathcal{F} is just the set of all possible predictors, which we are considering, called the hypothesis space. (see Function for an explanation of this notation; A×BA \times B is the Cartesian product of AA and BB)

The interpretation is that a large value l(f,x,y)l(f,x,y) means that the predictor ff does badly at predicting the output yy for a particular input xx.

But we don't care about just one particular input, so we average (take the Expectation) over the data distribution D\mathcal{D} (which test data points are assumed to follow). This is called the risk (or true risk, or generalization error, or true loss, true error or sometimes even simply error),

LD(f):=E(x,y)D[l(f,x,y)]L_\mathcal{D}(f): = \mathbf{E}_{(x,y) \sim \mathcal{D}} \left[l(f,x,y)\right]

E(x,y)D\mathbf{E}_{(x,y) \sim \mathcal{D}} just means the Expectation under sampling examples. We are taking the expectation of the loss function for a fixed predictor, so that the true risk is a function of (depends on) both the data distribution D\mathcal{D} and the predictor ff.

A common loss function is the 0-1 loss or classification accuracy, defined when the output space is finite, it is simply l(f,x,y)=1f(x)yl(f,x,y) = \mathbf{1}_{f(x)\neq y}, where 1A\mathbf{1}_{A} is the Indicator function, which is equal to 11 when the condition AA is satisfied, and to 00 if it isn't.

Formal definition of learning algorithm A\mathcal{A}

We can almost now finally define a learning problem. Well, first let us define one last thing. A learning algorithm A\mathcal{A} is defined as a function mapping training sets to predictors. Formally,

A:(X×Y)F\mathcal{A}: (\mathcal{X} \times \mathcal{Y})^* \to \mathcal{F}

The ^* is the Kleene star: AA^* means the set of all finite Tuples of elements of AA. Here we use it to encode "the set of all training sets of any size" in a concise way. Using this notation, the predictor which the learning algorithm produces, for a training set SS is denoted A(S)\mathcal{A}(S).

Formal definition of a supervised learning problem

We can now finally define a learning problem.

A supervised learning problem is: Given some properties of the data distribution D\mathcal{D}, and a loss function ll, find a learning algorithm A\mathcal{A}, such that

it is highly probable that upon sampling a training set SS, the predictor A(S)\mathcal{A}(S) which the learning algorithm produces, has low risk LD(A(S))L_\mathcal{D}(\mathcal{A}(S)).

In pseudo-math, ProbSDm[LD(A(S)) is low] is high\text{Prob}_{S\sim\mathcal{D}^m}\left[L_{\mathcal{D}}\left (\mathcal{A}(S)\right)\text{ is low}\right]\text{ is high}. or ProbSDm[LD(A(S)) is high] is low\text{Prob}_{S\sim\mathcal{D}^m}\left[L_{\mathcal{D}}\left (\mathcal{A}(S)\right)\text{ is high}\right]\text{ is low}. lol (here ProbSDm\text{Prob}_{S\sim\mathcal{D}^m} is the probability of obtaining a particular value of the thing in the brackets, when sampling SS according to its distribution D\mathcal{D}).

Being less pseudo... and more rigorous: we can define "LD(A(S)) is lowL_{\mathcal{D}}\left (\mathcal{A}(S)\right)\text{ is low}" as "LD(A(S)) is at most ϵL_{\mathcal{D}}\left (\mathcal{A}(S)\right)\text{ is at most } \epsilon" for some ϵ>0\epsilon > 0; and "Prob[blah] is high\text{Prob}[\text{blah}]\text{ is high}" as "Prob[blah] is at least 1δ\text{Prob}[\text{blah}]\text{ is at least }1 - \delta", for some δ>0\delta > 0.

– So we want to find a learning algorithm A\mathcal{A} such that ProbSDm[LD(A(S))ϵ]1δ\text{Prob}_{S\sim\mathcal{D}^m}\left[L_{\mathcal{D}}\left (\mathcal{A}(S)\right) \leq \epsilon \right] \geq 1-\delta, for any ϵ\epsilon and δ\delta within some range of interest.

Supervised learning theory is all about what we can say about ProbSDm[LD(A(S))]\text{Prob}_{S\sim\mathcal{D}^m}\left[L_{\mathcal{D}}\left (\mathcal{A}(S)\right)\right] under different assumptions on D\mathcal{D}, and A\mathcal{A}, and ll.

Above, we talked about properties of ProbSDm[LD(A(S))]\text{Prob}_{S\sim\mathcal{D}^m}\left[L_{\mathcal{D}}\left (\mathcal{A}(S)\right)\right], corresponding to bounding the cumulative distribution (because of the \leq) (bounds of these type are also called "bounds in probability", or "high-probability bounds"). However, the most information we can hope to get is the exact distribution ProbSDm[LD(A(S))]\text{Prob}_{S\sim\mathcal{D}^m}\left[L_{\mathcal{D}}\left (\mathcal{A}(S)\right)\right]. With that, we can say, what's the probability that our learning algorithm performs this badly under this data distribution, which would be very useful!

Another quantity which is useful, is the mean of this distribution, that is ED(A):=ESDm[LD(A(S))]\mathcal{E}_\mathcal{D} (\mathcal{A}):= \mathbf{E}_{S\sim\mathcal{D}^m}\left[L_{\mathcal{D}}\left (\mathcal{A}(S)\right)\right]. We will call this, the expected generalization error (or expected error, or expected risk). This is now a single quantity that only depends on the data distribution and the algorithm. This is the quantity that we study when calculating Learning curves. A related quantity comes about when we assume some probability distribution PP over data distributions (making Bayesians happy ;) ). This is meant to encode the prior probability of obtaining a particular learning task (which correspond to a particular data distribution). Perhaps we believe that learning tasks which give all images of dogs the same label, are more likely that those that don't. Or perhaps, we believe that it's more likely that the data is composed of images of actual things, rather than images of random noise. All of these, and more abstract mathematical assumptions, can be encoded on the prior PP. If we have such a prior, then we can average the expected error, to obtain the average expected error E^P(A):=EDP[ESDm[LD(A(S))]]\hat{\mathcal{E}}_P(\mathcal{A}):= \mathbf{E}_{\mathcal{D}\sim P}\left[ \mathbf{E}_{S\sim\mathcal{D}^m}\left[L_{\mathcal{D}}\left (\mathcal{A}(S)\right)\right]\right] (which we will sometimes refer to as average error for short, at the risk of confusing it with the expected error, defined above).

Stochastic predictors and stochastic learning algorithms

We have assumed that a predictor is a function from input to output space, f:XYf: \mathcal{X} \to \mathcal{Y}. However, one can easily generalize all the discussion above to stochastic predictors, which are distributions over the output space, dependent on the input, f:XP(Y)f: \mathcal{X} \to \mathcal{P}(\mathcal{Y}), where I used P(Y)\mathcal{P}(\mathcal{Y}) to just mean the "set of probability distributions over Y\mathcal{Y}". One can interpret this ff as the probability of a particular output given (or conditioned on) an input P(yx)P(y|x).

In a precisely analogous way, we can generalize our concept of learning algorithm to include stochastic learning algorithms which given a training set SS, produce a probability distribution over predictors Q(fS)Q(f|S).

From now on, any statement we make about predictors and learning algorithms will typically apply to their stochastic versions, unless we specify they have to be deterministic.

Supervised learning theory

No free lunch theorem: Average error with "no assumptions" on data distribution D\mathcal{D}

What if make no assumptions about D\mathcal{D}? Well, we can't say much then. There are several No free lunch theorems formalizing this.

One of them, by Wolpert 1995, says:

No free lunch theorem The average expected error E^P(A)\hat{\mathcal{E}}_P(\mathcal{A}), when the prior over data distributions PP is uniform (assuming a uniform distribution can be defined for the choice of input and output spaces) is independent of the learning algorithm A\mathcal{A}.

In particular, this most often implies that any algorithm will perform badly in average. For instance, this theorem is typically applied in the context of classification, where the risk is defined to be the probability of missclassification. The theorem then implies that any algorithm will have an expected risk, in average, which is the same as the dumbest algorithm which is: "guess randomly", which predicts rather poorly (has probability of success of 0.50.5 for binary classification, for instance).

Btw, there are no free lunch theorems for optimization too, which say that all optimization algorithms are equally bad in average

Fortunately, we often have some prior knowledge that constraints the data distributions we care about. But before, let us see if we can get some cheap snack, even if we can't get a free lunch

Aside [Bayesian vs frequentist]: Note that we put "no assumptions" in quotes on the title. This hints at a subtle issue, related to frequentist vs Bayesian approaches to probability, that I should discuss more in depth some other day. In Bayesian statistics "no assumption" typically means a prior which is a "uniform distribution". However, in frequentist statistics (which is the more common approach in learning theory), "no assumption" means "no assumption" in a more literal sense (closer to the everyday usage), and in terms of what one actually computes, it implies that one looks at the worst case, so that one can say things like "for any task whatsoever that satisfy such and such constraint, then {some statement about my algorithm being good} holds". Note the lack of prior distribution over tasks in that statement!

Agnostic learning: No assumptions on data distribution D\mathcal{D} (but assumptions on learning algorithm A\mathcal{A})

It turns out that even without making any assumptions on the data distribution, we can say many things. This is studied in Agnostic learning. In particular, it turns out that we can give a bound on the generalization error, which depends on the sample SS we obtain, and hold with high probability, at least 1δ1-\delta for δ>0\delta>0. That is, for algorithm A\mathcal{A}, we can find a function ϵA(S,δ)\epsilon_\mathcal{A}(S,\delta) such that, for any distribution D\mathcal{D} whatsoever,and for any δ>0\delta>0 (perhaps δ\delta is allowed to {be required to be smaller than some number for the equation to hold})

ProbSDm[LD(A(S))ϵA(S,δ)]1δ\text{Prob}_{S\sim\mathcal{D}^m}\left[L_{\mathcal{D}}\left (\mathcal{A}(S)\right) \leq \epsilon_\mathcal{A}(S,\delta) \right] \geq 1-\delta

[ Call this equation \star]

and the awesome thing is that now ϵA(S,δ)\epsilon_\mathcal{A}(S,\delta) actually does depend on the algorithm A\mathcal{A}. What this means is that even though all algorithms perform the same on average, some algorithms will perform very well in a few tasks, and pretty bad on most tasks; while other algorithms, like random guessing, will perform equally bad in all tasks. It is quite useful to distinguish these two types of algorithms. And what the above expression says is that if we find ϵA(S,δ)\epsilon_\mathcal{A}(S,\delta) for some non-trivial algorithm, we can see from the data we have SS whether this is (with high probability) one of the instances in which this algorithm performs well (ϵA(S,δ)\epsilon_\mathcal{A}(S,\delta) low) or bad (ϵA(S,δ)\epsilon_\mathcal{A}(S,\delta) high), which quite a useful thing to be able to do!

As an illustrative example, one common function ϵA(S,δ)\epsilon_\mathcal{A}(S,\delta) that people (in Statistics) are familiar with is Cross-validation! It offers a function of the data, the algorithm, and the confidence parameter (δ\delta), that allows us to be confident of whether our algorithm will generalize well or badly. Indeed cross-validation (CV) can be analyzed within this whole formalism! (which is something I wanna look at actually!). On the other hand, the approach here is more general, and we will see examples of ϵA(S,δ)\epsilon_\mathcal{A}(S,\delta) which are much easier to calculate than the cross-validation (for instance, only dependent on the training error, see below, and some simple properties of the learning algorithm, which are easier to design than the raw cross-validation score!). In some cases, the bounds on the error given by other methods may be better than those given by CV (not sure about this; would be nice to check!)

Aside [Bayesian vs frequentist again]: There is a curious subtlety, which is that if one tries to look instead at a function ϵA(S,δ)\epsilon'_\mathcal{A}(S,\delta) such that, for some distribution PP over D\mathcal{D}s,
ProbDP,SDm[LD(A(S))ϵA(S,δ)S]1δ\text{Prob}_{\mathcal{D} \sim P, S\sim\mathcal{D}^m}\left[L_{\mathcal{D}}\left (\mathcal{A}(S)\right) \leq \epsilon'_\mathcal{A}(S,\delta) | S \right] \geq 1-\delta
the function one obtains is very different, and depends on PP. What this probability means is the following. Before we were looking at the probability under some fixed, but arbitrary D\mathcal{D}, of sampling SS such that the bound on the error holds. Now we are looking at a probability conditioned on any sample SS. This only makes sense because we have assumed a prior PP over tasks D\mathcal{D}, so that via Bayes' theorem, there is a distribution P(DS)P(\mathcal{D}|S), and it is over this distribution that we are asking the bound to work with high probability. This really gets at the heart between Bayesian and frequentist statistics! For PP uniform (which has the interpretation of "no assumption" in the Bayesian approach), the bound one obtains is much worse (higher) than ϵA(S,δ)\epsilon_\mathcal{A}(S,\delta)!. To me Agnostic learning is one of the most successful examples of the frequentist approach, suggesting that the frequenstist approach is better at encoding the notion of "no assumption on an uncertain quantity". The Bayesian approach, however, is typically better at encoding most other things (namely non-empty sets of assumptions). Although, one can combine both approaches, as we will see, when we look at PAC-Bayesian learning!

The function ϵA(S,δ)\epsilon_\mathcal{A}(S,\delta) is almost always expanded in two terms

ϵA(S,δ)=ϵ^A(S)+C(A,S,δ)mα\epsilon_\mathcal{A}(S,\delta) = \hat{\epsilon}_\mathcal{A}(S) + \frac{\mathcal{C}(\mathcal{A}, S, \delta)}{m^\alpha}

where 1/2α11/2 \leq \alpha \leq 1, and the first term is the empirical risk (aka empirical error, empirical loss, training error, training loss) and is defined as:

ϵ^A(S):=i=1ml(A(S),xi,yi) \hat{\epsilon}_\mathcal{A}(S) := \sum\limits_{i=1}^m l(\mathcal{A}(S), x_i,y_i)

The numerator of the second term C(A,S,δ)\mathcal{C}(\mathcal{A}, S, \delta) could be any function of its arguments. As the denominator mαm^\alpha is determined by SS, it's technically redundant, but we typically put it there, because then C(A,S,δ)\mathcal{C}(\mathcal{A}, S, \delta) has the interpretation of capacity (aka sample complexity) [we will refer to it using that name]. This is a measure of how many training points we need (mm) to get a small second term.

If the second term is small, we are then saying that the generalization error is basically bounded by the training error. LD(A(S))ϵ^A(S)L_{\mathcal{D}}\left (\mathcal{A}(S)\right) \lesssim \hat{\epsilon}_\mathcal{A}(S) (w.h.p.). So that, in this case, if we get a small training error, we then get a small generalization error!

Doing well compared to the best in the hypothesis class

There is another popular expansion for ϵA(S,δ)\epsilon_\mathcal{A}(S,\delta), which is

ϵA(S,δ)=minfFLD(f)+C~(A,S,δ)mα\epsilon_\mathcal{A}(S,\delta) = \min_{f\in \mathcal{F}} L_\mathcal{D}(f) + \frac{\tilde{\mathcal{C}}(\mathcal{A}, S, \delta)}{m^\alpha}

where minfFLD(f)\min_{f\in \mathcal{F}} L_\mathcal{D}(f) is just the minimum true error achievable by any algorithm in the hypothesis class

If one proves bounds using the expansion in terms of empirical error, which hold for all functions in the hypothesis class, and one considers algorithms that minimize the empirical error (said to do Empirical risk minimization (ERM for short)), then by applying that bound twice, and using a simple argument, one obtains a bound of this second form, where C~\tilde{\mathcal{C}} is the same as C\mathcal{C} except for some small constant (factor of 22 basically). Basically, one says that if we know that the true and empirical error are close, for all functions in the class, then they are close for both the function that minimizes the empirical error and the function that minimizes the true error. Then the true error of the former can't be much higher than that for the latter (note it has to be higher by definition of the latter being a minimizer of true error), because if it was higher (than 2f/m2f/m) then the empirical error of the latter will be lower than the empirical error of the former (with high probability), contradicting the fact that the former was a minimizer of empirical error.

Algorithms for which, for any δ\delta, the capacity measure C~(A,S,δ)\tilde{\mathcal{C}}(\mathcal{A}, S, \delta) is bounded (or, more generally grows slower than mαm^\alpha, i.e. o(mα)o(m^\alpha), this is little-oo from Order notation) with high probability, are called Agnostic PAC learners. They are characterized by the fact that for any δ\delta and for any ϵ\epsilon, we can get enough data (mm sufficiently high), such that the second term is smaller than ϵ\epsilon (i.e. ϵA(S,δ)minfFLD(f)ϵ\epsilon_\mathcal{A}(S,\delta) - \min_{f\in \mathcal{F}} L_\mathcal{D}(f) \leq \epsilon) for any ϵ>0\epsilon > 0 arbitrarily close to 00.

Intuitively, agnostic PAC learners are algorithms that can do arbitrarily close to the best one within some restricted class, if given enough data. This is assuming nothing about the data distribution (in the frequentist sense, i.e. looking at the worst case over data distributions).

Now we look at different approaches to bounding the generalization error, which vary on what they take C\mathcal{C} to depend upon. The more you allow C\mathcal{C} to depend on, the tighter the bound can be, because when C\mathcal{C} doesn't depend on some quantity, then, if we make no assumptions on that quantity, the bound must be true, in particular, for the worst case over that quantity, i.e. for the value of that quantity that gives the biggest generalization error. Then for values of that quantity that actually give smaller generalization error, because our bound doesn't depend on that quantity, then our bound will not be very tight (it will be a true upper bound, but won't be very predictive, as it would be much higher than the true value).

Capacity independent of training set

Agnostic PAC bound

A classic theory, initiated by a computer scientists called Leslie Valiant, is known as Probably approximately correct learning. Technically, it is just a name for the general {theory of computing high-probability bounds of the generalization error, i.e. equation \star}. However, it is also often used to refer to a certain approach for such bounds, and we will use the term in that way, to refer to that approach. The approach assumes as little as is reasonable about the learning algorithm. It only assumes something about the cardinality of the set of predictors which the algorithm may output (hypothesis class), F|\mathcal{F}|.

The Agnostic Occam theorem for finite hypothesis classes states:

For any D\mathcal{D}, for any learning algorithm with hypothesis class F\mathcal{F}, equation \star holds with ϵA(S,δ)=ϵ^A(S)+lnF+ln1δm\epsilon_\mathcal{A}(S,\delta) = \hat{\epsilon}_\mathcal{A}(S) + \sqrt{\frac{\ln{|\mathcal{F}|}+\ln{\frac{1}{\delta}}}{m}}.

This bound bounds the difference between the true error and the training error ϵ^A(S)\hat{\epsilon}_\mathcal{A}(S) . The second term is independent of the training data. Therefore, as far as the bound is concerned, we should minimize the training error (doing that gives the best bound of this type). Namely, use empirical risk minimizers.

From this theorem, we can see that ERM algorithms with finite hypothesis classes are agnostic PAC learners! We will see next, that some algorithms with infinite hypothesis classes are also agnostic PAC learners!

Agnostic VC dimension bound (tighter version of the PAC bound)

A hypothesis class consisting of infinitely many functions may still allow an algorithm to generalize. If many of the functions are quite close to each other, we may expect that the number of functions could effectly finite! There are several quantities that mathematically express this notion. One, defined for hypothesis classes corresponding to binary classification (output space has to elements, Y=0,1\mathcal{Y}={0,1}), is the VC dimension. Given a set of functions, even if infinite, one can in principle calculate its VC dimension.

The Agnostic Occam theorem for hypothesis classes with finite VC dimension

For binary classification, for the 0-1 loss function (classification accuracy), any D\mathcal{D}, for any learning algorithm with hypothesis class F\mathcal{F}, equation \star holds with ϵA(S,δ)=ϵ^A(S)+C1VC(F)+ln1δm\epsilon_\mathcal{A}(S,\delta) = \hat{\epsilon}_\mathcal{A}(S) + C_1\sqrt{\frac{VC(\mathcal{F})+\ln{\frac{1}{\delta}}}{m}}.

for some constant C1C_1 independent of the algorithm, or D\mathcal{D}. So it's basically the same as the Occam theorem for finite classes but where the log of the cardinality of the hypothesis class is substituted by the VC dimension of the class (denoted VC(F)VC(\mathcal{F}), which is always equal or smaller than lnF\ln{|\mathcal{F}|}, so the bound is as good or better). Note also that this is only for binary classification

This bound turns out to be optimal (up to multiplicative constants, independent of the hypothesis class, algorithm, etc) for the set of assumptions we are considering here [well actaully, the independence on the algorithm can be relaxed, see below!]. What that means is that it has a matching lower bound. What that means is that one can prove that (VC lower bound):

For binary classification, for the 0-1 loss function (classification accuracy),

for any algorithm producing predictors in F\mathcal{F},

there exists a data distribution Dbad\mathcal{D}_{\text{bad}} such that
for any δ\delta
ProbSDm[LD(A(S))ϵ^A(S)+C2VC(F)+ln1δm]δ\text{Prob}_{S\sim\mathcal{D}^m}\left[L_{\mathcal{D}}\left (\mathcal{A}(S)\right) \geq \hat{\epsilon}_\mathcal{A}(S) + C_2\sqrt{\frac{VC(\mathcal{F})+\ln{\frac{1}{\delta}}}{m}} \right] \geq \delta

for some other constant C2C_2

Therefore no agnostic bound which is independent of the training set can be better than the VC dimension bound, except up to a constant factor CC. Any high-probability upper bound lower than that would say ProbSDm[LD(A(S))smaller bound]δ\text{Prob}_{S\sim\mathcal{D}^m}\left[L_{\mathcal{D}}\left (\mathcal{A}(S)\right) \geq \text{smaller bound} \right] \leq \delta which actually implies ProbSDm[LD(A(S))ϵ^A(S)+C2VC(F)+ln1δm]δ\text{Prob}_{S\sim\mathcal{D}^m}\left[L_{\mathcal{D}}\left (\mathcal{A}(S)\right) \geq\hat{\epsilon}_\mathcal{A}(S) + C_2\sqrt{\frac{VC(\mathcal{F})+\ln{\frac{1}{\delta}}}{m}} \right] \leq \delta, thus being false in the case of Dbad\mathcal{D}_{\text{bad}}, as per the theorem above.

The "for any algorithm" in the VC lower bound theorem above has told us something more than we may have anticipated. It has told us that even if we assume something more about the algorithm, but still don't assume or know anything about the data when computing the bound, the VC dimension bound is still optimal!

So if you are doing binary classification, and you know the hypothesis class of your algorithm, if you assumme nothing else (about data for instance)! (in the frequentist sense, i.e. worst-case), then the VC dimension bound is (up to a constant) the best guarantee you can make about the generalization error of your algorithm.
You also know that for binary classification, the algorithms that give the best bounds are ERM algorithms (minimize training error). But within those (there may be many possible algorithms that minimize the training error; think that there could be many functions that all fit the data equally well), none gives you better worst-case VC dimension bounds than any other.

What about Multiclass classification?

All the results above are about binary classification! There is an analogous quantity to VC dimension, for the case of multiclass classification. It is called the Natarajan dimension. Similar results to the Occam theorem, and the lower bound (optimality) hold.

However, it turns out that not all ERM algorithms satisfy the optimal bound (or even are agnostic PAC learners), as was the case in binary classification. For multiclass classification, only a subset of ERM algorithms give the best possible agnostic bound (again, for the case of bounds that don't depend on the training set, like we are considering in this section).

Capacity depends on the training set, but weakly on the algortihm

What if you allow the bound to depend on the training set. But you still don't want to assume anything about the algorithm, except its hypothesis class F\mathcal{F}? Then it turns out we also have a bound that is optimal (up to multiplicative constants) for this case. As we want the bound to depend on the training set, we need something analogous to the VC dimension but which depends on the training set too, not just on the hypothesis class. That thing is the Rademacher complexity, R(S,F,L)\mathcal{R}(S,\mathcal{F}, L) (which actually depends on the loss function LL too; although one can define the Rademacher complexity of the hypothesis class on some data, without reference to the loss function too. That will make more sense when I put in the definition of Rademacher complexity)

Here is a good introduction to the definition of Rademacher complexity. However, I plan to write things more in depth here in the future.

Agnostic Rademacher complexity bound

The basic version of the Rademacher complexity bound where instead of having a bound which depends on the training set (ϵA(S,δ)\epsilon_\mathcal{A}(S,\delta)), it depens on the data distribution (and also on mm), i.e. we have:

ProbSDm[LD(A(S))ϵA(D,m,δ,L)]1δ\text{Prob}_{S\sim\mathcal{D}^m}\left[L_{\mathcal{D}}\left (\mathcal{A}(S)\right) \leq \epsilon_\mathcal{A}(\mathcal{D},m,\delta,L) \right] \geq 1-\delta

where ϵA(D,m,δ,L)\epsilon_\mathcal{A}(\mathcal{D},m,\delta,L) can be written as before as the training error plus some ratio of a capacity function and mm, but now the capacity function C(S,A,δ,L)\mathcal{C}(S,\mathcal{A},\delta,L) depends, instead of on the VC dimension, on the expected Rademacher complexity R(D,F,L,m):=ESDm[R(S,F,L)]\mathcal{R}(\mathcal{D},\mathcal{F},L,m):=\mathbf{E}_{S\sim \mathcal{D}^m}\left[\mathcal{R}(S,\mathcal{F},L) \right]

This is called a distribution-dependent bound. Even if we know everything about D\mathcal{D}, if we know something about it (some property of it) that allows us to calculate the bound, then a distribution-dependent bound may be useful. And in that case, we can predict generalization bounds before getting the training data (just like we could for VC dimension, but here the bounds could be tigther/better because we are putting some assumptions on the data distribution, unlike for VC dimension). That can be quite useful, because via the dependence of mm, we can find how many data points we need to train our algorithm to be able to guarantee a generalization error smaller than some value. Getting data can be quite expensive in the real world, so knowing this can be a critical piece of information for some company or project!

Unfortunately, knowing things about D\mathcal{D} that allow us to compute a bound may be quite hard, or not possible. [We will discuss this further later for bounds that depend on the algorithm as well as the data; but I leave it as an open question: Are there any nontrivial problems where we can predict the distribution-dependent Rademacher complexity bound?].Therefore, we may be more interested in a data-dependent bound that depends on SS rather than D\mathcal{D}. Fortunately, there are Concentration inequalityes (Bounded differences inequality) that guarantees that with high probability, the Rademacher complexity on a sample SS, R(S,F,L)\mathcal{R}(S,\mathcal{F},L) is close to its expected value, R(D,F,L)\mathcal{R}(\mathcal{D},\mathcal{F},L). Their difference can be bounded, with high probability, and we can therefore obtain a data-dependent bound of the form \star

The Agnostic data-dependent Occam theorem using Rademacher complexity

For any input or output spaces, any loss function which is bounded (has a maximum absolute value), any D\mathcal{D}, for any learning algorithm with hypothesis class F\mathcal{F}, equation \star holds with ϵA(S,δ)=ϵ^A(S)+2R(S,F,L)+4c2ln4δm\epsilon_\mathcal{A}(S,\delta) = \hat{\epsilon}_\mathcal{A}(S) + 2 \mathcal{R}(S,\mathcal{F},L) +4c\sqrt{\frac{2\ln{\frac{4}{\delta}}}{m}}.

for some universal constant cc. See Understanding Machine Learning by Shai and Shai for the proof (until I put it here or something lol)

Note this is a bound that works not just for binary classification, but for any supervised learning task!

It also turns out that the Rademacher bound is optimal under its assumptions. Meaning that we can find a matching lower bound, by which:

there exists a data distribution, for which

there exists a learning algorithm with hypothesis class F\mathcal{F}, such that
the error is higher than {something like the above upper bound but with some different constant} with high probability.

Notice, that unlike for VC dimension, we don't have "for all learning algorithms" in the middle, but "there exists a learning algorithm". This means that the Rademacher complexity bound is only the optimal data-dependent bound, if we assume nothing about the algorithm, except that it has a certain hypothesis class! This is different than what happened for the VC dimension data-independent bounds. It means that we can expect to get tighter data-dependet bouns, if we use some more information about the learning algorithm. We will explore that in the sections below.

Particular cases of Rademacher complexity

The theory of Rademacher complexity is quite ritch. If one specializes to the case of binary classification, one can obtain a bound on the Rademacher complexity that depends on something called the growth function. One can in turn bound the growth function with the VC dimension using Sauer's lemma. We can therefore derive the VC dimension bound from the Rademacher bound.

One can also have bounds on the Rademacher complexity which depend on the Covering numbers and Packing numbers of the hypothesis class. One can also relate these to VC dimension. In fact to derive the tightest version of the VC bound, this is the only way to derive it, as far as I know is using a trick with covering numbers known as chaining.

Capacity depends on the training set and on the learning algorithm

As we saw above, the Rademacher complexity bound could potentially be improved if we make the bound depend, not just on the data SS and the hypothesis class F\mathcal{F}, but also on further properties of the algorithm A\mathcal{A}. Here we explore approaches to do that. As far as I am aware, it is not known what are the optimal bounds for this most general case, which explains why there is a bigger variety of approaches here. There are also different ways of defining "optimal" bound, some frequentist, some Bayesian, some a mixture of the two. And I will try to discuss them here [up to now we have seem only frequentist notions of optimal bounds in the VC and Rademacher complexity cases]. data and algorithm dependent bounds is also where a lot of interesting current work is happening.

Weighted agnostic PAC bound

Agnostic Structural risk minimization

PAC-Bayes bound

Tighter than the weighted PAC

Algorithmic stability, robustness and sensitivity bounds

Compression bounds

Occam algorithm bound

General version

Kolmogorov complexity version.


TODO: Proofs, and checking rigorous statements of the different types of generalization bound theorems