In Supervised learning, we want to convert a set of example "input-output" pairs, into a procedure for predicting the outputs for new inputs (see Supervised learning for examples).
Input and output sets. To be precise we define two Sets, the domain (or input space, or set of instances) , and the range (or output space, or co-domain; or set of labels, or classes, if the space is finite) .
What we are given in an instatiation of a supervised learning problem is a training set (which is actually a Tuple) , consisting of pairs , , where for each , and . ( means "belongs to the set") These pairs are called examples (or sometimes example pairs; also sometimes "data points", or "data"; this is the "data" everyone talks about in ML)
What we want in a supervised learning problem is to find a procedure for converting a training set into a predictor (a Function from the input space to the output space), which is "good". A good predictor has guarantees about its generalization on new examples, which we'll explain below (on the section "The metric for assessing quality of predictors").
The point now is that to have any hope of predicting, we need to assume that new example pairs , that we use to assess the quality of the predictor, are somehow related in a similar way to the examples in the training set, at least with high probability. The standard way to ensure this is that examples in the training set and test set are identically and independently distributed (if one considers examples which are more generally distributed, one has to start looking at other areas, like Reinforcement learning). So for supervised learning we assume
i.i.d. for all , for some data distribution ( means that the left hand side is a Random variable, and the right hand side is the Probability distribution it follows). We have a notation for saying that the examples are drawn i.i.d. which is .
We also assume that new examples on which we test a predictor (sometimes called the test set) are also statistically distributed according to the same distribution .
We are now ready to define what we formally mean by a "good" predictor. To do this, we define a measure of how bad the predictor is, called the Loss function. The loss function is a function which maps example pairs and a predictor to a Real number,
where is just the set of all possible predictors, which we are considering, called the hypothesis space. (see Function for an explanation of this notation; is the Cartesian product of and )
The interpretation is that a large value means that the predictor does badly at predicting the output for a particular input .
But we don't care about just one particular input, so we average (take the Expectation) over the data distribution (which test data points are assumed to follow). This is called the risk (or true risk, or generalization error, or true loss, true error or sometimes even simply error),
just means the Expectation under sampling examples. We are taking the expectation of the loss function for a fixed predictor, so that the true risk is a function of (depends on) both the data distribution and the predictor .
A common loss function is the 0-1 loss or classification accuracy, defined when the output space is finite, it is simply , where is the Indicator function, which is equal to when the condition is satisfied, and to if it isn't.
We can almost now finally define a learning problem. Well, first let us define one last thing. A learning algorithm is defined as a function mapping training sets to predictors. Formally,
The is the Kleene star: means the set of all finite Tuples of elements of . Here we use it to encode "the set of all training sets of any size" in a concise way. Using this notation, the predictor which the learning algorithm produces, for a training set is denoted .
We can now finally define a learning problem.
A supervised learning problem is: Given some properties of the data distribution , and a loss function , find a learning algorithm , such that
- it is highly probable that upon sampling a training set , the predictor which the learning algorithm produces, has low risk .
In pseudo-math, . or . lol (here is the probability of obtaining a particular value of the thing in the brackets, when sampling according to its distribution ).
Being less pseudo... and more rigorous: we can define "" as "" for some ; and "" as "", for some .
– So we want to find a learning algorithm such that , for any and within some range of interest.
Supervised learning theory is all about what we can say about under different assumptions on , and , and .
Above, we talked about properties of , corresponding to bounding the cumulative distribution (because of the ) (bounds of these type are also called "bounds in probability", or "high-probability bounds"). However, the most information we can hope to get is the exact distribution . With that, we can say, what's the probability that our learning algorithm performs this badly under this data distribution, which would be very useful!
Another quantity which is useful, is the mean of this distribution, that is . We will call this, the expected generalization error (or expected error, or expected risk). This is now a single quantity that only depends on the data distribution and the algorithm. This is the quantity that we study when calculating Learning curves. A related quantity comes about when we assume some probability distribution over data distributions (making Bayesians happy ;) ). This is meant to encode the prior probability of obtaining a particular learning task (which correspond to a particular data distribution). Perhaps we believe that learning tasks which give all images of dogs the same label, are more likely that those that don't. Or perhaps, we believe that it's more likely that the data is composed of images of actual things, rather than images of random noise. All of these, and more abstract mathematical assumptions, can be encoded on the prior . If we have such a prior, then we can average the expected error, to obtain the average expected error (which we will sometimes refer to as average error for short, at the risk of confusing it with the expected error, defined above).
We have assumed that a predictor is a function from input to output space, . However, one can easily generalize all the discussion above to stochastic predictors, which are distributions over the output space, dependent on the input, , where I used to just mean the "set of probability distributions over ". One can interpret this as the probability of a particular output given (or conditioned on) an input .
In a precisely analogous way, we can generalize our concept of learning algorithm to include stochastic learning algorithms which given a training set , produce a probability distribution over predictors .
From now on, any statement we make about predictors and learning algorithms will typically apply to their stochastic versions, unless we specify they have to be deterministic.
What if make no assumptions about ? Well, we can't say much then. There are several No free lunch theorems formalizing this.
One of them, by Wolpert 1995, says:
No free lunch theorem The average expected error , when the prior over data distributions is uniform (assuming a uniform distribution can be defined for the choice of input and output spaces) is independent of the learning algorithm .
In particular, this most often implies that any algorithm will perform badly in average. For instance, this theorem is typically applied in the context of classification, where the risk is defined to be the probability of missclassification. The theorem then implies that any algorithm will have an expected risk, in average, which is the same as the dumbest algorithm which is: "guess randomly", which predicts rather poorly (has probability of success of for binary classification, for instance).
Btw, there are no free lunch theorems for optimization too, which say that all optimization algorithms are equally bad in average
Fortunately, we often have some prior knowledge that constraints the data distributions we care about. But before, let us see if we can get some cheap snack, even if we can't get a free lunch
It turns out that even without making any assumptions on the data distribution, we can say many things. This is studied in Agnostic learning. In particular, it turns out that we can give a bound on the generalization error, which depends on the sample we obtain, and hold with high probability, at least for . That is, for algorithm , we can find a function such that, for any distribution whatsoever,and for any (perhaps is allowed to {be required to be smaller than some number for the equation to hold})
[ Call this equation ]
and the awesome thing is that now actually does depend on the algorithm . What this means is that even though all algorithms perform the same on average, some algorithms will perform very well in a few tasks, and pretty bad on most tasks; while other algorithms, like random guessing, will perform equally bad in all tasks. It is quite useful to distinguish these two types of algorithms. And what the above expression says is that if we find for some non-trivial algorithm, we can see from the data we have whether this is (with high probability) one of the instances in which this algorithm performs well ( low) or bad ( high), which quite a useful thing to be able to do!
As an illustrative example, one common function that people (in Statistics) are familiar with is Cross-validation! It offers a function of the data, the algorithm, and the confidence parameter (), that allows us to be confident of whether our algorithm will generalize well or badly. Indeed cross-validation (CV) can be analyzed within this whole formalism! (which is something I wanna look at actually!). On the other hand, the approach here is more general, and we will see examples of which are much easier to calculate than the cross-validation (for instance, only dependent on the training error, see below, and some simple properties of the learning algorithm, which are easier to design than the raw cross-validation score!). In some cases, the bounds on the error given by other methods may be better than those given by CV (not sure about this; would be nice to check!)
The function is almost always expanded in two terms
where , and the first term is the empirical risk (aka empirical error, empirical loss, training error, training loss) and is defined as:
The numerator of the second term could be any function of its arguments. As the denominator is determined by , it's technically redundant, but we typically put it there, because then has the interpretation of capacity (aka sample complexity) [we will refer to it using that name]. This is a measure of how many training points we need () to get a small second term.
If the second term is small, we are then saying that the generalization error is basically bounded by the training error. (w.h.p.). So that, in this case, if we get a small training error, we then get a small generalization error!
Doing well compared to the best in the hypothesis class
There is another popular expansion for , which is
where is just the minimum true error achievable by any algorithm in the hypothesis class
If one proves bounds using the expansion in terms of empirical error, which hold for all functions in the hypothesis class, and one considers algorithms that minimize the empirical error (said to do Empirical risk minimization (ERM for short)), then by applying that bound twice, and using a simple argument, one obtains a bound of this second form, where is the same as except for some small constant (factor of basically). Basically, one says that if we know that the true and empirical error are close, for all functions in the class, then they are close for both the function that minimizes the empirical error and the function that minimizes the true error. Then the true error of the former can't be much higher than that for the latter (note it has to be higher by definition of the latter being a minimizer of true error), because if it was higher (than ) then the empirical error of the latter will be lower than the empirical error of the former (with high probability), contradicting the fact that the former was a minimizer of empirical error.
Algorithms for which, for any , the capacity measure is bounded (or, more generally grows slower than , i.e. , this is little- from Order notation) with high probability, are called Agnostic PAC learners. They are characterized by the fact that for any and for any , we can get enough data ( sufficiently high), such that the second term is smaller than (i.e. ) for any arbitrarily close to .
Now we look at different approaches to bounding the generalization error, which vary on what they take to depend upon. The more you allow to depend on, the tighter the bound can be, because when doesn't depend on some quantity, then, if we make no assumptions on that quantity, the bound must be true, in particular, for the worst case over that quantity, i.e. for the value of that quantity that gives the biggest generalization error. Then for values of that quantity that actually give smaller generalization error, because our bound doesn't depend on that quantity, then our bound will not be very tight (it will be a true upper bound, but won't be very predictive, as it would be much higher than the true value).
Agnostic PAC bound
A classic theory, initiated by a computer scientists called Leslie Valiant, is known as Probably approximately correct learning. Technically, it is just a name for the general {theory of computing high-probability bounds of the generalization error, i.e. equation }. However, it is also often used to refer to a certain approach for such bounds, and we will use the term in that way, to refer to that approach. The approach assumes as little as is reasonable about the learning algorithm. It only assumes something about the cardinality of the set of predictors which the algorithm may output (hypothesis class), .
The Agnostic Occam theorem for finite hypothesis classes states:
This bound bounds the difference between the true error and the training error . The second term is independent of the training data. Therefore, as far as the bound is concerned, we should minimize the training error (doing that gives the best bound of this type). Namely, use empirical risk minimizers.
From this theorem, we can see that ERM algorithms with finite hypothesis classes are agnostic PAC learners! We will see next, that some algorithms with infinite hypothesis classes are also agnostic PAC learners!
Agnostic VC dimension bound (tighter version of the PAC bound)
A hypothesis class consisting of infinitely many functions may still allow an algorithm to generalize. If many of the functions are quite close to each other, we may expect that the number of functions could effectly finite! There are several quantities that mathematically express this notion. One, defined for hypothesis classes corresponding to binary classification (output space has to elements, ), is the VC dimension. Given a set of functions, even if infinite, one can in principle calculate its VC dimension.
The Agnostic Occam theorem for hypothesis classes with finite VC dimension
for some constant independent of the algorithm, or . So it's basically the same as the Occam theorem for finite classes but where the log of the cardinality of the hypothesis class is substituted by the VC dimension of the class (denoted , which is always equal or smaller than , so the bound is as good or better). Note also that this is only for binary classification
This bound turns out to be optimal (up to multiplicative constants, independent of the hypothesis class, algorithm, etc) for the set of assumptions we are considering here [well actaully, the independence on the algorithm can be relaxed, see below!]. What that means is that it has a matching lower bound. What that means is that one can prove that (VC lower bound):
For binary classification, for the 0-1 loss function (classification accuracy),
for any algorithm producing predictors in ,
for some other constant
Therefore no agnostic bound which is independent of the training set can be better than the VC dimension bound, except up to a constant factor . Any high-probability upper bound lower than that would say which actually implies , thus being false in the case of , as per the theorem above.
The "for any algorithm" in the VC lower bound theorem above has told us something more than we may have anticipated. It has told us that even if we assume something more about the algorithm, but still don't assume or know anything about the data when computing the bound, the VC dimension bound is still optimal!
So if you are doing binary classification, and you know the hypothesis class of your algorithm, if you assumme nothing else (about data for instance)! (in the frequentist sense, i.e. worst-case), then the VC dimension bound is (up to a constant) the best guarantee you can make about the generalization error of your algorithm. |
You also know that for binary classification, the algorithms that give the best bounds are ERM algorithms (minimize training error). But within those (there may be many possible algorithms that minimize the training error; think that there could be many functions that all fit the data equally well), none gives you better worst-case VC dimension bounds than any other. |
What about Multiclass classification?
All the results above are about binary classification! There is an analogous quantity to VC dimension, for the case of multiclass classification. It is called the Natarajan dimension. Similar results to the Occam theorem, and the lower bound (optimality) hold.
However, it turns out that not all ERM algorithms satisfy the optimal bound (or even are agnostic PAC learners), as was the case in binary classification. For multiclass classification, only a subset of ERM algorithms give the best possible agnostic bound (again, for the case of bounds that don't depend on the training set, like we are considering in this section).
What if you allow the bound to depend on the training set. But you still don't want to assume anything about the algorithm, except its hypothesis class ? Then it turns out we also have a bound that is optimal (up to multiplicative constants) for this case. As we want the bound to depend on the training set, we need something analogous to the VC dimension but which depends on the training set too, not just on the hypothesis class. That thing is the Rademacher complexity, (which actually depends on the loss function too; although one can define the Rademacher complexity of the hypothesis class on some data, without reference to the loss function too. That will make more sense when I put in the definition of Rademacher complexity)
Here is a good introduction to the definition of Rademacher complexity. However, I plan to write things more in depth here in the future.
Agnostic Rademacher complexity bound
The basic version of the Rademacher complexity bound where instead of having a bound which depends on the training set (), it depens on the data distribution (and also on ), i.e. we have:
where can be written as before as the training error plus some ratio of a capacity function and , but now the capacity function depends, instead of on the VC dimension, on the expected Rademacher complexity
This is called a distribution-dependent bound. Even if we know everything about , if we know something about it (some property of it) that allows us to calculate the bound, then a distribution-dependent bound may be useful. And in that case, we can predict generalization bounds before getting the training data (just like we could for VC dimension, but here the bounds could be tigther/better because we are putting some assumptions on the data distribution, unlike for VC dimension). That can be quite useful, because via the dependence of , we can find how many data points we need to train our algorithm to be able to guarantee a generalization error smaller than some value. Getting data can be quite expensive in the real world, so knowing this can be a critical piece of information for some company or project!
Unfortunately, knowing things about that allow us to compute a bound may be quite hard, or not possible. [We will discuss this further later for bounds that depend on the algorithm as well as the data; but I leave it as an open question: Are there any nontrivial problems where we can predict the distribution-dependent Rademacher complexity bound?].Therefore, we may be more interested in a data-dependent bound that depends on rather than . Fortunately, there are Concentration inequalityes (Bounded differences inequality) that guarantees that with high probability, the Rademacher complexity on a sample , is close to its expected value, . Their difference can be bounded, with high probability, and we can therefore obtain a data-dependent bound of the form
The Agnostic data-dependent Occam theorem using Rademacher complexity
for some universal constant . See Understanding Machine Learning by Shai and Shai for the proof (until I put it here or something lol)
Note this is a bound that works not just for binary classification, but for any supervised learning task!
It also turns out that the Rademacher bound is optimal under its assumptions. Meaning that we can find a matching lower bound, by which:
there exists a data distribution, for which
Notice, that unlike for VC dimension, we don't have "for all learning algorithms" in the middle, but "there exists a learning algorithm". This means that the Rademacher complexity bound is only the optimal data-dependent bound, if we assume nothing about the algorithm, except that it has a certain hypothesis class! This is different than what happened for the VC dimension data-independent bounds. It means that we can expect to get tighter data-dependet bouns, if we use some more information about the learning algorithm. We will explore that in the sections below.
Particular cases of Rademacher complexity
The theory of Rademacher complexity is quite ritch. If one specializes to the case of binary classification, one can obtain a bound on the Rademacher complexity that depends on something called the growth function. One can in turn bound the growth function with the VC dimension using Sauer's lemma. We can therefore derive the VC dimension bound from the Rademacher bound.
One can also have bounds on the Rademacher complexity which depend on the Covering numbers and Packing numbers of the hypothesis class. One can also relate these to VC dimension. In fact to derive the tightest version of the VC bound, this is the only way to derive it, as far as I know is using a trick with covering numbers known as chaining.
As we saw above, the Rademacher complexity bound could potentially be improved if we make the bound depend, not just on the data and the hypothesis class , but also on further properties of the algorithm . Here we explore approaches to do that. As far as I am aware, it is not known what are the optimal bounds for this most general case, which explains why there is a bigger variety of approaches here. There are also different ways of defining "optimal" bound, some frequentist, some Bayesian, some a mixture of the two. And I will try to discuss them here [up to now we have seem only frequentist notions of optimal bounds in the VC and Rademacher complexity cases]. data and algorithm dependent bounds is also where a lot of interesting current work is happening.
Weighted agnostic PAC bound
Agnostic Structural risk minimization
PAC-Bayes bound
Tighter than the weighted PAC
Algorithmic stability, robustness and sensitivity bounds
Occam algorithm bound
General version
Kolmogorov complexity version.
TODO: Proofs, and checking rigorous statements of the different types of generalization bound theorems