Growth function generalization error bound

cosmos 17th May 2019 at 12:35am
Generalization error Growth function

So, in Supervised learning theory people are pretty obsesed with proving bounds on the Generalization error, so let's try that.

Check my Introduction to supervised learning theory for a more in-detail description of all the terms and things. But I don't actually mind repeating myself so let's see

We want to model the idea of learning a pattern from a set of examples. What kind of pattern? Well, each of these examples is actually a pair of things. One thing, we call an input, and the other thing we call an output. The pattern we want to learn is a Function, that is a procedure, for predicting an output, when we are given an input.

Examples.

  • The inputs could be images, and the outputs could be labels.
  • Inputs could be memes and the output could be haha / k
  • Inputs could be movies, and the output could be soundtracks
  • Input could be a bunch of lego pieces, and output could be lego castles. Yes, this also technically falls within supervised learning. The function could literally be a machine now (who says machine learning is about "software" haha). This is a very non-deterministic function because there are many lego castles that are possible with a given set of pieces, but that's fine, we allow for stochastic functions. To make it less stochastic, you could feed an instruction book as input too.

As the last example goes to illustrate, in the general case, we assume that the output could be a (stochastic) probability distribution dependent on the input, rather than be deterministic. But actually, for simplicity, I'm going to assume a deterministic function for now.

Next assumption! We assume the set of examples are sampled from a probability distribution D\mathcal{D} identically and independently, like toin coses are.

We want to predict whether a particular function will predict correctly in new independent samples from the same distribution. We call the probability that function (aka hypothesis) hh predicts incorrectly, generalization error, with symbol D(h)\mathcal{D}(h).

Now, I'm going to prove one of the more advanced results in this theory, so sorry, if it's a bit of a jump :P (it will be unless you already are familiar I guess)

We call the set of mm examples, which makes our training set, S0S_0.

Agnostic growth function generalization bound

Trying to get a tight version as per discussion here: https://twitter.com/guillefix/status/1129103523092803584

Now, we call the true function, that determines the outputs given inputs, in the set of examples, the target function. We assume the outputs in this case are just 00 or 11 (aka binary classification)). In this case we sometimes call a funciton, a concept, and they are 1-to-1 with sets (the set corresponding to a function is the set of all inputs mapping to 11)

Without loss of generality, we assume the target concept to be the empty set, or function with output uniformly 00. You can think of this as an input-dependent relabelling of the outputs, which doesn't change anything information-theoretically (labels are just labels, I can call them duck/orange or whatever instead of 0/1, and the labelling can depend on the particular input too, as long as it's fixed).

We now assume that the algorithm can only output functions within a certain set of functions, which we call hypothesis class, and denote H\mathcal{H}.

We will denote the fraction of examples in S0S_0 which the algorithm missclassifies (training error) by S0(h)S_0(h).

Call {D(h)S0(h)ϵ\mathcal{D}(h) - S_0(h) \geq \epsilon}, property P(h)P(h). Let event AA be {hH:P(h)\exists h\in \mathcal{H} : P(h)}, that is the event that there exists a function in H\mathcal{H} whose difference between generalization and training error is higher than some constant ϵ\epsilon

If such function doesn't exists, then we can guarantee for our algorihtm (which can only produce funcitons in H\mathcal{H}) that the generalization error is close (within ϵ\epsilon) to the training error, so that minimizing the training error approximately minimizes the generalization error. Proving that generror and trainerror are close is the goal of Agnostic learning theory at large.

So the theorem we want to prove is that the probability of such function existing, that is the probability of AA, is low.

This is not too hard if H\mathcal{H} is a finite set, see Probably approximately correct. You just use the Union bound, after bounding the probability that single function has the bad property (D(h)S0(h)ϵ \mathcal{D}(h) - S_0(h) \geq \epsilon). But we are hardcore, so let's consider the possibility of the set being infinite.

Aha, here's the most clever trick of all supervised learning theory. We can transform an statement about an infinite set into an statement about a finite one as follows.

First, assume D(h)S0(h)ϵ \mathcal{D}(h) - S_0(h) \geq \epsilon for some hh, i.e. assume property P(h)P(h). Then sample an extra mm samples, and call the new sample S1S_1. We can show that if the true error is D(h)\mathcal{D}(h), then it's very unlikely that the error on the new sample, S1(h)S_1(h) is very different. In particular, using Chernoff bounds we can show that

Prob[S1(h)>D(h)/2P(h)]1/2\text{Prob}[S_1(h) > \mathcal{D}(h)/2|P(h)] \geq 1/2

(see page 348, sec 28.3, in Understanding Machine Learning by Shai and Shai)

Now, Call {S1(h)>D(h)/2S_1(h) > \mathcal{D}(h)/2}, property Q(h)Q(h).

Then

Prob[hH:P(h)Q(h)]=P(A)P[Q(h)P(h)]P(A)2\text{Prob}[\exists h\in \mathcal{H}: P(h) \cap Q(h)] = P(A)P[Q(h)|P(h)] \geq \frac{P(A)}{2}

So

P(A)2Prob[hH:P(h)Q(h)]\frac{P(A)}{2} \leq \text{Prob}[\exists h\in \mathcal{H}: P(h) \cap Q(h)] Prob[hH:hmm] \leq \text{Prob}[\exists h \in \mathcal{H}: hmm]

2S1(h)S0(h)>D(h)S0(h)ϵ2S_1(h) - S_0(h) > \mathcal{D}(h) - S_0(h) \geq \epsilon

eeh not working. See below.


Hmmm,

let us assume for now that Prob[S1(h)>D(h)P(h)]1/4\text{Prob}[S_1(h) > \mathcal{D}(h)|P(h)] \geq 1/4. They say here, that in fact this is true (assuming D(h)>2/m\mathcal{D}(h)>2/m) But how to show? Well, 3/4>e2+2e2+2e23/4 > e^{-2}+2e^{-2}+2e^{-2} is an upper bound on the probability that S1(h)<D(h)S_1(h) < \mathcal{D}(h) if the binomial probability is D(h)=2/m\mathcal{D}(h)=2/m. For D(h)\mathcal{D}(h), the probability will only decrease as the Binomial distribution only becomes more right-skewed with increasing pp, so the upper bound still works*. Now, because of the condition P(h)P(h), ensuring D(h)>ϵ\mathcal{D}(h)>\epsilon, we can guarantee D(h)>2/m\mathcal{D}(h)>2/m, by guaranteeing δ\delta is smaller than 1/2 or so, because in the actual statement of the theorem we define ϵ\epsilon to be a certain function of δ\delta, mm, etc. Alternatively, think of us putting a condition on ϵ\epsilon that it has to be at least 2/m2/m.

*we still need to show that, to fully complete the proof.

P(A)4Prob[hH:S1(h)S0(h)>ϵ]\frac{P(A)}{4} \leq \text{Prob}[\exists h \in \mathcal{H}: S_1(h) - S_0(h) > \epsilon]

Call {S1(h)S0(h)>ϵS_1(h) - S_0(h) > \epsilon}, property T(h)T(h).

(we imagine first sampling a Set of instances, and then deciding its order to get the tuple S=S0+S1S=S_0+S_1 (using python tuple sum :P)

XD(X)SXP(SX)maxhH1T(h)\leq \sum\limits_X \mathcal{D}(X) \sum\limits_{S|X} P(S|X) \max_{h\in \mathcal{H}} \mathbb{1}_{T(h)}

NOW. This is the crucial moment. We notice, that property T(h)T(h) only depends on the labelling that hh gives on the elements in XX, so the max over H\mathcal{H} is the same as the max over Δ(X)\mathcal{\Delta(X)}, the set of labellings on XX. This is now a finite set! Hurray! We can safely union bound the hecc out of this guy.

using Union bound (which when expressing union as max of indicators, it's the same as Sum is greater than max for nonneg numbers)

XD(X)lΔ(X)SXP(SX)1T(l)\leq \sum\limits_X \mathcal{D}(X) \sum_{l\in \mathcal{\Delta(X)}} \sum\limits_{S|X} P(S|X) \mathbb{1}_{T(l)}

Now, P(SX)1T(l)P(S|X) \mathbb{1}_{T(l)} is the probability that a random suffle of aa things in a set of 2m2m things (where p=a/2m>ϵp=a/2m > \epsilon) gives us a shuffle such that S1(h)S0(h)>ϵS_1(h) - S_0(h) > \epsilon

We can look at the random variable S1(h)S0(h)S_1(h) - S_0(h) itself. It has mean 00 and is a function of a series of binary variables which indicate whether a particular xXx \in X is in S0S_0 say. These variables can be approximately modelled as being i.i.d. with prob pp However, they are not really, we are doing a random shuffle of them. However, if they were i.i.d., then Bounded differences inequality would give us (the difference of flipping one of the variables is 2/m2/m.

Actually, you can see here that we can convert it into an i.i.d. thing because they look at a different group. Rather than the full permutation group, only at a subgroup of this generated by sawps between the ith and the m+i th element (so between corresponding elements in the two halves of SS. This is still measure preserving, but one can write the relevant event as dependent on a set of i.i.d. swap-related variables... Neato. Trick apparently comes from here?

then the diff between value and mean (00) is with prob at least 1/delta1-/delta smaller than 2mm2log1δ=2mlog1δ\frac{2}{m}\sqrt{\frac{m}{2}\log{\frac{1}{\delta}}} = \sqrt{\frac{2}{m}\log{\frac{1}{\delta}}}

so that if 2mlog1δ=ϵ\sqrt{\frac{2}{m}\log{\frac{1}{\delta}}} = \epsilon

log1δ=mϵ22\log{\frac{1}{\delta}} = m\frac{\epsilon^2}{2}

δ=emϵ22\delta = e^{-m\frac{\epsilon^2}{2}}

So that under this approximation, Prob[S1(h)S0(h)>ϵ]emϵ22\text{Prob}[S_1(h) - S_0(h) > \epsilon] \leq e^{-m\frac{\epsilon^2}{2}}

So

P(A)4Δ(H)emϵ22\frac{P(A)}{4} \leq \Delta(H) e^{-m\frac{\epsilon^2}{2}}

where Δ(H)\Delta(H) is the Growth function, a bound on Δ(X)\Delta(X)

See good proof here and here

In here they actually prove a tighter version of the Growth function generalization error bound, which allows them to bound a sort of relative version of the generalization gap (divided by the square root of the true error, or by the actual true error, if they assume a bound on it too).

The bound they get is not uniform over training error, it's more like a generalization of the realizable bound, it shows that with high probability, if training error is less than blah (this is fixed before drawing the sample), then the generalization error is less than blah (just like this other blah).

That is why they later do the weighted union bound in here, because they don't really have an agnostic bound. They have a bound that holds can be applied to any training error, but doesn't work uniformly over all training errors. It's a different kind of thing.


Actually people seem to prove this using Rademacher complexity and Massart lemma