Rademacher complexity

cosmos 13th December 2019 at 6:32pm
Learning theory

The Rademacher complexity of a set TRmT \subseteq \mathbb{R}^m is

R(T):=EsuptT1mi=1mϵiti\mathcal{R}(T) := \mathbf{E}\sup_{t \in T} \frac{1}{m} \sum_{i=1}^m \epsilon_i t_i

where ϵ1,...,ϵm\epsilon_1, ..., \epsilon_m are i.i.d. Random variables uniform in {1,1}\{-1,1\}

See here for book chapter (UML), also notes on LectureNotes on tablet (ML-AI/Learning theory) and these notes and here

An important tool to bound the mean of a supremum of a random process is the Rademacher complexity.

Video

See Learning theory

Expected Rademacher complexity bounds expected Representativeness of the training set.

ESDm[RepD(F,S)]2ESDmR(FS)\mathbb{E}_{S \sim \mathcal{D}^m} [\text{Rep}_\mathcal{D} (\mathcal{F},S)] \leq 2 \mathbb{E}_{S \sim \mathcal{D}^m} R(\mathcal{F} \circ S)

where the representativeness is defined as:

RepD(F,S):=supfF(LD(f)LS(f))\text{Rep}_\mathcal{D} (\mathcal{F},S):=\sup\limits_{f\in \mathcal{F}}(L_\mathcal{D}(f)-L_S(f))

This proof requires dividing the term inside the Rad comp into two i.i.d. training sets, changing the sign of the Rad variables, as the expectation is the same. We also need to use the fact that swapping i.i.d. variables (from the two training sets) leaves the expectation unchanged. As doing is turns out to be equivalent to changing the sign of the Rad vars, then the expectation over the training sets is independent of the Rad vars, and so we can ignore the expectation over them!

This allows to bound Generalization gaps both in expectation and with high probability (using the McDiarmid's inequality).

The first result is that via McDiarmid's concentration inequality RepD(F,S)\text{Rep}_\mathcal{D} (\mathcal{F},S) is close to its expectation ESDm[RepD(F,S)]\mathbb{E}_{S \sim \mathcal{D}^m} [\text{Rep}_\mathcal{D} (\mathcal{F},S)] w.h.p over SDmS \sim \mathcal{D}^m, so that if we can bound ESDm[RepD(F,S)]\mathbb{E}_{S \sim \mathcal{D}^m} [\text{Rep}_\mathcal{D} (\mathcal{F},S)], we are bounding the worst-case difference between the true risk and the empirical risk, w.h.p.. To bound this we need to bound ESDmR(FS)\mathbb{E}_{S \sim \mathcal{D}^m} R(\mathcal{F} \circ S), but we don't know D\mathcal{D}, so instead we show that w.h.p over training sets SDmS \sim \mathcal{D}^m, R(FS)R(\mathcal{F} \circ S) is close to its expectation (via the same concentration inequality as above). then we can estimate the expectation from the R(FS)R(\mathcal{F} \circ S) that we get from our training set!. So it depends on the training set; however, the key is that it doesn't depend much on the particular draw SS, but it may depend significantly on the underlying distribution D\mathcal{D}. This is similar to how bounds like that for Occam algorithms depend on the target funciton, or how other Data-dependent generalization bounds depend on the target distribution to allow bypassing the problem of distribution-free worst-case bounds. However, it is still worst-case over F\mathcal{F}, and it doesn't allow for biased distribution over functions, like PAC-Bayes.

So for instance you could have a hypothesis class that has the vector of all 11s, and then all vectors of half 11s and half 00s. For random target functions, this would have high expected Rad comp and generalize badly, but for the function with all 11s it would have low expected Rad comp and generalize well. However, for the hypothesis class of all functions, Rad comp would be high for all targets!

I think there are probably some ways to solve this.. and incorporate some bias within F\mathcal{F}. This would require making the bound depend on ff, like something based on Structural risk minimization, with a set of classes (typically nested) which have different Rad comp.

Paper with applications of Rademacher complexity to classes of margin classifiers: http://www.jmlr.org/papers/volume5/luxburg04b/luxburg04b.pdf


multiplying a set of vectors by a constant, and translating them by a constant vector (a kind of affine transformation) doesn't change the Rademacher complexity. This demonstrates the fact that Rademacher is not only about how well the vectors match the signs of the sigmas. They may only encode this if we limit the norm of the vectors (for e.g. when working with a bounded loss)

taking the convex hull of a set of vectors doesn't change the Rademacher complexity. This is because the suppremum over a set of a linear objective doesn't change when we take the sup over the convex hull of the set.

Massart lemma

Rademacher complexity contraction lemma

Rademacher complexity of linear classes

Generalization bounds for SVM

Margin-based generalization bound

Norm-based generalization bound

Together with margin-based, these are used in Deep learning theory