Convergence and performance results in learning theory

cosmos 30th April 2017 at 3:58pm
Learning theory

See more detail and results at Computational learning theory. Here's just basics

Setting up formally a linear classifier, to explore the problems of learning theory, and using empirical risk minimization as learning principle, also uses definitions above.

Hypothesis classes

Hypothesis class and more general def

The simple case of finite hypothesis classes

Lemmas

The union bound

Hoeffding's inequality

Results for ERM for the case of finite hypothesis classes

Intro

Strategy

  1. Training error is a good approximation to generalization error. ϵ^ϵ\hat{\epsilon} \approx \epsilon (uniform convergence)
  2. This implies that minimizing training error will also do pretty well in minimizing generalization error, and this will give us a bound on the generalization error ϵ(h^)\epsilon(\hat{h}), of the hypothesis h^\hat{h} output by ERM.

1. Uniform convergence

–>Video. The result is a form of uniform convergence (vid).

Other formulations

sample complexity bound. vid

Error bound

2. Performance of ERM

vid

Theorem Let H=k|\mathcal{H}| = k (where H\mathcal{H} is the hypothesis class (set of possible hypothesis)), and let any m,δm, \delta be fixed. Then, w.p. at least 1δ1-\delta, the generalization error

ϵ(h^)(minhHϵ(h))+212mlog2kδ\epsilon (\hat{h}) \leq \left (\min_{h \in \mathcal{H}} \epsilon (h) \right)+ 2 \sqrt{ \frac{1}{2m} \log{\frac{2k}{\delta}}}

–>Relation to bias/variance tradeoff. First term can be seen as the "bias", and the second as the "variance"

Corollary

The training size, mO(1j2logkδ)m \geq O(\frac{1}{j^2} \log{\frac{k}{\delta}})

Results for ERM for the case of infinite hypothesis classes

Intuition

Definition of shattering: video

–>Theorem. Let a hypothesis class HH be given, and let the VC dimension VC(H) = d. Then w.p. at leat1δ1-\delta, we have that for all hHh \in H

ϵ(h)ϵ^(h)O(dmlogmd+1mlog1d)|\epsilon(h)-\hat{\epsilon}(h)|\leq O\left(\sqrt{\frac{d}{m}\log{\frac{m}{d}}+\frac{1}{m}\log{\frac{1}{d}}}\right)

where mm is the size of the training set. A corollary is that for ERM, the number of training samples needed for a particular performance is roughly linear on the VC dimension of the hypothesis class (see video). Sample complexity is upper bounded by VC dimension. Often, the VC dimension is roughly proportional to the number of parameters in your model. Lower bound in the worst case

Application to support vector machines

smooth loss functions as convex relaxations. Learning algos such as logistic regression or support vector machines (which use the Hinge loss) may be viewed as approximating 0-1 loss-ERM.

Neural network theoryDeep learning theory