PAC learning with infinite concept classes

cosmos 31st October 2017 at 3:32pm
Probably approximately correct

What happens when HH or CC is infinite?

–>VC dimension

Roughly, VC-dimension for infinite class logH\approx \log|H| for finite classes. This will enter an extended version of the Occam's razor theorem..

Consistent learning algorithm for half-spaces

We will use Linear programming (for finding feasible solutions) four our consistent learner.

Is there a minimum number of examples needed to be seen for PAC learning?

YES

If CC has VC dimension d, then any PAC-learning algo for C that outputs hCh\in C, requires at least d132ϵ\frac{d-1}{32\epsilon} examples.

The more accuracy we want, or the more complex class, the more data we need.

One can't learn a class of infinite VC dimenison.

Chernoff bound: let X1,..XnX_1,..X_n be independent random variables with Xi=1X_i =1 w.p pip_i, and Xi=0X_i=0 w.p. 1=pi1=p_i. Let μ=E[iXi]=ipi\mu = E[\sum_i X_i]=\sum_i p_i

P[Xμαμ]2euα2/3P[|X-\mu| \geq \alpha \mu] \leq 2e^{-\,u\alpha^2/3}

Let SS be the set which gives the VC-dimension d. (i.e. has maximal cardinality while being shattered by C).

DD is uniform over S.

Suppose your algo sees d/10 examples. outputs some h.

For all examples in S\{examples}, h makes error w.p at least 1/2.

If your number of examples is less than the vc dimension, then, there's not hope of doing well in general, because the unobserved points could be anything..

See more in book.. what's the meaning of ϵ\epsilon, here?


Growth function


Sample complexity upper bound

See here. These are upper bounds on the minimum sample complexity to generalize. So that if we are above these, we know we are above the minimum and we will generalize.

Theorem (Occam's razor, VC dim version): Let CC be a concept class of VC-dimension dd, and HH a hypothesis class. Let LL be a consistent learner for CC using HH. Then, n,cCn,D\forall n, \forall c \in C_n, \forall D over XnX_n, if LL is given mm examples drawn from EX(c,D)EX(c,D) s.t. m1ϵ(dlog1/ϵ+log1/δ)m \geq \frac{1}{\epsilon} (d\log{1/\epsilon}+ \log{1/\delta}) then LL outputs hh, that w.p 1δ\geq 1-\delta, satisfies err(h)ϵerr(h) \leq \epsilon.

...... proof. see pic. Uses Epsilon-net (see more explanation there). Use trick of doubling your training set, to get finite things which allow to bound probabilities..but need to understand it better... See here, and here

Sample complexity lower bound

See here


PAC learning power changes when you relax the requirement that the algo should work for any distribution on the input data

Boosting, relaxing the ϵ\epsilon condition doesn't increase power..

Can also give better bounds for finite concept classes, I think..