Learning boosting

cosmos 6th March 2017 at 12:28pm

aka boosting

Get an ensemble of hypotheses, by changing our input data to the algorithm in different and clever ways. We then have some clever devision rule that combines the classifiers.

boosting neural nets

Weak learning: Like PAC learning, but we now just require err(h)12γ(n,size(c))err(h) \leq \frac{1}{2}-\gamma(n, size(c)), for sme function γ\gamma. It is efficient if it runs in poly time, H are poly evaluatable, and 1γ\frac{1}{\gamma} is poly bounded (on its arguments).

Ada boost

(adaptive boosting)

at every iteration tt, we have a prob dist DtD_tthat you use to sample points from the training data, and train the weak learner. At every iteration, we update the prob dist by making points which the algorithm had wrong and making them more likely, while points it got right are less likely.

For t=1,...,Tt=1, ..., T:

Get hth_t by running WL with DtD_t

At every iteration, we get a hypothesis.

Our final decision rule is gotten by a linear combination which takes into account which points we got wrong with each hypothesis. g(x)=sign(t=1Tαtht(x))g(x)=sign(\sum\limits_{t=1}^T \alpha_t h_t(x)). Let H=t=1Tαtht(.)H=\sum\limits_{t=1}^T \alpha_t h_t(.).

Training error: i=1mD1(i)1(g(xi)yi)D1(i)ey)iH(xi)\sum_{i=1}^m D_1(i) 1(g(x_i)\neq y_i) \leq \sum D_1(i) e^{-y)i H(x_i)} =Z2...ZT+1=Z_2 ... Z_{T+1} (see notes, and photo). If the Zs are less than 1, then it doesn't get that many iterations for the trainig error to be small.

Ht(x)=s=tTαshs(x)H_t(x)=\sum_{s=t}^T \alpha_s h_s(x)

Zt+1=e=αt(1ϵt)+eαtϵtZ_{t+1}=e^{=\alpha_t}(1-\epsilon_t)+e^{\alpha_t}\epsilon_t

=2ϵt(1ϵt)=2\sqrt{\epsilon_t (1-\epsilon_t)} (if we minimize this over αt\alpha_t).

e2γ2\leq e^{-2\gamma^2} (using γtγ\gamma_t \geq \gamma (definition of WL). (note ϵt=12γt\epsilon_t = \frac{1}{2}-\gamma_t).

=Z2...ZT+1e2Tγ2=Z_2 ... Z_{T+1} \leq e^{-2T\gamma^2}

It increases the weight of examples which are classified incorerctly many times, and so has problems with noise in labels.

See paper in webseite. The idea of multiplicative weights is quite applicable.

Generalizaation...VC dimension?..