Probably approximately correct

cosmos 19th December 2018 at 7:02pm
Computational learning theory

aka PAC

Main learning framework in Computational learning theory, that studies the conditions under which an algorithm can perform learning tasks with confidence (high probability of success), and accuracy (low probability of error). See more at Learning theory. See also PAC book

notesnice lectures

An informal set of notes and comments follows. See Learning theory for more rigorous formalizations, and further generalizations of the theory:

Assume training data comes from distribution DD. Test data from same distribution.

Example:

We are given points which are labeled as green and red. We are also told that the green points are all sampled from inside a given square (the outer square in the picture below). We consider the algorithm that uses the minimum square that bounds the green points (the inner square in the picture) to classify future points.

Our errors come from the blue region where the points are regally green, but we classify them as red.

my vector image Layer 1

Consider a set of regions at the edges (i.e. a region like x>xox>x_o) of the real square each with probability mass ϵ/4\epsilon/4.

If there is one training point in each of these regions, then we know that the probability of error is <ϵ<\epsilon.

Probability that there is no point in any of these regions is

P4exp(mϵ4)P \leq 4\exp{\left(\frac{-m\epsilon}{4}\right)}, we want this to be small, because if no point in this region, then our error can be greater than ϵ\epsilon.

Union bound: P(AB)P(A)+P(B)P(A\cup B) \leq P(A)+P(B)

Learning: with probability at least 1δ1-\delta our classifier has error ϵ\leq \epsilon provided m4ϵlog4δm\geq \frac{4}{\epsilon}\log{\frac{4}{\delta}}.

For hyperrectangles in nn dimensions, the 44s in the expression above become 2n2n.

δ\deltaconfidence parameter. always logarithmic dependence.

ϵ\epsilonaccuracy parameter. could have other dependences.

We want it to be statistically and computationally efficient (mostly runtime).

Instance space

XX is the set of all possible instances/examples/observations.

Concept class

A Concept over XX is a Boolean function c:X{0,1}c: X \rightarrow \{0,1\}. A concept class is a collection of concepts..

Distribution: we assume distribution DD over XX.

Once a target concept cCc \in C and DD is fixed, then any classifier/hypothesis h:X{0,1}h:X \rightarrow \{0,1\} , err(H;c,D)=PXD[c(x)h(x)]err(H;c,D) =P_{X\sim D}[c(x) \neq h(x)] (Zero-one error).

Probabily approximately correct (take 1): Let CC be a concept class over XX. We say that CC is PAC-learnable if there exists an algorithm LL s.t. cC\forall c \in C, D\forall D over XX, 0<ϵ<1/2\forall 0< \epsilon < 1/2, 0<δ<1/2\forall 0<\delta < 1/2, if LL is given access to training data, and inputs ϵ\epsilon, δ\delta. outputs hCh\in C s.t. with probability 1δ\geq 1-\delta, err(H)ϵerr(H) \leq \epsilon

Efficiently learnable if it runs in time polynomial time in 1/ϵ1/\epsilon, 1/δ1/\delta.

Sometimes we want to bound data too!

Complexity of the concepts in the concept class affects the learnability.

  • Polygons. The more verticies the more numbers that need to be specified
  • Boolean functions. f:{0,1}n{0,1}f: \{0,1\}^n \rightarrow \{0,1\}. Can represent with
    • a Truth table. 2n2^n bits.
    • Boolean circuit, using ands, ors.
    • Disjunctive normal form (DNF): a disjunction of conjunctions. E.g. X1X2X3)(X1X2)...(X2X3)X_1 \wedge X_2 \wedge X_3) \vee (X_1 \wedge X_2) \vee ... \vee (X_2 \wedge X_3)

Representation scheme for a concept class CC is R:ΣkCR: \Sigma^k \rightarrow C.

We define a function size:σkNsize: \sigma^k \rightarrow \mathbb{N}. We can then infer another function:

size(c)=minσ:R(σ)=csize(σ)size(c) = \min\limits_{\sigma:R(\sigma)=c} size(\sigma).

Compare Kolmogorov complexity.

Instance size

XnX_n is instance of size nn. XnXnX\cup_n X_n.

PAC (take 2). Let C_n be concepts over XnX_n. Let X=nXnX=\cup_n X_n and C=nCnC=\cup_n C_n. We say that CC is PAC-learnable if there exists an algorithm LL s.t. nN\forall n\in N, cCn\forall c\in C_n, D\forall D over XnX_n, 0<ϵ<1/2\forall 0<\epsilon<1/2, if LL is given access to trainig set EX(c,D)EX(c,D) and inputs ϵ\epsilon, δ\delta, size(c)size(c), L outputs hCh \in C s.t. w.p.1δ\geq 1-\delta, err(h)ϵerr(h) \leq \epsilon

Efficiently lernable, same as before but also poly on size(c)size(c).

(basically we want it to be poly on anything that it can depend, and that we reasonably expect could be a large number).


Example: Learning Boolean conjunctions

E.g. Learning 3-term DNFs

They are not efficiently PAC learnable


Learning 3-CNFs (conjuctive normal form)

They are efficiently PAC learnable

PAC learning (take 3): we say that concept CC is PAC-learnable using hypothesis class HH, if there is an algorithm LL, that n0\forall n \geq 0, cCnc \in C_n, D\forall D over XnX_n, 0<ϵ<1/2\forall 0 < \epsilon < 1/2, 0<δ<1/2\forall 0 < \delta < 1/2, with access to training data EX(c,D)EX(c,D) and inputs size(c)size(c), ϵ\epsilon, δ\delta, outputs hHnh\in H_n, that with probability at least 1δ1-\delta, satistifies err(h)ϵerr(h) \leq \epsilon.

We say that CC is efficiently PAC-learnable if the running time of LL is polynomial in nn, size(c)size(c), 1/ϵ1/\epsilon, 1/δ1/\delta, and HH is polynomially evaluatable.

HH is polynomially evaluatable: HH is polynomially evaluatable if there is a polynomial time (in size(h),nsize(h), n) algorithm AA that n\forall n, hHn\forall h \in H_n, xXn\forall x \in X_n, on input hh, xx outputs h(x)h(x).

So in the example above, the problem is learnable using the hypothesis class HH of 3-CNFs, but not that of 3-DNFs!


Explanatory models

notes

Given (x1,y1),(x2,y2),...,(xn,yn)(\vec{x}_1,y_1),(\vec{x}_2,y_2),...,(\vec{x}_n, y_n)

Xi{0,1}n\vec{X}_i \in \{0,1\}^n, yi{0,1}y_i \in \{0,1\}

Find a hypothesis that is consistent on this data.

Can give hypothesis that just memorizes all of the data, and on all other data it predicts 00.

Occam's razor

Find the simplest possible explanation.

Simplest? shortest explanation?

We will focus on "short enough" is good (as shortest, is not computable..). If description grows linearly with number of observations, it's bad. We want it to grow less fast!

Consistent learner: We say that LL is a consistent learner for concept class CC, using hypothesis class HH, if n,cCn,m\forall n, \forall c \in C_n, \forall m, given (x1,c(x1)),...(xm,c(xm))(x_1, c(x_1)), ...(x_m, c(x_m)), as input, the algorithm LL outputs hHnh \in H_n, s.t. i=1,...,m\forall i=1, ..., m, h(xi=c(xi)h(x_i=c(x_i).

We say LL is efficient if the running time is polynomial in nn, mm, size(c)size(c).

.

Theorem (Occam's razor, cardinality version): Let CC be a concept class and HH a hypothesis class. Let LL be a consistent learner for CC using HH. Then, n,cCn,D\forall n, \forall c \in C_n, \forall D over XnX_n, if LL is given mm examples drawn from EX(c,D)EX(c,D) s.t. m1ϵ(logHn+log1/δ)m \geq \frac{1}{\epsilon} (\log{|H_n|} + \log{1/\delta}) then LL outputs hh, that w.p 1δ\geq 1-\delta, satisfies err(h)ϵerr(h) \leq \epsilon.

Similar to theorems in Learning theory, VC dimension, empirical risk minimization, etc.

Proof...see photo (see also this nice lecture)

Proof idea

  1. Probability that a bad hypothesis looks good on the sample (and so may be picked by the algorithm) is small for a given hypothesis. It is (1ϵ)m\leq (1-\epsilon)^m
  2. Take union over all h in hypothesis class HH., giving a factor of the size of the class H|H|

size of class of conjunctions is about 3n3^n.

Size of class of 3-DNF 26n\leq 2^{6n}, and for 3-CNF 2n3\geq 2^{n^3}

—> Learning 3-term DNF with infinite computation requires much less training examples than the efficiently learnable 3-CNFs!.. Nice.

photo1

photo2


PAC learning with infinite concept classes

–> VC dimension (tight, Minimax risk under certain realizability assumptions),

Rademacher complexity for PAC learning with continuous functions, and also tighter bounds by making the bounds data-dependendent!

Learning boosting

Convert weak learners to strong learners.

Get an ensemble of hypotheses, by changing our input data to the algorithm in different and clever ways. We then have some clever devision rule that combines the classifiers.

Agnostic learning and more

See Computational learning theory for variations on the basic PAC learning framework.

Uniform convergence, main tool to proving results about PAC learnability

PAC learning with noise

See https://arxiv.org/pdf/math/0702683.pdf and http://www.stats.ox.ac.uk/~rebeschi/teaching/AFoL/18/


Learning band-limited Boolean functions

See Analysis of Boolean functions

See this video for how to learn Boolean functions which have their Fourier coefficients concentrated in some terms.

Similar result to Occam algorithms


Under the discrete root the class of circuits of logarithmic depth are not PAC-learnable