Probably approximately correct: Cosmos — All that is, or was, or ever will be

Probably approximately correct

cosmos 19th December 2018 at 7:02pm

aka PAC

Main learning framework in Computational learning theory, that studies the conditions under which an algorithm can perform learning tasks with confidence (high probability of success), and accuracy (low probability of error). See more at Learning theory. See also PAC book

notes – nice lectures

An informal set of notes and comments follows. See Learning theory for more rigorous formalizations, and further generalizations of the theory:

Assume training data comes from distribution $D$ . Test data from same distribution.

Example:

We are given points which are labeled as green and red. We are also told that the green points are all sampled from inside a given square (the outer square in the picture below). We consider the algorithm that uses the minimum square that bounds the green points (the inner square in the picture) to classify future points.

Our errors come from the blue region where the points are regally green, but we classify them as red.

Consider a set of regions at the edges (i.e. a region like $x>x_o$ ) of the real square each with probability mass $\epsilon/4$ .

If there is one training point in each of these regions, then we know that the probability of error is $<\epsilon$ .

Probability that there is no point in any of these regions is

$P \leq 4\exp{\left(\frac{-m\epsilon}{4}\right)}$ , we want this to be small, because if no point in this region, then our error can be greater than $\epsilon$ .

Union bound: $P(A\cup B) \leq P(A)+P(B)$

Learning: with probability at least $1-\delta$ our classifier has error $\leq \epsilon$ provided $m\geq \frac{4}{\epsilon}\log{\frac{4}{\delta}}$ .

For hyperrectangles in $n$ dimensions, the $4$ s in the expression above become $2n$ .

$\delta$ – confidence parameter. always logarithmic dependence.

$\epsilon$ – accuracy parameter. could have other dependences.

We want it to be statistically and computationally efficient (mostly runtime).

Instance space

$X$ is the set of all possible instances/examples/observations.

Concept class

A Concept over $X$ is a Boolean function $c: X \rightarrow \{0,1\}$ . A concept class is a collection of concepts..

Distribution: we assume distribution $D$ over $X$ .

Once a target concept $c \in C$ and $D$ is fixed, then any classifier/hypothesis $h:X \rightarrow \{0,1\}$ , $err(H;c,D) =P_{X\sim D}[c(x) \neq h(x)]$ (Zero-one error).

Probabily approximately correct (take 1): Let $C$ be a concept class over $X$ . We say that $C$ is PAC-learnable if there exists an algorithm $L$ s.t. $\forall c \in C$ , $\forall D$ over $X$ , $\forall 0< \epsilon < 1/2$ , $\forall 0<\delta < 1/2$ , if $L$ is given access to training data, and inputs $\epsilon$ , $\delta$ . outputs $h\in C$ s.t. with probability $\geq 1-\delta$ , $err(H) \leq \epsilon$
Efficiently learnable if it runs in time polynomial time in $1/\epsilon$ , $1/\delta$ .

Sometimes we want to bound data too!

Complexity of the concepts in the concept class affects the learnability.

Polygons. The more verticies the more numbers that need to be specified
Boolean functions. $f: \{0,1\}^n \rightarrow \{0,1\}$ . Can represent with
- a Truth table. $2^n$ bits.
- Boolean circuit, using ands, ors.
- Disjunctive normal form (DNF): a disjunction of conjunctions. E.g. $X_1 \wedge X_2 \wedge X_3) \vee (X_1 \wedge X_2) \vee ... \vee (X_2 \wedge X_3)$

Representation scheme for a concept class $C$ is $R: \Sigma^k \rightarrow C$ .

We define a function $size: \sigma^k \rightarrow \mathbb{N}$ . We can then infer another function:

$size(c) = \min\limits_{\sigma:R(\sigma)=c} size(\sigma)$ .

Compare Kolmogorov complexity.

Instance size

$X_n$ is instance of size $n$ . $X\cup_n X_n$ .

PAC (take 2). Let C_n be concepts over $X_n$ . Let $X=\cup_n X_n$ and $C=\cup_n C_n$ . We say that $C$ is PAC-learnable if there exists an algorithm $L$ s.t. $\forall n\in N$ , $\forall c\in C_n$ , $\forall D$ over $X_n$ , $\forall 0<\epsilon<1/2$ , if $L$ is given access to trainig set $EX(c,D)$ and inputs $\epsilon$ , $\delta$ , $size(c)$ , L outputs $h \in C$ s.t. w.p. $\geq 1-\delta$ , $err(h) \leq \epsilon$
Efficiently lernable, same as before but also poly on $size(c)$ .

(basically we want it to be poly on anything that it can depend, and that we reasonably expect could be a large number).

Example: Learning Boolean conjunctions

E.g. Learning 3-term DNFs

They are not efficiently PAC learnable

Learning 3-CNFs (conjuctive normal form)

They are efficiently PAC learnable

PAC learning (take 3): we say that concept $C$ is PAC-learnable using hypothesis class $H$ , if there is an algorithm $L$ , that $\forall n \geq 0$ , $c \in C_n$ , $\forall D$ over $X_n$ , $\forall 0 < \epsilon < 1/2$ , $\forall 0 < \delta < 1/2$ , with access to training data $EX(c,D)$ and inputs $size(c)$ , $\epsilon$ , $\delta$ , outputs $h\in H_n$ , that with probability at least $1-\delta$ , satistifies $err(h) \leq \epsilon$ .
We say that $C$ is efficiently PAC-learnable if the running time of $L$ is polynomial in $n$ , $size(c)$ , $1/\epsilon$ , $1/\delta$ , and $H$ is polynomially evaluatable.
$H$ is polynomially evaluatable: $H$ is polynomially evaluatable if there is a polynomial time (in $size(h), n$ ) algorithm $A$ that $\forall n$ , $\forall h \in H_n$ , $\forall x \in X_n$ , on input $h$ , $x$ outputs $h(x)$ .

So in the example above, the problem is learnable using the hypothesis class $H$ of 3-CNFs, but not that of 3-DNFs!

Explanatory models

notes

Given $(\vec{x}_1,y_1),(\vec{x}_2,y_2),...,(\vec{x}_n, y_n)$

$\vec{X}_i \in \{0,1\}^n$ , $y_i \in \{0,1\}$

Find a hypothesis that is consistent on this data.

Can give hypothesis that just memorizes all of the data, and on all other data it predicts $0$ .

Occam's razor

Find the simplest possible explanation.

Simplest? shortest explanation?

We will focus on "short enough" is good (as shortest, is not computable..). If description grows linearly with number of observations, it's bad. We want it to grow less fast!

Consistent learner: We say that $L$ is a consistent learner for concept class $C$ , using hypothesis class $H$ , if $\forall n, \forall c \in C_n, \forall m$ , given $(x_1, c(x_1)), ...(x_m, c(x_m))$ , as input, the algorithm $L$ outputs $h \in H_n$ , s.t. $\forall i=1, ..., m$ , $h(x_i=c(x_i)$ .
We say $L$ is efficient if the running time is polynomial in $n$ , $m$ , $size(c)$ .

Theorem (Occam's razor, cardinality version): Let $C$ be a concept class and $H$ a hypothesis class. Let $L$ be a consistent learner for $C$ using $H$ . Then, $\forall n, \forall c \in C_n, \forall D$ over $X_n$ , if $L$ is given $m$ examples drawn from $EX(c,D)$ s.t. $m \geq \frac{1}{\epsilon} (\log{|H_n|} + \log{1/\delta})$ then $L$ outputs $h$ , that w.p $\geq 1-\delta$ , satisfies $err(h) \leq \epsilon$ .

Similar to theorems in Learning theory, VC dimension, empirical risk minimization, etc.

Proof...see photo (see also this nice lecture)

Proof idea

Probability that a bad hypothesis looks good on the sample (and so may be picked by the algorithm) is small for a given hypothesis. It is $\leq (1-\epsilon)^m$
Take union over all h in hypothesis class $H$ ., giving a factor of the size of the class $|H|$

size of class of conjunctions is about $3^n$ .

Size of class of 3-DNF $\leq 2^{6n}$ , and for 3-CNF $\geq 2^{n^3}$

—> Learning 3-term DNF with infinite computation requires much less training examples than the efficiently learnable 3-CNFs!.. Nice.

photo1

photo2

PAC learning with infinite concept classes

–> VC dimension (tight, Minimax risk under certain realizability assumptions),

Rademacher complexity for PAC learning with continuous functions, and also tighter bounds by making the bounds data-dependendent!

Learning boosting

Convert weak learners to strong learners.

Get an ensemble of hypotheses, by changing our input data to the algorithm in different and clever ways. We then have some clever devision rule that combines the classifiers.

Agnostic learning and more

See Computational learning theory for variations on the basic PAC learning framework.

Uniform convergence, main tool to proving results about PAC learnability

PAC learning with noise

See https://arxiv.org/pdf/math/0702683.pdf and http://www.stats.ox.ac.uk/~rebeschi/teaching/AFoL/18/

Learning band-limited Boolean functions

See Analysis of Boolean functions

See this video for how to learn Boolean functions which have their Fourier coefficients concentrated in some terms.