aka PAC
Main learning framework in Computational learning theory, that studies the conditions under which an algorithm can perform learning tasks with confidence (high probability of success), and accuracy (low probability of error). See more at Learning theory. See also PAC book
An informal set of notes and comments follows. See Learning theory for more rigorous formalizations, and further generalizations of the theory:
Assume training data comes from distribution . Test data from same distribution.
Example:
We are given points which are labeled as green and red. We are also told that the green points are all sampled from inside a given square (the outer square in the picture below). We consider the algorithm that uses the minimum square that bounds the green points (the inner square in the picture) to classify future points.
Our errors come from the blue region where the points are regally green, but we classify them as red.
Consider a set of regions at the edges (i.e. a region like ) of the real square each with probability mass .
If there is one training point in each of these regions, then we know that the probability of error is .
Probability that there is no point in any of these regions is
, we want this to be small, because if no point in this region, then our error can be greater than .
Union bound:
Learning: with probability at least our classifier has error provided .
For hyperrectangles in dimensions, the s in the expression above become .
– confidence parameter. always logarithmic dependence.
– accuracy parameter. could have other dependences.
We want it to be statistically and computationally efficient (mostly runtime).
is the set of all possible instances/examples/observations.
A Concept over is a Boolean function . A concept class is a collection of concepts..
Distribution: we assume distribution over .
Once a target concept and is fixed, then any classifier/hypothesis , (Zero-one error).
Probabily approximately correct (take 1): Let be a concept class over . We say that is PAC-learnable if there exists an algorithm s.t. , over , , , if is given access to training data, and inputs , . outputs s.t. with probability ,
Efficiently learnable if it runs in time polynomial time in , .
Sometimes we want to bound data too!
Representation scheme for a concept class is .
We define a function . We can then infer another function:
.
Compare Kolmogorov complexity.
is instance of size . .
PAC (take 2). Let C_n be concepts over . Let and . We say that is PAC-learnable if there exists an algorithm s.t. , , over , , if is given access to trainig set and inputs , , , L outputs s.t. w.p.,
Efficiently lernable, same as before but also poly on .
(basically we want it to be poly on anything that it can depend, and that we reasonably expect could be a large number).
Example: Learning Boolean conjunctions
E.g. Learning 3-term DNFs
They are not efficiently PAC learnable
Learning 3-CNFs (conjuctive normal form)
They are efficiently PAC learnable
PAC learning (take 3): we say that concept is PAC-learnable using hypothesis class , if there is an algorithm , that , , over , , , with access to training data and inputs , , , outputs , that with probability at least , satistifies .
We say that is efficiently PAC-learnable if the running time of is polynomial in , , , , and is polynomially evaluatable.
is polynomially evaluatable: is polynomially evaluatable if there is a polynomial time (in ) algorithm that , , , on input , outputs .
So in the example above, the problem is learnable using the hypothesis class of 3-CNFs, but not that of 3-DNFs!
Given
,
Find a hypothesis that is consistent on this data.
Can give hypothesis that just memorizes all of the data, and on all other data it predicts .
Find the simplest possible explanation.
Simplest? shortest explanation?
We will focus on "short enough" is good (as shortest, is not computable..). If description grows linearly with number of observations, it's bad. We want it to grow less fast!
Consistent learner: We say that is a consistent learner for concept class , using hypothesis class , if , given , as input, the algorithm outputs , s.t. , .
We say is efficient if the running time is polynomial in , , .
.
Theorem (Occam's razor, cardinality version): Let be a concept class and a hypothesis class. Let be a consistent learner for using . Then, over , if is given examples drawn from s.t. then outputs , that w.p , satisfies .
Similar to theorems in Learning theory, VC dimension, empirical risk minimization, etc.
Proof...see photo (see also this nice lecture)
Proof idea
size of class of conjunctions is about .
Size of class of 3-DNF , and for 3-CNF
—> Learning 3-term DNF with infinite computation requires much less training examples than the efficiently learnable 3-CNFs!.. Nice.
–> VC dimension (tight, Minimax risk under certain realizability assumptions),
Rademacher complexity for PAC learning with continuous functions, and also tighter bounds by making the bounds data-dependendent!
Convert weak learners to strong learners.
Get an ensemble of hypotheses, by changing our input data to the algorithm in different and clever ways. We then have some clever devision rule that combines the classifiers.
See Computational learning theory for variations on the basic PAC learning framework.
Uniform convergence, main tool to proving results about PAC learnability
See https://arxiv.org/pdf/math/0702683.pdf and http://www.stats.ox.ac.uk/~rebeschi/teaching/AFoL/18/
See Analysis of Boolean functions
See this video for how to learn Boolean functions which have their Fourier coefficients concentrated in some terms.
Similar result to Occam algorithms
Under the discrete root the class of circuits of logarithmic depth are not PAC-learnable