Computational learning theory: Cosmos — All that is, or was, or ever will be

Computational learning theory

cosmos 19th December 2018 at 8:36pm

actually a lot of this is statistical learning theory..

Computational learning theory is Learning theory, but taking into account the Computational complexity of the Learning algorithms. There is a big overlap with topics studied in Statistical learning theory!

See Machine learning for more general picture.

Mostly Supervised learning (see Learning theory).

Probably approximately correct

video

Basic setup for supervised learning. Assumptions:

Invariance assumption. Data is randomly generated, and from the same distribution as data will be tested on!.
Realizability assumption. We assume our answer lies within some class of hypotheses. Relaxed in Agnostic learning

Empirical risk minimization is a good approach. If ERM is 0, we call it a consistent learner.

Overfitting. To guard against overfitting, limit size of hypothesis class (Occam's razor in learning theory). It basically looks like an extension of Hypothesis testing, to a class of hypotheses.. Proof of Occam PAC learning theorem

Two types of error:

Missfortune error: bad samples. vid (basic def of PAC learning here too)
Rarity error: error on rare circumstances which weren't learned.

Uniform convergence, main tool to proving results about PAC learnability. The basic result is Occam theorem (see Probably approximately correct)

VC dimension characterizes PAC-learnability (see "Understanding machine learning" book – Fundamental theorem of learning theory)

Covering-number generalization error bounds

Exact learning

We say concept class $C$ is efficiently learnable with membership and equivalence queries, if there exists a polynomial p(.) and an algorithmL L with access to membership and equivalence queries oracles, $\forall c \in C_n$ outputs in time $p(n, size(c))$ a concept $h \in C$ that is equivalent to $C$ .

SQ learning

Also, learning in the presence of noise

Membership query, and Equivalence query

Example oracle + membership query is sufficient for PAC learning

Learning real-valued functions

Convex optimization

notes

Rademacher Complexity analogous to VC dimension, but for real-valued concepts.

Agnostic learning

So far in all the learning frameworks we’ve studied, we’ve made an assumption that there is some “ground truth” target function that we attempt to learn. Our goal has been to identify a hypothesis that is close to this target, with respect to the target distribution. Learning algorithms are given access to the target function in the form of labelled observations, which in some cases may be noisy. In this lecture, we’ll drop the assumption of a ground-truth target completely; it is for this reason that the framework is called agnostic learning. As there is no longer a well-defined notion of target, our goal will be to identify a hypothesis that is competitive with respect to the best concept from a particular class

video – weakness of the PAC definition which motivates agnostic learning – to approach this we redefine succesful learning to have only a relative error guarantee –> Definition of Agnostic PAC learnability

notes

proof that finite classes are agnostic learnable

https://www.cs.bgu.ac.il/~asml162/Class_Material

See "Understanding machine learning" by Shai and Shai for more

General losses and more general definition of the learning problem

examples of other learning tasks

kind of clustering $\sim$ to representing data with $k$ code words.

Probabilistic prediction rules can't predict better. But they can give us our uncertainty in the prediction, for each particular case we encounter. That extra information can be useful! (this is what modern Bayesian inference recognizes)

How to learn in such a general situation?

Uniform convergence, main tool to proving results about PAC learnability. Define epsilon-representativity, if the empirical loss on the sample is a good approximation to true loss (within $\epsilon$ ).
- Claim1 and proof (proof 2 – actually here): if we have a representative sample, then Empirical risk minimization (over finite class) is $2\epsilon$ accurate (relative to best in class)
- Claim2: a large enough sample is representative w.h.p. See proof and more at Agnostic learning – proof here
  - Defining sample complexity of uniform convergence
- Finally to proof useful bounds, we need to bound the sample complexity of the uniform convergence. See example at Agnostic learning, result for sample complexity of uniform convergence

There are relations to Game theory

Online learning

Complexity Theoretic Limitations on Learning Halfspaces

Grammar learning

Nice classical experiment about supervision (1st lecture of lecture seires on ML)

https://simons.berkeley.edu/workshops/schedule/9161