Learning theory

cosmos 15th May 2019 at 7:33pm
Artificial intelligence Learning Machine learning Statistical inference

the general field of machine learning theory (computational learning theory, statistical learning theory, optimization for learning, etc); note that these have slightly different connotations/focus

Mathematical theory of learning, and learning algorithms. See Machine learning, Deep learning theory, wiki.

Learn in theory

See these notes and these notes. See book and lectures "Understanding machine learning" by Shia (found here for free). See also Introduction to supervised learning theory for an introduction to supervised learning theory.

Statistical learning theoryComputational learning theory

Understanding machine learning (book) – Foundations of machine learning

Learning problem: Design a system that improves on its ability to perform task T, as measured by performance measure P, by going through experience E.

Types of learning problems

Some important types:

See more at Machine learning, and Computational learning theory

Design factors in supervised machine learning

Also named: Expressibility, Optimization for learning, Generalization


Learning theory focuses mostly on Supervised learning, which is one of the most common learning problems in practice., and provides a general foundation for many other learning problems.

Good course on statistical learning theory! (course webpage)

Supervised learning theory

Introduction to supervised learning theory

See Supervised learning for more details on practical algorithms and applications.

Problem definition: risk minimization

Definition and introduction of the problem of learning a function given a loss

If one defines a Loss function (or risk, or cost, see Decision theory), Rf(y^,y)R_f (\hat{y}, y), one can ask what is the prediction scheme (for instance, the function y^=f(x)\hat{y} = f(x) that minimizes the expected risk. This is called risk minimization.

Given the previous abstract definition, the solution to expected risk minimization is well defined given the distribution of data. However, what is a good solution when only given training data (real learning problem)? The answer to this is what defines a good learning algorithm

Some answers: Consistency of a learning algorithm and other notions of goodness (one common one is also Probably approximately correct learning. See also Walport's chapter on boon "Mathematics of generalization") . A learning algorithm that has low expected risk with high probability (or related notions) is said to predict well, and if we do this even when we were given a training set much smaller than the test set, we say the algorithm is able to generalize.

Empirical risk minimization

Risk minimization, requires knowing the joint probability distribution P(x,y)P(x,y), so one often uses the sample mean of the risk as an estimator for the expected value of the risk. Minimizing this empirical quantity is called empirical risk minimization.

The empirical risk, for a 0-1 loss function is also known as training error ϵ^\hat{\epsilon}, which is just the fraction of training points that your hypothesis missclassifies. See Training error (vid).

Sometimes, the difference between the empirical risk and the true risk is called the generalization gap. Many learning algorithms are about doing approximate ERM while ensuring low generalization gap (and thus ensuring subsequent generalization/good prediction). This is approached with Regularization. This is a tradeoff whose extremes are called overfitting and underfitting.

Generalization/prediction

Making good generalizations is the goal of predictive/supervised learning, where generalization means making predictions about unseen data from seen data, and good generalization means these predictions agree with the actual results, as often as possible. See also Induction.

For the case of 0-1 loss functions, the expected risk is also known as generalization error, ϵ\epsilon, defined as the probability that if I sample a new sample (x,y)(x,y) from the same distribution producing the data, my hypothesis misslabels that example.

No free lunch theoremWolpert article on No Free Lunch and the different learning theory frameworks

Overfitting and underfitting

Overfitting and underfitting refer to ways of misstraining a model, i.e., making it have poor generalization error, compared to the optimal model. Bias-variance tradeoff

Regularization

A way to avoid overfitting, while keeping all parameters in your model. Intro vid

We use the Prior distribution from Bayesian statistics, to make simple hypothesis more likely. See Simplicity and learning.

Intuition

Model selection/assessment

Model selection algorithms provide methods to automatically choose optimal bias/variance tradeoffs.

Statistical supervised learning theory

See post: Learn in theory

Convergence and performance results in learning theory

Mostly about general bounds, that often are pessimistic because they have to include worst cases, but give good intuition of general relations, for instance, between training set size and model complexity.

–>Theorem. Let a hypothesis class HH be given, and let the VC dimension VC(H) = d. Then w.p. at leat1δ1-\delta, we have that for all hHh \in H

ϵ(h)ϵ^(h)O(dmlogmd+1mlog1d)|\epsilon(h)-\hat{\epsilon}(h)|\leq O\left(\sqrt{\frac{d}{m}\log{\frac{m}{d}}+\frac{1}{m}\log{\frac{1}{d}}}\right)

where mm is the size of the training set. A corollary is that for ERM, the number of training samples needed for a particular performance is roughly linear on the VC dimension of the hypothesis class (see video). Sample complexity is upper bounded by VC dimension.

Deep learning theory

Neural network theory, Generalization in deep learning


Bayesian statistics

Applicable to problems in supervised learning, but also beyond

Maximum likelihood

Minimize the negative log likelihood (similar to entropy. More precisely, cross-entropy, or relative entropy), which corresponds to maximizing likelihood.

Maximum a posteriori (MAP)

Posterior mean

Reinforcement learning

Optimization for learning


Statistical physics and inference


Learning theory and neural networks gingko tree


General/worst-case analysis vs specific case analysis of learning

quite relevant for Deep learning theory