the general field of machine learning theory (computational learning theory, statistical learning theory, optimization for learning, etc); note that these have slightly different connotations/focus
Mathematical theory of learning, and learning algorithms. See Machine learning, Deep learning theory, wiki.
See these notes and these notes. See book and lectures "Understanding machine learning" by Shia (found here for free). See also Introduction to supervised learning theory for an introduction to supervised learning theory.
Statistical learning theory – Computational learning theory
Understanding machine learning (book) – Foundations of machine learning
Learning problem: Design a system that improves on its ability to perform task T, as measured by performance measure P, by going through experience E.
Some important types:
See more at Machine learning, and Computational learning theory
Design factors in supervised machine learning
Also named: Expressibility, Optimization for learning, Generalization
Learning theory focuses mostly on Supervised learning, which is one of the most common learning problems in practice., and provides a general foundation for many other learning problems.
Good course on statistical learning theory! (course webpage)
Introduction to supervised learning theory
See Supervised learning for more details on practical algorithms and applications.
Definition and introduction of the problem of learning a function given a loss
If one defines a Loss function (or risk, or cost, see Decision theory), , one can ask what is the prediction scheme (for instance, the function that minimizes the expected risk. This is called risk minimization.
Given the previous abstract definition, the solution to expected risk minimization is well defined given the distribution of data. However, what is a good solution when only given training data (real learning problem)? The answer to this is what defines a good learning algorithm
Some answers: Consistency of a learning algorithm and other notions of goodness (one common one is also Probably approximately correct learning. See also Walport's chapter on boon "Mathematics of generalization") . A learning algorithm that has low expected risk with high probability (or related notions) is said to predict well, and if we do this even when we were given a training set much smaller than the test set, we say the algorithm is able to generalize.
Risk minimization, requires knowing the joint probability distribution , so one often uses the sample mean of the risk as an estimator for the expected value of the risk. Minimizing this empirical quantity is called empirical risk minimization.
The empirical risk, for a 0-1 loss function is also known as training error , which is just the fraction of training points that your hypothesis missclassifies. See Training error (vid).
Sometimes, the difference between the empirical risk and the true risk is called the generalization gap. Many learning algorithms are about doing approximate ERM while ensuring low generalization gap (and thus ensuring subsequent generalization/good prediction). This is approached with Regularization. This is a tradeoff whose extremes are called overfitting and underfitting.
Making good generalizations is the goal of predictive/supervised learning, where generalization means making predictions about unseen data from seen data, and good generalization means these predictions agree with the actual results, as often as possible. See also Induction.
For the case of 0-1 loss functions, the expected risk is also known as generalization error, , defined as the probability that if I sample a new sample from the same distribution producing the data, my hypothesis misslabels that example.
No free lunch theorem – Wolpert article on No Free Lunch and the different learning theory frameworks
Overfitting and underfitting refer to ways of misstraining a model, i.e., making it have poor generalization error, compared to the optimal model. Bias-variance tradeoff
A way to avoid overfitting, while keeping all parameters in your model. Intro vid
We use the Prior distribution from Bayesian statistics, to make simple hypothesis more likely. See Simplicity and learning.
Model selection algorithms provide methods to automatically choose optimal bias/variance tradeoffs.
See post: Learn in theory
Mostly about general bounds, that often are pessimistic because they have to include worst cases, but give good intuition of general relations, for instance, between training set size and model complexity.
–>Theorem. Let a hypothesis class be given, and let the VC dimension VC(H) = d. Then w.p. at leat, we have that for all
where is the size of the training set. A corollary is that for ERM, the number of training samples needed for a particular performance is roughly linear on the VC dimension of the hypothesis class (see video). Sample complexity is upper bounded by VC dimension.
Neural network theory, Generalization in deep learning
Applicable to problems in supervised learning, but also beyond
Minimize the negative log likelihood (similar to entropy. More precisely, cross-entropy, or relative entropy), which corresponds to maximizing likelihood.
Learning theory and neural networks gingko tree
quite relevant for Deep learning theory