Statistical supervised learning theory: Cosmos — All that is, or was, or ever will be

Statistical supervised learning theory

cosmos 12th December 2019 at 4:41pm

The Statistical learning theory of Supervised learning. Mostly studies ways of ensuring good Generalization for learning algorithms. See more at Computational learning theory and Probably approximately correct

See Introduction to supervised learning theory for a more complete formal introduction, and these notes

We have several approaches for Generalization error bounds (see more detail here). See below.

General principle: The more that the bound depends on, the tighter it can be

Note that bounds dependent on training sample are basically bounds that depend on the Data distribution, but where we estimate the data-distribution-dependent bound from the sample, using some Concentration inequality

Realizable learning

where we assume the hypothesis class and data distribution (e.g. target function) are such that we can reach $0$ Empirical risk. This leads to PAC learning Data-dependent realizable learning bounds

Depends on data distribution. (Deterministc case: Depends on target function. )
- Analysis of Occam algorithms. (for deterministic case)
- Rademacher complexity analysis (average version)
Depends on distribution over data distributions. See analysis when assuming distribution overt target functions follows the Universal probability distribution. Can also do Bayesian analysis in this case.
Dependent on training sample, but not on algorithm
- Rademacher complexity analysis. Depends on hypothesis class and training sample S
Depends on training sample, and on algorithm
- Elementary PAC-Bayes bound. Depends on the output of the learning algorithm, $h$ .
- Realizable PAC-Bayes bound. Depends on likelihood of training sample
- Realizable Structural risk minimization. Depends on the class to which output belongs. Occam algorithms can be analyzed like this also I think.
  - Minimum description length is a special case.

Agnostic learning

where we don't assume anything about the data distribution, and just ask: how well can I do relative to the function that does best (has lowest true risk) in a given hypothesis class. It has a related formulation, where we instead ask how much does the true risk differ from the empirical risk. The error in the former can be shown to be at most twice the error in the later.

Data-dependent agnostic learning

Depends on data distribution (Deterministic case: Depends on target function.)
- Rademacher complexity analysis (average version))
Depends on distribution over data distributions.
Dependent on training sample, but not on algorithm
- Rademacher complexity analysis. Depends on hypothesis class and training sample S
Depends on the training sample and the algorithm
- Agnostic PAC-Bayes bounds
- Agnostic Structural risk minimization. Depends on the class to which output belongs.
  - Minimum description length is a special case.
- Sensitivity
- Algorithmic robustness

(well technically, what's called "realizability" tends to be that the true risk is 0, while here I am using it as meaning that the empirical risk is 0)

Structural risk minimization is interesting because it only depends on the data distribution and algorithm kind of implicitly (well kinda how sample-dependent bounds depend on the data distribution implicitly.)