Statistical supervised learning theory

cosmos 12th December 2019 at 4:41pm
Learning theory

See post: Learn in theory

The Statistical learning theory of Supervised learning. Mostly studies ways of ensuring good Generalization for learning algorithms. See more at Computational learning theory and Probably approximately correct

See Introduction to supervised learning theory for a more complete formal introduction, and these notes

We have several approaches for Generalization error bounds (see more detail here). See below.

General principle: The more that the bound depends on, the tighter it can be

Note that bounds dependent on training sample are basically bounds that depend on the Data distribution, but where we estimate the data-distribution-dependent bound from the sample, using some Concentration inequality

Realizable learning

where we assume the hypothesis class and data distribution (e.g. target function) are such that we can reach 00 Empirical risk. This leads to PAC learning Data-dependent realizable learning bounds

  • Depends on data distribution. (Deterministc case: Depends on target function. )
  • Depends on distribution over data distributions. See analysis when assuming distribution overt target functions follows the Universal probability distribution. Can also do Bayesian analysis in this case.
  • Dependent on training sample, but not on algorithm
  • Depends on training sample, and on algorithm

Agnostic learning

where we don't assume anything about the data distribution, and just ask: how well can I do relative to the function that does best (has lowest true risk) in a given hypothesis class. It has a related formulation, where we instead ask how much does the true risk differ from the empirical risk. The error in the former can be shown to be at most twice the error in the later.

Data-dependent agnostic learning

(well technically, what's called "realizability" tends to be that the true risk is 0, while here I am using it as meaning that the empirical risk is 0)


Structural risk minimization is interesting because it only depends on the data distribution and algorithm kind of implicitly (well kinda how sample-dependent bounds depend on the data distribution implicitly.)