See more detail and results at Computational learning theory. Here's just basics
Setting up formally a linear classifier, to explore the problems of learning theory, and using empirical risk minimization as learning principle, also uses definitions above.
Hypothesis classes
Hypothesis class and more general def
The simple case of finite hypothesis classes
Lemmas
Strategy
–>Video. The result is a form of uniform convergence (vid).
Other formulations
sample complexity bound. vid
Theorem Let (where is the hypothesis class (set of possible hypothesis)), and let any be fixed. Then, w.p. at least , the generalization error
–>Relation to bias/variance tradeoff. First term can be seen as the "bias", and the second as the "variance"
The training size,
Definition of shattering: video
–>Theorem. Let a hypothesis class be given, and let the VC dimension VC(H) = d. Then w.p. at leat, we have that for all
where is the size of the training set. A corollary is that for ERM, the number of training samples needed for a particular performance is roughly linear on the VC dimension of the hypothesis class (see video). Sample complexity is upper bounded by VC dimension. Often, the VC dimension is roughly proportional to the number of parameters in your model. Lower bound in the worst case
Application to support vector machines
smooth loss functions as convex relaxations. Learning algos such as logistic regression or support vector machines (which use the Hinge loss) may be viewed as approximating 0-1 loss-ERM.