Learning functions of the form
"Clearn data": , observe "Statistics": , , where
For discrete-valued functions, we want to find , s.t. . However, this is not appropriate for real value, as any deviation would be considered an error of the same kind. Instead we want to find , s.t.
Approach: optimize over a class of functions : ("square-error")
MInimizing is equivalent to minimizing (see photo). This works for the square-error but not for some other error measures!
E.g.: Linear regression.
Constrained gradient descent (see photo. Just use Gradient descent and project into set of possible parameters defined by the constraint). Can show rate of convergence. He showed it for the variant where the step size is fixed, and we output the average over the points we visit..
Since the model class has real-valued output, can not directly use VC-dimension here. Instead, we can use a similar concept called subgraph VC-dimension which is similar to VC-dimension with the difference being that here we count the number of different behavior with a given margin. This means for the binary case, we require yf(x) ≥ η for some margin η. There are different techniques that bound subgraph-VC dimension such as Covering Numbers and Rademacher Complexities. See here
[28] proved that the Rademacher complexity of fully connected feedforward networks on set S can be bounded based on the 1 norm of the weights of hidden units in each layer In Chapter 5 we show how the capacity can be controlled for a large family of norms.