Kernel method: Cosmos — All that is, or was, or ever will be

Kernel method

cosmos 1st October 2018 at 4:25pm

Machine learning methods which use a Hypothesis space (set of functions it considers as an output) which is a Reproducing kernel Hilbert space (or a hypothesis space based on one, for kernel classification, which passes the function through some nonlinearity for classificaiton, typically). These can be quite expresive. They are often used in conjunction with regularization, which makes their training/optimization linear! (by the Representer theorem). They are also nonparametric (when chosing infinite dimensional function/feature spaces), which has nice properties.

The loss functions that we choose are coercive Strongly convex Functionals. This implies it has a minimum, and it is unique

Representer theorem

When function is in an RKHS, loss minimization becomes a linear optimization problem. This is shown with the Representer theorem (heavily relies on $L^2$ norm regularization, aka Tikhonov regularization), which says that the optimum function can be expressed as a linear combination of Feature maps of the RKHS at the input data points.

This is easy to see analyticially for finite dimensional RKHSs and the case of squared loss, because then the optimum is the pseudoinverse, which can be cast in the form of a vector times the feature vectors of the input data points. Proof idea. more detailed proof Nice

This is how the loss looks like after reducing to finite dimensions

Basically in practice, the idea is just to apply the Kernel trick (substituting inner products with kernels), which avoids working with the feature vectors explicitly that (as is done in standard Dictionary learning). See here for instance. Here he talks about the kernel trick itself

All of this allows to use linear learning algorithms to learn a nonlinear function or decision boundary!

Representator theorem implies that minimum is minimal norm solution.. and gradient descent can be rewritten in a certain way. also here

See more in this video

Interpretation of the norm of a function (used for regularization) for RKHSs with different regularizations

Main reason kernel methods aren't used so much nowadays I think is because of high computational cost, compared to alternatives (like Artificial neural networks), for high dimensional data, and more importantly big data sets. See here. This is because linear methods (and non-kernel gradient descent methods in general) run in time linear in the size of the data set, while kernel methods are quadratic (as they effectively substitute the dimension of the data with the size of the training set!)

Why is this better than just choosing a linear model with p features (because the features and number of parameters are chosen by the data!) – nonparametrics!

Gaussian processes are basically a probabilistic interpretation of kernel methods. They are both nonparametric methods, because the number of parameters grows with the data (linearly). See Gaussian Processes for Machine Learning

Kernel methods can be seen as generalizations of Nearest-neighbour classification where the notion of "closeness" is given by the specific kernel we are using

Common examples

Logistic regression, when we use Logistic loss function

Support vector machine, when we use the Hinge loss function.

Kernel ridge regression, when we use Squared error loss function (for Regression)

Random features

Splines

https://en.wikipedia.org/wiki/Kernel_method