Support vector machine: Cosmos — All that is, or was, or ever will be

Support vector machine

cosmos 3rd October 2018 at 3:23pm

aka SVM

A method for discriminative Supervised learning, that is for Classification, and Regression analysis. It allows to learn non-linear functions, like Artificial neural networks.

They correspond to Kernel methods which use the Hinge loss function. See this video. Because loss function is non-differentiable we use the Subgradient method (See here).

Andrew Ng lec intro – Notes – Wiki – Notation

Support Vector Machine Intro and Application - Practical Machine Learning Tutorial with Python p.20

See page 39 here, and here, and these notes

Generalization bounds – These bounds seem to not apply to what we observe (see Deep learning theory)

Relationships between Gaussian processes, Support Vector machines and Smoothing Splines

Maximum/Optimal margin classifier

You can intuitively see that Hinge loss + Tikhonov regularization (norm squared) pushes you towards low margin. Hinge loss penalizes things far from the boundary more, the smaller the norm of the parameter vector. Therefore, if you push towards low parameter vector norms, the classifier is pushed toward larger margin so that points are less close to the boundary.

Max-margin learning

Summary

Definition of the optimization (learning) problem: choose classification hyperplane (parametrized by $\omega, b$ ), that maximizes the Geometric margin wrt to training data set, with the contratint $||w||=1$ .

Alternative formulationt, that however, leaves $||w||$ arbitrary.

Here is a constraint that fixes the arbitrariness of the magnitude of $w$ : $\hat{\gamma} = 1$ (Functional margin), we get the following optimization problem:

$\min \frac{1}{2} ||w||^2$
s.t. $y^{(i)} (w^T x + b ) \geq 1$

This is an example of a quadratic program

This is the same as maximizing the margin, under the constraint See page 8 here, and divide consrtaints by norm of $w$ , giving the geometric margin.

SVM optimization problem

An SVM works by solving the dual problem to the optimization problem of OMCs defined above. The good thing is that the resulting optimization problem is convex, for which there are very efficient methods for solving it, as there are no local minima!

Primal and dual Optimization problems: intro –> primal problem –> Dual problem (Skip some algebra)

Support vectors

(Definition of support vectors, the $\alpha$ s are Lagrange multipliers)

Support vectors satisfy the contraint inequality with equality.

The approach for deriving the optimal margin classifier for a support vector machine, is that we'll solve the dual optimization problem above.

Prediction can be written down using inner products: see here

–Summary

Kernels

See Kernel methods

_{Kernels Introduction - Practical Machine Learning Tutorial with Python p.29}

Allow the computation of the inner products, that appear in the formulas for SVMs, more effectively

Idea of a Kernel

How to choose Kernels

How to check if a function is a valid Kernel

If $K(x,z)$ is a valid kernel, then the matrix with elements $K_{ij} = K(x^{(i)}, x^{(j)}$ , for any set of $x^{(i)}$ , must be symmetric positive semi-definite. It turns out the converse is true.

How to use a SVM with a kernel

How SVM work (to classify non-linearly separable data)

Examples of kernels

Gaussian kernel

Kernels can be applied to other learning algorithms!

$L1$ norm soft margin SVM

Soft Margin SVM - Practical Machine Learning Tutorial with Python p.31

Soft margin is a method to deal with non-linearly separable decision boundaries.

Intro

Mathematical formulation

$\min \frac{1}{2} ||w||^2 + C \sum_{i=1}^n \xi_i$
s.t. $y^{(i)} (w^T x + b ) \geq 1 - \xi_i$
$y^{(i)} (w^T x + b ) \geq 1 - \xi$
$\xi_i \geq 0$
$i = 1, ..., n$

basically weight how each point contributes, so that if there is some outlier it doesn't break the algorithm.. Explanation

We need $\xi_i \geq 0$ so that we don't reward easy to classify terms, with negative $\xi_i$ .

This constrained optimization problem can be reformulated as an unconstrained optimization problem with a regularized Hinge loss function:

\frac{1}{2}||w||^2 + C \sum_i \max{\{0,1-y_i(\mathbf{w}^T \mathbf{x}_i)\}}

To see this we have to minimize w.r.t $\xi$ first, and then w.r.t $\mathbf{w}$ (which can always be done). So we consider fixing the parameters $w$ , and minimizing w.r.t $\xi$ , which gives $\xi=\max{\{0,1-y_i(\mathbf{w}^T \mathbf{x}_i)\}}$ , which gives the Hing loss.

Can also work out the dual of this optimization problem (see more on Optimization and Linear programming about dual optimization problems). See slides here

Convergence conditions

SMO algorithm

Sequential minimal optimization is an optimization algorithm, that is a variation of Coordinate descent: Application to SVM dual optimization problem. This is the reason we use it

– Description of the algorithm – algorithm (we are maximizing a function $w(\alpha_1, ..., \alpha_m)$