Least-squares

cosmos 10th December 2018 at 9:30pm
Kernel method Loss function Optimization

Video

These are called the normal equations. They have applications in Linear regression, and in GPS


Intuition for why normal equations correspond to projection (apart from being derived by minimizing distance to column space!)

XTXw=XTYX^T X w = X^T Y.

XTYX^T Y is the correlation of the outputs with each of the features (also, the Dot product of the output with each of the vector of features (where each element correspond to a point from the sample).

The above equation says that XwX w should be such that it has the same dot products with the features as YY does. Now XwX w lies in the column space of XX (which in the undetermined case has less dimensionality than the space in which Y lives), and every element in this space is uniquely determined by its dot product with the basis given by the columns of XX (assume linear independent columns, full rank). Furthermore, projecting YY into the column space doesn't change its dot products with the columns (decompose YY into perpendicular and parallel component to the column space to see this; perp part has zero dot product with columns). Therefore this is finding XwX w that is equal to the projection of YY on column space, as we expect from the definition of the least squares problem!

When we then take the inverse of the Covariance matrix XTXX^T X, we are giving more weights to eigendirections of the Cov matrix with little variance, and less weight to directions with large variance, for fixed correlation with YY. This is because if there is smaller variance in some direction but changes in this direction cause equally big changes in the output, then the weight for this direction must be large (so that small changes in input cause large changes in output)