These are called the normal equations. They have applications in Linear regression, and in GPS
Intuition for why normal equations correspond to projection (apart from being derived by minimizing distance to column space!)
.
is the correlation of the outputs with each of the features (also, the Dot product of the output with each of the vector of features (where each element correspond to a point from the sample).
The above equation says that should be such that it has the same dot products with the features as does. Now lies in the column space of (which in the undetermined case has less dimensionality than the space in which Y lives), and every element in this space is uniquely determined by its dot product with the basis given by the columns of (assume linear independent columns, full rank). Furthermore, projecting into the column space doesn't change its dot products with the columns (decompose into perpendicular and parallel component to the column space to see this; perp part has zero dot product with columns). Therefore this is finding that is equal to the projection of on column space, as we expect from the definition of the least squares problem!
When we then take the inverse of the Covariance matrix , we are giving more weights to eigendirections of the Cov matrix with little variance, and less weight to directions with large variance, for fixed correlation with . This is because if there is smaller variance in some direction but changes in this direction cause equally big changes in the output, then the weight for this direction must be large (so that small changes in input cause large changes in output)