See Regression analysis.
See Least-squares
Use Matrix calculus for optimization: leads to normal equations (analytical solution to least squares), etc.
See here
See here. We assume the errors between model values and actual values follow a Normal distribution. This implies the data would be distributed as a Gaussian with mean , and a certain Variance. See here. With this we define the Likelihood function. One can then derive that maximizing likelihood is the same as mimimizing mean squares.
Doing linear regression in Python with Sklearn
Regression Intro - Practical Machine Learning Tutorial with Python p.2
Regression How it Works - Practical Machine Learning Tutorial with Python p.7
There exists a nonparametric generalization of linear regression: Locally-weighted linear regression
The different features should not be linearly dependent in linear models, as parameters would be badly behaved