Empirical risk minimization: Cosmos — All that is, or was, or ever will be

Empirical risk minimization

cosmos 29th November 2017 at 2:20pm

Risk minimization (see Learning theory), requires knowing the joint probability distribution $P(x,y)$ , so one often uses the sample mean of the risk as an estimator for the expected value of the risk. Minimizing this empirical quantity is called empirical risk minimization (ERM).

Depending on the form of the risk function, this optimization problem may be convex or non-convex. If one uses a 0-1 loss function, the problem is non-convex, and finding its solution is NP-hard. A smoothed loss function may convert it into convex problem, solvable by Gradient descent.

Neural networks [2.1] : Training neural networks - empirical risk minimization

Empirical risk minimization is thus defined as the Optimization problem of minimizing

$\frac{1}{m} \sum\limits_i l(f(x^{(i)}; \mathbf{\theta}), y ^{(i)}) + \lambda \Omega (\mathbf{\theta})$

where $f$ is our Model (hypothesis, that depend on the model parameters $\mathbf{\theta}$ ; $l$ is the Loss function, $\Omega (\mathbf{\theta})$ is the regularizer. $\lambda$ is a hyperparameter that balances the two terms.

When we add the regularizer, ERM is called structural risk minimization.

another video explanation – Another good lecture

_{Setting up formally a linear classifier, to explore the problems of learning theory, and using empirical risk minimization as learning principle}

Intro to ERM – def