No free lunch theorem: Cosmos — All that is, or was, or ever will be

No free lunch theorem

cosmos 23rd January 2019 at 11:14pm

Wolpert article on No Free Lunch and the different learning theory frameworks

Much of machine learning is concerned with devising different models, and different algorithms to fit them. We can use methods such as cross validation to empirically choose the best method for our particular problem. However, there is no universally best model — this is sometimes called the no free lunch theorem (Wolpert 1996). The reason for this is that a set of assumptions that works well in one domain may work poorly in another.

A conservation law for generalization performance.

The supervised learning no-free-lunch theorems.

No free lunch theorems for optimization.

The mathematics of generalization

The Relationship between PAC, the Statistical Physics framework, the Bayesian framework, and the VC framework (1994)

Given data, most probably test error is not indicative of off-given-data error. Given real function, most probably test error is indicative.

Cross-validation doesnt work because on-training error doesn't tell you anything about off-training-set error

symmetry between hypothesis class and class of possible real functions.

if realizability assumption is not satisfied then pac theorem doesnt work

(1-e)^m/(H/F +(F-H)/F*(1-e)^m)

for algo that can only produce 1 function.

pac Bayes needs matching P(f)?

draw.io things

–> No Free Lunch versus Occam’s Razor in Supervised Learning

Basically it's a proof (Theorems 14 and 15) giving a generalization bound similar to Occam's theorems in Probably approximately correct theory; but instead of being based on probability, it gives a bound for a particular training set, and complexity of a target function. Intuitively, this is possible. For if given the target function complexity (or a bound on it), for a particular Turing machine (to make K complexity concrete), then we can look at all functions which agree with this complexity bound and see the generalization of each (given the output of the algorithm, which only depends on the training data). We can then see if all the corresponding generalization errors are below some bound. I am just saying this to show that the information which the theorem assumes could be enough in principle to construct a generalization bound. Now the theorem uses several properties of Kolmogorov complexity and entropy to obtain a bound.

The idea is the following: given the target function and the function outputted by the algorithm, we immediately know all the elements in the domain $X$ on which the functions differ. We assume that the algorithm gives a function which agrees perfectly on the training set.

To give a description of the training set (which elements are in the training set), then we just need to add a description of {which of {the elements in the domain, for which the function agrees with the target} are in the domain and which are not}. Intuitively, this information (call it $I$ ) will be less than describing which elements are in the domain and which are not, over the set of all elements. They prove this. However, because the functions plus this is a description of the training set then, {information needed to describe the functions} + $I$ $\geq$ $K$ (training set). Furthermore $I \leq K$ , as we intuited above. Now, $I$ and $K$ (training set) both turn out to scale with $n$ (the number of elements in the domain), if we assume the training set is a fixed fraction $\theta$ of the elements of the domain. If {information needed to describe the functions} scales more slowly, then asymptotically, we have $I \geq K$ (training set) and $I \leq K$ (training set), so asymptotically, $I = K$ (training set). From the intuitive argument above (made rigorous in the paper), this can only be if the set of {the elements in the domain, for which the function agrees with the target} becomes the whole domain, asymptotically. That is the function agrees with the target over the whole domain (and in particular over the whole test set, which is a constant fraction $1-\theta$ of the whole domain).

Universal Indution and Optimisation: No Free Lunh

All models are wrong, but some models are useful. — George Box (Box and Draper 1987,

p424).12

Just like I blame the 2nd law for my room being messy, now I blame the No Free Lunch theorem whenever I'm not good at something: "No Free Lunch! One can't be good at everything!" "I'm just good at other things", "in average we are just as good", etc

mathematicallyjustifiedexcuses