Universal probability

cosmos 16th November 2017 at 12:57am
Algorithmic information theory

a.k.a. algorithmic probability, Levin's probability distribution, universal prior

Probability distribution of outputs of a Turing machine, when fed random inputs. Apparently, the distribution is rather robust to the probability distribution of inputs, thus it being called "universal"

Imagine a monkey sitting at a keyboard and typing the keys at random.
Probability of an input program (string), pp, is 2l(p)2^{-l(p)}. simple strings are more likely than complicated strings of the same length.

Universality of the universal probability

See here (vid) also

Remark. Bounded likelihood ratio The likelihood ratio PU(x)/PA(x)P_{\mathcal{U}}(x)/P_{\mathcal{A}}(x) is bounded, and doesn't go to 00 or \infty for any xx, thus no universal probability can be totally discarded relative to any other in hypothesis testing. This is essentially because any universal computer can simulate any other, and in that sense the probability distribution obtained by feeding random input into one is also contained in the distribution obtained in the other.

In that sense we cannot reject the possibility that the universe is the output of monkeys typing at a computer. However, we can reject the hypothesis that the universe is random (monkeys with no computer). 😮

The example indicates that a random input to a computer is much more likely to produce “interesting” outputs than a random input to a typewriter. We all know that a computer is an intelligence amplifier. Apparently, it creates sense from nonsense as well.

On the machine dependence of UP

That Algorithmic Probability was relatively independent of the choice of which universal machine was used seemed likely but was not altogether cer tain. It was clear that probabilities assigned by different machines would differ by only a constant factor (independent of length of data described), but this constant factor could be quite large.

Fortunately, in just about all applications of probability, we are not in terested in absolute probabilities, but only in probability ratios. I had some heuristic arguments to the effect that the probability ratio for alternative possi ble continuations of a sequence should be relatively machine independent if the sequence were very long The second half of the 1964 paper was devoted to examples of how this prob- ability evaluation method could be applied. That it seemed to give reasonable answers was some of the strongest evidence that I had for its correctness


See Simplicity bias

Coding-theorem Like Behaviour and Emergence of the Universal Distribution from Resource-bounded Algorithmic Probability

Universal distribution, Algorithmic probability, use in induction/inference

There are many properties of m that make it optimal [32, 33, 34, 23]. For example, the same Universal Distribution will work for any problem within a convergent error; it can deal with missing and multidimensional data; the data does not need to be stationary or ergodic; there is no under-fitting or over-fitting because the method is parameter-free and thus the data need not be divided into training and test sets; it is the gold standard for a Bayesian approach in the sense that it updates the distribution in the most efficient and accurate way possible with no assumptions.

Several interesting extensions of resource-bounded Universal Search approaches have been introduced in order to make algorithmic probability more useful in practice [27, 28, 20, 14], some of which provide some theoretical bounds [1]. Some of these approaches have explored the effect of relaxing some of the conditions (e.g. universality) on which Levin’s Universal Search is fundamentally based [36] or have introduced domain-specific versions (and thus versions of conditional AP). Here we explore the behaviour of explicitly weaker models of computation– of increasing computational power– in order to investigate the asymptotic behaviour and emergence of the Universal Distribution and the properties of both the different models with which to approximate it and the actual empirical distributions that such models produce (see here)

predict natural distributions (also useful for inference..)

More real-world approaches may then lead to applications, like for helping predict the most likely (algorithmic) final configuration of a folding protein. If the chemical and thermodynamic laws that drive these processes are considered algorithmic in any way, even under random interactions, e.g. molecular Brownian motion, the Universal Distribution may offer insights that may help us quantify the most likely regions if those laws in any sense constitute forms of computation below or at the Turing level that we explore here

===

Levin proved that the output distribution established by Algorithmic Probability dominates (up to multiplicative scaling) any other distribution produced by algorithmic means as long as the executor is a universal machine, hence giving the distribution its ‘universal’ character (and name, as the ‘Universal Distribution’).

This so-called Universal Distribution is a signature of Turing-completeness. However, many processes that model or regulate natural phenomena may not necessarily be Turing universal. For example, some models of self-assembly may not be powerful enough to reach Turing-completeness, yet they display similar output distributions to those predicted by the Universal Distribution by way of the algorithmic Coding theorem, with simplicity highly favoured by frequency of production. Noise is another source of power degradation that may hamper universality and therefore the scope and application of algorithmic probability. However, if some sub-universal systems approach the coding-theorem behaviour, these give us great prediction capabilities and less powerful but computable algorithmic complexity measures. Here we ask whether such distributions can be partially or totally explained by importing the relation established by the coding theorem, and under what conditions non-universal systems can display algorithmic coding-theorem like behaviour. (see here)

We produce empirical distributions of systems at each of the computing levels of the Chomsky hierarchy (see here) ,

  • starting from transducers (Type-3) as defined in [18]. We use an enumeration of transducers introduced in [18] where they also proved an invariance theorem, thus demonstrating that the enumeration choice is invariant (up to a constant).
  • Context-free grammars (Type-2) as defined in [35],
  • linear-bounded nondeterministic Turing machines (Type-1) as approximations to bounded Kolmogorov-Chaitin complexity
  • and a universal procedure from an enumeration of Turing machines (Type-0) as defined in [5, 21].

We report the results of the experiments and comparisons, showing the gradual coding-theorem-like behaviour at the boundary between decidable and undecidable systems. We also explore the consequences of relaxing the halting configuration (e.g. halting state) in models of universal computation (Type-0) when it comes to comparing their output distributions.

Finite-state complexity (see Automata-based descriptional complexity). See definitions here. Check out the invariance theorem for finite-state complextity

(Finite-state Algorithmic Probability) Let SS be the set of encodings of all transducers by a binary string in the form of σ\sigma. We then define the algorithmic probability of a string sBs \in \mathcal{B}^* as follows:

APS(s)=(σ,p):TσS=s2(σ+p)AP_S(s) = \sum\limits_{(\sigma,p):T_\sigma^S=s} 2^{-(|\sigma|+|p|)}

Finite-state distribution (def 2.4. Typo)

In algorithm to enumerate context-free grammars in Chomsky normal form. Why does generateSetN[{n, p}] depend on p? I can see that not all n are allowed for given p. But then..., how is this done to not generate same grammars twice..? Also, can't there be terminals in the LHS too? NO, because this algorithm is for context-free grammars!. CYK algorithm to decide if a string s is generated by a grammar G.

Context-free Algorithmic Probability. Actually, a CFG empirical distribution.

Linear-bounded complexity

What is the justification of using these "empirical distributions", as they seem to be quite different from algorithmic probability (2K2^{-K})? Well, I guess the universal distribution dominates all other distributions.., and also at least this captures the intuition of how easy it is to produce ss.. The difference between the two is that in AP, one is typing symbols sequentially, while in the empirical distribution, one gives equal probability to all inputs.... Hmmmmm. Yeah, but coding-theorem and its domination properties (which I guess are related to Simplicity bias) maybe causes them to be similar..? Hmm yeah. The universal distribution (even if it only applies asymptotically), is rather amazing. But I should probably read their paper on sampling Turing Machines, because I think it's the same idea...

CheckNumerical Evaluation of Algorithmic Complexity for Short Strings: A Glance into the Innermost Structure of Randomness

Calculating Kolmogorov Complexity from the Output Frequency Distributions of Small Turing Machines

We use the same Turing Machine formalism as [5, 21], which is that of the Busy Beaver introduced by Rado [8], and we use the known values of the Busy beaver functions.

Time-bounded. Non-Halting

Results

They have many typos. For instance, here, caption and figure axis labels don't agree.

Figure 4 not really clear..


The discovery of algorithmic probability. See also Universal inductive inference. Solomonoff defined 5 models which are variants of UP, discussed in that paper. They are based on different constraints on the Turing machine, and on defining probability in a frequentist way. Some of them were used for sequence prediction.

See here (vid) for description of these.

Other related models/work

Levin 73. On the notion of a random sequence

Levin 74. Laws of information conservation (non-growth) and aspects of the foundations of probability theory

Gacs 74. On the symmetry of algorithmic information

Chaitin 75. A theory of program size formally identical to information theory

http://link.springer.com/chapter/10.1007/978-0-387-49820-1_4#page-1


Physical algorithmic probability

book

Ultimate Intelligence Part I: Physical Completeness and Objectivity of Induction

Diverse Consequences of Algorithmic Probability


Applications as Universal bias (Simplicity bias) to No free lunch theorem