A priori probability estimates from structural complexity: Cosmos

A priori probability estimates from structural complexity

cosmos 11th April 2018 at 3:38pm

See draft paper –> published: Input–output maps are strongly biased towards simple outputs

Coding theorem connects probability and complexity

AIT differs fundamentally from Shannon information theory because the latter is fundamentally a theory about distributions, whereas the former is a theory about the information content of individual objects (see Descriptional complexity).

If one assumes that the probability of generating a binary input string of length l issimply $2^l$ (which is true for prefix codes, see Appendix A) then the most likely way to obtain output $x$ by random sampling of inputs is with the shortest program that generates it, a string of length $K(x)$ .

Direct application of these results fromAIT to many practical systems in science or engineering suf-fers from a number of well known problems:

$K(x)$ is formally uncomputable (due to halting problem).
Most results in AIT only hold asymptotically, up to $O(1)$ .
Many of the input-output maps in science and engineering are not UTMs.

The way these problems are tackled is described in the next sections.

Coding theorem for computable functions. We begin with a weaker form of the coding theorem, applicable to real world (computable) functions

$P(x)\leq 2^{-K(x|f,n)+O(1)}$

Eq. (2)

where $K(x|f,n)$ is the complexity of output $x$ , given the map $f$ and the value $n$ which is

(see M. Li and P.M.B. Vitanyi. An introduction to Kolmogorov complexity and its applications. Springer-Verlag New York Inc, 2008., and Lecture notes on descriptional complexity and randomness).

We provide a derivation of equation (2) in Appendix B, using standard results from AIT such as: The complexity of a whole set is often much less than the complexity of individual members of the set (9). Informally , $K(x|f, n)$ can be viewed as the length of computer code required to specify $x$ , given the function $f$ and value $n$ are already pre-programmed in to the computer. Note that equation (2) is just an upper bound. In contrast to the the full coding theorem of equation (1), there is no lower bound.

We mainly consider maps that comply witha few simple restrictions:

# inputs $\gg$ # outputs.
# outpus $\gg 1$ , to avoid finite size effects.
Descriptional complexity of the map $f$ itself doesn't grow with $n$ . _{Basically the map is represented by fixed rule-set for computing outputs from inputs, that is largely independently of n.}
Due to how the algorithmic complexity is approximated in practice, the map needs to be 'well-behaved' in the sense of not producing many outputs like the digit of $\pi$ which are algorithmically simple, but which have large entropy values and thus large values of $\tilde{K}x)$ (the computed approximation of $K(x)$ ). This is a pratical problem, that depends on the complexity measure we use. The "complexity estimator" they used is described in Appendix F.

If instead the map is allowed to contain arbitrary amounts of information, then the map could assign arbitrary probabilties to the outputs, and any coding theorem-like behaviour would be lost. We discuss this fixed complexity condition further in Appendix C.

as we show in Appendix E , it turns out that a reasonably broad range of complexities will follow under quite general conditions for fixed complexity maps.

They use the above conditions to argue that $K(x|f, n) \approx K(x)$ .

Approximately computing $K(x)$ . $K(x)$ although uncomputable has been approximated using standard compression algorithms. $\tilde{K}(x)$ is used to denote some real-world (computable) approximation to $K(x)$ .

Importance of $O(1)$ terms. Experimental results on apply the coding theorem to short strings suggest that the $O(1)$ terms are not very important.

**Central ansatz and simplicity bias**:

$P(x)\lesssim 2^{-a\tilde{K}x)-b}$

where the constants $a>0$ , and $b$ depend on the mapping, but not on $x$ .

We call this upper bound of the probability simplicity bias: High probability outputs must be 'simple', complex outputs must have exponentially lower probabilities. In contrast to the full coding theorem, the lack of a lower bound means that simple outputs may also have low probabilities.

They off estimates for $a$ and $b$ in appendix D. In Appendix B, they also argue that: the upper bound of equation (3) should be tight for most inputs, but weak for many outputs.

Examples of simplicity bias in maps

See them!

Discrete RNA sequence to structure mapping

Coarse-grained ordinary differential equation

Coarse-grained stochastic partial differential equation. Black Scholes equation

Polynomial curves

Random matrix – bias but not simplicity bias

L-systems

Random walk map

Logistic map (see Nonlinear maps)

Predicting which of two outputs has higher probability

Connection to Chomsky hierarchy (see Formal language), Sloppy systems..

Appendix A: AIT

Appendix B: Upper and lower bound for computable maps

Upper bound: following derivation using Shannon-Fano code as in InfoTheory book.

Lower bound: Not sure. Ask!, or read! Page 12.

Also many outputs must have probability below their upper bounds.

Appendix C: Fixed complexity map

Appendix D: Making predictions for P(x) in computable maps

Appendix E: Estimating the range of $K(x|f,n)$ .

Arguments based on bounding complexity given the description: map + index of output. This gives upper bounds to min and max complexities to be $0$ , and $\log_2{N_0}$ (everything up to additive constant, $O(1)$ ).

For the max, we also need a lower bound, and this is given by the well known fact in AIT that if one has $N_0$ different strings, not all of them can have complexity lower than $\log_2{N_0}$ , as there are not enough such descriptions. In fact, most of the strings need to have a complexity of $\log_2{N_0}$ .

Appendix F: Approximations to $K(x)$ .

Appendix G: Simplicity bias and system size

Appendix H: On the intuitive connection of probability and complexity

Appendix I: Simplicity bias in the logbinomial distribution

Appendix J: Predicting the number of outputs, $N_0$ . By fitting $\alpha$ and estimating $max(K)$ from the known details of the system.

Appendix K: Further examples and figures

Continuous systems are sampled and discretized to create the output.

L-systems, Circadian rhythm

Cell cycle

Feed forward network. Sample networks, measure complexity of given network. by entropy of the distribution of outputs.

Logic gate

Appendix L: Histograms of complexity