Relative entropy

cosmos 15th March 2019 at 8:19pm
Information measures

aka relative information, Kullback–Leibler divergence, KL divergence

The relative entropy between distribution PP and QQ is defined as:

DKL(PQ)=iP(i)logP(i)Q(i).D_{\mathrm{KL}}(P\|Q) = \sum_i P(i) \, \log\frac{P(i)}{Q(i)}.

Defines a measure of "distance" between probabiliy distributions. However, note that it is not a Metric, as it is not symmetric, and it doesn't obey the Triangle inequality. On the other hand, it is 00 if and only if P=QP=Q, and in its infinitesimal form, specifically its Hessian, gives a metric tensor known as the Fisher information metric.

It can be interpreted as the average (under PP) of the difference in code length, when assigning code lengths in an optimal way (as per Shannon Source coding theorem) for distribution PP and QQ.

Interpretation as information gained

Entropy is the number of yes/no questions you expect you need to ask to identify the state of the world, under a Model of the world (Probability distribution over states of the world). I.e. how ignorant I think I am about the world.

If you then for some reason update your model of the world, your expectations change. Because of this, the expected number of yes/no questions using the previously optimal scheme can change. The new number, called Cross entropy represents how ignorant you think now that you *were* about the world.

Relative entropy can then be interpreted as the difference between how ignorant* I think I am *now* after the update (new value of entropy), versus how ignorant I think I was before (Cross entropy. I.e. how much less ignorant about the world do I think I have become after the update – how much information I think I have learned [*here I am using "ignorance" as "expected number of yes/no questions I to ask, which is equivalent to the code length, of course!]

One can show that a Bayesian update from data containing k bits, results in a relative entropy between before and after of k (or ignorance decrease). It's an easy exercise, specially if one uses 0-1 likelihood. The information in the data DD under the prior is just logP(D)-\log{P(D)}

This is why it is good to interpret is as information gained, because one can go from distribution QQ to PP by means of DKL(PQ)D_{\mathrm{KL}}(P\|Q) bits of evidence!

Properties

video introducing the concept and its properties.

Mutual information is a special case where PP is a joint distribution and QQ is the product of the marginals.

Applications in estimating hypothesis testing errors and in large deviation theory. Also in PAC-Bayesian learning


See this video for connections with areas in physics and biology. In particular, the mutual information between a non-equilibrium distribution and the equilibrium distribution in some Statistical physics system, equals the difference between the Free energyes of the non-equilibrium state and the equilibrium state

https://www.wikiwand.com/en/Kullback%E2%80%93Leibler_divergence

https://www.youtube.com/watch?v=QPkb5VcgXAM#t=20m55