aka relative information, Kullback–Leibler divergence, KL divergence
The relative entropy between distribution and is defined as:
Defines a measure of "distance" between probabiliy distributions. However, note that it is not a Metric, as it is not symmetric, and it doesn't obey the Triangle inequality. On the other hand, it is if and only if , and in its infinitesimal form, specifically its Hessian, gives a metric tensor known as the Fisher information metric.
It can be interpreted as the average (under ) of the difference in code length, when assigning code lengths in an optimal way (as per Shannon Source coding theorem) for distribution and .
Interpretation as information gained
Entropy is the number of yes/no questions you expect you need to ask to identify the state of the world, under a Model of the world (Probability distribution over states of the world). I.e. how ignorant I think I am about the world.
If you then for some reason update your model of the world, your expectations change. Because of this, the expected number of yes/no questions using the previously optimal scheme can change. The new number, called Cross entropy represents how ignorant you think now that you *were* about the world.
Relative entropy can then be interpreted as the difference between how ignorant* I think I am *now* after the update (new value of entropy), versus how ignorant I think I was before (Cross entropy. I.e. how much less ignorant about the world do I think I have become after the update – how much information I think I have learned [*here I am using "ignorance" as "expected number of yes/no questions I to ask, which is equivalent to the code length, of course!]
One can show that a Bayesian update from data containing k bits, results in a relative entropy between before and after of k (or ignorance decrease). It's an easy exercise, specially if one uses 0-1 likelihood. The information in the data under the prior is just
This is why it is good to interpret is as information gained, because one can go from distribution to by means of bits of evidence! |
video introducing the concept and its properties.
Mutual information is a special case where is a joint distribution and is the product of the marginals.
Applications in estimating hypothesis testing errors and in large deviation theory. Also in PAC-Bayesian learning
See this video for connections with areas in physics and biology. In particular, the mutual information between a non-equilibrium distribution and the equilibrium distribution in some Statistical physics system, equals the difference between the Free energyes of the non-equilibrium state and the equilibrium state
https://www.wikiwand.com/en/Kullback%E2%80%93Leibler_divergence