aka relative information, Kullback–Leibler divergence, KL divergence
The relative entropy between distribution and is defined as:
Defines a measure of "distance" between probabiliy distributions. However, note that it is not a Metric, as it is not symmetric, and it doesn't obey the Triangle inequality. On the other hand, it is if and only if , and in its infinitesimal form, specifically its Hessian, gives a metric tensor known as the Fisher information metric.
It can be interpreted as the average (under ) of the difference in code length, when assigning code lengths in an optimal way (as per Shannon Source coding theorem) for distribution and .
Interpretation as information gained
Entropy is the number of yes/no questions you expect you need to ask to identify the state of the world, under a Model of the world (Probability distribution over states of the world). I.e. how ignorant I think I am about the world.
If you then for some reason update your model of the world, your expectations change. Because of this, the expected number of yes/no questions using the previously optimal scheme can change. The new number, called Cross entropy represents how ignorant you think now that you *were* about the world.
Relative entropy can then be interpreted as the difference between how ignorant* I think I am *now* after the update (new value of entropy), versus how ignorant I think I was before (Cross entropy. I.e. how much less ignorant about the world do I think I have become after the update – how much information I think I have learned [*here I am using "ignorance" as "expected number of yes/no questions I to ask, which is equivalent to the code length, of course!]
One can show that a Bayesian update from data containing k bits, results in a relative entropy between before and after of k (or ignorance decrease). It's an easy exercise, specially if one uses 0-1 likelihood. The information in the data under the prior is just
| This is why it is good to interpret is as information gained, because one can go from distribution to by means of bits of evidence! |
video introducing the concept and its properties.
Mutual information is a special case where is a joint distribution and is the product of the marginals.
Applications in estimating hypothesis testing errors and in large deviation theory. Also in PAC-Bayesian learning
See this video for connections with areas in physics and biology. In particular, the mutual information between a non-equilibrium distribution and the equilibrium distribution in some Statistical physics system, equals the difference between the Free energyes of the non-equilibrium state and the equilibrium state
https://www.wikiwand.com/en/Kullback%E2%80%93Leibler_divergence