Fisher information matrix

cosmos 15th March 2019 at 8:59pm
Information geometry

The Fisher information matrix (FIM) is the expected value of the Hessian (w.r.t. parameters) of the log\log Likelihood function, under the distribution over data given by the likelihood function at a fixed parameter:

I(θ)=Exp(xθ)[2θ2logp(xθ)]\mathcal{I}(\theta) = - \mathbb{E}_{x\sim p(x|\theta)} \left[ \frac{\partial^2}{\partial \theta^2}\log{p(x|\theta)}\right] =p(xθ)2θ2logp(xθ)dx= - \int p(x|\theta) \frac{\partial^2}{\partial \theta^2} \log{p(x|\theta)} dx

Of course, it can be applied to any probability distribution, whether it has the interpretation of likelihood or not..

So, if one Taylor expands the log-likelihood around a maximum, and keeps only terms up to second-order, we are approximating the peak by a Gaussian peak, whose covariance is the Hessian. We then take the average of this to get the FIM

As Hessian of relative entropy

It can also be seen as the Hessian of {the Relative entropy between the likelihood at one parameter and the likelihood at a nearby parameter} w.r.t. the parameter displacement. See here for derivation using Relative entropy.

Recall the relative entropy between P(x;θ)P(x;\theta) and P(x;θ+δθ)P(x;\theta+\delta\theta) (where θ\theta is parametrizing these distributions, which are over xx) is defined as

DKL[P(;θ)P(;θ+δθ)]=xP(x;θ)logP(x;θ)P(x;θ+δθ).D_{\mathrm{KL}}\left[ P(\cdot;\theta)\|P(\cdot;\theta+\delta\theta)\right] = \sum_x P(x;\theta) \, \log\frac{P(x;\theta)}{P(x;\theta+\delta\theta)}.

Now, let us Taylor expand P(x;θ+δθ)=P(x;θ)+δθθP(x;θ)P(x;\theta+\delta\theta) = P(x;\theta) + \delta\theta \frac{\partial}{\partial\theta}P(x;\theta) +δθ222θ2P(x;θ)+O(δθ3) + \frac{\delta\theta^2}{2}\frac{\partial^2}{\partial\theta^2}P(x;\theta)+O(\delta\theta^3) to get

DKL[P(;θ)P(;θ+δθ)]D_{\mathrm{KL}}\left[ P(\cdot;\theta)\|P(\cdot;\theta+\delta\theta)\right] =xP(x;θ)log[P(x;θ)P(x;θ)(1+δθθP(x;θ)P(x;θ)+δθ222θ2P(x;θ)P(x;θ)+O(δθ3))].= \sum_x P(x;\theta) \, \log{\left[\frac{P(x;\theta)}{ P(x;\theta) \left(1+\delta\theta \frac{ \frac{\partial}{\partial\theta}P(x;\theta)}{P(x;\theta)}+ \frac{\delta\theta^2}{2}\frac{\frac{\partial^2}{\partial\theta^2}P(x;\theta)}{P(x;\theta)}+O(\delta\theta^3)\right)}\right]}.

we expand this to second order (\recall the Taylor expansion of the natural logarithm). There are contributions to second order from the first order in the log expansion, and from the secod order as well (coming from the O(δθ)O(\delta \theta) term in its argument)

xP(x;θ)[δθθP(x;θ)P(x;θ)12(δθθP(x;θ)P(x;θ))2+δθ222θ2P(x;θ)P(x;θ)+O(δθ3)] - \sum_x P(x;\theta) \, \left[ \delta\theta \frac{ \frac{\partial}{\partial\theta}P(x;\theta)}{P(x;\theta)} -\frac{1}{2}\left(\delta\theta \frac{ \frac{\partial}{\partial\theta}P(x;\theta)}{P(x;\theta)}\right)^2 + \frac{\delta\theta^2}{2}\frac{\frac{\partial^2}{\partial\theta^2}P(x;\theta)}{P(x;\theta)} +O(\delta\theta^3)\right] =x[δθθP(x;θ)+δθ222θ2P(x;θ)]+xP(x;θ)[12(δθθP(x;θ)P(x;θ))2+O(δθ3)] = - \sum_x \left[\delta\theta \frac{\partial}{\partial\theta}P(x;\theta) + \frac{\delta\theta^2}{2}\frac{\partial^2}{\partial\theta^2}P(x;\theta)\right] + \sum_x P(x;\theta) \left[ \frac{1}{2}\left(\delta\theta \frac{ \frac{\partial}{\partial\theta}P(x;\theta)}{P(x;\theta)}\right)^2 +O(\delta\theta^3)\right] =(δθθ+δθ222θ2)xP(x;θ)+xP(x;θ)[12(δθθP(x;θ)P(x;θ))2+O(δθ3)] = -\left(\delta\theta \frac{\partial}{\partial\theta} +\frac{\delta\theta^2}{2} \frac{\partial^2}{\partial\theta^2} \right) \sum_x P(x;\theta) + \sum_x P(x;\theta) \left[ \frac{1}{2}\left(\delta\theta \frac{ \frac{\partial}{\partial\theta}P(x;\theta)}{P(x;\theta)}\right)^2 +O(\delta\theta^3)\right] =(δθθ+δθ222θ2)1+xP(x;θ)[12(δθθP(x;θ)P(x;θ))2+O(δθ3)] = -\left(\delta\theta \frac{\partial}{\partial\theta} +\frac{\delta\theta^2}{2} \frac{\partial^2}{\partial\theta^2} \right) 1 +\sum_x P(x;\theta) \left[\frac{1}{2}\left(\delta\theta \frac{ \frac{\partial}{\partial\theta}P(x;\theta)}{P(x;\theta)}\right)^2 +O(\delta\theta^3)\right] =δθ22xP(x;θ)[(θP(x;θ)P(x;θ))2+O(δθ3)] =\frac{\delta\theta ^2}{2} \sum_x P(x;\theta) \left[\left(\frac{ \frac{\partial}{\partial\theta}P(x;\theta)}{P(x;\theta)}\right)^2 +O(\delta\theta^3)\right] =δθ22xP(x;θ)[(θlogP(x;θ))2+O(δθ3)] =\frac{\delta\theta ^2}{2} \sum_x P(x;\theta) \left[\left( \frac{\partial}{\partial\theta}\log{P(x;\theta)}\right)^2 +O(\delta\theta^3)\right] =δθ22I(θ)+O(δθ3)=\frac{\delta\theta ^2}{2} \mathcal{I}(\theta) +O(\delta\theta^3)

Therefore, the Hessian of {the Mutual information of a probability distribution and a nearby distribution in a parametrized family of distributions} equals the FIM of the distribution at that parameter!

This is a clean way of thinking about FIM, because one is thinking about distributions directly, and functions of these distributions, like the relative entropy, and one doesn't need to think about averaging w.r.t. these distributions explicitly (of course an average appears in the definition of relative entropy, but one can think of relative entropy as just an abstract function of two prob dists. )

As covariance of the gradients of likelihood

It can also be seen as the Covariance matrix of the Gradients of the likelihood function (sometimes called "scores"):

I(θ)=Exp(xθ)[(θlogp(xθ))2] \mathcal{I}(\theta)= \mathbb{E}_{x\sim p(x|\theta)} \left[ \left(\frac{\partial}{\partial \theta}\log{p(x|\theta)}\right)^2\right]

=p(xθ)(θlogp(xθ))2dx=\int p(x|\theta) \left(\frac{\partial}{\partial \theta} \log{p(x|\theta)} \right)^2 dx

=(θlogp(xθ))θp(xθ)dx= \int \left(\frac{\partial}{\partial \theta} \log{p(x|\theta)} \right) \frac{\partial}{\partial \theta} p(x|\theta) dx

=(2θ2logp(xθ))p(xθ)dx= - \int \left(\frac{\partial^2}{\partial \theta^2} \log{p(x|\theta)}\right) p(x|\theta) dx +θ((θlogp(xθ))p(xθ))dx+\int \frac{\partial}{\partial \theta} \left( \left(\frac{\partial}{\partial \theta} \log{p(x|\theta)} \right) p(x|\theta) \right)dx =(2θ2logp(xθ))p(xθ)dx= - \int \left(\frac{\partial^2}{\partial \theta^2} \log{p(x|\theta)}\right) p(x|\theta) dx +θ(θp(xθ))dx+\int \frac{\partial}{\partial \theta} \left( \frac{\partial}{\partial \theta} {p(x|\theta)} \right)dx =(2θ2logp(xθ))p(xθ)dx= - \int \left(\frac{\partial^2}{\partial \theta^2} \log{p(x|\theta)}\right) p(x|\theta) dx +2θ2p(xθ)dx+ \frac{\partial^2}{\partial \theta^2} \int {p(x|\theta)}dx =(2θ2logp(xθ))p(xθ)dx= - \int \left(\frac{\partial^2}{\partial \theta^2} \log{p(x|\theta)}\right) p(x|\theta) dx +2θ21+ \frac{\partial^2}{\partial \theta^2} 1 =(2θ2logp(xθ))p(xθ)dx= - \int \left(\frac{\partial^2}{\partial \theta^2} \log{p(x|\theta)}\right) p(x|\theta) dx

So this definition is equivalent!

See wiki.

Intro to Fisher Matrices

The Covariance matrix is the inverse of the Fisher matrix.

χ2\chi^2 can be calculated as χ2=δFδT\chi^2 = \delta F \delta^T, where FF is the FIM, and δ\delta is a small step in parameter space from the maximum of the likelihood.

Reparametrization

https://www.wikiwand.com/en/Fisher_information#/Reparametrization

Note that the likelihood function is not a probability density over the parameters θ\theta, but over xx. Therefore, when we reparametrize, the likelihood function only changes by changing its argument according to the reparametrization, but it doesn't get any Jacobian factor. The derivative with respect of the parameter does get a Jacobian factor, so that Fisher information is dependent on reparametrization. The Metric tensor it encodes is, of course, invariant, by definition of tensor.

Fisher information as a Metric

https://youtu.be/IKetDJof8pk?t=2250