The Fisher information matrix (FIM) is the expected value of the Hessian (w.r.t. parameters) of the log Likelihood function, under the distribution over data given by the likelihood function at a fixed parameter:
I(θ)=−Ex∼p(x∣θ)[∂θ2∂2logp(x∣θ)] =−∫p(x∣θ)∂θ2∂2logp(x∣θ)dx
Of course, it can be applied to any probability distribution, whether it has the interpretation of likelihood or not..
So, if one Taylor expands the log-likelihood around a maximum, and keeps only terms up to second-order, we are approximating the peak by a Gaussian peak, whose covariance is the Hessian. We then take the average of this to get the FIM
As Hessian of relative entropy
It can also be seen as the Hessian of {the Relative entropy between the likelihood at one parameter and the likelihood at a nearby parameter} w.r.t. the parameter displacement. See here for derivation using Relative entropy.
Recall the relative entropy between P(x;θ) and P(x;θ+δθ) (where θ is parametrizing these distributions, which are over x) is defined as
DKL[P(⋅;θ)∥P(⋅;θ+δθ)]=∑xP(x;θ)logP(x;θ+δθ)P(x;θ).
Now, let us Taylor expand P(x;θ+δθ)=P(x;θ)+δθ∂θ∂P(x;θ) +2δθ2∂θ2∂2P(x;θ)+O(δθ3) to get
DKL[P(⋅;θ)∥P(⋅;θ+δθ)]
=∑xP(x;θ)log⎣⎢⎢⎡P(x;θ)(1+δθP(x;θ)∂θ∂P(x;θ)+2δθ2P(x;θ)∂θ2∂2P(x;θ)+O(δθ3))P(x;θ)⎦⎥⎥⎤.
we expand this to second order (\recall the Taylor expansion of the natural logarithm). There are contributions to second order from the first order in the log expansion, and from the secod order as well (coming from the O(δθ) term in its argument)
−x∑P(x;θ)[δθP(x;θ)∂θ∂P(x;θ)−21(δθP(x;θ)∂θ∂P(x;θ))2+2δθ2P(x;θ)∂θ2∂2P(x;θ)+O(δθ3)]
=−x∑[δθ∂θ∂P(x;θ)+2δθ2∂θ2∂2P(x;θ)]+x∑P(x;θ)[21(δθP(x;θ)∂θ∂P(x;θ))2+O(δθ3)]
=−(δθ∂θ∂+2δθ2∂θ2∂2)x∑P(x;θ)+x∑P(x;θ)[21(δθP(x;θ)∂θ∂P(x;θ))2+O(δθ3)]
=−(δθ∂θ∂+2δθ2∂θ2∂2)1+x∑P(x;θ)[21(δθP(x;θ)∂θ∂P(x;θ))2+O(δθ3)]
=2δθ2∑xP(x;θ)[(P(x;θ)∂θ∂P(x;θ))2+O(δθ3)]
=2δθ2∑xP(x;θ)[(∂θ∂logP(x;θ))2+O(δθ3)]
=2δθ2I(θ)+O(δθ3)
Therefore, the Hessian of {the Mutual information of a probability distribution and a nearby distribution in a parametrized family of distributions} equals the FIM of the distribution at that parameter!
This is a clean way of thinking about FIM, because one is thinking about distributions directly, and functions of these distributions, like the relative entropy, and one doesn't need to think about averaging w.r.t. these distributions explicitly (of course an average appears in the definition of relative entropy, but one can think of relative entropy as just an abstract function of two prob dists. )
As covariance of the gradients of likelihood
It can also be seen as the Covariance matrix of the Gradients of the likelihood function (sometimes called "scores"):
I(θ)=Ex∼p(x∣θ)[(∂θ∂logp(x∣θ))2]
=∫p(x∣θ)(∂θ∂logp(x∣θ))2dx
=∫(∂θ∂logp(x∣θ))∂θ∂p(x∣θ)dx
=−∫(∂θ2∂2logp(x∣θ))p(x∣θ)dx +∫∂θ∂((∂θ∂logp(x∣θ))p(x∣θ))dx
=−∫(∂θ2∂2logp(x∣θ))p(x∣θ)dx +∫∂θ∂(∂θ∂p(x∣θ))dx
=−∫(∂θ2∂2logp(x∣θ))p(x∣θ)dx +∂θ2∂2∫p(x∣θ)dx
=−∫(∂θ2∂2logp(x∣θ))p(x∣θ)dx +∂θ2∂21
=−∫(∂θ2∂2logp(x∣θ))p(x∣θ)dx
So this definition is equivalent!
See wiki.
Intro to Fisher Matrices
The Covariance matrix is the inverse of the Fisher matrix.
χ2 can be calculated as χ2=δFδT, where F is the FIM, and δ is a small step in parameter space from the maximum of the likelihood.
Reparametrization
https://www.wikiwand.com/en/Fisher_information#/Reparametrization
Note that the likelihood function is not a probability density over the parameters θ, but over x. Therefore, when we reparametrize, the likelihood function only changes by changing its argument according to the reparametrization, but it doesn't get any Jacobian factor. The derivative with respect of the parameter does get a Jacobian factor, so that Fisher information is dependent on reparametrization. The Metric tensor it encodes is, of course, invariant, by definition of tensor.
Fisher information as a Metric
https://youtu.be/IKetDJof8pk?t=2250