Bayes' theorem as a softmax

cosmos 30th April 2017 at 3:06pm

See Neural network theory

As in Generative learning, we can relate p(xy)p(x|y) to p(yx)p(y|x), using Baye's theorem

Baye's theorem can be put into the form of a Boltzmann distribution, by defining the "self-information" or "surprisal", or Hamiltonian, in the context of Statistical physics

This on turn, can be succinctly written using the Softmax nonlinear opeartor σ\mathbf{\sigma} as

p(y)=σ[H(y)μ]\mathbf{p}(\mathbf{y})= \mathbf{\sigma} [ -\mathbf{H}(\mathbf{y})-\mathbf{\mu}]

That means that if we compute the Hamiltonian (which depends on the generating process encoded on the conditional probability distribution p(yx)p(y|x)), we can add a softmax layer to compute the Bayes a-posteriori probability distribution. The a-priori distribution p(x)p(x) goes into μ\mu, which is the bias term of the final layer.