EM algorithm

cosmos 18th June 2018 at 12:07pm
Statistical inference

The expectation–maximization algorithm is an iterative method for finding Maximum likelihood or Maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables, which is used in models for Unsupervised learning

Bootstrap ideaGeneral EM algorithm, see Bootstrapping

EM vs graident descentWhy should one use EM vs. say, Gradient Descent with MLE?see discussion here alsosee video

Derivation

Deriving the general version of EM algorithm

Have some predetermined model for P(x,z;θ)P(x,z ; \theta), for instance a Gaussian mixture model, but observe only xx. For Mixture models, the zz are often labels. Want to maximize the log-likelihood (see Maximum likelihood):

l(θ)=i=1mlogP(x(i);θ)=i=1mlogz(i)P(x(i),z(i);θ)l(\theta )=\sum _{i=1}^m\log P\left(x^{\left(i\right)};\theta \right) =\sum _{i=1}^m\log {\sum _{z^{\left(i\right)}}P\left(x^{\left(i\right)},z^{\left(i\right)};\theta \right)} =i=1mlogz(i)Qi(z(i))P(x(i),z(i);θ)Qi(z(i))=\sum _{i=1}^m\log \sum _{z^{\left(i\right)}}^{ }Q_i\left(z^{\left(i\right)}\right)\frac{P\left(x^{\left(i\right)},z^{\left(i\right)};\theta \right)}{Q_i\left(z^{\left(i\right)}\right)}

[Qi(z(i))0z(i)Qi(z(i))=1Q_i (z^{(i)}) \geq 0 \sum_{z^{(i)}} Q_i(z^{(i)})=1, is some Probability distribution for zz]

=i=1mlogEz(i)Qi[P(x(i),z(i);θ)Qi(z(i))]=\sum _{i=1}^m\log{ \mathbb{E}_{z^{(i)} \sim Q_i} \left[\frac{P\left(x^{\left(i\right)},z^{\left(i\right)};\theta \right)}{Q_i\left(z^{\left(i\right)}\right)}\right]}

We known that logE[x]E[logx]\log{\mathbb{E}[x]} \geq \mathbb{E}[\log{x}], by the concave version of Jensen's inequality, as log is a concave function, so:

l(θ)i=1mEz(i)Qi[logP(x(i),z(i);θ)Qi(z(i))]l(\theta ) \geq \sum _{i=1}^m \mathbb{E}_{z^{(i)} \sim Q_i} \left[\log{\frac{P\left(x^{\left(i\right)},z^{\left(i\right)};\theta \right)}{Q_i\left(z^{\left(i\right)}\right)}}\right] =i=1mz(i)Qi(z(i))logP(x(i),z(i);θ)Qi(z(i))= \sum _{i=1}^m\sum _{z^{\left(i\right)}}^{ }Q_i\left(z^{\left(i\right)}\right)\log{\frac{P\left(x^{\left(i\right)},z^{\left(i\right)};\theta \right)}{Q_i\left(z^{\left(i\right)}\right)}}1

Intuitive picture of the algorithm: the EM algorithm, at each iteration, constructs a lower bound function for l(θ)l(\theta), which is tight at the current value of θ\theta (i.e. at that value the inequality is an equality). The algorithm then maximizes this lower bound function to update θ\theta. The equality at current θ\theta guarantees that each step actually gives a larger value of the actual l(θ)l(\theta). To ensure this, we choose the probability distribution Qi(z(i))Q_i (z^{(i)}) appropriately. Note that the lower bound need not be a concave function of theta, which may cause problems? I don't think so, as we are constructively showing that the construction shown in the video can be done.

The inequality becomes an equality if the random variable inside the expectation is a constant (see Jensen's inequality), so we want to choose QiQ_i s.t.

P(x(i),z(i);θ)Qi(z(i))=const.\frac{P\left(x^{\left(i\right)},z^{\left(i\right)};\theta \right)}{Q_i\left(z^{\left(i\right)}\right)} = const. w.r.t. z(i)z^{(i)} at the current value of θ\theta. This is an important point, otherwise we would have equality at all θ\theta, and we would be just maximizing l(θ)l(\theta) itself directly, so the EM algorithm wouldn't make sense..

Qi(z(i))P(x(i),z(i);θ)\therefore Q_i\left(z^{\left(i\right)}\right) \propto P\left(x^{\left(i\right)},z^{\left(i\right)};\theta \right)

From normalization, z(i)Qi(z(i))=1\sum_{z^{(i)}} Q_i(z^{(i)})=1, we can determine the constant:

Qi(z(i))=P(x(i),z(i);θ)z(i)P(x(i),z(i);θ)=P(x(i),z(i);θ)P(x(i);θ)=P(z(i)x(i);θ)Q_i\left(z^{\left(i\right)}\right) = \frac{P\left(x^{\left(i\right)},z^{\left(i\right)};\theta \right)}{\sum_{z^{(i)}} P\left(x^{\left(i\right)},z^{\left(i\right)};\theta \right)} = \frac{P\left(x^{\left(i\right)},z^{\left(i\right)};\theta \right)}{P(x^{(i)}; \theta)} = P(z^{(i)}|x^{(i)}; \theta)2

EM algorithm

  1. Repeat until convergence
    1. E-step. Guess values of z(i)z^{(i)}s. In particular, compute the probability distribution of Qi(z(i)=j)=P(z(i)=jx(i);θ)Q_i(z^{(i)}=j) = P(z^{(i)} = j | x^{(i)}; \theta) , from [2]. This can also be seen as an a-posteriori probability distribution. Note: The value of θ\theta here is the current value of θ\theta; it is fixed in the maximization in the next step (despite the unfortunate notation). See here.
    2. M-step. Update parameters θ\theta as θ=argmaxθi=1mz(i)Qi(z(i))logP(x(i),z(i);θ)Qi(z(i))\theta = \arg\max\limits_\theta \sum _{i=1}^m\sum _{z^{\left(i\right)}}^{ }Q_i\left(z^{\left(i\right)}\right)\log{\frac{P\left(x^{\left(i\right)},z^{\left(i\right)};\theta \right)}{Q_i\left(z^{\left(i\right)}\right)}} Key: This is easier to compute because the probability distribution of zz does not depend on the θ\theta over which we are maximizing!

Another way of seeing the EM algorithm as coordinate descent!

The two steps are defined slightly different here (though algo is the same overall), also there it defines the approximate EM algorithm which can be interpreted as denoising, by following gradients towards prior, and which is used to interpret Ladder networks


My own derivation of the likelihood using Baye's theorem

instead of just maximizing P(θz,x)P(\theta | z, x), as we don't know zz with certainty, we need to maximize P(θx)=zP(θ,zx)P(\theta|x) = \sum_z P(\theta , z| x) =zP(xθ,z)P(θ,z)θ,zP(x,zθ)P(θ)zP(xθ,z)P(θ,z)= \sum_z \frac{P(x|\theta, z) P(\theta, z)}{\sum_{\theta, z} P(x,z|\theta) P(\theta)} \propto \sum_z P(x|\theta, z) P(\theta, z) =zP(x,z;θ)=\sum_z P(x,z;\theta), where zz represents the set of z(i)z^{(i)} and the sum is over all possible configurations. P(z)P(z) is known from the E-step.

See here too.

http://www.rmki.kfki.hu/~banmi/elte/bishop_em.pdf


Example

Baum-Welch algorithm