EM algorithm: Cosmos — All that is, or was, or ever will be

EM algorithm

cosmos 18th June 2018 at 12:07pm

The expectation–maximization algorithm is an iterative method for finding Maximum likelihood or Maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables, which is used in models for Unsupervised learning

Bootstrap idea – General EM algorithm, see Bootstrapping

_{EM vs graident descent} – Why should one use EM vs. say, Gradient Descent with MLE? – _{see discussion here also} – see video

Derivation

Deriving the general version of EM algorithm

Have some predetermined model for $P(x,z ; \theta)$ , for instance a Gaussian mixture model, but observe only $x$ . For Mixture models, the $z$ are often labels. Want to maximize the log-likelihood (see Maximum likelihood):

$l(\theta )=\sum _{i=1}^m\log P\left(x^{\left(i\right)};\theta \right) =\sum _{i=1}^m\log {\sum _{z^{\left(i\right)}}P\left(x^{\left(i\right)},z^{\left(i\right)};\theta \right)}$ $=\sum _{i=1}^m\log \sum _{z^{\left(i\right)}}^{ }Q_i\left(z^{\left(i\right)}\right)\frac{P\left(x^{\left(i\right)},z^{\left(i\right)};\theta \right)}{Q_i\left(z^{\left(i\right)}\right)}$

[ $Q_i (z^{(i)}) \geq 0 \sum_{z^{(i)}} Q_i(z^{(i)})=1$ , is some Probability distribution for $z$ ]

$=\sum _{i=1}^m\log{ \mathbb{E}_{z^{(i)} \sim Q_i} \left[\frac{P\left(x^{\left(i\right)},z^{\left(i\right)};\theta \right)}{Q_i\left(z^{\left(i\right)}\right)}\right]}$

We known that $\log{\mathbb{E}[x]} \geq \mathbb{E}[\log{x}]$ , by the concave version of Jensen's inequality, as log is a concave function, so:

l(\theta ) \geq \sum _{i=1}^m \mathbb{E}_{z^{(i)} \sim Q_i} \left[\log{\frac{P\left(x^{\left(i\right)},z^{\left(i\right)};\theta \right)}{Q_i\left(z^{\left(i\right)}\right)}}\right]

= \sum _{i=1}^m\sum _{z^{\left(i\right)}}^{ }Q_i\left(z^{\left(i\right)}\right)\log{\frac{P\left(x^{\left(i\right)},z^{\left(i\right)};\theta \right)}{Q_i\left(z^{\left(i\right)}\right)}}

Intuitive picture of the algorithm: the EM algorithm, at each iteration, constructs a lower bound function for $l(\theta)$ , which is tight at the current value of $\theta$ (i.e. at that value the inequality is an equality). The algorithm then maximizes this lower bound function to update $\theta$ . The equality at current $\theta$ guarantees that each step actually gives a larger value of the actual $l(\theta)$ . To ensure this, we choose the probability distribution $Q_i (z^{(i)})$ appropriately. Note that the lower bound need not be a concave function of theta, which may cause problems? I don't think so, as we are constructively showing that the construction shown in the video can be done.

The inequality becomes an equality if the random variable inside the expectation is a constant (see Jensen's inequality), so we want to choose $Q_i$ s.t.

$\frac{P\left(x^{\left(i\right)},z^{\left(i\right)};\theta \right)}{Q_i\left(z^{\left(i\right)}\right)} = const.$ w.r.t. $z^{(i)}$ at the current value of $\theta$ . This is an important point, otherwise we would have equality at all $\theta$ , and we would be just maximizing $l(\theta)$ itself directly, so the EM algorithm wouldn't make sense..

$\therefore Q_i\left(z^{\left(i\right)}\right) \propto P\left(x^{\left(i\right)},z^{\left(i\right)};\theta \right)$

From normalization, $\sum_{z^{(i)}} Q_i(z^{(i)})=1$ , we can determine the constant:

Q_i\left(z^{\left(i\right)}\right) = \frac{P\left(x^{\left(i\right)},z^{\left(i\right)};\theta \right)}{\sum_{z^{(i)}} P\left(x^{\left(i\right)},z^{\left(i\right)};\theta \right)} = \frac{P\left(x^{\left(i\right)},z^{\left(i\right)};\theta \right)}{P(x^{(i)}; \theta)} = P(z^{(i)}|x^{(i)}; \theta)

EM algorithm

Repeat until convergence
1. E-step. Guess values of $z^{(i)}$ s. In particular, compute the probability distribution of $Q_i(z^{(i)}=j) = P(z^{(i)} = j | x^{(i)}; \theta)$ , from [2]. This can also be seen as an a-posteriori probability distribution. Note: The value of $\theta$ here is the current value of $\theta$ ; it is fixed in the maximization in the next step (despite the unfortunate notation). See here.
2. M-step. Update parameters $\theta$ as $\theta = \arg\max\limits_\theta \sum _{i=1}^m\sum _{z^{\left(i\right)}}^{ }Q_i\left(z^{\left(i\right)}\right)\log{\frac{P\left(x^{\left(i\right)},z^{\left(i\right)};\theta \right)}{Q_i\left(z^{\left(i\right)}\right)}}$ Key: This is easier to compute because the probability distribution of $z$ does not depend on the $\theta$ over which we are maximizing!

Another way of seeing the EM algorithm as coordinate descent!

The two steps are defined slightly different here (though algo is the same overall), also there it defines the approximate EM algorithm which can be interpreted as denoising, by following gradients towards prior, and which is used to interpret Ladder networks

My own derivation of the likelihood using Baye's theorem

instead of just maximizing $P(\theta | z, x)$ , as we don't know $z$ with certainty, we need to maximize $P(\theta|x) = \sum_z P(\theta , z| x)$ $= \sum_z \frac{P(x|\theta, z) P(\theta, z)}{\sum_{\theta, z} P(x,z|\theta) P(\theta)} \propto \sum_z P(x|\theta, z) P(\theta, z)$ $=\sum_z P(x,z;\theta)$ , where $z$ represents the set of $z^{(i)}$ and the sum is over all possible configurations. $P(z)$ is known from the E-step.

See here too.

http://www.rmki.kfki.hu/~banmi/elte/bishop_em.pdf

Example

Baum-Welch algorithm