The expectation–maximization algorithm is an iterative method for finding Maximum likelihood or Maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variables, which is used in models for Unsupervised learning
Bootstrap idea – General EM algorithm, see Bootstrapping
EM vs graident descent –
Why should one use EM vs. say, Gradient Descent with MLE? – see discussion here also – see video
Derivation
Deriving the general version of EM algorithm
Have some predetermined model for P(x,z;θ), for instance a Gaussian mixture model, but observe only x. For Mixture models, the z are often labels. Want to maximize the log-likelihood (see Maximum likelihood):
l(θ)=∑i=1mlogP(x(i);θ)=∑i=1mlog∑z(i)P(x(i),z(i);θ)
=∑i=1mlog∑z(i)Qi(z(i))Qi(z(i))P(x(i),z(i);θ)
[Qi(z(i))≥0∑z(i)Qi(z(i))=1, is some Probability distribution for z]
=∑i=1mlogEz(i)∼Qi[Qi(z(i))P(x(i),z(i);θ)]
We known that logE[x]≥E[logx], by the concave version of Jensen's inequality, as log is a concave function, so:
l(θ)≥∑i=1mEz(i)∼Qi[logQi(z(i))P(x(i),z(i);θ)] =∑i=1m∑z(i)Qi(z(i))logQi(z(i))P(x(i),z(i);θ) | 1 |
Intuitive picture of the algorithm: the EM algorithm, at each iteration, constructs a lower bound function for l(θ), which is tight at the current value of θ (i.e. at that value the inequality is an equality). The algorithm then maximizes this lower bound function to update θ. The equality at current θ guarantees that each step actually gives a larger value of the actual l(θ). To ensure this, we choose the probability distribution Qi(z(i)) appropriately. Note that the lower bound need not be a concave function of theta, which may cause problems? I don't think so, as we are constructively showing that the construction shown in the video can be done.
The inequality becomes an equality if the random variable inside the expectation is a constant (see Jensen's inequality), so we want to choose Qi s.t.
Qi(z(i))P(x(i),z(i);θ)=const. w.r.t. z(i) at the current value of θ. This is an important point, otherwise we would have equality at all θ, and we would be just maximizing l(θ) itself directly, so the EM algorithm wouldn't make sense..
∴Qi(z(i))∝P(x(i),z(i);θ)
From normalization, ∑z(i)Qi(z(i))=1, we can determine the constant:
Qi(z(i))=∑z(i)P(x(i),z(i);θ)P(x(i),z(i);θ)=P(x(i);θ)P(x(i),z(i);θ)=P(z(i)∣x(i);θ) | 2 |
EM algorithm
- Repeat until convergence
- E-step. Guess values of z(i)s. In particular, compute the probability distribution of Qi(z(i)=j)=P(z(i)=j∣x(i);θ), from [2]. This can also be seen as an a-posteriori probability distribution. Note: The value of θ here is the current value of θ; it is fixed in the maximization in the next step (despite the unfortunate notation). See here.
- M-step. Update parameters θ as θ=argθmax∑i=1m∑z(i)Qi(z(i))logQi(z(i))P(x(i),z(i);θ) Key: This is easier to compute because the probability distribution of z does not depend on the θ over which we are maximizing!
Another way of seeing the EM algorithm as coordinate descent!
The two steps are defined slightly different here (though algo is the same overall), also there it defines the approximate EM algorithm which can be interpreted as denoising, by following gradients towards prior, and which is used to interpret Ladder networks
My own derivation of the likelihood using Baye's theorem
instead of just maximizing P(θ∣z,x), as we don't know z with certainty, we need to maximize P(θ∣x)=∑zP(θ,z∣x) =∑z∑θ,zP(x,z∣θ)P(θ)P(x∣θ,z)P(θ,z)∝∑zP(x∣θ,z)P(θ,z) =∑zP(x,z;θ), where z represents the set of z(i) and the sum is over all possible configurations. P(z) is known from the E-step.
See here too.
http://www.rmki.kfki.hu/~banmi/elte/bishop_em.pdf
Example
Baum-Welch algorithm