Understanding Expectation Maximization as Coordinate Ascent

These notes are based on what I learned from my first postdoc advisor, who learned it (I believe) from (Neal and Hinton 1998). See also section 4 of (Roweis and Ghahramani 1999) for a short derivation, and the broader discussion in Chapter 9 of Bishop, in particular Section 9.4

Introduction

When performing maximum likelihood estimation of parameters $\theta$ explaining some data $X$, that is, solving $$ \argmax_\theta \log p(X|\theta),$$ it can sometimes be useful to introduce hidden (latent) variables $Z$ and to consider the joint distribution $p(X, Z | \theta)$.

There are a variety of reasons why we would consider doing this. Sometimes, $Z$ represents missing data, which if we knew would make the estimation problem easier. Sometimes, in the generative view, $Z$ represents causes that have generated our observations $X$ – for example, objects in the surrounding environment that have produced the input to our retina. Sometimes, it’s for purely computational reasons, to simplify estimation.

Having introduced the latent variables $Z$, how do they make our life easier? A naive application of the sum rule of probability doesn’t help, $$ \argmax_\theta \log p(X|\theta) = \argmax_\theta \log \int p(Z, X | \theta) dZ,$$ because the integral is inside the logarithm.

The Expectation-Maximization algorithm says that, given a previous estimate $\theta_{t-1}$, we should first compute the expectation $$ Q(\theta, \theta_{t-1}) = \int p(Z|X, \theta_{t-1}) \log p(X, Z|\theta) dZ.$$ Then, we should maximize it with respect to $\theta$ to update the parameters: $$\theta_t = \argmax_\theta Q(\theta, \theta_{t-1}).$$ We are guaranteed that this procedure will not decrease the log likelihood.

The motivation behind this algorithm and why it works were always a bit mysterious to me. However, it turns out, as shown in (Neal and Hinton 1999) that we can view EM as coordinate descent on a lower bound to the log likelihood. That’s what we’re going to describe here.

Lower-bounding the log likelihood

Instead of expressing $p(X|\theta)$ as what’s left over when we marginalize out the latent variables, we take a different approach and express the log likelihood as an expectation over the latent variable.

We start with a trival step. For any probability distribution $q(Z)$ over $Z$, we have $$\log p(X|\theta) = 1 \cdot \log p(X|\theta) = \left(\int q(Z) dZ \right) \cdot \log p(X|\theta).$$ We can then bring the log likelihood inside the integral, and therefore express it as an expectation relative to $q(Z)$: $$ \dots = \int q(Z) \log p(X|\theta) dZ.$$

We now compound this with another triviality: $$\log p(X|\theta) = \log (p(X|\theta) \cdot 1) = \log \left(p(X|\theta) {q(Z) \over q(Z)}\right).$$

To gain something from all of this, we now have to express the log likelihood in terms of $Z$. As we saw above, doing so in terms of the sum rule of probability (marginalization) wasn’t useful. Instead, we can try the product rule, and express the likelihood as the ratio of joint and conditional probabilities, $$ p(X|\theta) = {p(X, Z|\theta) \over p(Z|X, \theta)}.$$ We now get a more interesting expression for the log likelihood
$$ \dots = \int q(Z) \log {p(X, Z | \theta) q(Z) \over p(Z|X, \theta) q(Z)} dZ.$$

Expanding the logarithm into three expectations, we get \begin{align}\dots &= \underbrace{\int q(Z) \log p(X,Z|\theta) dZ}_{\mathbb E_q \log p(X, Z|\theta)} + \underbrace{\int q(Z) \log {1 \over q(Z)}}_{H(q)} + \underbrace{\int q(Z) \log {q(Z) \over p(Z|X,\theta)}}_{D_{KL}(q||p(Z|X,\theta)}. \end{align}

We now combine the first two terms into the famous negative variational free energy,
$$ \mathcal{F}(q, \theta) = \mathbb E_q \log p(X, Z|\theta) + H(q). \tag{1} $$ This quantity is also known as the Evidence Lower BOund or ELBO, because we can also write it as $$ \mathcal{F}(q, \theta) = \log p(X|\theta) – D_{KL}(q||p(Z|X,\theta)) \tag{2}. $$ Notice that this implies $$\mathcal{F}(q, \theta)\le \log p(X|\theta)$$ since KL-divergence is non-negative. Therefore $\mathcal F$ lower bounds the log likelihood, which in some settings is also known as the model evidence, hence the name.

EM as coordinate-ascent

Now that we have a lower bound to the log likelihood, we can maximize it and hope to improve the log likelihood. Since $\mathcal F(q,\theta)$ is a function of two variables, we can perform this maximization using coordinate-ascent, where we iteratively maximize with respect to the distribution $q$, and the parameters $\theta$.

To do so, we leverage the two different expressions for $\mathcal F(q, \theta)$ we derived above. Notice that in the second one (eqn. 2), the distribution $q$ only appears in the KL divergence term. So, given the previous set of parameters $\theta_{t-1}$, we can maximize the ELBO with respect to the distribution $q$ by maximizing this expression. This in turn means minimizing the KL divergence, which happens when the two distributions in the divergence are equal. We thus arrive at the E-step:
$$ q_{t}(Z) = p(Z|X, \theta_{t-1}).$$

At the end of the E-step, the KL divergence has been reduced to zero and the ELBO equals the log likelihood: $$ \mathcal{F}(q_t, \theta_{t-1}) = \log p(X|\theta_{t-1}).$$

In fact, the ELBO is tangent to the log likelihood i.e. has the same slope relative to the parameters. This is because its slope is composed of two terms – the contribution from the log likelihood itself, and that from the KL divergence. But the latter term is zero when evaluated at $(q_{t}, \theta_{t-1})$: \begin{align} {\partial D_{KL}(q||p(Z|X,\theta)) \over \partial \theta} &= {\partial \over \partial \theta}\int \left[ q(Z) \log {q(Z) \over p(Z|X,\theta)}\right] dZ \\ &=- \int q(Z) {\partial \log p(Z|X,\theta) \over \partial \theta} dZ\\ &=- \int {q(Z) \over p(Z|X,\theta)} {\partial p(Z|X,\theta) \over \partial \theta} dZ . \end{align} At $(q_t, \theta_{t-1})$ we have, by the E-step, that $q_t(Z) = p(Z|X, \theta_{t-1})$, so the integral simpifies to $$ \left.\dots \right|_{(q_t, \theta_{t-1})} = \int {\partial p(Z|X,\theta) \over \partial \theta} dZ ={\partial \over \partial \theta} \int p(Z|X,\theta) dZ ={\partial \over \partial \theta} 1 = 0.$$

Next, in the M-step, we maximize the ELBO with respect to the parameters. We do this by using the first expression for the ELBO (eqn. 1). In that expression, the parameters appear as a variable in the first term only, in $\log p(X, Z|\theta).$ We then have \begin{align} \theta_t &= \argmax_\theta \mathcal{F}(q_t, \theta) \\&= \argmax_\theta \mathbb E_{q_t} \log p(X, Z|\theta)\\&= \argmax_\theta \mathbb E_{p(Z|X,\theta_{t-1})} \log p(X, Z|\theta)\\&= \argmax_\theta Q(\theta, \theta_{t-1}).\end{align} So we see that the original formulation of EM is equivalent to our coordinate ascent perspective.

To see how EM does not decrease the log likelihood, note that $$ \log p(X|\theta_{t-1}) = \mathcal{F}(q_t, \theta_{t-1}) \underbrace{\le}_{\text{M-step}} \mathcal{F}(q_t, \theta_t) \underbrace{\le}_{\text{E-step}} \mathcal{F}(q_{t+1}, \theta_t) = \log p(X|\theta_t).$$ The inequalities will also hold if we merely increase, rather than maximize in the M-step, so a full maximization is not necessary.

There is a nice picture of the EM process in Bishop, Fig. 9.14:

Bishop, Fig. 9.14, illustrating EM. The ELBO (blue, green curves) always lower bounds the log likelihood (red), and is tangent to it at the old parameter values. The M-step maximizes the ELBO (blue curve) and produces the new parameters $\theta_{new}$. The E-step then updates the distribution on latents to match their true value at the new parameter, increasing the ELBO to tangent the log likelihood at the new parameter value.

Summary

In this post we’ve dicussed maximum likelihood estimation in models with latent variables, seen how the EM prescription of $$ \theta_{t+1} = \argmax Q(\theta, \theta_t),$$ can be seen as coordinate ascent on the ELBO, \begin{align} \mathcal{F}(q, \theta) &= \log p(X|\theta) – D_{KL}(q||p(Z|X, \theta))\\ &= \mathbb E_q \log p(X, Z|\theta) + H(q).\end{align} We alternating maximizing with respect to $q$ using the first expression, which gives the E-step: $$ q_t = p(Z|X,\theta_{t-1}),$$ and maximizing with respect to the parameters using the second expression, which gives the M-step: $$ \theta_t = \argmax_\theta \mathbb E_{q_t} \log p(X, Z|\theta).$$ In doing so we’re guaranteed to not decrease the log likelihood.

References

Neal, Hinton 1998 “A View of the EM Algorithm that Justifies Incremental, Sparse, and Other Variants.“
Roweis, Ghahramani 1999: “A unifying review of linear Gaussian models.”
Bishop 2006 “Pattern Recognition and Machine Learning.”

$$\blacksquare$$