{"id":3575,"date":"2025-03-09T18:02:41","date_gmt":"2025-03-09T18:02:41","guid":{"rendered":"https:\/\/sinatootoonian.com\/?p=3575"},"modified":"2025-12-27T20:05:54","modified_gmt":"2025-12-27T20:05:54","slug":"understanding-expectation-maximization-as-coordinate-ascent","status":"publish","type":"post","link":"https:\/\/sinatootoonian.com\/index.php\/2025\/03\/09\/understanding-expectation-maximization-as-coordinate-ascent\/","title":{"rendered":"Understanding Expectation Maximization as Coordinate Ascent"},"content":{"rendered":"\n<p><em>These notes are based on what I learned from my first postdoc advisor, who learned it (I believe) from (Neal and Hinton 1998). See also section 4 of (Roweis and Ghahramani 1999) for a short derivation, and the broader discussion in Chapter 9 of Bishop, in particular Section 9.4<\/em><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Introduction<\/h2>\n\n\n\n<p>When performing maximum likelihood estimation of parameters $\\theta$ explaining some data $X$, that is, solving $$ \\argmax_\\theta \\log p(X|\\theta),$$ it can sometimes be useful to introduce hidden (latent) variables $Z$ and to consider the joint distribution $p(X, Z | \\theta)$. <\/p>\n\n\n\n<p>There are a variety of reasons why we would consider doing this. Sometimes, $Z$ represents missing data, which if we knew would make the estimation problem easier. Sometimes, in the generative view, $Z$ represents causes that have generated our observations $X$ &#8211; for example, objects in the surrounding environment that have produced the input to our retina. Sometimes, it&#8217;s for purely computational reasons, to simplify estimation.<\/p>\n\n\n\n<p>Having introduced the latent variables $Z$, how do they make our life easier? A naive application of the sum rule of probability doesn&#8217;t help, $$ \\argmax_\\theta \\log p(X|\\theta) = \\argmax_\\theta \\log \\int p(Z, X | \\theta) dZ,$$ because the integral is inside the logarithm.<\/p>\n\n\n\n<p>The <em>Expectation-Maximization<\/em> algorithm says that, given a previous estimate $\\theta_{t-1}$, we should first compute the <em>expectation<\/em> $$ Q(\\theta, \\theta_{t-1}) = \\int p(Z|X, \\theta_{t-1}) \\log p(X, Z|\\theta) dZ.$$ Then, we should <em>maximize<\/em> it with respect to $\\theta$ to update the parameters: $$\\theta_t = \\argmax_\\theta Q(\\theta, \\theta_{t-1}).$$ We are guaranteed that this procedure will not decrease the log likelihood.<\/p>\n\n\n\n<p>The motivation behind this algorithm and why it works were always a bit mysterious to me. However, it turns out, as shown in (Neal and Hinton 1999) that we can view EM as coordinate descent on a lower bound to the log likelihood. That&#8217;s what we&#8217;re going to describe here.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Lower-bounding the log likelihood<\/h2>\n\n\n\n<p>Instead of expressing $p(X|\\theta)$ as what&#8217;s left over when we marginalize out the latent variables, we take a different approach and express the log likelihood as an expectation over the latent variable. <\/p>\n\n\n\n<p>We start with a trival step. For any probability distribution $q(Z)$ over $Z$, we have $$\\log p(X|\\theta) = 1 \\cdot \\log p(X|\\theta) = \\left(\\int q(Z) dZ \\right) \\cdot \\log p(X|\\theta).$$ We can then bring the log likelihood inside the integral, and therefore express it as an expectation relative to $q(Z)$: $$ \\dots = \\int q(Z) \\log p(X|\\theta) dZ.$$<\/p>\n\n\n\n<p>We now compound this with another triviality: $$\\log p(X|\\theta) = \\log (p(X|\\theta) \\cdot 1) = \\log \\left(p(X|\\theta) {q(Z) \\over q(Z)}\\right).$$<\/p>\n\n\n\n<p>To gain something from all of this, we now have to express the log likelihood in terms of $Z$. As we saw above, doing so in terms of the sum rule of probability (marginalization) wasn&#8217;t useful. Instead, we can try the product rule, and express the likelihood as the ratio of joint and conditional probabilities, $$ p(X|\\theta) = {p(X, Z|\\theta) \\over p(Z|X, \\theta)}.$$ We now get a more interesting expression for the log likelihood<br>$$ \\dots = \\int q(Z) \\log {p(X, Z | \\theta) q(Z) \\over p(Z|X, \\theta) q(Z)} dZ.$$<\/p>\n\n\n\n<p>Expanding the logarithm into three expectations, we get \\begin{align}\\dots &amp;= \\underbrace{\\int q(Z) \\log p(X,Z|\\theta) dZ}_{\\mathbb E_q \\log p(X, Z|\\theta)} + \\underbrace{\\int q(Z) \\log {1 \\over q(Z)}}_{H(q)} + \\underbrace{\\int q(Z) \\log {q(Z) \\over p(Z|X,\\theta)}}_{D_{KL}(q||p(Z|X,\\theta)}. \\end{align}<\/p>\n\n\n\n<p>We now combine the first two terms into the famous <strong>negative variational free energy<\/strong>,<br>$$ \\mathcal{F}(q, \\theta) = \\mathbb E_q \\log p(X, Z|\\theta) + H(q). \\tag{1} $$ This quantity is also known as the <strong>Evidence Lower BOund<\/strong> or <strong>ELBO<\/strong>, because we can also write it as $$ \\mathcal{F}(q, \\theta) = \\log p(X|\\theta) &#8211; D_{KL}(q||p(Z|X,\\theta)) \\tag{2}. $$ Notice that this implies $$\\mathcal{F}(q, \\theta)\\le \\log p(X|\\theta)$$ since KL-divergence is non-negative. Therefore $\\mathcal F$ lower bounds the log likelihood, which in some settings is also known as the model evidence, hence the name.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">EM as coordinate-ascent<\/h2>\n\n\n\n<p>Now that we have a lower bound to the log likelihood, we can maximize <em>it<\/em> and hope to improve the log likelihood. Since $\\mathcal F(q,\\theta)$ is a function of two variables, we can perform this maximization using coordinate-ascent, where we iteratively maximize with respect to the distribution $q$, and the parameters $\\theta$.<\/p>\n\n\n\n<p>To do so, we leverage the two different expressions for $\\mathcal F(q, \\theta)$ we derived above. Notice that in the second one (eqn. 2), the distribution $q$ only appears in the KL divergence term. So, given the previous set of parameters $\\theta_{t-1}$, we can maximize the ELBO with respect to the distribution $q$ by maximizing this expression. This in turn means minimizing the KL divergence, which happens when the two distributions in the divergence are equal. We thus arrive at the <strong>E-step<\/strong>:<br>$$ q_{t}(Z) = p(Z|X, \\theta_{t-1}).$$<\/p>\n\n\n\n<p>At the end of the E-step, the KL divergence has been reduced to zero and the ELBO equals the log likelihood: $$ \\mathcal{F}(q_t, \\theta_{t-1}) = \\log p(X|\\theta_{t-1}).$$ <\/p>\n\n\n\n<p>In fact, the ELBO is tangent to the log likelihood i.e. has the same slope relative to the parameters. This is because its slope is composed of two terms &#8211; the contribution from the log likelihood itself, and that from the KL divergence. But the latter term is zero when evaluated at $(q_{t}, \\theta_{t-1})$: \\begin{align} {\\partial D_{KL}(q||p(Z|X,\\theta)) \\over \\partial \\theta} &amp;= {\\partial \\over \\partial \\theta}\\int \\left[ q(Z) \\log {q(Z) \\over p(Z|X,\\theta)}\\right] dZ \\\\ &amp;=- \\int q(Z) {\\partial \\log p(Z|X,\\theta) \\over \\partial \\theta} dZ\\\\ &amp;=- \\int {q(Z) \\over p(Z|X,\\theta)} {\\partial p(Z|X,\\theta) \\over \\partial \\theta} dZ . \\end{align} At $(q_t, \\theta_{t-1})$ we have, by the E-step, that $q_t(Z) = p(Z|X, \\theta_{t-1})$, so the integral simpifies to $$ \\left.\\dots \\right|_{(q_t, \\theta_{t-1})} = \\int {\\partial p(Z|X,\\theta) \\over \\partial \\theta} dZ ={\\partial \\over \\partial \\theta} \\int p(Z|X,\\theta) dZ ={\\partial \\over \\partial \\theta} 1 = 0.$$ <\/p>\n\n\n\n<p>Next, in the <strong>M-step<\/strong>, we maximize the ELBO with respect to the parameters. We do this by using the first expression for the ELBO (eqn. 1). In that expression, the parameters appear as a variable in the first term only, in $\\log p(X, Z|\\theta).$ We then have \\begin{align} \\theta_t &amp;= \\argmax_\\theta \\mathcal{F}(q_t, \\theta) \\\\&amp;= \\argmax_\\theta \\mathbb E_{q_t} \\log p(X, Z|\\theta)\\\\&amp;= \\argmax_\\theta \\mathbb E_{p(Z|X,\\theta_{t-1})} \\log p(X, Z|\\theta)\\\\&amp;= \\argmax_\\theta Q(\\theta, \\theta_{t-1}).\\end{align} So we see that the original formulation of EM is equivalent to our coordinate ascent perspective.<\/p>\n\n\n\n<p>To see how EM does not decrease the log likelihood, note that $$ \\log p(X|\\theta_{t-1}) = \\mathcal{F}(q_t, \\theta_{t-1}) \\underbrace{\\le}_{\\text{M-step}} \\mathcal{F}(q_t, \\theta_t) \\underbrace{\\le}_{\\text{E-step}} \\mathcal{F}(q_{t+1}, \\theta_t) = \\log p(X|\\theta_t).$$ The inequalities will also hold if we merely increase, rather than maximize in the M-step, so a full maximization is not necessary.<\/p>\n\n\n\n<p>There is a nice picture of the EM process in Bishop, Fig. 9.14:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"747\" src=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2025\/03\/image-1024x747.png\" alt=\"\" class=\"wp-image-3635\" srcset=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2025\/03\/image-1024x747.png 1024w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2025\/03\/image-300x219.png 300w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2025\/03\/image-768x560.png 768w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2025\/03\/image.png 1248w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\"><em>Bishop, Fig. 9.14, illustrating EM. The ELBO (blue, green curves) always lower bounds the log likelihood (red), and is tangent to it at the old parameter values. The M-step maximizes the ELBO (blue curve) and produces the new parameters $\\theta_{new}$. The E-step then updates the distribution on latents to match their true value at the new parameter, increasing the ELBO to tangent the log likelihood at the new parameter value.<\/em><\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Summary<\/h2>\n\n\n\n<p>In this post we&#8217;ve dicussed maximum likelihood estimation in models with latent variables, seen how the EM prescription of $$ \\theta_{t+1} = \\argmax Q(\\theta, \\theta_t),$$ can be seen as coordinate ascent on the ELBO,  \\begin{align} \\mathcal{F}(q, \\theta) &amp;= \\log p(X|\\theta) &#8211; D_{KL}(q||p(Z|X, \\theta))\\\\ &amp;= \\mathbb E_q \\log p(X, Z|\\theta) + H(q).\\end{align} We alternating maximizing with respect to $q$ using the first expression, which gives the E-step: $$ q_t = p(Z|X,\\theta_{t-1}),$$ and maximizing with respect to the parameters using the second expression, which gives the M-step: $$ \\theta_t = \\argmax_\\theta \\mathbb E_{q_t} \\log p(X, Z|\\theta).$$ In doing so we&#8217;re guaranteed to not decrease the log likelihood.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">References<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li><a href=\"https:\/\/www.cs.utoronto.ca\/~hinton\/absps\/em.htm\">Neal, Hinton 1998 &#8220;A View of the EM Algorithm that Justifies Incremental, Sparse, and Other Variants.<\/a>&#8220;<\/li>\n\n\n\n<li><a href=\"https:\/\/mlg.eng.cam.ac.uk\/zoubin\/papers\/lds.pdf\">Roweis, Ghahramani 1999: &#8220;A unifying review of linear Gaussian models.&#8221;<\/a><\/li>\n\n\n\n<li>Bishop 2006 &#8220;Pattern Recognition and Machine Learning.&#8221;<\/li>\n<\/ul>\n\n\n\n<p><br>$$\\blacksquare$$<\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>In these notes we follow Neal and Hinton 1998 and show how to view EM as coordinate ascent on the negative variational free energy.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1,30],"tags":[107,108,109,98,28,3,106],"class_list":["post-3575","post","type-post","status-publish","format-standard","hentry","category-blog","category-notes-blog","tag-em","tag-expectation-maximization","tag-latent-variable-models","tag-maximum-likelihood","tag-notes","tag-optimization","tag-tutorial"],"acf":[],"_links":{"self":[{"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/posts\/3575","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/comments?post=3575"}],"version-history":[{"count":78,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/posts\/3575\/revisions"}],"predecessor-version":[{"id":5719,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/posts\/3575\/revisions\/5719"}],"wp:attachment":[{"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/media?parent=3575"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/categories?post=3575"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/tags?post=3575"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}