In this note I work out the EM updates for factor analysis, following the presentation in PRML 12.2.4.
In factor analysis our model of the observations in terms of latents is $$ p(\xx_n|\zz_n, \WW, \bmu, \bPsi) = \mathcal{N}(\xx_n;\WW \zz_n + \bmu, \bPsi).$$ Here $\bPsi$ is a diagonal matrix used to capture the variances of the observation dimensions. This frees the matrix $\WW$ to capture covariances between these dimensions.
To determine the parameters of the model using EM, we compute the expectation of the log full-data joint. The log joint is \begin{align*} \log p(\XX, \ZZ|\btheta) &= \sum_n \log p(\xx_n, \zz_n | \btheta)\\ &= \sum_n \log p(\xx_n | \zz_n, \btheta) + \log p(\zz_n | \btheta)\\
&=\sum_n -{D \over 2}\log 2\pi – {1 \over 2} \log |\bPsi| – {1 \over 2} (\xx_n – \WW \zz_n – \bmu)^T \bPsi^{-1} (\xx_n – \WW \zz_n – \bmu) – {K \over 2} \log 2\pi – {\zz_n^T \zz_n \over 2}.\end{align*}
EM then consists of taking the expectation of this with respect to the posterior distribution on the latents given the corresponding observations at the previous values of the parameters, and maximising with respect to the parameters.
It’s convenient to determine the value of mean parameter $\bmu$ first before starting with EM. To do so, we note that the distribution of observations given parameters is $$ p(\xx_n | \WW, \bmu, \bPsi) = \mathcal{N}(\xx_n; \bmu, \CC \triangleq \WW\WW^T +\bPsi).$$ The log likelihood is then $$ \log p(\XX|\btheta) = \sum_n -(\xx_n – \bmu)^T \CC^{-1} (\xx_n – \bmu) + \dots,$$ from which we get after maximizing with respect to $\bmu$, $$ \bmu = \overline{\xx}.$$
Plugging in this value for $\bmu$, the expectation of the log joint is (dropping constant terms) \begin{align*} \mathbb E (\dots) = – {N \over 2} \log |\bPsi| -{1 \over 2} \sum_n \mathbb E \left[ (\tilde \xx_n – \WW \zz_n)^T \bPsi^{-1} (\tilde \xx_n – \WW \zz_n) + \zz_n^T \zz_n \right],\end{align*} where $$\tilde \xx_n = \xx_n – \overline \xx.$$
Rearranging using traces gives $$ \mathbb -{N \over 2} \log |\bPsi| -{1 \over 2} \tr\left(\sum_n \bPsi^{-1} \tilde \xx_n \tilde \xx_n^T\right) + \tr\left( \bPsi^{-1} \WW \sum_n \mathbb E(\zz_n)\tilde \xx_n^T \right) – {1\over 2}\tr\left( (\WW^T \bPsi^{-1} \WW + \II) \mathbb E(\zz_n \zz_n^T)\right).$$
Taking derivatives with respect to $\WW$ we get $${\partial \over \partial \WW} =\bPsi^{-1} \sum_n \tilde \xx_n \mathbb E (\zz_n)^T – \bPsi^{-1}\WW \sum_n \mathbb E(\zz_n \zz_n^T).$$ Setting this to zero, we get
$$\WW_\text{new} = \left[\sum_n \tilde \xx_n \mathbb E (\zz_n)^T\right]\left[\sum_n \mathbb E(\zz_n \zz_n^T)\right]^{-1}.$$ This is equation 12.69 in PRML, and is the least-square solution for the basis that projects $\zz_n$ into $\xx_n$.
Notice, interestingly, that this doesn’t depend directly on $\bPsi$. In fact, it’s the same update as in the probabilistic PCA case.
To get the update for $\bPsi$, we’ll work with $\bPsi^{-1}$ instead. In that case, $$ {\partial \over \partial \bPsi^{-1}} = {N \over 2} \bPsi – {1 \over 2} \sum_n \tilde \xx_n \tilde \xx_n^T + \WW \sum_n \mathbb E(\zz_n)\tilde \xx_n^T – {1 \over 2} \WW \sum_n \mathbb E (\zz_n \zz_n^T) \WW^T.$$ Setting this to zero we get $$\bPsi_\text{new} = {1 \over N}\sum_n \text{diag}\left\{ \tilde \xx_n \tilde \xx_n^T – 2 \WW \mathbb E(\zz_n)\tilde \xx_n^T + \WW \mathbb E (\zz_n \zz_n^T) \WW^T \right\}.$$ This is equation 12.70 in PRML.
Notice that this is just $$ \bPsi_\text{new} = \text{diag} \left\{ \mathbb E \left[{1 \over N} \sum_n (\tilde \xx_n – \WW \zz_n)(\tilde \xx_n – \WW \zz_n)^T\right]\right\} = \text{diag} \left\{ \mathbb E \left[ \text{cov}(\tilde \xx – \WW \zz)\right]\right\},$$ where the covariance is computed over data points. This is because the expectation of $\tilde \xx$ and of $\zz$ over datapoints are both zero.
The expression above in terms of covariance makes intuitive sense: if we knew the latents, then we’d just set $\bPsi$ to be the diagonal part of the noise covariance. We don’t know the values of the latents, so we take the expectation of the covariance using the posterior distribution on the latents.
For the E-step, all we need are the means and covariances of the latents conditioned on the observations and the previous values of the parameters. Since the joint distribution $p(\xx_n, \zz_n | \btheta)$ is Gaussian, the posterior distribution on the latents will also be Gaussian. To find its mean, we maximize the full data joint with respect to $\zz_n$, to get $$ {\partial \over \partial \zz_n} = \WW^T \bPsi^{-1} (\xx_n – \WW \zz_n – \overline \xx) – \zz_n.$$ Setting this to zero gives
$$ \mathbb E(\zz_n) = (\II + \WW^T \bPsi^{-1} \WW)^{-1} \WW^T \bPsi^{-1} \tilde \xx_n,$$ which is PRML equation 12.64. To get the covariance, we compute second derivatives with respect to $\zz_n$ to get
$$ \cov(\zz_n) = (\II + \WW^T \bPsi^{-1} \WW)^{-1},$$ which gives, after rearranging, PRML Eqn 12.67.
The final point in this section is about how transformations of the data affect the parameters. In probabilistic PCA, rotating the data kept the model the same but just rotated the parameters. For factor analysis, the role of rotations is performed by scaling. The general case is shown in problem 12.25, which we’ll outline here.
Take our usual latents $\zz_n \sim \mathcal{N}(\zz_n|,\bzero, \II)$ producing observations according to $\xx_n|\zz_n \sim \mathcal{N}(\xx_n|\WW \zz_n + \bmu, \bPhi),$ where $\bPhi$ is some general positive definite noise covariance. The likelihood of a data point is then $$\xx_n|\btheta \sim \mathcal{N}(\xx_n|\bmu, \WW \WW^T + \bPhi).$$
If we now linearly transform the data to $\yy_n = \AA \xx_n$, then $$\yy|\theta \sim \mathcal{N}(\yy_n| \AA \bmu, \AA \WW \WW^T \AA + \AA \bPhi \AA^T).$$
But this corresponds to the original likelihood, with parameters transformed $$ \bmu \to \AA \bmu, \quad \WW \to \AA \WW, \quad \text{and} \quad \bPhi \to \AA \bPhi \AA^T.$$
The question is then what transformations $\AA$ keep the model in the same class. For PPCA, $\bPhi$ is a multiple of the identity, so we need $\AA\AA^T$ to be a multiple of the identity, which implies that $\AA$ is a multiple of a rotation matrix. In other words, rotating the data, combined with a global scaling, keeps the data in the same class, but with correspondingly rotated and scaled parameters.
For factor analysis, $\bPhi$ is diagonal, which means $\AA$ must be diagonal to stay in the same class. In other words, factor analysis is invariant to scaling of the data, which just manifests as a corresponding scaling of the parameters.
$$\blacksquare$$
Leave a Reply