In this note I flesh out the computations for Section 12.2.3 of Bishop’s Pattern Recognition and Machine Learning, where he uses automatic relevance to determine the dimensionality of the principal subspace in probabilistic PCA.
The principal subspace describing the data is spanned by the columns $\ww_1, \dots, \ww_M$ of $\WW$. The proper Bayesian way to determine dimensionality is to compare the marginal likelihoods of models with different values of $M$.
Instead, we will use automatic relevance determination to pick the model of the correct size automatically.
We do this by starting with some large $M$ that we think will be more than enough to capture the data. We then put a Gaussian prior on the columns of $\WW$, with precisions $\alpha_i$, $$ p(\WW|\bal) = \prod_{i=1}^M \left({\alpha_i \over 2\pi}\right)^{D/2} \exp\left(-{1 \over 2} \alpha_i \ww_i^T \ww_i\right).$$ When we optimize the values of $\bal$ to fit the data, we hope to find that some of the precisions will be driven to very large values, effectively forcing the corresponding columns values to zero, thereby reducing the dimensionality spanned by $\WW$. When and why this happens are explained in section 7.2.2 of Bishop.
With the inclusion of the ARD precisions on the column of $\WW$, the graphical model for probabilistic PCA with automatic relevance determination becomes

Our task is to maximize the likelihood of the data, $$ \log p(\XX|\bmu, \sigma^2, \bal).$$ Since we’ve got latent variables ($\{\zz_n\}$ and $\WW$) we’re going to apply EM. To do that, we start with the full-data log probability $$\log p(\XX, \ZZ, \WW | \bmu, \sigma^2, \bal).$$
To relate this to our updates for ordinary probabilistic PCA, we write this as $$ \log p(\XX, \ZZ | \WW, \bmu, \sigma^2) + \log p(\WW | \bal).$$
Notice that $\ZZ$ doesn’t depend on $\bal$, so our E-step updates are the same as before. The M-step updates for $\bmu$ and $\sigma^2$ also remain the same, since they only depend on the first logarithm above, which is the same as in the vanilla probabilistic PCA case. The only that changes is the M-step update for $\WW$. To determine that, we compute the expectation relative to the latents of the log full-data joint, \begin{align}\mathbb{E}\left[ \log p(\XX, \ZZ|\bmu, \WW, \sigma^2) + \log p(\WW|\bal)\right] &= \sum_n {1\over \sigma^2} \mathbb{E}[\zz_n]^T \WW^T (\xx_n – \bmu)\\&- {1 \over 2\sigma^2} \tr (\mathbb E[\zz_n \zz_n^T] \WW^T \WW)\\&-\sum_i \alpha_i \ww_i^T \ww_i + \dots\end{align}
Now we can write the last term as
\begin{align} \sum_i \alpha_i \ww_i^T \ww_i &= \tr( \sum_i \alpha_i \ww_i^T \ww_i) \\&=\tr(\sum_i \alpha_i \ww_i \ww_i^T) \\ &= \tr(\WW \diag{\bal} \WW^T)\\ &= \tr(\diag{\bal} \WW^T \WW).\end{align}
So we pool this with the other term that contains a $\WW^T \WW$ and get the modified update
$$\WW_\text{new} = \left[\sum_n (\xx_n – \overline \xx) \mathbb E[\zz_n]^T\right]\left[\sigma^2 \diag{\bal} + \sum_n \mathbb E[\zz_n \zz_n^T]\right]^{-1},$$ which is Equation 12.63.
It’s interesting to consider what happens to this expression when one of the elements of $\bal$ goes to infinity. Let’s assume, without loss of generality, that it’s the first one, so $$\sigma^2 \diag{\bal} + \sum_n \mathbb E[\zz_n \zz_n^T] = \begin{bmatrix} a + \lambda & \bb^T \\ \bb & \CC \end{bmatrix}.$$ You can show, using formulas for inverting block matrices, that $$ \lim_{\lambda \to \infty} \begin{bmatrix} a + \lambda & \bb^T \\ \bb & \CC \end{bmatrix}^{-1} = \begin{bmatrix}0 & \bzero^T \\ \bzero & \CC^{-1} \end{bmatrix}.$$
So then $$\WW_\text{new} = \left[\sum_n (\xx_n – \overline \xx) \mathbb E[\zz_n]^T\right]\begin{bmatrix}0 & \bzero^T \\ \bzero & \CC^{-1} \end{bmatrix} = \begin{bmatrix} \bzero, \sum_n (\xx_n – \overline \xx) \mathbb E[\zz_n(2:)]^T \CC^{-1} \end{bmatrix},$$ where by $\zz_n(2:)$ I mean the subvector consisting of all elements except the first. In other words, the first column of $\WW_\text{new}$ is zero, and the remaining columns lose their dependence on the first latent, driving its value to zero. So $\alpha_i \to \infty$ removes the corresponding column $\ww_i$, from the model, as expected.
This all assumes we’re given a value of $\bal$. To determine it, Bishop says we should look at the likelihood of the data
$$ p(\XX|\bal, \bmu, \sigma^2) = \int p(\XX|\WW, \bmu, \sigma^2)\; p(\WW|\bal) \; d\WW.$$ This integral is intractable, so instead we assume that it’s sharply peaked in the integrand and compute the value of $\bal$ that maximizes the integrand. This, in turn, is determined by the maximum of $p(\WW|\bal)$.
However, it seems cleaner to me to realize that $\bal$ is just another parameter in the now expanded probabilistic PCA model, and just compute its M-step update, which requires maximizing $\log p(\WW | \bal)$, and will give the same answer as Bishop’s approach. Performing the maximization,
$$ \log p(\WW|\bal) = \sum_i {D \over 2} \log \left(\alpha_i \over 2\pi \right) – {1 \over 2} \alpha_i \ww_i^T \ww_i. $$
Differentiating with respect to $\alpha_i$ and setting to zero, we get
$$ \alpha_i^{new} = {D \over \ww_i^T \ww_i},$$ which is Bishop’s equation 12.62.
$$\blacksquare$$
Leave a Reply