These are my derivations of the maximum likelihood estimates of the parameters of probabilistic PCA as described in section 12.2.1 of Bishop, and with some hints from (Tipping and Bishop 1999).
Once we have determined the maximum likelihood estimate of $\mu$ and plugged it in, we have (Bishop 12.44)
$$ L = \ln p(X|W, \mu, \sigma^2) = -{N \over 2} \left(D \ln (2\pi) + \ln |C| + \tr(C^{-1}S)\right).$$
The differential with respect to $W$ is \begin{align}
dL &\propto \tr(C^{-1} dC – C^{-1} dC C^{-1} S) \\
&= \tr([C^{-1} – C^{-1} S C^{-1}] dC)\\
&= \tr([C^{-1} – C^{-1} S C^{-1}](W dW^T + dW W^T).
\end{align}
From this we get that
$$ \nabla_W L = C^{-1} (I – S C^{-1})W.$$
In general $C$ will be full rank, so setting the gradient above to zero we get $$ W = S C^{-1} W.$$
One solution of this will be if $W = 0$. Another, when $M = D$ will be $C = S$.
To get the third class of solutions for when $M <D$, let $W = U L V^T$, where $W$ is $D \times M$, and $L$ and $V$ are $M \times M$.
In that case, $$C = WW^T + \sigma^2 I = U (L^2 + \sigma^2) U^T + \hat U \hat U^T \sigma^2,$$ where $\hat U$ is the completion of $U$. So then $$C^{-1} = U(L^2 + \sigma^2)^{-1} U^T + \hat U \hat U^T \sigma^{-2} .$$
$$ U L V^T = S \left(U(L^2 + \sigma^2)^{-1} U^T + \hat U \hat U^T \sigma^{-2}\right) U L V^T.$$
Multiplying on the the right by $V L^{-1}$, we get
$$ U = S \left(U(L^2 + \sigma^2)^{-1} U^T + \hat U \hat U^T \sigma^{-2}\right) U,$$ which simplifies to $$ U = S U(L^2 + \sigma^2)^{-1}.$$
From this we get the eigenvalue equation $$S U = U (L^2 + \sigma^2).$$ We can then set $U$ to be any subset of $M$ eigenvectors of $S$, with corresponding eigenvalues $L_M$, and we get $$ L = \sqrt{L_M – \sigma^2}.$$
To determine which eigenvectors to choose, we return to the objective, which, up to constant terms, is
$$ L = -\ln |C| – \tr(C^{-1}S ).$$
The eigenvalues of $C$ are $L^2 + \sigma^2 = L_M$ for the $M$ eigenvectors we chose, and $\sigma^2$ for the rest. So $$ -\ln |C| = -\sum_{i \in M} \log L_{M_i} – (D-M) \log \sigma^2= – \sum_{i\in M}\log {L_{M_i} \over \sigma^2} – D \log \sigma^2.$$
Also, $$C^{-1} S = U_M U_M^T + \hat U {L_M’/\sigma^2} \hat U^T,$$ so $$\tr(C^{-1} S) = M + \sum_{i \in M’} {L_{M_i} / \sigma^2} = M + \sum_i {L_i \over \sigma^2} – \sum_{i \in M} {L_i \over \sigma^2}.$$
Combining these, we get
\begin{align}
L &= -\ln |C| – \tr(C^{-1} S)\\&=-\sum_{i \in M} \log {L_{M_i} \over \sigma^2} + \sum_{i \in M}{L_{M_i} \over \sigma^2} – D \log \sigma^2 – \sum_i {L_i \over \sigma^2}\\
&=- D \log \sigma^2 – \sum_i {L_i \over \sigma^2} + \sum_{i \in M} f\left[{ L_{M_i} \over \sigma^2}\right]
\end{align}
where
$$ f(x) = x – \log x.$$
This function is monotonically increasing for $x \ge 1$, which will be satisfied since $L_M$ has $\sigma^2$ added to whatever variance it inherits from $W$. In that case, to maximize $L$ we want to use the largest $M$ eigenvalues.
We can use our expression above for $L$ to optimize for $\sigma^2$. \begin{align} {\partial L \over \partial \sigma^2} &= -{D \over \sigma^2} + {\sum L_i \over \sigma^4} + \sum_{i \in M} (1 – {\sigma^2 \over L_{M_i}}){-L_{M_i} \over \sigma^4} \\&\propto -D\sigma^2 – \sum L_i – \sum_{i \in M} (1 – {\sigma^2 \over L_{M_i}})L_{M_i}\\ &= -D \sigma^2 +\sum L_i – \sum_{i \in M}(L_{M_i} – \sigma^2). \end{align} Setting this to zero we get $$\sigma^2 = {\sum L_i – \sum L_{M_i} \over D – M},$$ which is the average of the bottom $D-M$ eigenvalues.
$$\blacksquare$$
Leave a Reply