The inference model isn’t giving good performance. But is this because we’re missing data?
In the inference model, the recorded output activity is related to the input according to $$ (\sigma^2 \II + \AA \AA^T) \bLa = \YY,$$
where we’ve absorbed $\gamma$ into $\AA$.
We can model this as $N$ observations of $\yy$ given $\bla$, where
$$ \yy_n | \bla_n = \FF \bla_n + \norm(0, \sigma^2_y)$$ If we put an isotropic normal prior on $\bla$, we have that the joint distribution on inputs and outputs is then $$ p(\yy_n, \bla_n|\bth) = \norm(\yy_n| \FF \bla_n, \sigma^2_y)\; \norm(\bla_n|\bmu, \sigma_\la^2),$$ where $$\bth \triangleq \{\FF, \sigma^2, \sigma^2_y, \sigma^2_\la\}$$ are our parameters.
Our observed $\yy_n$ are determined by both the observed and unobserved $\bla$. Define $\II_0$ and $\II_1$ as the matrices that extract the observed and unobserved values. Applying these to $\bla$ gives $\bla^0_n$ and $\bla^1_n$. Extracting the relevant parts of the prior on $\bla$ as $$ \bla^0_n \sim \norm(\bla^0_n|\bmu^0, \sigma_\la^2), \quad \bla^1_n \sim \norm(\bla^1_n|\bmu^1, \sigma_\la^2),$$ we then have $$ p(\yy_n, \bla_n^0, \bla_n^1| \bth) = \mathcal{N}(\yy_n | \FF_0 \bla_n^0 + \FF_1 \bla_n^1, \sigma^2_y)\, \, \norm(\bla^0_n|\bmu^0, \sigma_\la^2)\,\norm(\bla^1_n|\bmu^1, \sigma_\la^2),$$ where $$\FF_0 = \FF \II_0^T, \quad \FF_1 = \FF \II_1^T.$$
We can then rearrange our joint distribution to focus on the unobserved data
$$ p(\yy_n, \bla_n^0, \bla_n^1| \bth) = \mathcal{N}(\yy_n – \FF_1 \bla_n^1 |\FF_0 \bla_n^0 , \sigma^2_y)\,\norm(\bla^0_n|\bmu^0, \sigma_\la^2)\,\norm(\bla^1_n|\bmu^1, \sigma_\la^2).$$
Marginalising out the missing observations,\begin{align} p(\yy_n, \bla_n^1) &= \int d\bla_n^0\, p(\yy_n, \bla_n^0, \bla_n^1)\\ &= \norm(\bla^1_n|\bmu^1, \sigma_\la^2) \int d\bla_n^0\; \mathcal{N}(\yy_n – \FF_1 \bla_n^1 | \FF_0 \bla_n^0, \sigma^2_y)\,\norm(\bla^0_n|\bmu^0, \sigma_\la^2).\end{align}
The last integral is the marginal distribution of observations in a linear Gaussian model, so can be evaluated in closed form, giving $$ p(\yy_n, \bla_n^1) = \norm(\bla_n^1| \bmu^1, \sigma_\la^2) \, \norm(\yy_n – \FF_1 \bla_n^1| \FF_0 \bmu^0, \FF_0 \FF_0^T \sigma_\lambda^2 + \sigma^2_y \II).$$
We can rearrange this to finally arrive at $$ p(\yy_n, \bla_n^1|\bth) = \norm(\yy_n | \FF_1 \bla_n^1 + \FF_0 \bmu^0, \FF_0\FF_0^T \sigma^2_\la + \sigma^2_y\II)\, \norm(\bla_n^1| \bmu^1, \sigma_\la^2).$$
Parameter priors
We have several variance parameters: $\sigma^2, \sigma_y^2$ and $\sigma_\la^2$. In the first instance, we’ll just put an improper prior on these, allowing them to be arbitrarily large.
The connectivity parameters are in $$ \FF = \II_1 (\sigma^2 \II + \AA \AA^T).$$ $\AA$ has $M$ rows and $N$ columns, and we don’t know either of these. I can try running different versions of the model for different values of $M$ and $N$. But we can also use just one large value of $M$ and one for $N$, and use automatic relevance determination to determine the needed rows and columns. So, $$ p(\AA|\bal, \bbe) = \prod_{i,j} \norm(A_{ij}|0, \alpha^{-1}_i \beta^{-1}_j) = \prod_{i,j} \sqrt{{\alpha_i \beta_j \over 2\pi}}e^{-\alpha_i \beta_j A_{ij}^2/2}.$$
Then $$ \log p(\AA|\bal, \bbe) \dot = \sum_{i} {N\over 2}\log{\alpha_i} + \sum_j {M \over 2}\log{\beta_j} – {1\over 2} \sum_{i,j} \alpha_i \beta_j A_{ij}^2.$$
Rather than integrating over $\AA$ to get these values, we’ll just set them to their most likely values, like in the evidence approximation. The gradients for $\bal$ are $$ {\partial \over \partial \alpha_i} \dot = {N \over \alpha_i} – \sum_j A_{ij}^2 \beta_j,$$ and similarly for $\bbe$. Therefore, the updated values, given $\AA$, are $$ \bal^\text{new} = {N \over \AA^{\circ 2} \bbe}, \quad \bbe^\text{new} = {M \over \AA^{\circ 2, T} \bal}, $$ where $\AA^{\circ 2} = \AA \odot \AA,$ whose elements are those of $\AA$ squared.
Gradients
We want to maximize the likelihood so we’ll need gradients with respect to all the parameters.
Let’s put the covariance term into $$\bSig \triangleq \FF_0\FF_0^T \sigma^2_\la + \sigma^2_y\II.$$ The log likelihood is, up to constant terms
$$ \log p(\yy_n, \bla_n^1|\bth) \dot{=} -{1\over 2} \log |\bSig| -{1\over 2} (\yy_n – \FF_1 \bla_n^1 – \FF_0 \bmu^0)^T \bSig^{-1} (\yy_n – \FF_1 \bla_n^1 – \FF_0 \bmu^0) – {m \over 2} \log \sigma_\la^2 – {1 \over 2 \sigma_\la^2} \|\bla_n^1 – \bmu^1\|_2^2 \triangleq \ell(\bth).$$
Gradient wrt $\FF_1$
First let’s get the gradient with respect to $\FF_1$, since it doesn’t show up in the covariance. Defining $$\rr_n = \yy_n – \FF_1 \bla_n^1 – \FF_0 \bmu^0,$$
the differential is $$d\ell = – \rr_n^T \bSig^{-1}(-d\FF_1) \bla_n^1,$$
so $$ \boxed{\grad{\FF_1}{\ell} = \bSig^{-1} \rr_n \bla_n^{1,T}.}\; \checkmark$$
Gradient wrt $\FF_0$
The log likelihood is, again
$$\ell(\theta) = -{1\over 2} \log |\bSig| -{1\over 2} (\yy_n – \FF_1 \bla_n^1 – \FF_0 \bmu^0)^T \bSig^{-1} (\yy_n – \FF_1 \bla_n^1 – \FF_0 \bmu^0)- {m \over 2} \log \sigma_\la^2 – {1 \over 2 \sigma_\la^2} \|\bla_n^1-\bmu^1\|_2^2.$$
The differential is
\begin{align} d\ell &= -{1 \over 2} d\log |\bSig| + \rr_n^T \bSig^{-1} d\FF_0 \bla^0 – {1 \over 2}\rr_n^T d\bSig^{-1} \rr_n\\ &= -{1 \over 2} \tr{\bSig^{-1} d\bSig} + \rr_n^T \bSig^{-1} d\FF_0 \bla^0 + {1 \over 2} \rr_n^T \bSig^{-1} d\bSig \bSig^{-1} \rr_n. \end{align}
Now $$ d\bSig = d\FF_0 \FF_0^T \sigma_\la^2 + \FF_0 d\FF_0^T \sigma_\la^2.$$ Plugging this in, \begin{align} d\ell &= -{\sigma^2_\la \over 2} \tr{\bSig^{-1}(d\FF_0 \FF_0^T + \FF_0 d\FF_0^T)} + \rr_n^T \bSig^{-1} d\FF_0 \bla^0 + {\sigma_\la^2 \over 2} \rr_n^T \bSig^{-1} (d\FF_0 \FF_0^T + \FF_0 d\FF_0^T)\bSig^{-1} \rr_n. \end{align}
From this we get $$\boxed{\grad{\FF_0}{\ell} = -\sigma^2_\la \bSig^{-1} \FF_0 + \bSig^{-1}\rr_n\bla^{0,T} + \sigma^2_\la \bSig^{-1}\rr_n \rr_n^T \bSig^{-1} \FF_0.}\;\checkmark$$
Gradient wrt $\bmu^0$ and $\bmu^1$
The log likelihood is, again $$\ell(\theta) = -{1\over 2} \log |\bSig| -{1\over 2} (\yy_n – \FF_1 \bla_n^1 – \FF_0 \bmu^0)^T \bSig^{-1} (\yy_n – \FF_1 \bla_n^1 – \FF_0 \bmu^0)- {m \over 2} \log \sigma_\la^2 – {1 \over 2 \sigma_\la^2} \|\bla_n^1 – \bmu^1\|_2^2.$$
The differential is $$ d\ell = \rr_n^T \bSig^{-1} \FF_0 d\bla^0,$$ so the gradient is $$ \boxed{\grad{\bmu^0}{\ell} = \FF_0^T \bSig^{-1} \rr_n.}\;\checkmark$$
We can just read off the gradient with respect to $\bmu^1$ as $$\boxed{\grad{\bmu^1}{\ell} = {1 \over \sigma^2_\la} (\bla_n^1 – \bmu^1).}\;\checkmark$$
Gradients wrt $\sigma_\la^2$ and $\sigma_y^2$
These show up inside $\bSig$. The differential is \begin{align} d\ell = -{1 \over 2} \tr{\bSig^{-1} d\bSig} + {1 \over 2} \rr_n^T \bSig^{-1} d\bSig \bSig^{-1} \rr_n.\end{align} For $\sigma^2_\la$ $d\bSig = \FF_0 \FF_0^T d\sigma_\la^2$, so
\begin{align} d\ell = -{1 \over 2} \tr{\bSig^{-1} \FF_0 \FF_0^T}d\sigma_\la^2 + {1 \over 2} \rr_n^T \bSig^{-1} \FF_0 \FF_0^T \bSig^{-1} \rr_n d\sigma_\la^2,\end{align} from which the gradient is $$\boxed{\grad{\sigma^2_\la}{\ell} = -{1 \over 2} \tr{\bSig^{-1} \FF_0 \FF_0^T} + {1 \over 2} |\FF_0^T \bSig^{-1} \rr_n|_2^2 – {m \over 2 \sigma_\la^2} +{1 \over 2 \sigma_\la^4} |\bla_n^1 – \bmu^1|_2^2.}\; \checkmark $$
For $\sigma_y^2$ this simplifies to
$$\boxed{\grad{\sigma^2_y}{\ell} = -{1 \over 2} \tr{\bSig^{-1}} + {1 \over 2} |\bSig^{-1} \rr_n|_2^2.}\; \checkmark$$
Gradient wrt $\AA$ and $\sigma^2$
\begin{align} \FF_0 &= \II_1 (\sigma^2 \II + \AA \AA^T)\II_0^T.\\ d\FF_0 &= \II_1 \II_0^T d\sigma^2 + \II_1 d\AA \AA^T \II_0^T + \II_1 \AA d\AA^T \II_0^T = d\AA_1 \AA_0^T + \AA_1 d\AA_0^T, \end{align} where we used $\II_1 \II_0^T = \bzero.$
\begin{align} \FF_1 &= \II_1 (\sigma^2 \II + \AA \AA^T)\II_1^T\\ d\FF_1 &= \II_1 \II_1^T \sigma^2 + \II_1 d\AA \AA^T \II_1^T + \II_1 \AA d\AA^T \II_1^T = \II_m d\sigma^2 + d\AA_1 \AA_1^T + \AA_1 d\AA_1^T, \end{align} where we used $\II_1 \II_1^T = \II_m.$ The differential is then \begin{align} d\ell &= \tr{\grad{\FF_0}{\ell}^T d\FF_0} + \tr{\grad{\FF_1}{\ell}^T d\FF_1}\\ &= \tr{\grad{\FF_0}{\ell}^T(d\AA_1 \AA_0^T + \AA_1 d\AA_0^T)} + \tr{\grad{\FF_1}{\ell}^T(d\AA_1 \AA_1^T + \AA_1 d\AA_1^T)} + \tr{\grad{\FF_1}{\ell}^T\II_m} \sigma^2 \end{align}
From this we get $$ \boxed{\grad{\AA_0}{\ell} = \grad{\FF_0}{\ell}^T \AA_1,}\;\checkmark$$ and $$ \boxed{\grad{\AA_1}{\ell} = \grad{\FF_0}{\ell}\AA_0 + (\grad{\FF_1}{\ell} + \grad{\FF_1}{\ell}^T) \AA_1.}\;\checkmark$$ and
$$ \boxed{\grad{\sigma^2}{\ell} = \tr{\grad{\FF_1}{\ell}}.}\;\checkmark$$
Leave a Reply