{"id":3288,"date":"2025-01-27T16:59:49","date_gmt":"2025-01-27T16:59:49","guid":{"rendered":"https:\/\/sinatootoonian.com\/?p=3288"},"modified":"2025-12-27T16:01:55","modified_gmt":"2025-12-27T16:01:55","slug":"the-inference-model-when-missing-observations","status":"publish","type":"post","link":"https:\/\/sinatootoonian.com\/index.php\/2025\/01\/27\/the-inference-model-when-missing-observations\/","title":{"rendered":"The inference model when missing observations"},"content":{"rendered":"\n<p>The inference model isn&#8217;t giving good performance. But is this because we&#8217;re missing data?<\/p>\n\n\n\n<p>In the inference model, the recorded output activity is related to the input according to $$ (\\sigma^2 \\II + \\AA \\AA^T) \\bLa = \\YY,$$<br>where we&#8217;ve absorbed $\\gamma$ into $\\AA$.<\/p>\n\n\n\n<p>We can model this as $N$ observations of $\\yy$ given $\\bla$, where<br>$$ \\yy_n | \\bla_n = \\FF \\bla_n + \\norm(0, \\sigma^2_y)$$ If we put an isotropic normal prior on $\\bla$, we have that the joint distribution on inputs and outputs is then $$ p(\\yy_n, \\bla_n|\\bth) = \\norm(\\yy_n| \\FF \\bla_n, \\sigma^2_y)\\; \\norm(\\bla_n|\\bmu, \\sigma_\\la^2),$$ where $$\\bth \\triangleq \\{\\FF, \\sigma^2, \\sigma^2_y, \\sigma^2_\\la\\}$$ are our parameters.<\/p>\n\n\n\n<p>Our observed $\\yy_n$ are determined by both the observed and unobserved $\\bla$. Define $\\II_0$ and $\\II_1$ as the matrices that extract the observed and unobserved values. Applying these to $\\bla$ gives $\\bla^0_n$ and $\\bla^1_n$. Extracting the relevant parts of the prior on $\\bla$ as $$ \\bla^0_n \\sim \\norm(\\bla^0_n|\\bmu^0, \\sigma_\\la^2), \\quad \\bla^1_n \\sim \\norm(\\bla^1_n|\\bmu^1, \\sigma_\\la^2),$$ we then have $$ p(\\yy_n, \\bla_n^0, \\bla_n^1| \\bth) = \\mathcal{N}(\\yy_n | \\FF_0 \\bla_n^0 + \\FF_1 \\bla_n^1, \\sigma^2_y)\\, \\, \\norm(\\bla^0_n|\\bmu^0, \\sigma_\\la^2)\\,\\norm(\\bla^1_n|\\bmu^1, \\sigma_\\la^2),$$ where $$\\FF_0 = \\FF \\II_0^T, \\quad \\FF_1 = \\FF \\II_1^T.$$<\/p>\n\n\n\n<p>We can then rearrange our joint distribution to focus on the unobserved data<br>$$ p(\\yy_n, \\bla_n^0, \\bla_n^1| \\bth) = \\mathcal{N}(\\yy_n &#8211; \\FF_1 \\bla_n^1 |\\FF_0 \\bla_n^0 , \\sigma^2_y)\\,\\norm(\\bla^0_n|\\bmu^0, \\sigma_\\la^2)\\,\\norm(\\bla^1_n|\\bmu^1, \\sigma_\\la^2).$$<\/p>\n\n\n\n<p>Marginalising out the missing observations,\\begin{align} p(\\yy_n, \\bla_n^1) &amp;= \\int d\\bla_n^0\\, p(\\yy_n, \\bla_n^0, \\bla_n^1)\\\\ &amp;= \\norm(\\bla^1_n|\\bmu^1, \\sigma_\\la^2) \\int d\\bla_n^0\\; \\mathcal{N}(\\yy_n &#8211; \\FF_1 \\bla_n^1 | \\FF_0 \\bla_n^0, \\sigma^2_y)\\,\\norm(\\bla^0_n|\\bmu^0, \\sigma_\\la^2).\\end{align}<\/p>\n\n\n\n<p>The last integral is the marginal distribution of observations in a linear Gaussian model, so can be evaluated in closed form, giving $$ p(\\yy_n, \\bla_n^1) = \\norm(\\bla_n^1| \\bmu^1, \\sigma_\\la^2) \\, \\norm(\\yy_n &#8211; \\FF_1 \\bla_n^1| \\FF_0 \\bmu^0, \\FF_0 \\FF_0^T \\sigma_\\lambda^2 + \\sigma^2_y \\II).$$<\/p>\n\n\n\n<p>We can rearrange this to finally arrive at $$ p(\\yy_n, \\bla_n^1|\\bth) = \\norm(\\yy_n | \\FF_1 \\bla_n^1 + \\FF_0 \\bmu^0, \\FF_0\\FF_0^T \\sigma^2_\\la + \\sigma^2_y\\II)\\, \\norm(\\bla_n^1| \\bmu^1, \\sigma_\\la^2).$$<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Parameter priors<\/h2>\n\n\n\n<p>We have several variance parameters: $\\sigma^2, \\sigma_y^2$ and $\\sigma_\\la^2$. In the first instance, we&#8217;ll just put an improper prior on these, allowing them to be arbitrarily large.<\/p>\n\n\n\n<p>The connectivity parameters are in $$ \\FF = \\II_1 (\\sigma^2 \\II + \\AA \\AA^T).$$ $\\AA$ has $M$ rows and $N$ columns, and we don&#8217;t know either of these. I can try running different versions of the model for different values of $M$ and $N$. But we can also use just one large value of $M$ and one for $N$, and use automatic relevance determination to determine the needed rows and columns. So, $$ p(\\AA|\\bal, \\bbe) = \\prod_{i,j} \\norm(A_{ij}|0, \\alpha^{-1}_i \\beta^{-1}_j) = \\prod_{i,j} \\sqrt{{\\alpha_i \\beta_j \\over 2\\pi}}e^{-\\alpha_i \\beta_j A_{ij}^2\/2}.$$<\/p>\n\n\n\n<p>Then $$ \\log p(\\AA|\\bal, \\bbe) \\dot = \\sum_{i} {N\\over 2}\\log{\\alpha_i} + \\sum_j {M \\over 2}\\log{\\beta_j} &#8211; {1\\over 2} \\sum_{i,j} \\alpha_i \\beta_j A_{ij}^2.$$<br>Rather than integrating over $\\AA$ to get these values, we&#8217;ll just set them to their most likely values, like in the evidence approximation. The gradients for $\\bal$ are $$ {\\partial \\over \\partial \\alpha_i} \\dot = {N \\over \\alpha_i} &#8211; \\sum_j A_{ij}^2 \\beta_j,$$ and similarly for $\\bbe$. Therefore, the updated values, given $\\AA$, are $$ \\bal^\\text{new} = {N \\over \\AA^{\\circ 2} \\bbe}, \\quad \\bbe^\\text{new} = {M \\over \\AA^{\\circ 2, T} \\bal}, $$ where $\\AA^{\\circ 2} = \\AA \\odot \\AA,$  whose elements are those of $\\AA$ squared.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Gradients<\/h2>\n\n\n\n<p>We want to maximize the likelihood so we&#8217;ll need gradients with respect to all the parameters.<\/p>\n\n\n\n<p>Let&#8217;s put the covariance term into $$\\bSig \\triangleq \\FF_0\\FF_0^T \\sigma^2_\\la + \\sigma^2_y\\II.$$ The log likelihood is, up to constant terms<br>$$ \\log p(\\yy_n, \\bla_n^1|\\bth) \\dot{=} -{1\\over 2} \\log |\\bSig| -{1\\over 2} (\\yy_n &#8211; \\FF_1 \\bla_n^1 &#8211; \\FF_0 \\bmu^0)^T \\bSig^{-1} (\\yy_n &#8211; \\FF_1 \\bla_n^1 &#8211; \\FF_0 \\bmu^0) &#8211; {m \\over 2} \\log \\sigma_\\la^2 &#8211; {1 \\over 2 \\sigma_\\la^2} \\|\\bla_n^1 &#8211; \\bmu^1\\|_2^2 \\triangleq \\ell(\\bth).$$<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Gradient wrt $\\FF_1$<\/h3>\n\n\n\n<p>First let&#8217;s get the gradient with respect to $\\FF_1$, since it doesn&#8217;t show up in the covariance. Defining $$\\rr_n = \\yy_n &#8211; \\FF_1 \\bla_n^1 &#8211; \\FF_0 \\bmu^0,$$<br>the differential is $$d\\ell = &#8211; \\rr_n^T \\bSig^{-1}(-d\\FF_1) \\bla_n^1,$$<br>so $$ \\boxed{\\grad{\\FF_1}{\\ell} = \\bSig^{-1} \\rr_n \\bla_n^{1,T}.}\\; \\checkmark$$<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Gradient wrt $\\FF_0$<\/h3>\n\n\n\n<p>The log likelihood is, again<br>$$\\ell(\\theta) = -{1\\over 2} \\log |\\bSig| -{1\\over 2} (\\yy_n &#8211; \\FF_1 \\bla_n^1 &#8211; \\FF_0 \\bmu^0)^T \\bSig^{-1} (\\yy_n &#8211; \\FF_1 \\bla_n^1 &#8211; \\FF_0 \\bmu^0)- {m \\over 2} \\log \\sigma_\\la^2 &#8211; {1 \\over 2 \\sigma_\\la^2} \\|\\bla_n^1-\\bmu^1\\|_2^2.$$<\/p>\n\n\n\n<p>The differential is<br>\\begin{align} d\\ell &amp;= -{1 \\over 2} d\\log |\\bSig| + \\rr_n^T \\bSig^{-1} d\\FF_0 \\bla^0 &#8211; {1 \\over 2}\\rr_n^T d\\bSig^{-1} \\rr_n\\\\ &amp;= -{1 \\over 2} \\tr{\\bSig^{-1} d\\bSig} + \\rr_n^T \\bSig^{-1} d\\FF_0 \\bla^0 + {1 \\over 2} \\rr_n^T \\bSig^{-1} d\\bSig \\bSig^{-1} \\rr_n. \\end{align}<br>Now $$ d\\bSig = d\\FF_0 \\FF_0^T \\sigma_\\la^2 + \\FF_0 d\\FF_0^T \\sigma_\\la^2.$$ Plugging this in, \\begin{align} d\\ell &amp;= -{\\sigma^2_\\la \\over 2} \\tr{\\bSig^{-1}(d\\FF_0 \\FF_0^T + \\FF_0 d\\FF_0^T)} + \\rr_n^T \\bSig^{-1} d\\FF_0 \\bla^0 + {\\sigma_\\la^2 \\over 2} \\rr_n^T \\bSig^{-1} (d\\FF_0 \\FF_0^T + \\FF_0 d\\FF_0^T)\\bSig^{-1} \\rr_n. \\end{align}<br>From this we get $$\\boxed{\\grad{\\FF_0}{\\ell} = -\\sigma^2_\\la \\bSig^{-1} \\FF_0 + \\bSig^{-1}\\rr_n\\bla^{0,T} + \\sigma^2_\\la \\bSig^{-1}\\rr_n \\rr_n^T \\bSig^{-1} \\FF_0.}\\;\\checkmark$$<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Gradient wrt $\\bmu^0$ and $\\bmu^1$<\/h3>\n\n\n\n<p>The log likelihood is, again $$\\ell(\\theta) = -{1\\over 2} \\log |\\bSig| -{1\\over 2} (\\yy_n &#8211; \\FF_1 \\bla_n^1 &#8211; \\FF_0 \\bmu^0)^T \\bSig^{-1} (\\yy_n &#8211; \\FF_1 \\bla_n^1 &#8211; \\FF_0 \\bmu^0)- {m \\over 2} \\log \\sigma_\\la^2 &#8211; {1 \\over 2 \\sigma_\\la^2} \\|\\bla_n^1 &#8211; \\bmu^1\\|_2^2.$$<\/p>\n\n\n\n<p>The differential is $$ d\\ell = \\rr_n^T \\bSig^{-1} \\FF_0 d\\bla^0,$$ so the gradient is $$ \\boxed{\\grad{\\bmu^0}{\\ell} = \\FF_0^T \\bSig^{-1} \\rr_n.}\\;\\checkmark$$<\/p>\n\n\n\n<p>We can just read off the gradient with respect to $\\bmu^1$ as $$\\boxed{\\grad{\\bmu^1}{\\ell} = {1 \\over \\sigma^2_\\la} (\\bla_n^1 &#8211; \\bmu^1).}\\;\\checkmark$$<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Gradients wrt $\\sigma_\\la^2$ and $\\sigma_y^2$<\/h3>\n\n\n\n<p>These show up inside $\\bSig$. The differential is \\begin{align} d\\ell = -{1 \\over 2} \\tr{\\bSig^{-1} d\\bSig} + {1 \\over 2} \\rr_n^T \\bSig^{-1} d\\bSig \\bSig^{-1} \\rr_n.\\end{align} For $\\sigma^2_\\la$ $d\\bSig = \\FF_0 \\FF_0^T d\\sigma_\\la^2$, so<br>\\begin{align} d\\ell = -{1 \\over 2} \\tr{\\bSig^{-1} \\FF_0 \\FF_0^T}d\\sigma_\\la^2 + {1 \\over 2} \\rr_n^T \\bSig^{-1} \\FF_0 \\FF_0^T \\bSig^{-1} \\rr_n d\\sigma_\\la^2,\\end{align} from which the gradient is $$\\boxed{\\grad{\\sigma^2_\\la}{\\ell} = -{1 \\over 2} \\tr{\\bSig^{-1} \\FF_0 \\FF_0^T} + {1 \\over 2} |\\FF_0^T \\bSig^{-1} \\rr_n|_2^2 &#8211; {m \\over 2 \\sigma_\\la^2} +{1 \\over 2 \\sigma_\\la^4} |\\bla_n^1 &#8211; \\bmu^1|_2^2.}\\; \\checkmark $$<\/p>\n\n\n\n<p>For $\\sigma_y^2$ this simplifies to<br>$$\\boxed{\\grad{\\sigma^2_y}{\\ell} = -{1 \\over 2} \\tr{\\bSig^{-1}} + {1 \\over 2} |\\bSig^{-1} \\rr_n|_2^2.}\\; \\checkmark$$<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Gradient wrt $\\AA$ and $\\sigma^2$<\/h3>\n\n\n\n<p>\\begin{align} \\FF_0 &amp;= \\II_1 (\\sigma^2 \\II + \\AA \\AA^T)\\II_0^T.\\\\ d\\FF_0 &amp;= \\II_1 \\II_0^T d\\sigma^2 + \\II_1 d\\AA \\AA^T \\II_0^T + \\II_1 \\AA d\\AA^T \\II_0^T = d\\AA_1 \\AA_0^T + \\AA_1 d\\AA_0^T, \\end{align} where we used $\\II_1 \\II_0^T = \\bzero.$<br>\\begin{align} \\FF_1 &amp;= \\II_1 (\\sigma^2 \\II + \\AA \\AA^T)\\II_1^T\\\\ d\\FF_1 &amp;= \\II_1 \\II_1^T \\sigma^2 + \\II_1 d\\AA \\AA^T \\II_1^T + \\II_1 \\AA d\\AA^T \\II_1^T = \\II_m d\\sigma^2 + d\\AA_1 \\AA_1^T + \\AA_1 d\\AA_1^T, \\end{align} where we used $\\II_1 \\II_1^T = \\II_m.$ The differential is then \\begin{align} d\\ell &amp;= \\tr{\\grad{\\FF_0}{\\ell}^T d\\FF_0} + \\tr{\\grad{\\FF_1}{\\ell}^T d\\FF_1}\\\\ &amp;= \\tr{\\grad{\\FF_0}{\\ell}^T(d\\AA_1 \\AA_0^T + \\AA_1 d\\AA_0^T)} + \\tr{\\grad{\\FF_1}{\\ell}^T(d\\AA_1 \\AA_1^T + \\AA_1 d\\AA_1^T)} + \\tr{\\grad{\\FF_1}{\\ell}^T\\II_m} \\sigma^2 \\end{align}<br>From this we get $$ \\boxed{\\grad{\\AA_0}{\\ell} = \\grad{\\FF_0}{\\ell}^T \\AA_1,}\\;\\checkmark$$ and $$ \\boxed{\\grad{\\AA_1}{\\ell} = \\grad{\\FF_0}{\\ell}\\AA_0 + (\\grad{\\FF_1}{\\ell} + \\grad{\\FF_1}{\\ell}^T) \\AA_1.}\\;\\checkmark$$ and<br>$$ \\boxed{\\grad{\\sigma^2}{\\ell} = \\tr{\\grad{\\FF_1}{\\ell}}.}\\;\\checkmark$$<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The inference model isn&#8217;t giving good performance. But is this because we&#8217;re missing data? In the inference model, the recorded output activity is related to the input according to $$ (\\sigma^2 \\II + \\AA \\AA^T) \\bLa = \\YY,$$where we&#8217;ve absorbed $\\gamma$ into $\\AA$. We can model this as $N$ observations of $\\yy$ given $\\bla$, where$$ [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1,148],"tags":[95,98,96,97],"class_list":["post-3288","post","type-post","status-publish","format-standard","hentry","category-blog","category-research","tag-inference","tag-maximum-likelihood","tag-missing-observations","tag-model-selection"],"_links":{"self":[{"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/posts\/3288","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/comments?post=3288"}],"version-history":[{"count":57,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/posts\/3288\/revisions"}],"predecessor-version":[{"id":3345,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/posts\/3288\/revisions\/3345"}],"wp:attachment":[{"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/media?parent=3288"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/categories?post=3288"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/tags?post=3288"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}