{"id":7014,"date":"2026-03-05T15:14:13","date_gmt":"2026-03-05T15:14:13","guid":{"rendered":"https:\/\/sinatootoonian.com\/?p=7014"},"modified":"2026-03-25T17:05:08","modified_gmt":"2026-03-25T17:05:08","slug":"linearizing-the-covariance-loss","status":"publish","type":"post","link":"https:\/\/sinatootoonian.com\/index.php\/2026\/03\/05\/linearizing-the-covariance-loss\/","title":{"rendered":"Linearizing the Covariance Loss"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">We&#8217;re after insight, not an exact solution. ChatGPT had a good suggestion to linearize the loss around $\\zz = 1$. The empirical values we see for $\\zz$ can  be quite large relative to 1, in the range $[-0.5, 1.5]$, but linearization might be enough to give the insight we&#8217;re after. Let&#8217;s see.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">If we consider $\\bdelta = \\zz &#8211; \\bone$, then $$L(\\bdelta) = {1\\over 2}\\|\\XX^T[\\bdelta + \\bone]^2 \\JJ [\\bdelta + \\bone] \\XX &#8211; \\SS\\|_2^2 + {\\lambda \\over 2}\\|\\bdelta\\|_2^2.$$<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We can linearize the middle as \\begin{align*} [\\bdelta + \\bone]\\JJ[\\bdelta + \\bone] &amp;= [\\bdelta] \\JJ + \\JJ [\\bdelta] + \\JJ \\end{align*}<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">When we left and right multiply by $\\XX^T$, each of these contributes a term. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The first is $$ \\XX^T [\\bdelta] \\JJ \\XX = \\sum_i \\delta_i \\xx_i \\widetilde \\xx_i^T,$$ where $\\widetilde \\xx_i$ are the <em>population<\/em>-mean-subtracted (i.e. per-odour) responses. So it&#8217;s not that each $\\xx_i$ will have mean zero.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The second is $$ \\XX^T \\JJ [\\bdelta] \\XX = \\sum_i \\delta_i \\widetilde \\xx_i \\xx_i^T.$$  <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We can combine these into a single term $$ \\sum_i \\GG_i \\delta_i, \\quad \\GG_i \\triangleq \\xx_i \\widetilde \\xx_i^T + \\widetilde \\xx_i \\xx_i^T.$$<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We can combine the last term with the target covariance using $$ \\SS &#8211; \\XX^T \\JJ \\XX \\triangleq \\bE_0,$$ where the subscript reminds us that this is the error at $\\bdelta = 0.$ <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">If we vectorize, forming $$\\GG = [\\bg_1 \\dots \\bg_N], \\quad \\bg_i = \\text{vec}(\\GG_i),$$ and similarly for the other terms, our loss will be $$ L(\\bdelta) = {1 \\over 2} \\|\\GG \\bdelta &#8211; \\ee_0\\|_2^2 + {\\lambda \\over 2} \\|\\bdelta\\|_2^2.$$ The gradient is then $$ \\nabla L = \\GG^T (\\GG \\bdelta &#8211; \\ee_0) + \\lambda \\bdelta,$$ so the solution is $$ \\boxed{\\bdelta = (\\GG^T \\GG +\\lambda \\II) ^{-1} \\GG^T \\ee_0.}$$<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">If we ignore correlations among the $\\bg_i$, this is $$ \\delta_i \\approx {\\bg_i^T \\ee_0 \\over \\|\\bg_i\\|_2^2 + \\lambda}.$$<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"273\" src=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2026\/03\/image-9-1024x273.png\" alt=\"\" class=\"wp-image-7046\" srcset=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2026\/03\/image-9-1024x273.png 1024w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2026\/03\/image-9-300x80.png 300w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2026\/03\/image-9-768x204.png 768w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2026\/03\/image-9.png 1379w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">The linear update is good (orange)! However, the correlations among the columns of $\\GG$ are important, since ignoring them, and just taking the variances, clearly produces a poor estimate.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Explanations<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Now, how to intuitively explain the linear approximation? <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">One idea I had was to see whether there was an equivalent dataset, one that would give the same coefficients, for the the same $\\ee_0$, but would have decorrelated representational atoms $\\bg_i$, so that the explanation that ignored those correlations would work, and we could explain the results that way. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">However, no such dataset exists. Let $\\GG = \\UU \\SS \\VV^T$. Then $$ (\\GG^T \\GG + \\lambda \\II)^{-1} \\GG^T = \\VV {\\SS \\over \\SS^2 + \\lambda \\II} \\UU^T.$$ There are no degrees of freedom here, that would allow us to e.g. rotate $\\VV$ &#8211; once $\\GG$ is known, everything is determined.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Another possibility is to switch coordinates: multiply both sides of the above on the left by $\\VV^T$, and we get $$ \\UU^T \\ee_0 = \\left[{\\SS \\over \\SS^2 + \\lambda \\II}\\right]^{-1} \\VV^T \\bdelta.$$ This has the disadvantage that we&#8217;re no longer explaining the gains $\\bdelta$ themselves, but some rotated version.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">A further possibility is to simply define the raw overlaps as $\\GG^T \\ee_0$, and take the rows of $(\\GG^T \\GG + \\lambda \\II)^{-1}$ as the cell-specific filters.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">ChatGPT suggested an improvement on this, where we split the filters into two whitening steps $$ (\\GG^T \\GG + \\lambda \\II)^{-1} \\triangleq \\WW^2, \\quad \\WW =  (\\GG^T \\GG + \\lambda \\II)^{-{1 \\over 2}}.$$ We can then write $$ \\delta_i = \\ee_i^T \\WW \\WW (\\GG^T \\ee_0) = \\langle \\WW \\ee_i, \\WW (\\GG^T \\ee_0)  \\rangle.$$ <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">If there were no correlations, $\\delta_i \\propto \\ee_i^T \\GG^T \\ee_0.$ With the correlations, we have to whiten both the filter $\\ee_i$, and the overlaps, $\\GG^T \\ee_0,$ which is what we have above. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I think that&#8217;s the best we can do. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">$$ \\blacksquare$$<\/p>\n","protected":false},"excerpt":{"rendered":"<p>We&#8217;re after insight, not an exact solution. ChatGPT had a good suggestion to linearize the loss around $\\zz = 1$. In this post we do that.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1,148],"tags":[161,162,163],"class_list":["post-7014","post","type-post","status-publish","format-standard","hentry","category-blog","category-research","tag-diagonal-model","tag-input-output-transform","tag-iopaper"],"acf":[],"_links":{"self":[{"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/posts\/7014","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/comments?post=7014"}],"version-history":[{"count":55,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/posts\/7014\/revisions"}],"predecessor-version":[{"id":7543,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/posts\/7014\/revisions\/7543"}],"wp:attachment":[{"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/media?parent=7014"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/categories?post=7014"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/tags?post=7014"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}