{"id":7075,"date":"2026-03-09T11:36:22","date_gmt":"2026-03-09T11:36:22","guid":{"rendered":"https:\/\/sinatootoonian.com\/?p=7075"},"modified":"2026-03-24T10:18:30","modified_gmt":"2026-03-24T10:18:30","slug":"linearizing-the-covariance-loss-for-the-free-model","status":"publish","type":"post","link":"https:\/\/sinatootoonian.com\/index.php\/2026\/03\/09\/linearizing-the-covariance-loss-for-the-free-model\/","title":{"rendered":"Linearizing the Covariance Loss for the Free Model"},"content":{"rendered":"\n<p>Below I&#8217;ve plotted the learned connectivity, minus the identity, for the Free model, which has no constraints. I used the regularization value, $10^6$, that gave the best validation $R^2$.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"303\" src=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2026\/03\/image-11-1024x303.png\" alt=\"\" class=\"wp-image-7077\" srcset=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2026\/03\/image-11-1024x303.png 1024w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2026\/03\/image-11-300x89.png 300w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2026\/03\/image-11-768x227.png 768w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2026\/03\/image-11-1536x454.png 1536w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2026\/03\/image-11.png 1542w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>The connectivity deviations are indeed small, presumably because the regularization is so strong. Therefore linearization might be effective here as well.<\/p>\n\n\n\n<p>Letting $\\ZZ= \\II +\\WW$, our loss is \\begin{align*} L(\\WW) &amp;= {1 \\over 2} \\|\\SS &#8211; \\XX^T (\\II + \\WW)^T \\JJ (\\II + \\WW) \\XX\\|^2_F + {\\lambda \\over 2}\\|\\WW\\|_F^2. \\end{align*} Keeping only the linear terms in $\\WW$ inside the norm, we get \\begin{align*}&amp;\\approx {1 \\over 2} \\|\\SS &#8211; \\CC &#8211; \\XX^T \\WW^T \\JJ \\XX &#8211; \\XX^T \\JJ \\WW \\XX\\|_F^2 + {\\lambda \\over 2}\\|\\WW\\|_F^2\\end{align*}<\/p>\n\n\n\n<p>Then $$ dL = -\\bE^T \\XX^T d\\WW^T \\JJ \\XX &#8211; \\bE^T \\XX^T \\JJ d\\WW \\XX + \\lambda \\WW^T d\\WW,$$ where we&#8217;ve defined $$\\bE \\triangleq \\SS &#8211; \\CC &#8211; \\XX^T \\WW^T \\JJ \\XX &#8211; \\XX^T \\JJ \\WW \\XX.$$ <\/p>\n\n\n\n<p>From which we read off the gradient as \\begin{align} \\nonumber \\nabla L &amp;= -2 \\JJ \\XX \\bE^T \\XX^T + \\lambda \\WW\\\\  \\nonumber &amp;= -2 \\JJ \\XX (\\SS &#8211; \\CC &#8211; \\XX^T \\WW^T \\JJ \\XX &#8211; \\XX^T \\JJ \\WW \\XX) \\XX^T + \\lambda \\WW\\\\ &amp;= &#8211; 2 \\JJ \\XX \\bE_0 \\XX^T + 2 \\JJ \\XX \\XX^T \\WW^T \\JJ \\XX \\XX^T+ 2 \\JJ \\XX \\XX^T \\JJ \\WW \\XX \\XX^T + \\lambda \\WW,\\end{align} where we&#8217;ve defined the error at $\\WW = \\bzero$ as $$\\bE_0 \\triangleq \\SS &#8211; \\XX^T \\XX.$$<\/p>\n\n\n\n<p>Note the that the non-regularization part of this is symmetric. This means that the updates to $\\WW$ will be symmetric, so its non-symmetric part will decay to zero, and the solution will be symmetric.<\/p>\n\n\n\n<p>From the equation above we also have that \\begin{align*} \\bone^T \\nabla L &amp;= \\bone^T 2 \\JJ \\XX \\bE_0 \\XX^T + 2 \\JJ \\XX \\XX^T \\WW^T \\JJ \\XX \\XX^T+ 2 \\JJ \\XX \\XX^T \\JJ \\WW \\XX \\XX^T + \\lambda \\bone^T \\WW\\\\ &amp;= \\bzero + \\bone^T \\WW\\\\ &amp;= \\bzero \\\\ \\implies \\WW &amp;= \\JJ \\WW.\\end{align*}<\/p>\n\n\n\n<p>We can then write the above gradient as $$ \\nabla L =- 2 \\JJ \\XX \\bE_0 \\XX^T + 4 \\JJ \\XX \\XX^T \\WW  \\XX \\XX^T + \\lambda \\WW.$$<\/p>\n\n\n\n<p>Now $\\XX$ is mean subtracted along the columns, so $\\JJ \\XX = \\XX$, and the gradient becomes \\begin{align*} \\nabla L &amp;=- 2 \\XX \\bE_0 \\XX^T + 4 \\XX \\XX^T \\WW  \\XX \\XX^T + \\lambda \\WW \\\\ &amp;= -2 \\XX (\\SS_Y &#8211; \\XX^T \\XX) \\XX^T + 4 \\XX \\XX^T \\WW  \\XX \\XX^T + \\lambda \\WW \\\\ &amp;= -2 \\XX \\SS_Y \\XX^T + 2 \\XX \\XX^T \\XX \\XX^T + 4 \\XX \\XX^T \\WW  \\XX \\XX^T + \\lambda \\WW\\end{align*}<\/p>\n\n\n\n<p>Let $\\XX = \\UU \\sqrt{\\SS_X} \\VV^T.$  In these terms, \\begin{align*} \\nabla L &amp;= &#8211; 2 \\UU \\sqrt{\\SS_X}\\VV^T \\SS_Y \\VV \\SS_X \\UU^T + 2 \\UU \\SS_X^2 \\UU^T + 4 \\UU \\SS_X \\UU^T \\WW \\UU \\SS_X \\UU^T + \\lambda \\WW.\\end{align*}<\/p>\n\n\n\n<p>Multiplying on the left by $\\UU^T$ and the right $\\UU$, we get \\begin{align*} \\UU^T \\nabla L \\UU &amp;= -2 \\sqrt{\\SS_X}\\VV^T \\SS_Y \\VV \\sqrt{\\SS_X} + 2 \\SS_X^2 + 4 \\SS_X \\WW_{UU} \\SS_X + \\lambda \\WW_{UU} \\end{align*}<\/p>\n\n\n\n<p>Defining \\begin{align*} \\RR &amp;\\triangleq \\VV^T \\SS_Y \\VV, \\end{align*} we get \\begin{align} \\sqrt{\\SS_X} (\\RR &#8211; \\SS_X) \\sqrt{\\SS_X} = 2 \\SS_X \\WW_{UU} \\SS_X + {\\lambda \\over 2} \\WW_{UU}. \\end{align}<\/p>\n\n\n\n<p>We can then solve for $\\WW_{UU}$ element wise as \\begin{align*} W_{UU, ij} &amp;= {\\sqrt{S_i}(R_{ij} &#8211; S_i \\delta_{ij}) \\sqrt{S_j} \\over 2 S_i S_j + {\\lambda \\over 2} }\\\\ &amp;= {R_{ij} &#8211; S_i \\delta_{ij} \\over 2 \\sqrt{S_i S_j} + {\\lambda \\over 2 \\sqrt{S_i S_j}} \\delta_{ij}} \\\\ &amp;={1 \\over 2} {R_{ij} &#8211; S_i \\delta_{ij} \\over \\sqrt{S_i S_j} + \\lambda_{ij}}, \\quad \\lambda_{ij} \\triangleq {\\lambda \\over 4 \\sqrt{S_i S_j}}. \\end{align*}<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Trying it out<\/h2>\n\n\n\n<p>This works very well for high values of $\\lambda$. The value that gives the best validation error is $10^6$. If we set $\\lambda$ much higher, to $10^9$, then the approximation works very well. For example, we can return to equation 1, which equals zero at convergence. Moving the first term to the left hand side, we can plot its values against the right hand side:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"448\" height=\"461\" src=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2026\/03\/image-14.png\" alt=\"\" class=\"wp-image-7174\" style=\"width:328px;height:auto\" srcset=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2026\/03\/image-14.png 448w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2026\/03\/image-14-292x300.png 292w\" sizes=\"auto, (max-width: 448px) 100vw, 448px\" \/><\/figure>\n\n\n\n<p>This shows a very good fit. However, at the value that gave the best validation error, the fit is not nearly as good:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"465\" height=\"462\" src=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2026\/03\/image-13.png\" alt=\"\" class=\"wp-image-7173\" style=\"aspect-ratio:1.0064883295260816;width:344px;height:auto\" srcset=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2026\/03\/image-13.png 465w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2026\/03\/image-13-300x298.png 300w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2026\/03\/image-13-150x150.png 150w\" sizes=\"auto, (max-width: 465px) 100vw, 465px\" \/><\/figure>\n\n\n\n<p>Below I&#8217;ve overlaid the diagonal values of $\\WW$ in orange on the off-diagonal values in blue, on the same y-axis (the x-axes are different, since there are different numbers of elements). At the larger value of $\\lambda$, the magnitudes are similarly small:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"773\" height=\"284\" src=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2026\/03\/image-15.png\" alt=\"\" class=\"wp-image-7176\" srcset=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2026\/03\/image-15.png 773w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2026\/03\/image-15-300x110.png 300w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2026\/03\/image-15-768x282.png 768w\" sizes=\"auto, (max-width: 773px) 100vw, 773px\" \/><\/figure>\n\n\n\n<p>At the smaller, optimal value of $\\lambda$, the magnitudes of some of the diagonal elements become large:<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"773\" height=\"283\" src=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2026\/03\/image-16.png\" alt=\"\" class=\"wp-image-7177\" srcset=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2026\/03\/image-16.png 773w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2026\/03\/image-16-300x110.png 300w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2026\/03\/image-16-768x281.png 768w\" sizes=\"auto, (max-width: 773px) 100vw, 773px\" \/><\/figure>\n\n\n\n<p>This indicates that we need to extend the model of the weights to include a diagonal component. <\/p>\n\n\n\n<p>$$ \\blacksquare$$<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I linearize the covariance for the Free model, and find that I need to include an additional diagonal component.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1,166,165,148],"tags":[],"class_list":["post-7075","post","type-post","status-publish","format-standard","hentry","category-blog","category-bulb-io","category-iopaper","category-research"],"acf":[],"_links":{"self":[{"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/posts\/7075","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/comments?post=7075"}],"version-history":[{"count":72,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/posts\/7075\/revisions"}],"predecessor-version":[{"id":7452,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/posts\/7075\/revisions\/7452"}],"wp:attachment":[{"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/media?parent=7075"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/categories?post=7075"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/tags?post=7075"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}