{"id":7180,"date":"2026-03-10T17:53:18","date_gmt":"2026-03-10T17:53:18","guid":{"rendered":"https:\/\/sinatootoonian.com\/?p=7180"},"modified":"2026-03-10T17:54:13","modified_gmt":"2026-03-10T17:54:13","slug":"linearizing-covariance-for-the-free-model-part-ii","status":"publish","type":"post","link":"https:\/\/sinatootoonian.com\/index.php\/2026\/03\/10\/linearizing-covariance-for-the-free-model-part-ii\/","title":{"rendered":"Linearizing Covariance for the Free Model, Part II"},"content":{"rendered":"\n<p>In the previous post we saw that linearzing the covariance for the Free connectivity model gives good results for high values of the regularization term $\\lambda$, but not for the lower values which give the best validation error. At those values, the diagonal component of the weights starts becoming large. Below I&#8217;ve overlaid this diagonal component (orange) on the off-diagonal component (blue):<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"773\" height=\"283\" src=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2026\/03\/image-17.png\" alt=\"\" class=\"wp-image-7181\" srcset=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2026\/03\/image-17.png 773w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2026\/03\/image-17-300x110.png 300w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2026\/03\/image-17-768x281.png 768w\" sizes=\"auto, (max-width: 773px) 100vw, 773px\" \/><\/figure>\n\n\n\n<p>This suggests modeling the connectivity as $$\\ZZ = \\II + \\WW + \\DD.$$ We&#8217;ll assume that $\\WW$ is small so that $\\WW^T \\WW \\approx \\bzero$. We will also ignore products of $\\WW \\DD$ for simplicity, but keep $\\DD^2$.<\/p>\n\n\n\n<p>Our loss is then \\begin{align*} L(\\WW) &amp;= {1 \\over 2} \\|\\SS &#8211; \\XX^T (\\II + \\WW + \\DD)^T \\JJ (\\II + \\WW + \\DD) \\XX\\|^2_F + {\\lambda \\over 2}\\|\\WW + \\DD\\|_F^2.\\end{align*}<\/p>\n\n\n\n<p>The inner term multiplying $\\JJ$ expands to $$ \\JJ + \\JJ \\WW + \\JJ \\DD  + \\WW^T \\JJ + \\WW^T \\JJ \\WW + \\WW^T \\JJ \\DD + \\DD^T \\JJ  + \\DD^T \\JJ \\WW + \\DD^T \\JJ \\DD.$$<\/p>\n\n\n\n<p>If we assume that for typical elements of $\\WW$ and $\\DD$ that $ |W| &lt; |D| &lt; 1,$ then, in addition to $\\WW^T \\WW$, we can hope to ignore products of $\\WW$ and $\\DD$, $$ \\approx \\JJ + \\JJ \\WW + \\JJ \\DD  + \\WW^T \\JJ  + \\DD^T \\JJ  +  \\DD^T \\JJ \\DD.$$<\/p>\n\n\n\n<p>Let&#8217;s additionally approximate $\\DD^T \\JJ \\DD$ by just $\\DD^2$, to arrive at $$ \\approx \\JJ + \\JJ \\WW + \\JJ \\DD  + \\WW^T \\JJ  + \\DD^T \\JJ  +  \\DD^2.$$<\/p>\n\n\n\n<p>Then, using that $\\JJ \\XX = \\XX$ for our centered data, and that $\\WW$ will be symmetric at the optimum, we get $$ \\XX^T (\\dots) \\XX \\approx \\XX^T \\XX + 2 \\XX^T \\WW \\XX + 2 \\XX^T \\DD \\XX + \\XX^T \\DD^2 \\XX.$$<\/p>\n\n\n\n<p>Approximate our regularizer similarly, our approximate loss becomes $$ L(\\WW, \\DD) \\approx {1 \\over 2} \\|\\bE_0 &#8211; 2 \\XX^T \\WW \\XX &#8211; 2 \\XX^T \\DD \\XX &#8211; \\XX^T \\DD^2 \\XX\\|_F^2 + {\\lambda \\over 2}\\|\\WW\\|_2^2 + {\\lambda \\over 2} \\|\\DD\\|_F^2, \\quad \\bE_0 \\triangleq \\SS &#8211; \\XX^T \\XX. $$<\/p>\n\n\n\n<p>At the optimium the gradient with respect to $\\WW$ is then \\begin{align*}\\bzero = L_\\WW &amp;= -2\\XX \\bE \\XX^T + \\lambda \\WW \\\\ &amp;= &#8211; 2\\XX (\\bE_0 &#8211; 2 \\XX^T \\WW \\XX &#8211; 2 \\XX^T \\DD \\XX &#8211; \\XX^T \\DD^2 \\XX)\\XX^T + \\lambda \\WW \\\\ 2 \\XX \\bE_0 \\XX^T &#8211; 2 \\XX \\XX^T (2 \\DD + \\DD^2) \\XX \\XX^T &amp;=  4 \\XX \\XX^T \\WW \\XX \\XX^T + \\lambda \\WW \\\\ 2 \\XX \\SS \\XX^T &#8211; 2\\XX \\XX^T (\\II + \\DD)^2 \\XX \\XX^T &amp;= 4\\XX \\XX^T \\WW \\XX \\XX^T + \\lambda \\WW.  \\end{align*}<\/p>\n\n\n\n<p>Decomposing $\\XX = \\UU \\sqrt{\\SS_X} \\VV^T$ the above equation becomes $$ 2 \\UU \\sqrt{\\SS_X} \\VV^T \\SS \\VV \\sqrt{\\SS_X} \\UU^T &#8211; 2 \\UU \\SS_X \\UU^T (\\II + \\DD)^2 \\UU \\SS_X \\UU^T = 4 \\UU \\SS_X \\UU^T \\WW \\UU \\SS_X \\UU^T + \\lambda \\WW.$$<\/p>\n\n\n\n<p>Multiplying the above equation on the left by $\\UU^T$ and right by $\\UU$, we get \\begin{align} \\label{main} \\sqrt{\\SS_X} \\VV^T \\SS \\VV \\sqrt{\\SS_X}  =  \\SS_X(\\II +  \\DD)_{UU}^2 \\SS_X + 2 \\SS_X \\WW_{UU} \\SS_X + {\\lambda \\over 2}\\WW_{UU},\\end{align} which reduces to Eqn. 2 in our <a href=\"https:\/\/sinatootoonian.com\/index.php\/2026\/03\/09\/linearizing-the-covariance-loss-for-the-free-model\/\" data-type=\"post\" data-id=\"7075\">previous linearization attempt<\/a> when $\\DD = \\bzero$.   <\/p>\n\n\n\n<p>For the gradient with respect to $\\DD$, we&#8217;ll regroup terms in our loss and write it as $$ L(\\DD) = {1 \\over 2} \\|\\SS &#8211; 2 \\XX^T \\WW \\XX -\\XX^T (\\II + \\DD)^2 \\XX \\|_2^2 + {\\lambda_D \\over 2}\\|\\DD\\|_2^2,$$ where we&#8217;ve generalized to allow $\\DD$ its own regularizer. <\/p>\n\n\n\n<p>The differential is then $$ dL = -\\bE^T (4 \\XX^T (\\II + \\DD) d\\DD \\XX) + \\lambda_D \\DD$$ from which we can read off the gradient as $$ L_D = -4 (\\II + \\DD) \\XX \\bE \\XX^T + \\lambda_D \\DD,$$ where we&#8217;ve treated $\\DD$ as a full matrix. The actual gradient would take the diagonal of this. Setting it to zero gives $$ [\\XX \\bE \\XX^T] = {\\lambda_D \\over 4} {\\DD \\over \\bone + \\DD}.$$<\/p>\n\n\n\n<p>The condition from the gradient on $\\WW$ tells us that $2 \\XX \\bE \\XX^T =\\lambda \\WW$. Substituting this in gives \\begin{align*} 2 \\lambda (\\II + \\DD) \\WW &amp;= \\lambda_D \\DD, \\end{align*} which rearranges to $$ \\DD = {2 \\lambda \\WW \\over \\lambda_D &#8211; 2\\lambda \\WW}.$$<\/p>\n\n\n\n<p>In our standard regime above, $\\lambda_D = \\lambda$, so this simplifies to $$\\DD = {2 \\WW \\over 1 &#8211; 2 \\WW}.$$<\/p>\n\n\n\n<p>This still couples $\\DD$ to $\\WW$ in a potentially awkward way. The Chatbots also pointed out that there&#8217;s an identifiability issue here with $\\DD$ and $\\WW$ overlapping on the diagonal, hence the coupling above.<\/p>\n\n\n\n<p>An obvious way around this is to enforce $W_{ii}=0$. The standard way to do this is with Lagrange multipliers, but that will get messy. So instead we can try not regularizing $\\DD$. In that case, it should mop-up the diagonal contribution of $\\WW$, and $W_{ii}$ should go to zero through its own regularization.<\/p>\n\n\n\n<p>With $\\lambda_D = 0$, the gradient equation for $\\DD$ becomes $$ -4(\\II + \\DD)\\XX \\bE \\XX^T = -2\\lambda (\\II + \\DD)\\WW = 0 \\implies (\\II + \\DD)\\WW = 0,$$ which has two solutions: $D_{ii} = -1$, which incidentally is the limiting value when $\\lambda_D \\to 0$ above, or $W_{ii} = 0$, leaving $D_{ii}$ free. This latter is what we should get, given the regularization of $\\WW$.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p>I went back to check the approximation in the $\\lambda_D = \\lambda$ case, where $ D_{ii} = 2 W_{ii}\/(1 &#8211; 2 W_{ii}).$  We need to set the values of $\\WW$ and $\\DD$ from some learned $\\ZZ$. Since $ \\ZZ = \\II + \\WW + \\DD,$ $$W_{i \\neq j} = \\ZZ_{ij}.$$ The on-diagonal terms satisfy $$ r_i \\triangleq z_i &#8211; 1 = w_i + {2 w_i \\over 1 &#8211; 2 w_i}.$$ Then $$ r_i (1 &#8211; 2 w_i) = w_i (1 &#8211; 2 w_i) + 2 w_i.$$ <\/p>\n\n\n\n<p>Rearranging, we get $$ 2 w_i^2 &#8211; (2 r_i +3 )w_i + r_i = 0.$$ We can solve this using the quadratic formula to get an expression for $w_i$. The negative root seems to give the right behaviour. Once we have an expression for $w_i$, we can use it to get one for $d_i$.<\/p>\n\n\n\n<p>At that point $\\WW$ and $\\DD$ are completely specified, and we can plug them into Eqn. \\ref{main} to see if it holds.<\/p>\n\n\n\n<p>The fit seems OK, except for one large value where there&#8217;s a big discrepancy:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"453\" height=\"452\" src=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2026\/03\/image-19.png\" alt=\"\" class=\"wp-image-7269\" style=\"width:321px;height:auto\" srcset=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2026\/03\/image-19.png 453w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2026\/03\/image-19-300x300.png 300w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2026\/03\/image-19-150x150.png 150w\" sizes=\"auto, (max-width: 453px) 100vw, 453px\" \/><\/figure>\n\n\n\n<p>Those large values are precisely the ones we&#8217;re trying to capture&#8230;<\/p>\n\n\n\n<p>$$\\blacksquare$$<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I extend the linearization to include non-linear diagonal terms, but at least the simplest approximation doesn&#8217;t capture the large values we&#8217;re after.<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1,165,148],"tags":[],"class_list":["post-7180","post","type-post","status-publish","format-standard","hentry","category-blog","category-iopaper","category-research"],"acf":[],"_links":{"self":[{"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/posts\/7180","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/comments?post=7180"}],"version-history":[{"count":87,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/posts\/7180\/revisions"}],"predecessor-version":[{"id":7272,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/posts\/7180\/revisions\/7272"}],"wp:attachment":[{"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/media?parent=7180"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/categories?post=7180"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/tags?post=7180"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}