{"id":561,"date":"2024-01-20T21:37:26","date_gmt":"2024-01-20T21:37:26","guid":{"rendered":"https:\/\/sinatootoonian.com\/?p=561"},"modified":"2025-12-28T15:34:06","modified_gmt":"2025-12-28T15:34:06","slug":"differentiating-scalar-functions-of-matrices-in-mathematica-in-22-easy-steps","status":"publish","type":"post","link":"https:\/\/sinatootoonian.com\/index.php\/2024\/01\/20\/differentiating-scalar-functions-of-matrices-in-mathematica-in-22-easy-steps\/","title":{"rendered":"Differentiating scalar functions of matrices in 22 easy steps of Mathematica."},"content":{"rendered":"\n<p>I frequently need to differentiate loss functions with respect to matrices. I usually do this manually, which can be time-consuming and error-prone. Therefore I wanted to see if I could use Mathematica to compute these symbolic derivatives automatically. Mathematica does not have such functionality built in, but ChatGPT suggested achieving it using pattern matching. <\/p>\n\n\n\n<p>The loss function we&#8217;ll differentiate is $$ L(R) = {1 \\over 2} \\|A^T (J + R)^T (J+R) A &#8211; D\\|_F^2 + {\\lambda \\over 2}\\|R\\|_F^2.$$ <\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Manual differentiation<\/h3>\n\n\n\n<p>Let&#8217;s first differentiate the loss by hand to see what the right answer is. Letting $H = A^T (J + R)^T (J+R) A &#8211; D$, we have $$ \\begin{align}L(R) &amp;= {1 \\over 2} \\tr(H H^T) + {\\lambda \\over 2}\\tr(R R^T)\\\\<br>dL &amp;= \\tr(dH^T H) + \\lambda \\tr(dR^T R)\\\\<br>&amp;= \\tr(2 A^T dR^T (J + R)A H) + \\lambda \\tr(dR^T R)\\\\<br>&amp;= \\tr(2 dR^T (J + R)A H A^T ) + \\lambda \\tr(dR^T R)\\\\<br>&amp;=\\tr(2 dR^T (J+R)A(A^T (J + R)^T (J+R) A &#8211; D)A^T  + \\lambda \\tr(dR^T R).<br>\\end{align}$$ Therefore,<br>$$\\boxed{\\nabla_R L= 2 (J+R)(A A^T (J + R)^T (J+R) A A^T -A DA^T) + \\lambda  R.}$$<br>That wasn&#8217;t so bad. Let&#8217;s now try to get the same thing by applying transformation rules.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Semi-automatic differentiation<\/h3>\n\n\n\n<p>There are lots of ways to do this, below is just one, and probably not the most efficient one given that I&#8217;m a Mathematica novice.<\/p>\n\n\n<div class=\"wp-block-syntaxhighlighter-code \"><pre class=\"brush: perl; title: ; notranslate\" title=\"\">\nsumOfSquares&#x5B;X_] := Tr&#x5B;X.Transpose&#x5B;X]];\nL&#x5B;R_] := 1\/2 sumOfSquares&#x5B;Transpose&#x5B;A].Transpose&#x5B;J + R].(J + R).A - D1] + \\&#x5B;Lambda]\/2 sumOfSquares&#x5B;R]\n<\/pre><\/div>\n\n\n<p>First we&#8217;ll apply a symbolic differential.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"182\" src=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-1024x182.png\" alt=\"\" class=\"wp-image-575\" srcset=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-1024x182.png 1024w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-300x53.png 300w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-768x136.png 768w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-1536x272.png 1536w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image.png 1962w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Next we&#8217;ll apply a rule that commutes sums with differentials.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"158\" src=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-1-1024x158.png\" alt=\"\" class=\"wp-image-576\" srcset=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-1-1024x158.png 1024w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-1-300x46.png 300w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-1-768x118.png 768w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-1-1536x237.png 1536w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-1-2048x316.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>We want to bring the 1\/2 factors out of the differentials, so we create a rule to commute differentials with scalars. The first line of the output below describes the rule, and the second line show the result of its application.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"315\" src=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-2-1024x315.png\" alt=\"\" class=\"wp-image-577\" srcset=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-2-1024x315.png 1024w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-2-300x92.png 300w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-2-768x236.png 768w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-2-1536x472.png 1536w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-2.png 2022w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>We want to do the same with the lambda at the end, so:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"244\" src=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-3-1024x244.png\" alt=\"\" class=\"wp-image-578\" srcset=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-3-1024x244.png 1024w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-3-300x71.png 300w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-3-768x183.png 768w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-3-1536x366.png 1536w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-3-2048x487.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>We&#8217;re aiming at $\\tr[dR^T (\\dots)]$, so next we commute the trace and the differential.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"312\" src=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-4-1024x312.png\" alt=\"\" class=\"wp-image-581\" srcset=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-4-1024x312.png 1024w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-4-300x91.png 300w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-4-768x234.png 768w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-4-1536x468.png 1536w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-4.png 2022w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Our next aim is to turn this into the trace of a sum of differentials of dot product, e.g. $ \\sum_i \\tr[d(A_i B_i \\dots)]$. To do this, we&#8217;re going to apply a sequence of rules. The first expresses that $(A + B)^T = A^T + B^T$.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"169\" src=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-5-1024x169.png\" alt=\"\" class=\"wp-image-583\" srcset=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-5-1024x169.png 1024w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-5-300x49.png 300w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-5-768x126.png 768w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-5-1536x253.png 1536w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-5.png 1896w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>The next distributes the dot product over sums, $A.(B+C) = A.B+ A.C$.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"186\" src=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-6-1024x186.png\" alt=\"\" class=\"wp-image-584\" style=\"width:537px;height:auto\" srcset=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-6-1024x186.png 1024w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-6-300x54.png 300w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-6-768x139.png 768w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-6-1536x279.png 1536w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-6.png 1620w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>The next expresses $(A^T)^T = A$.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"274\" src=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-7-1024x274.png\" alt=\"\" class=\"wp-image-585\" style=\"width:409px;height:auto\" srcset=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-7-1024x274.png 1024w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-7-300x80.png 300w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-7-768x206.png 768w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-7.png 1090w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>The next commutes transposing with scalar multiplication i.e. $(k A)^T = k A^T$.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"118\" src=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-8-1024x118.png\" alt=\"\" class=\"wp-image-586\" srcset=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-8-1024x118.png 1024w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-8-300x35.png 300w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-8-768x89.png 768w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-8-1536x177.png 1536w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-8-2048x236.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Finally, the last distributes the differential over sums $d(A+B) = d(A) + dB$.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"218\" src=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-9-1024x218.png\" alt=\"\" class=\"wp-image-587\" style=\"width:431px;height:auto\" srcset=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-9-1024x218.png 1024w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-9-300x64.png 300w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-9-768x163.png 768w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-9.png 1402w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>We will now apply these in the order I found by trial and error to be effective. The &#8216;\/.&#8217; applies the rule once, &#8216;\/\/.&#8217; applies it repeatedly until the expression doesn&#8217;t change.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"268\" src=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-10-1024x268.png\" alt=\"\" class=\"wp-image-589\" srcset=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-10-1024x268.png 1024w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-10-300x78.png 300w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-10-768x201.png 768w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-10-1536x402.png 1536w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-10-2048x536.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>The next rule distributes transpose over dot products $(A B)^T = B^T A^T$<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"239\" src=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-11-1024x239.png\" alt=\"\" class=\"wp-image-590\" srcset=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-11-1024x239.png 1024w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-11-300x70.png 300w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-11-768x179.png 768w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-11-1536x358.png 1536w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-11-2048x477.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>The next rule extracts out the annoying minus signs from dot products: $A B (-C)D = -ABCD$.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"209\" src=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-12-1024x209.png\" alt=\"\" class=\"wp-image-591\" srcset=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-12-1024x209.png 1024w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-12-300x61.png 300w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-12-768x157.png 768w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-12-1536x313.png 1536w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-12-2048x418.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Next we apply Leibniz&#8217;s rule to expand out the differentials, $d(AB) = d(A)B + A d(B)$. This produces a very long output which I&#8217;ve truncated.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"246\" src=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-13-1024x246.png\" alt=\"\" class=\"wp-image-592\" srcset=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-13-1024x246.png 1024w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-13-300x72.png 300w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-13-768x184.png 768w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-13-1536x368.png 1536w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-13-2048x491.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Our next aim is to move all the transposed differentials to be left-most in their terms, and the un-transposed differentials to be right-most, using the circular property of the trace, $\\tr(AB) = \\tr(BA)$. <\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"331\" src=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-14-1024x331.png\" alt=\"\" class=\"wp-image-595\" srcset=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-14-1024x331.png 1024w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-14-300x97.png 300w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-14-768x248.png 768w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-14-1536x496.png 1536w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-14-2048x662.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>We next apply a transpose to end up with all terms having transposed differentials at the left-most position. We can then read off their contributions to the gradient as whatever the transposed differentials are being dot-producted against.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"227\" src=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-15-1024x227.png\" alt=\"\" class=\"wp-image-596\" srcset=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-15-1024x227.png 1024w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-15-300x66.png 300w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-15-768x170.png 768w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-15-1536x340.png 1536w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-15-2048x453.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>We only care about differentials of $R$, so we now zero-out the other terms.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"361\" src=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-16-1024x361.png\" alt=\"\" class=\"wp-image-598\" srcset=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-16-1024x361.png 1024w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-16-300x106.png 300w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-16-768x271.png 768w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-16-1536x542.png 1536w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-16-2048x722.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Rather than having one big trace we distribute the trace over sums, $\\tr(A+B) = \\tr(A)+ \\tr(B)$.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"289\" src=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-17-1024x289.png\" alt=\"\" class=\"wp-image-599\" srcset=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-17-1024x289.png 1024w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-17-300x85.png 300w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-17-768x217.png 768w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-17-1536x434.png 1536w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-17-2048x579.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>We pull out the constant terms by computing trace with scalar multiplication, $\\tr(a X)$ = a \\tr(X)$.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"286\" src=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-18-1024x286.png\" alt=\"\" class=\"wp-image-600\" srcset=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-18-1024x286.png 1024w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-18-300x84.png 300w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-18-768x214.png 768w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-18-1536x429.png 1536w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-18-2048x571.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>We can now extract the gradient as the sum of the terms being dot-producted against $d(R)^T$.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"252\" src=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-19-1024x252.png\" alt=\"\" class=\"wp-image-601\" srcset=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-19-1024x252.png 1024w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-19-300x74.png 300w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-19-768x189.png 768w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-19-1536x378.png 1536w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-19-2048x503.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>$D1$ is a diagonal matrix, so $D1^T = D1$.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"177\" src=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-20-1024x177.png\" alt=\"\" class=\"wp-image-602\" srcset=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-20-1024x177.png 1024w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-20-300x52.png 300w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-20-768x133.png 768w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-20-1536x266.png 1536w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-20-2048x355.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Next we&#8217;re going to collect some dot product. First, using $AB + AC = A(B+C)$.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"223\" src=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-21-1024x223.png\" alt=\"\" class=\"wp-image-603\" srcset=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-21-1024x223.png 1024w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-21-300x65.png 300w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-21-768x167.png 768w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-21-1536x335.png 1536w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-21-2048x446.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Next, using $BA + CA = (B+C)A$.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"258\" src=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-22-1024x258.png\" alt=\"\" class=\"wp-image-604\" style=\"width:469px;height:auto\" srcset=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-22-1024x258.png 1024w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-22-300x76.png 300w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-22-768x194.png 768w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-22-1536x387.png 1536w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-22.png 1848w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Next, that $k BA + k CA= k( B+ C)A$.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"190\" src=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-23-1024x190.png\" alt=\"\" class=\"wp-image-605\" srcset=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-23-1024x190.png 1024w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-23-300x56.png 300w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-23-768x143.png 768w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-23-1536x285.png 1536w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-23-2048x380.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Next, that $ p A B + q A C =  A (p B + qC)$.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"183\" src=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-24-1024x183.png\" alt=\"\" class=\"wp-image-606\" srcset=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-24-1024x183.png 1024w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-24-300x54.png 300w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-24-768x137.png 768w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-24-1536x274.png 1536w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-24-2048x365.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Finally, we collect transposes, $A^T + B^T = (A+B)^T$.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"165\" src=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-25-1024x165.png\" alt=\"\" class=\"wp-image-607\" srcset=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-25-1024x165.png 1024w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-25-300x48.png 300w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-25-768x124.png 768w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-25-1536x247.png 1536w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/01\/image-25-2048x329.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Ideally, I&#8217;d pull out the factor of 2 as well, but couldn&#8217;t quite get that to work. Nevertheless, this is the same equation we derived manually above.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Summary<\/h3>\n\n\n\n<p>While we did get the right derivative in the end, the process was very ad hoc and even more laborious than manual differentiation. Having to hard-code the rules was interesting because it made me realise how many of these rules we apply, in problem-specific order, to arrive at the gradient. I suppose that&#8217;s part of the fun! This was an interesting exercise, but I think I&#8217;ll continue to derive my gradients manually for now.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I frequently need to differentiate loss functions with respect to matrices. I usually do this manually, which can be time-consuming and error-prone. Therefore I wanted to see if I could use Mathematica to compute these symbolic derivatives automatically. Mathematica does not have such functionality built in, but ChatGPT suggested achieving it using pattern matching. The [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1,152],"tags":[21,20,22],"class_list":["post-561","post","type-post","status-publish","format-standard","hentry","category-blog","category-post","tag-differentiation","tag-mathematica","tag-pattern-matching"],"acf":[],"_links":{"self":[{"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/posts\/561","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/comments?post=561"}],"version-history":[{"count":25,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/posts\/561\/revisions"}],"predecessor-version":[{"id":612,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/posts\/561\/revisions\/612"}],"wp:attachment":[{"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/media?parent=561"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/categories?post=561"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/tags?post=561"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}