{"id":857,"date":"2024-04-08T15:28:48","date_gmt":"2024-04-08T14:28:48","guid":{"rendered":"https:\/\/sinatootoonian.com\/?p=857"},"modified":"2025-12-27T16:04:39","modified_gmt":"2025-12-27T16:04:39","slug":"atick-and-redlich-1993","status":"publish","type":"post","link":"https:\/\/sinatootoonian.com\/index.php\/2024\/04\/08\/atick-and-redlich-1993\/","title":{"rendered":"Notes on Atick and Redlich 1993"},"content":{"rendered":"\n<p>In their <a href=\"https:\/\/direct.mit.edu\/neco\/article-abstract\/5\/1\/45\/5690\/Convergent-Algorithm-for-Sensory-Receptive-Field\">1993 paper<\/a> Atick and Redlich consider the problem of learning receptive fields that optimize information transmission. They consider a linear transformation of a vector of retinal inputs $s$ to ganglion cell outputs of the same dimension $$y = Ks.$$ They aim to find a biologically plausible learning rule that will use the input statistics to find weights $K$ that optimize a particular loss function. Below we state the loss function and motivate it from the perspective of efficient coding. We will then work through their derivation of the learning rule.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Motivating the objective function<\/h3>\n\n\n\n<p>The loss function they minimize is $$E(K) = \\tr(K R K^T) &#8211; \\rho \\log \\det(K^T K).\\label{obj}\\tag{1}$$ Here the matrix $R = \\langle s s^T\\rangle$ contains the input correlations computed over time, and $\\rho$ is a hyperparameter. <\/p>\n\n\n\n<h5 class=\"wp-block-heading\">The first term: reducing signal energy<\/h5>\n\n\n\n<p>To motivate this objective from the perspective of efficient coding, we note that the first term is the expected energy of the outputs. This is because \\begin{align} \\tr(K R K^T) &amp;= \\tr(K \\langle s s^T \\rangle K^T)\\\\ &amp;= \\tr( \\langle K s s^T K^T\\rangle)\\\\ &amp;= \\tr(\\langle y y^T \\rangle)\\\\ &amp;= \\tr(\\langle y^T y \\rangle)\\\\&amp;= \\langle y^T y \\rangle.\\end{align}<\/p>\n\n\n\n<p>If this was the only term in the objective then we could minimize it to its lowest possible value of zero by setting $K = 0$. The problem, of course, is that the resulting output, 0 for every input, also discards all the information present in the input signal. <\/p>\n\n\n\n<h5 class=\"wp-block-heading\">The second term: preserving information<\/h5>\n\n\n\n<p>To minimize the output energy while preventing information loss, we use the second term in the objective. To see why this works, we apply the singular value decomposition to express $K$ as $U \\Lambda V^T$. The energy of an output $y$ in response to an input $s$ can be expressed in terms of this decomposition as $$y^Ty = (U \\Lambda V^T s)^T U \\Lambda V s = s^T V \\Lambda U^T U \\Lambda V^T s = s^T V \\Lambda^2 V^Ts.$$ The first thing to note is that the left singular vectors $U$ are absent from this expression. This reflects the fact that rotating the output $y$ does not change its energy.<\/p>\n\n\n\n<p>Continuing, we can view $V^T s$ as rotating the input signal to produce $\\tilde s$. In this rotated coordinate system, the energy of the transformed signal $y$ is just the sum of the energies of the rotated input signal along each of its dimensions, weighted by the squared singular values. That is, $$ y^T y = s^T V \\Lambda^2 V^T s = \\tilde s^T \\Lambda^2 \\tilde s = \\sum_i \\Lambda_i^2 \\tilde s_i^2.$$ Since rotations preserve information, the only way that information could be lost is if one of the squared singular values, $\\Lambda_i^2$, is zero. To prevent this we can penalize small singular values by their negative logarithm. The negative logarithm sends small values towards $\\infty$ as they approach zero, penalizing them as we require. The second term of the objective is proportional to the sum of such penalties over all the singular values since<br>\\begin{align} -\\log \\det(K^T K) &amp;= -\\log \\det(V \\Lambda^2 V^T)\\\\ &amp;= -\\log \\det(\\Lambda^2)\\\\ &amp;= -\\log \\prod_i \\Lambda_i^2\\\\ &amp;= -2\\sum_i \\log \\Lambda_i.\\end{align} We can thus view the objective function as promoting energy efficiency while avoiding information loss,<br>$$ E(K) = \\underbrace{\\tr(K R K^T)}_{\\text{energy minimization}} \\quad \\underbrace{- \\rho \\log \\det(K^T K)}_{\\text{information preservation}}, $$ with the parameter $\\rho$ determining the balance between the two.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Convexity of the objective<\/h2>\n\n\n\n<p>To determine the optimal transformation that minimizes the output energy while preserving the information in the input signal, we minimize the loss $E(K)$. The first thing to note is that the loss is only a function of $K$ through $K^T K$, since by the rotational property of the trace, $$ E(K) = \\tr(K R K^T) &#8211; \\rho \\log \\det(K^T K) = \\tr(R K^T K) &#8211; \\rho \\log \\det(K^T K).$$ Therefore there is rotational redundancy in $K$ as $K \\to U K$ produces the same $K^T K$ and hence the same loss. As mentioned above, this reflects the fact that rotating the outputs changes neither their energy nor their information content.<\/p>\n\n\n\n<p>We can write the loss in terms of $G = K^T K$ as $$ E(G) = \\tr(R G) &#8211; \\rho \\log \\det(G).$$ Since the trace is convex and $\\log \\det$ concave in $G$, $E(G)$ is a convex function of $G$, being minimized over a convex domain, the set of positive definite matrices. Therefore, the loss function has a unique global minimum. <\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Optimality condition<\/h2>\n\n\n\n<p>To determine the value of $G$ that achieves this minimum, we compute the gradient $$ \\nabla_G E = R^T &#8211; \\rho G^{-1} = R  &#8211; \\rho G^{-1} = 0 \\implies R G = \\rho I,$$ where we&#8217;ve used that $\\nabla_X \\log \\det(X) = X^{-1}$ and that the correlation matrix $R$ is symmetric. We see that the optimal $G$ inverts $R$, and is therefore unique.<\/p>\n\n\n\n<p>The optimality condition $R G = I$ in terms of $K$ is (after dropping $\\rho$), $$ R K^T K = I \\implies K R K^T K = K.$$ Right-multiplying by $K^{-1}$ we arrive the <em>optimality condition<\/em> $$\\boxed{K R K^T = I. \\tag{2}\\label{opt}}$$ Now $K R K^T = K \\cov(s) K^T = \\cov(Ks) = \\cov(y)$. Therefore, we see that an optimal $K$ <em>whitens<\/em> the output &#8211; that is, it decorrelates the channels and equalizes their variances.<\/p>\n\n\n\n<p>The rotational redundancy in $K$ is also present in the optimality condition, since $K \\to U K$ still whitens the output because $$ U K R K ^T U^T = U U^T = I.$$<\/p>\n\n\n\n<h5 class=\"wp-block-heading\">OPtimizing connectivity directly<\/h5>\n\n\n\n<p>Instead of working with $K^T K$, we can find optimal $K$&#8217;s by differentiating the loss in terms of $K$ itself. To compute the gradient, we compute the differential <br>\\begin{align} dE &amp;= \\tr(dK R K^T) + \\tr(K R dK^T) &#8211; \\tr((K^T K)^{-1} d(K^T K)\\\\ &amp;=\\tr(R K^T dK) + \\tr(dK^T K R) &#8211; \\tr((K^T K)^{-1}(dK^T K + K^T dK)),\\end{align} from which we read out the gradient as $$ \\nabla_K E = 2 (KR &#8211; K (K^T K)^{-1}) = 2 (KR-  K^{-T}),\\label{Ek}\\tag{3}$$ where in the last equality we&#8217;ve used that $K$ is square and invertible. Setting the gradient to zero we get $$ KR &#8211; K^{-T} = 0 \\implies KR K^T = I,$$ the whitening condition in $\\Eqn{opt}$ that we derived above.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Implementation in a feedforward circuit<\/h2>\n\n\n\n<p>At this point it&#8217;s useful to think about how all of this would occur in a neural circuit. <\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"780\" src=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/02\/Notiz-19-02-2024-3-3-1024x780.png\" alt=\"\" class=\"wp-image-1334\" style=\"width:650px;height:auto\" srcset=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/02\/Notiz-19-02-2024-3-3-1024x780.png 1024w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/02\/Notiz-19-02-2024-3-3-300x228.png 300w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/02\/Notiz-19-02-2024-3-3-768x585.png 768w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/02\/Notiz-19-02-2024-3-3-1536x1170.png 1536w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/02\/Notiz-19-02-2024-3-3-2048x1560.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>A simple circuit that would implement the action so far is the feedforward circuit shown above. Sensory inputs $s_1$ to $s_3$ excite the linear output units $y_1$ to $y_3$ through feedforward synaptic weights $K_{ij}$. The dynamics of the output units asymptotically converge to $y_i = \\sum_{j} K_{ij} s_j,$ or in vector\/matrix notation, $y = Ks$. <\/p>\n\n\n\n<p>We&#8217;re interested in somehow learning the weights $K$ that optimize the objective in $\\Eqn{obj}$. An obvious idea is to descend the gradient in $\\Eqn{Ek}$ directly. This gives a learning rule like $$\\delta K \\propto -\\nabla_K E = -KR + K^{-T}.\\label{rule1}\\tag{4}$$ Unfortunately this rule is not biologically plausible, because both terms on the righthand side make this rule <em>non-local<\/em>, since updating the strength of a synapse would require more information than is present at that synapse.<\/p>\n\n\n\n<p>It&#8217;s easier to see this if we look at the rule element-wise and consider the update to synapse $K_{ij}$ neuron $i$ receives from input $j$. We would want such a rule to only involve quantities related to units $i$ and $j$, for example the synaptic strength itself $K_{ij}$, the output $y_i$ of neuron $i$, the input $s_j$ to neuron $j$, etc. <\/p>\n\n\n\n<p>Instead, we have $$ \\delta K_{ij} = -\\sum_{n} K_{in} R_{nj} + (K^{-T})_{ij}.\\label{rule1el}\\tag{5}$$ The first term is already problematic because of the weighted sum over $K_{in}$ involves the synapses from <em>all<\/em> inputs to neuron $j$. But this might not be such a problem, since those quantities could still somehow be available at the soma of neuron $j$. Or, you might get lucky and your inputs arrive decorrelated. In that case $$R_{ij} \\propto \\delta_{ij} \\implies \\sum_{n} K_{in} R_{nj} \\propto \\sum_n K_{in} \\delta_{nj} \\propto K_{ij},$$ which is local. You could also try something more sophisticated and use the update $$ \\delta K \\propto -\\nabla_K E R^{-1} = -K + K^{-T} R^{-1}.$$ This is still a descent direction (more on this below) so would still minimize the objective.  It would also certainly solve the problem with the first term, since now $\\delta K_{ij} = -K_{ij} + \\dots.$ It also doesn&#8217;t require that $R$ by decorrelated, only that it be invertible, which we&#8217;re assuming it is.<\/p>\n\n\n\n<p>The real problem is with the second term, which we&#8217;ve now exacerbated with our $R^{-1}$ term. Returning to the elementwise expression in $\\Eqn{rule1el}$, we see that the update to the synapse $K_{ij}$ requires knowing the corresponding value of the transpose of the inverse of $K$. From <a href=\"https:\/\/en.wikipedia.org\/wiki\/Invertible_matrix#Methods_of_matrix_inversion\">Wikipedia<\/a>, this value is $$(K^{-T})_{ij} = {C_{ji} \\over \\det(K)}, \\quad C_{ji} = (-1)^{j+i} M_{ji},$$ where the <em>minor<\/em> $M_{ji}$ is the determinant of the submatrix left after removing the $j&#8217;\\text{th}$ row and $i&#8217;\\text{th}$ column. <\/p>\n\n\n\n<p>If this sounds complicated, it is &#8211; determinants, minors etc. are global properties of a matrix. In synaptic terms, it means that to update $K_{ij}$, we need to know some complicated functions of <em>all<\/em> other synapses in the system, which is certainly not plausible.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">An equivalent recurrent circuit<\/h2>\n\n\n\n<p>At this point we&#8217;re stuck &#8211; we have a well-motivated objective function in $\\Eqn{obj}$ to evaluate feedforward connectivity $K$, but the learning rule we have in $\\Eqn{rule1}$ is non-local, no matter how we have tried to finesse it. The way they get around this problem is by trying a different circuit. <\/p>\n\n\n\n<p>To every feedforward circuit with (invertible) connectivity $K$, we can associate a recurrent circuit which produces the same asymptotic response to a given input. To see this, consider the circuit below:<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"780\" src=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/02\/Notiz-19-02-2024-3-Kopie-4-1024x780.png\" alt=\"\" class=\"wp-image-1487\" srcset=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/02\/Notiz-19-02-2024-3-Kopie-4-1024x780.png 1024w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/02\/Notiz-19-02-2024-3-Kopie-4-300x228.png 300w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/02\/Notiz-19-02-2024-3-Kopie-4-768x585.png 768w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/02\/Notiz-19-02-2024-3-Kopie-4-1536x1170.png 1536w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2024\/02\/Notiz-19-02-2024-3-Kopie-4-2048x1560.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>The all-to-all feedforward connections have been replaced with 1-to-1 connections from each input channel to a corresponding output unit. In contrast to the feedforward circuit, in which output units did not interact, in this recurrent circuit all output units interact, inhibiting each other with strength $B_{ij}$ (only some of the connections have been shown, for clarity). The circuit dynamics are $$ \\dot y = -y + s &#8211; B y.$$<\/p>\n\n\n\n<p>To see the connection to the feedforward circuit we determined the asymptotic value of the activity by setting the dynamics to zero. This gives $$y + B y = s \\implies (I + B) y = s \\implies y = (I + B)^{-1} s.$$ So we see that as far as asymptotic activity is concerned (and that&#8217;s all we&#8217;re concerned with here) a recurrent circuit with connectivity $B$ corresponds to a feedforward circuit with connectivity $K = (I + B)^{-1}$. If we call $W = (I + B)$, then we have that $K = W^{-1}$. In other words, a feedforward circuit with connectivity $K$ maps uniquely onto a recurrent circuit with connectivity $W = K^{-1}$.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Learning in the recurrent circuit<\/h2>\n\n\n\n<p>The problematic term in our feedforward learning rule was the inverse transpose, $K^{-T}$. We have just seen that a feedforward circuit $K$ maps onto a recurrent circuit with the inverse connectivity $W = K^{-1}$. So there might be hope that our learning rule $\\Eqn{rule1}$, expressed in terms of $W$, might end up being local.<\/p>\n\n\n\n<p>Instead, they use the update (defining $E_K \\equiv \\nabla_K E$) $$\\delta K^T = -K^T E_K K^T.$$ This is still a descent direction of $E$ since \\begin{align}\\tr(\\delta K^T E_K) &amp;= \\tr(-K^T E_K K^T E_K)\\\\ &amp;=-\\tr(E_K K^T E_K K^T )\\\\ &amp;= -\\tr((KRK^T &#8211; I)(KRK^T-I))\\\\ &amp;= -\\tr((KRK^T &#8211; I)(KRK^T &#8211; I)^T) \\le 0,\\end{align} where the inequality follows from $\\tr(XX^T) \\ge 0$ for all $X$.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Switching coordinates<\/h3>\n\n\n\n<p>Next they switch to $W = K^{-1}$ coordinates. These would be the weight that would convert the output to the input through $ W Y = S$. Another way to look at them would be in the lateral inhibition circuit $$ \\dot Y = -Y + S &#8211; B Y \\implies (I + B)^{-1} Y = S.$$ So $W &#8211; I$ would map on to the recurrent inhibitory weights $B$.<\/p>\n\n\n\n<p>To determine the $W$ update, we use the fact that when $W = K^{-1}$, $\\delta W = -K^{-1} \\delta K K^{-1}$. We then substitute the transpose of our $\\delta K^T$ update above to get $$\\delta W = K^{-1}(K E_K^T K) K^{-1}= E_K^T = RK^T &#8211; K^{-1}.$$ Expressing $K$ in terms of $W$, we then get the dynamics $${dW \\over dt} \\propto \\delta W = R W^{-T} &#8211; W.$$ <\/p>\n\n\n\n<p>It looks like we haven&#8217;t gained anything since we&#8217;ve again arrived at a nonlocal (through the inverse) learning rule. But unlike our nonlocal update of $K$, where the inverse term appeared alone, here the inverse of $W$ appears with $R$, which will allow us to express the updates as local (see below).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Convergence<\/h3>\n\n\n\n<p>We can show convergence of the updates using<br>\\begin{align}{dWW^T \\over dt} &amp;= {dW \\over dt} W^T + W {dW ^T \\over dt}\\\\&amp;= (RW^{-T} -W)W^T + W(W^{-1}R &#8211; W^T) \\\\ &amp;= 2(R-WW^T),<br>\\end{align} so $$WW^T(t) = R &#8211; e^{-2t} C,$$ where $C$ depends on the initial conditions. We see now that $WW^T$ converges exponentially to $R$. And this is the global minimum of the energy, since $$WW^T = R \\implies W^{-1} R W^{-T} = K R K^{T} = I,$$ the whitening condition we found at the minimum.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">A local implementation<\/h3>\n\n\n\n<p>It turns out that our nonlocal-looking update for $W$ can in fact be made local. This is because \\begin{align} {d W \\over dt} &amp;= RW^{-T} &#8211; W \\\\ &amp;= S S^T W^{-T}  &#8211; W \\\\ &amp;=  S S^T K^T &#8211; W \\\\ &amp;= S Y^T &#8211; W,\\end{align} and we can convert this to an online update as <br>$$  {dW \\over dt} = s(t)y(t)^T &#8211; W.$$<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Choosing among solutions<\/h3>\n\n\n\n<p>We saw at the beginning that the loss $E(K)$ is actually only a function of $K^T K$, so does not determine $K$ uniquely. This was reflected again in the convergence condition requiring only that $WW^T = R$, and thus specifying $W$ only up to rotation. This means that although our weight dynamics will converge to a solution that will minimize our loss as desired, the particular solution found will depend on the initial conditions. Should we favour some solutions over others? <\/p>\n\n\n\n<p>The authors argue on biological grounds for spatially localized receptive fields and therefore opt for the symmetric solution $W = W^T$. They explained why this solution would produce localized receptive fields, but I didn&#8217;t understand the explanation.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Summary<\/h2>\n\n\n\n<p>Atick and Redlich formulate the efficient coding problem as optimizing a loss that balances the coding energy against information preservation. By differentiating this with respect to connectivity they come up with a learning rule for the weights in a feedforward implementation. Unfortunately, the learning rule is non-local. However, by switching coordinates the the inverse of the feedforward connectivity, they managed to find a local learning rule for the resulting, recurrent, weights. The fact that this worked out in this case was a little miraculous, and it&#8217;s not clear why and under what conditions such a procedure would work in general.<\/p>\n\n\n\n<p>$$\\blacksquare$$<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In their 1993 paper Atick and Redlich consider the problem of learning receptive fields that optimize information transmission. They consider a linear transformation of a vector of retinal inputs $s$ to ganglion cell outputs of the same dimension $$y = Ks.$$ They aim to find a biologically plausible learning rule that will use the input [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1,30],"tags":[52,59,58,61,60,64,63,3],"class_list":["post-857","post","type-post","status-publish","format-standard","hentry","category-blog","category-notes-blog","tag-decorrelation","tag-efficient-coding","tag-exposition","tag-feedback","tag-feedforward","tag-learning-rules","tag-non-local","tag-optimization"],"acf":[],"_links":{"self":[{"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/posts\/857","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/comments?post=857"}],"version-history":[{"count":237,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/posts\/857\/revisions"}],"predecessor-version":[{"id":2286,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/posts\/857\/revisions\/2286"}],"wp:attachment":[{"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/media?parent=857"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/categories?post=857"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/tags?post=857"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}