The Logic of Free Connectivity

When we fit connectivity to convert the input representations to the output representations, and without constraining the connectivity except for regularizing its implied feedforward circuit to be approximately the identity, we find, first, that we can fit the representation data reasonably well. Below I show the observed representation of some held-out data on the left, and the predicted representation on the right. The $R^2$ is about 0.4.

Since we don’t constrain the connectivity, it could be quite complex. But when we actually look at it (left panel below shows off-diagonal terms), we see that it’s very structured. In fact, we can fit it very well as the sum of a diagonal term and a two rank 1 terms, that is, $$ \ZZ \approx \DD + \alpha \uu \uu^T + \beta \vv \vv^T,$$. The panel on the right shows the result for the off-diagonal terms.

The overall $R^2$ is above 0.95, and is 0.8 or so when considering just the off-diagonal terms.

The diagonal terms are mostly 1, except for a small number that deviate by a small amount. Here’s an example of the sorted values:

Rather than just showing this result, I want to understand its logic. That is, why do the diagonal elements have the values they do? And what determines the rank-1 terms?

The Logic of Connectivity

The loss function we’re optimizing is
$$ L(\ZZ) = {1 \over 2} \|\XX^T \ZZ^T \ZZ \XX – \CC\|_F^2 + {\lambda \over 2 }\|\ZZ – \II\|_F^2.$$

We determine the value of $\lambda$ through cross-validation. It turns out to be quite high, greater than $10^6$. We’ll be using that fact below.

It also turns out that the solutions we find are symmetric. To see why, we compute the gradient above and get $$ \nabla L = \ZZ (2 \XX \bE \XX^T) + \lambda (\ZZ – \II),$$ where $\bE$ is the term in the first norm above. Setting this to zero and solving for $\ZZ,$ we get
$$ \ZZ = \lambda (2 \XX \bE \XX^T + \lambda \II)^{-1} = (2 \XX \bE \XX^T/\lambda + \II)^{-1} .$$ This is symmetric because $\bE$ and $\II$ are.

Due to the symmetry, $\ZZ^T \ZZ = \ZZ^2.$

For $\ZZ \approx \DD + \UU \SS \UU^T$, where the diagonal matrix $\SS$ contains the weight, possibly negative, of each of the orthogonal (but not orthonormal) components in the columns of $\UU$, we have $$ \ZZ^2 = \DD^2 + \DD \UU \SS \UU^T + \UU \SS \UU^T \DD + \UU \SS^2 \UU^T.$$

We don’t want to have to deal with the two middle interaction terms, so we’ll use the fact that $\DD \approx \II$ to approximate the square as $$\ZZ^2 \approx \DD^2 + \UU (2 \SS + \SS^2) \UU^T.$$

I will absorb magnitudes into $\UU$ and keep only the sign as explicit, and write
$$ \UU (2 \SS + \SS^2) \UU^T = \sum_i \uu_i \uu_i^T \sigma_i, \quad \sigma_i \in \{-1, 1\}.$$

With these in hand our prediction term gets approximated as $$ \XX^T \ZZ^T \ZZ \XX \approx \XX^T (\DD^2 + \sum_i \sigma_i \uu_i \uu_i^T ) \XX.$$

Next, we’ll modify the regularizer by using the intuition that the $\UU$ terms that contribute the off-diagonal terms will be very small, so the on-diagonal stuff will be dominated by $\DD$. So keeping $\ZZ$ near $\II$ will be achieved if we keep $\DD$ near $\II$, and the off-diagonal stuff near zero. That is,
$$ \| \ZZ -\II\|_F^2 = \| \DD + \UU \sigma \UU^T – \II \| \approx \| \DD – \II \| + \|\UU \sigma \UU^T\|.$$

Then $$ \|\UU \sigma \UU^T\|_F^2 = \tr(\UU \sigma \UU^T \UU \sigma \UU^T) = \tr( \UU^T \UU \sigma \UU^T \UU \sigma) = \tr([ \|\uu_i\|_2^4]),$$ since the columns of $\UU$ are assumed orthogonal (though not orthonormal), and $\sigma_i^2 = 1$.

Normally we would also add Lagrange multipliers $\Lambda_{ij}$ to enforce orthogonality. But it will turn out below that the optimal $\uu_i$ will be eigenvectors of a symmetric matrix, so we will get orthogonality for free.

Thus the loss function we’ll study is $$ \boxed{\wt L(\DD, \UU, \boldsymbol{\sigma}) = {1 \over 2}\|\underbrace{\XX^T (\DD^2 + \sum_i \sigma_i \uu_i \uu_i^T) \XX – \CC}_{\bE} \|_F^2 + {\lambda \over 2} \|\DD- \II \|_F^2 + {\lambda \over 2} \sum_i (\uu_i^T \uu_i)^2.}$$

Why is studying this loss function useful? We minimized the original loss function (of $\ZZ$) and found that the result could be decomposed into $\DD + \UU \SS \UU^T.$ What we’re doing above is parameterizing $\ZZ$ that way explicitly, and assuming that if we optimize this parameterization, it will give the same solution.

Note also that the $\sigma_i$ are discrete. So the unconstrained solution will correspond to some setting of these, which we will have to search over explicitly, rather than with gradient descent.

Optimizing the parameterization

We’ll start with the rank-1 components $\uu_i$. The differential is $$ d\wt L = \tr(\bE^T \XX^T (\sigma_i d\uu_i \uu_i^T + \sigma_i \uu_i d\uu_i^T)\XX) + 2\lambda \|\uu_i\|_2^2 d\uu_i^T \uu_i.$$

Therefore, the gradient is \begin{align}\nabla_{\uu_i} \wt L &= \sigma_i \XX \bE \XX^T \uu_i + \sigma_i \XX \bE^T \XX^T \uu_i + 2\lambda \|\uu_i\|_2^2 \\ &= 2 \sigma_i \XX \bE \XX^T \uu_i + 2 \lambda \|\uu_i\|_2^2 \uu_i, \end{align} since $\bE = \bE^T.$

Setting this to zero and multiplying on the left by $\sigma_i$ and rearranging, we get the eigenvalue equation $$ \XX \bE \XX^T \uu_i = -\sigma_i \lambda \|\uu_i\|_2^2 \uu_i.$$

So the $\uu_i$ are eigenvectors of $\XX \bE \XX^T$ with eigenvalues $-\sigma_i \lambda \|\uu_i\|_2^2$.

When we check numerically, we see that this is indeed (approximately) the case. The plot below shows the two connectivity modes (u_fit, blue) overlayed onto the two eigenvectors of $\XX \bE \XX^T$ with largest absolute eigenvalue.

It’s also useful to consider $\vv_i \triangleq \XX^T \uu_i$. Multiplying both sides of the eigenvalue equation above on the left with $\XX^T$, we get $$ \XX^T \XX \bE \vv_i = -\sigma_i \lambda \|\uu_i\|_2^2 \vv_i,$$ so the $\vv_i$ are eigenvectors of $\XX^T\XX \bE$ with eigenvalue $-\sigma_i \lambda \|\uu_i\|_2^2$.

Expanding this out, $$ -\XX^T \XX \bE \hat \vv_i = \XX^T \XX (\CC \,-\, \XX^T \DD^2 \XX \,-\, \sum_j \sigma_j \vv_j \vv_j^T) \hat \vv_i = \sigma_i \lambda \|\uu_i\|_2^2 \hat \vv_i.$$

Now $$\sum_j \sigma_j \vv_j \vv_j^T \hat \vv_i = \sigma_i \|\vv_i\|_2^2 \hat \vv_i + \sum_{j \neq i} \sigma_j \|\vv_j\|_2^2 \cos(\theta_{ij}) \hat \vv_j,$$ where $\cos(\theta_{ij}) = \hat \vv_i^T \hat \vv_j.$

We therefore get the not-quite eigenvalue relation $$ \XX^T \XX (\CC \; – \; \XX^T \DD^2 \XX \; – \; \sigma_i \|\vv_i\|_2^2 \II) \hat \vv_i \;-\; \XX^T \XX \sum_{j \neq i} \sigma_j \|\vv_j\|_2^2 \cos(\theta_{ij}) \hat \vv_j = \sigma_i \lambda \|\uu_i\|_2^2 \hat \vv_i.$$

This says that the low-rank, orthogonal components we find, $\sigma_i \vv_i$, satisfy the equation above, for some setting of $\sigma_i$. The equation above is valid for any setting of $\sigma_i$, but the solutions we find by optimizing connectivity in an unconstrained way correspond to a particular setting that we have to find by direct search.

Since $\DD^2 \approx \II$, and if the $\|\vv_i\|_2^2$ are sufficiently small, we may be able to approximate $\hat \vv_i$ as the solutions to the eigenvector equations
$$ \XX^T \XX (\CC – \XX^T \XX) \hat \vv_i = \sigma_i \lambda \hat \vv_i.$$

$$\blacksquare$$

The Logic of Connectivity

Optimizing the parameterization

Comments

Leave a Reply Cancel reply