Cosyne 2026

My running notes from Cosyne 2026. Most were hastily written during/immediately after live presentations, so likely contain errors reflecting my misunderstandings. Apologies to the presenters.

Posters by Topic

Experimental

Data Analysis

Theory/Modeling

Posters by Session

Poster Session 1

1-007 Convergent motifs of early olfactory processing
  • Vertebrate and invertebrate olfactory systems are similar, but not derived from a common ancestor, implies convergent evolution.
  • Question: Can the structure of the olfactory system be determined normatively?
  • Model
    • Sparse monomolecular concentration vectors $\cc$
    • Sensed by receptors with affinities $\WW$.
    • Expressed, not necessarily one-to-one, in ORNs according to $\bE$.
    • Hill transformed, plus isotropic additive noise, to give ORN responses.$$ \rr = \varphi(\bE \WW \cc) + \eta.$$
    • ORN responses converge, not necessarily one to one, on glomeruli
  • Learned parameters of the Hill function, and mappings of ORs to ORNs, to maximize mutual information between odours and responses.
    • I think there was an energetic cost somewhere as well.
  • Found the 1-1 mapping from ORs to ORNs ($\bE \approx \II$).
  • Downstream: found 1-1 mapping from ORNs to glomeruli.
  • Dicussion: These results depend strongly on the assumed structure of the noise.
1-063
Sparse glomerular representations explain odor discrimination in complex, concentration-varying mixtures
  • 2AFC task with one target monomolecular odour among up to 16 distractor odours.
  • Performance depended on concentration of target odour, but not number of distractor odours.
  • Recapitulated in model with sparse, highly tuned receptors.
    • These only sense the target odour, unresponsive to distractors, hence affected by target SNR but not number of distractors.
1-140
A generative diffusion model reveals V2’s representation of natural images
  • Modeled the manifold of images by training a diffusion model to produce images.
  • Generated two kinds of “noise”:
    • On-manifold: Gaussian perturbations to the diffusion model input.
    • Off-manifold: Gaussian perturbations to the diffusion mode output.
  • Measured responses of Macaque V2.
    • On-manifold noise produced higher variability in the responses.
      • Such “noise” produces valid, different images, so the responses will reflect the new images geneated.
      • Cosine similarity of these responses decreased with noise level.
    • Off-manifold noise produced similar responses across all noise levels.
      • This noise produces corrupted versions of the same images, so lack of variability perhaps reflects mapping of all of these to the same image.
      • Cosine similarity was similar across noise levels.
  • Comment: It’s like V2 is inverting the diffusion model.
    • On-manifold noise produces variable activity as the mapping back to the image-generating latents, varies.
    • Off-manifold noise produces constant responses, as the mapping back is to the same image.
1-089 Interpretable time-series analysis with Gumbel dynamics
  • Context was dynamics of discrete latent state generating observations.
  • We track the probability of being in each state.
  • We want state occupancies to be nearly one-hot, for interpretability.
    • We want the system to be mostly in one state, rather than distributed across states.
  • Standard approach uses softmax, which has two problems:
    • If we work with the probabilities, then these can stay nearly uniform.
      • Downstream decoder/observation model transforming can expand the small fluctuations from uniformity as needed.
      • Hard to interpret as states occupancy will be distributed.
    • Alternatively, we can sample from the softmax.
      • Fixes the interpretability issue as exactly one state will be occupied
      • In efficient, as gradients etc. will only operate on that active state / sample.
  • Solution:Gumbel Softmax: $$ z \sim \text{GS}(\pi, \tau) \iff z \sim \text{softmax}\left({\pi + \eta \over \tau}\right), \quad \eta \sim \text{Gumbel}(0,1).$$
    • Pushes the softmax probabilities so state-occupancy is nearly one-hot.
      • Helps interpretability
      • More efficient than purely one hot because states with small probabilities are also update.
1-065 Continuous Multinomial Logistic Regression for Neural Decoding
  • Standard logistic regression: $$p(y_k|\xx) \propto \exp(-\ww_k^T \xx).$$
  • Weights are fixed per class.
  • Can be extended to a temporal dimension by treating each time bin independently.
    • This ignore temporal structure.
  • Idea: allow $\ww_k$ to vary smoothly in time by giving it a Gaussian process prior : $$ \ww_k \sim \text{GP}(\bzero, \lambda).$$
    • I think there was also some linking of the different states as well, so that their vector evolutions were also linked.
  • Take the mean $\ww_k(t)$ to compute output probabilities. $$ p(\yy_k | \xx(t)) \propto \exp(-\overline{\ww}_k(t)^T \xx).$$

Poster Session 2

2-136 Statistical theory for inferring population geometry in high-dimensional neural data
  • Used RMT to investigate how covariance estimation varies with the number of neurons $N$ and trials $T$.
  • First result was on PCA dimension (participation ratio), showing how subsampling $N$ neurons and $T$ trials to $M$ neurons and $P$ trials affects estimation by relating the PCA dimensions at the two configuration.
  • Next result was about how estimation error relative to true covariance, $\|\hat C – C\|_F^2$ varies as $D_N/T$.
    • $D_N$ is the true PCA dimensionality for $N$ neurons in the infinite trials limit.
  • What if the true dimensionality is not known? Replication error, $\|\hat C_1 – \hat C_2\|_F^2$, varies as $2 \hat{D}_{N,T}/T$.
    • $\hat{D}_{N,T}$ is the dimensionality of the sample.
  • Estimating eigenvectors and eigenvalues:
    • Low rank signal model: $C = U D U + \sigma^2 I$.
    • SNR for the $k$’th dimensions: $SNR_k = d_k/{N \sigma^2}$.
    • There was some critical SNR threshold above which eigenvalues and eigenvectors could be estimated.
    • Effective $R^2$ is $R^2 = {\sum_k O_k SNR_k – {K/N} \over \sum_k SNR_k -1 }.$
      • O_k is the alignment of the recovered eigenvector with the true eigenvector.
      • Recovering each mode contributes SNR, weighted by the alignment, but also brings noise (contributing $N^{-1}$ per mode).
      • Best rank to recover is the $K$ at which the numerator no longer increases, where adding one more mode brings more noise than signal.
2-157 Continuous partitioning of neuronal variability
  • Classic models of neural responses explain them as homogeneous Poisson processes whose rates are the product of a stimulus dependent tuning $f_s$ and a trial-specific gain $g_k$: $$y \sim \text{Poisson}(f_s g_k).$$
  • Innovation to capture temporal variability: the Continuous Modulated Poisson Model.
  • Tunings and gains as Gaussian processes: $$ \log g(t) \sim \text{GP}(0, K_g), \quad \log f_s(t) \sim \text{GP}(0, K_f(s)).$$
  • Gains modelled as $$K_g = \rho_g \exp\left(-{|t_1 – t_2|^q \over \ell_g}\right).$$
  • Can then e.g. monitor how parameters like optimal temporal correlation lengths $\ell_g$ and heavy-tailed-ness $q$ vary across brain areas.
2-092 Uncovering statistical structure in large-scale neural activity with restricted Boltzmann machines
  • Main idea: fit RBMs to neural data.
  • Marginalize out hidden states to get effective interactions between units at all orders.
    • Advantage over e.g. Schneidman’s approach by making it easier to estimate higher order interactions.
      • But surely subsampling problems must be the same? i.e. estimates of high-order interactions will be noisy due to lack of observations.
  • Compute index of higher order interactions between pairs of units.
    • Can indicate either missing units in the recording, or true higher order interactions (e.g. via glia).
2-218 A unified theory of feature learning in RNNs and DNNs
  • Compared the solutions in a regression task by RNNs and DNNs.
  • RNNs can be viewed as DNNs using temporal unrolling.
  • Key difference is that in the unrolled RNN, the layer weights are shared.
  • Weight sharing imposes an inductive bias which can make RNNs more sample efficient e.g. when learning temporal sequences.
2-152 Dynamical archetype analysis: Autonomous computation
  • Neural systems often have different geometry, but the same topology
    • Converge to the same pattern of fixed points, repelled by the same set of repellers etc.
  • This is called topological conjugacy: Two systems are topologically conjugate if there is a homeomorphism $\Phi$ that transforms one set of dynamics into the other.
  • The set of all homeomorphisms is too large, so parameterize using a Neural ODE.
  • Compute the distance between two sets of dynamics $f$, $g$ by minimizing a combination of a trajectory mismatch $d_\text{traj}$, and homeomorphism complexity $d_\text{cxty}$.
  • Trajectory loss: Given the trajectory at time $t$ starting at $x$ under $f$, dynamics, $\phi_f^t(x)$, and similarly $\phi_g^t(x)$: $$ d_\text{traj}(\Phi; f, g) = \int_t \| \phi_f^t(x) – \underbrace{\Phi(\phi_g^t ( \Phi^{-1}(x)))}_{\text{$g$ trajectory in $f$ space}}\| dt$$
  • Complexity loss: $$ d_\text{cxty}(\Phi) = \int \| \nabla \Phi(x) – I \| dx,$$ evaluated along the $f$ trajectory (?).
  • Measure the distance of a given neural system to a fixed set of archetype dynamics, e.g. line attractor, ring attractor.
  • Showed that they could correctly map dynamics to their archetypes when the ground truth was known.
    • E.g. neural ring attractor dynamics (fly system? head direction system?) mapped to ring attractor archetype.
2-208 Plastic Circuits for Context-Dependent Decisions
  • Investigated the effect of Short-Term Synaptic Plasticity (STSP) in an RNN performing the Mante 2013 task (output based on colour or motion signal, depending on context).
  • STSP: Strengths are determined as the product of utilization and activity, $w_{\text{eff}, ij}(t) =w_{ij} u_j(t) x_j(t).$
    • This is mean to model e.g. vesicle pool depletion etc.
  • A network using STSP could perform the task, one with Hebbian plasticity could not (?).
  • Found that in STSP, context information is stored in neural activity, not the synapses.
  • A network with fixed weights wanting to implement the same thing has to do so through nonlinear activations: $A(t) x(t) \to W \phi(x(t)),$, which presumably could get complicated, intractable.
2-173 Spatiotemporal Dynamics in Recurrent Neural Networks as Flow Invariance
  • RNNs are often used to learn stimulus dynamics.
  • It’s natural to want equivariant hidden representations: flow in the stimulus results in corresponding flow in the latents.
  • Incorporating such equivariance into the RNN dynamics can dramatically speed up learning.
2-045 A non-local variational framework for optimal neural representations
  • Tuning curves are often defined by maximizing Fisher information.
  • Fisher information is a local measure – doesn’t capture e.g. errors due to jumps in the inferred values.
  • Mutual information is global, but can be hard to compute.
  • Solution: a non-local loss on tuning curves comparing all pairs of input stimuli: $$ L[f] = {1 \over (2 \pi)^2} \int_\theta \int_{\theta’} \ell(f(\theta), f(\theta’)),$$ where $\ell$ is misclassification error.
  • How to solve this?
    • $f(\theta) = p(x|\theta)$, the population response.
    • The population response space is a manifold with the Fisher-Rao metric.
    • The responses to two stimuli are two points in this space, separated by a geodesic distance $d$.
    • Classification error $\ell$ is approximately erfc of this distance.
    • The set of all responses (to circular stimuli) forms a closed curve in this space.
    • The optimal tuning curves that minimize the loss form a circle in the space of square-root firing rates.

Poster Session 3

3-154 Estimating neural coding fidelity in high dimensions with limited samples
  • d’ measures discriminability but can be biased when neurons >> trials.
  • Used RMT to estimate d’ in high-dimensional setting and produce a less-biased estimator.
  • Key quantity: Signal aligned spectrum: $$ G_\rho(x) = \sum_{i=1}^N \left({v_i^T u \over \|u\|}\right)^2 1_{x > \lambda_i},$$ where $v_i$ are the noise directions in decreasing order of variance $\lambda_i$, and $u$ is the signal direction.
  • In those terms, $$ d’ = \|u\|^2 \int {1 \over \lambda} dG(\lambda).$$

3-138 Identifying interpretable latent factors within and across brain regions
  • Decompose temporal activity into sparse factors, orthogonal factors convolved with Gaussian filters, possibly with some delay.
    • Orthogonality and sparsity give interpretability.
    • Convolution with Gaussian filters is faster to fit than general Gaussian process.
3-070 Generalization and memorization in mouse olfactory learning
  • Trained mice to distinguish a variety of sixteen component mixtures to test generalization vs memorization.
  • Mice can do both, biased towards simple rules when these exist.
3-130 Sensory prediction errors update predictive representations
  • RSC is the source of sensory predictions (not prediction errors) to V1.
3-225 Noise Correlations for Efficient Learning
  • For optimal discrimination we want noise correlations to be orthogonal to the discriminating dimensions.
  • Tested humans on a joint color-motion discrimination task, where the rule would occasionally flip.
  • Modelled this with shallow linear net mapping color and motion to the decision.
  • Observed that noise correlations in the model were parallel to the optimal discrimination direction.
    • This would produce sub-optimal accuracy, and indeed it does.
    • But they hypothesize that it helps find the discriminating directions.
      • I think this is putting the cart before the horse.
      • Classifiers will find the discriminating directions, and that in turn will affect the noise correlations.
3-005 State-dependent modulation of neocortical sensory processing
  • Imaged mouse S1 and PPC during two-alternative multi-sensory discrimination task.
  • Modelled brain activity using a 3-state GLM-HMM
  • Produced one “engaged” state, two “disengaged” states.
  • In the engaged state:
    • Stronger reps in S1, PPC-A
    • Stronger communication b/w S1 to PPC-A
    • Stronger bottom-up activation.
3-069 Generalized DSA: Comparing Neural Population Dynamics by Identifying Optimal Linearizing Embeddings
  • DSA has two steps:
    • Map nonlinear dynamics to high (infinite) dimensional linear dynamics using Koopman theory.
    • Find an orthonormal coordinate transformation that best lines up one set with another: $$ d_\text{DSA}(A,B) = \min_{C \in O(N)} \|A – C B C^T\|.$$
  • Genearlized DSA:
    • Find eigenspectrum of dynamics
    • Measure distance between spectra using optimal transport.
  • Ostrow: Is faster / works better than DSA.
3-030 Identifying Neural Activity Manifolds through Non-reversibility Analysis
  • Neural dynamics are generally not reversible.
    • i.e. $p(x_t = a, x_{t+1} = b) != p(x_t = b, x_{t+1}=a)$
  • Noise dynamics are reversible.
  • Idea: Dimension reduce dynamics by finding projections that produce maximally non-reversible dynamics.
  • Let $X_k \in \RR^{N \times T}$ be the responses to condition $k$.
  • Compute the covariance of the vectorized responses $C = \EE(\vec{X_k^T}, \vec{X_k^T}^T).$
  • Split this into a reversible and non-reversible part:
    • Let $\sigma(C)$ be the time-transposed covariances.
    • $C^+ = C + \sigma(C)$
    • $C^- = C – \sigma(C)$
  • Non-reversibility index: $$\xi = {\|C^-\|_F \over \|C^+\|_F}.$$
  • Tough to maximize this coefficient itself, instead just maximize the numerator:
    • Find $U$ to maximize $\|C^-\|$ for the projected data $Y_k = U^T X_k$.
    • Non-reversible part has a simple expression: $$ \|C^-\|_F^2 = \sum_{k,k’} \tr{Y_k Y_{k’}^T}^2 – \tr{Y_k Y_k^T Y_{k’} Y_{k’}^T}.$$
    • Can be kernelized.
3-111 Distinguishing probabilistic from heuristic neural representations of uncertainty
  • When are neural representations truly probabilistic rather than just heuristic?
  • Found that truly probabilistic reps require a bottleneck that forces learning of sufficient statistics.
  • Otherwise can just memorize inputs.
    • Measurable if can predict inputs from the hidden states.

3-046 Canonical cortical circuits: A unified sampling machine for static and dynamic inference
  • Found that the canonical microcircuit was a substrate for Hamiltonian dynamics and allowed fast inference.

Workshops 1

Eero

  • Efficient coding (Barlow): information transmission while minimizing redundancy.
  • First part of the talk: building circuit models that progressively reduce different kinds of redundancy, and how these match up to corresponding visual ares.
  • Second part of talk: accessing the image manifold through denoising
    • Supervised training of image denoisers
      • For least squares loss this reports posterior mean: $$ y = x + z \mapsto \hat x(y) = \int x p(x|y) dx .$$
      • The trained system implicitly contains information about the prior on image. How to access this information?
      • Tweedie’s identity: $$\hat{x}(y) = y + \sigma^2 \nabla_y \log p(y).$$

Ken Harris

Workshops 2

Questions

  • Can the OT metric on tuning curves be pulled back to a metric in the response space to allow direct comparison of datasets with different sizes?
  • Non-reversibility analysis looks for an orthogonal projection in neural space. Is there also an optimal time scale, a projection along the time direction, to compute non-reversibility over?
  • Ken Harris mentioned the trivial neural code for numbers, where successive neurons code for successive significant digits.
    • This is an intensive code, information per dimensions decays to zero in the large N limit.
    • What is a corresponding extensive code, that distributes the information evenly among neurons?
      • That is also easy to decode?
      • Does the naive encoding correspond to axis-aligned coordinates?
        • And the a distributed code could be a rotation?
  • Matthew Chalk’s talk
    • He defined a multi-resolution information metric by looking at how fisher information changed with noise corruption of different magnitudes.
    • The eigenvectors of the metric give the most informative directions for each stimulus.
    • Can we derie Can the multi-resolution information metric be derived from locally most informative directions?

Posted

in

,

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *