Neurips 2025 Day 2

My notes on Day 2 of Neurips 2025. I wanted to wait till I’d filled more in, but it’s three months later and I think there’s enough here to post!

Rich Sutton “OaK architecture: A vision of super intelligence.”

Agents are smaller than the world
Therefore everything they learn is an approximation.
Therefore the world appears non-stationary
Agents must be able to learn and act at run-time
Building in domain-dependent knowledge is ultimately harmful (the bitter lesson)
Agents should be capable of developing open-ended abstractions.
General agreement across fields on design components of intelligent agents:
- Perception
- Policy
- Value Function
- Transition Model
The one-step trap of RL: planning one step ahead
- Errors compound
- Exploration blows up in time.
Solution: higher-order transition models to plan over larger timescales.

The OaK architecture

Learning higher-order transitions by constructing reward-respecting subgoals.

Options and Knowledge
- Option: A tuple of policy (state $\to$ action) and termination function (state $\to [0,1]$ termination)

Concepts and References

Continual learning
- “Catastrophic loss of plasticity”
“Reward-respecting sub

Why diffusion models don’t memorize

Time to generation $\tau_\text{gen}$ is constant.
Time to memorization increases with number of examples $n$.
Gap between is good generalization time, early stopping.

Poster Session 1

Generalizable, real-time neural decoding with hybrid state-space models

Model has three parts:
- Mapping spikes across units and sessions into a common vector state.
  - Split time into chunks.
  - Learns unit-specific embedding of units.
  - Session-specific keys and values queried with a fixed vector.
- Vector state is updated using a state-state model to maintain memory.
- Reading out the predicted behaviour from the history.
  - Queried with the session and time.
  - Use previous history as keys and values.
What is it doing?
- Finding a state-space representation that can be linearly decoded to extract behaviour.

Finding separatrices of dynamical flows with Deep Koopman Eigenfunctions

Decision making as moving between different basins
Typical analysis is at the fixed points.
Found as locations where $\|f(x)\|_2^2 = 0$.
- i.e. zero of a scalar function of the dynamics.
What such function to find sepratrices?
Key idea: map dynamics to unstable scalar dynamics $\psi(x)$

Zero of the function corresponds to separatrix.
On either side the function increases to +/- $\infty$: unstable dynamics.
Use a neural net to find such a function.
We want $$\dot \psi(x(t)) = \nabla_x \psi^T \dot x = \nabla_x \psi^T f(x) = \lambda \psi,$$ so minimize $$\EE_{x \sim p(x)} \|\nabla \psi ^T f(x) – \lambda \psi \|_2^2.$$
Do this with a neural net using gradient descent?
Once you have it, you can design optimal (smallest) perturbations that will move state to the separatrix, evoke new behaviour.

Learning to Cluster Neuronal Function

Deep learning models can learn per-neuron embeddings that predict responses.
These embeddings may reflect neuronal types, but don’t cluster.
If true neuron types cluster, then incorporating this information should improve embeddings.
Key idea: learn an initial embedding to predict responses, then augment the loss with a clustering promoting term:

Here Q is the data distribution, modelled as a mixture of t-distributions
- Number of distributions is a parameter.
P is a target distribution based on previous work.
Fit using various numbers of clusters, measure quality of clustering using adjusted rand index.
- Optimize $\beta$ using cross-validation?
Setting # clusters to whateve maximizes ARI finds the right number of clusters in marmoset.
No clear peak in ARI in V1, suggests no clear types but rather a continuum.

Johnson-Lindenstrauss Lemma Beyond Euclidean Geometry

JL lemma: given a set of points in $n$-dimensional Euclidean space, there exists a mapping $f$ to $m$-dimensional Euclidean space that approximately preserves distances, so $(1 – \vareps) \|x_i – x_j\| \le \|f(x_i) – f(x_j)\| \le (1 + \vareps) \|x_i – x_j\|$, where $m$ is $O(\vareps^{-2} \log n).$
What if only a distance matrix is available, not point coordinates, and what if distance are not necessarily metric (don’t satisfy triangle inequality)?
Non-metric distances can be expressed as a sum of positive and negative Euclidean part.
- Result: Prove that JL holds for this positive-negative split.
Non-metric distances can also be expressed as a power distance.
- Result: Prove that JL holds for the power distance.

Spectral Analysis of Representational Similarity with Limited Neurons

Brain-like Variational Inference**

Introduce FOND: Free Energy Optimization using Natural Gradient Dynamics.
Their idea is to derive neurally plausible algorithms by iterative minimization of the negative variational free energy.
- This is an old idea.
They do a concrete demonstration by using a poisson model with log rates
Show that it outperforms standard methods on sparsity, neural plausibility, etc.
Show that it outperforms amortized inference methods.

Jacobian-Based Interpretation of Nonlinear Neural Encoding Model

Metric for measuring nonlinearity by quantifying changes in input-output Jacobian of NN model with input.
- Per voxel, compute input output jacobian per sample.
- Compute the mean across samples
- Compute deviation relative to the mean
- Summarize deviations across coordinates using L1 norm
- Measure the variance of the mean absolute deviations across samples.

Brain-Like Processing Pathways Form in Models With Heterogeneous Experts

Poster Session 2

From Flat to Hierarchical: Extracting Sparse Representations with Matching Pursuit

Sparse auto-encoders are useful for capturing concepts.
Assume that images are linear combinations of concepts.
Natural scences have hierarchical structure.
Deep nets trained on natural scenes uncover hierarchical structure.
- Concepts at different layers tend to be orthogonal.
Vanilla sparse auto-encoders have trouble capturing this hierarchical structure.
Solution: replace inference stage of auto-encoder with matching pursuit:
- Iteratively find the concept most correlated with the current residual and subtract it out.
- Has the property that the result residual is orthogonal to the selected concept.
- Naturally captures the cross-hierarchy-layer orthogonality.
  - At least for adjacent layers.
Tested on surrogate data.
- Data has hierarchical structure if $p(parent|child) = 1$ but $p(child|parent) < 1$.
- MP-SAE recovers the underlying dictionary element relationships

Representational Difference Explanations

Want to find differences in representation
Outcome: clustering of samples that are considered similar in one rep but not the other.
Approach:
- Compute $n \times n$ distance matrices $D_A$ and $D_B$ for the two reps.
- Normalize distance matrices $d_A$ and $d_B$, using nearest neighbours.
- Compute a relative difference of normalized distances: $$G_{A,B}^{ij} = {d_A^{ij} – d_B^{ij} \over \min(d_A^{ij}, d_B^{ij})}.$$
- Compress these to $\{-1,1\}$ using $\tanh$ and form an affinity matrix by passing through negative exponential: $$F_{A,B}^{ij} = \exp(-\beta\tanh(\gamma G_{A,B}^{ij}).$$
- This will produce strong links between pairs that are more similar in $A$ than in $B$.
- Cluster the resulting graph, forming explanations:
  - Different configurations of samples that are more similar in $A$ than in $B$.

Connecting Jensen–Shannon and Kullback–Leibler Divergences

Fast exact recovery of noisy matrix from few entries: the infinity norm approach

Rich Sutton “OaK architecture: A vision of super intelligence.”

The OaK architecture

Concepts and References

Why diffusion models don’t memorize

Poster Session 1

Generalizable, real-time neural decoding with hybrid state-space models

Finding separatrices of dynamical flows with Deep Koopman Eigenfunctions

Learning to Cluster Neuronal Function

Poster Session 2

From Flat to Hierarchical: Extracting Sparse Representations with Matching Pursuit

Representational Difference Explanations

Connecting Jensen–Shannon and Kullback–Leibler Divergences

Comments

Leave a Reply Cancel reply