Neurips 2025 Day 2

My notes on Day 2 of Neurips 2025. I wanted to wait till I’d filled more in, but it’s three months later and I think there’s enough here to post!

Rich Sutton “OaK architecture: A vision of super intelligence.”

  • Agents are smaller than the world
  • Therefore everything they learn is an approximation.
  • Therefore the world appears non-stationary
  • Agents must be able to learn and act at run-time
  • Building in domain-dependent knowledge is ultimately harmful (the bitter lesson)
  • Agents should be capable of developing open-ended abstractions.
  • General agreement across fields on design components of intelligent agents:
    • Perception
    • Policy
    • Value Function
    • Transition Model
  • The one-step trap of RL: planning one step ahead
    • Errors compound
    • Exploration blows up in time.
  • Solution: higher-order transition models to plan over larger timescales.

The OaK architecture

  • Learning higher-order transitions by constructing reward-respecting subgoals.
  • Options and Knowledge
    • Option: A tuple of policy (state $\to$ action) and termination function (state $\to [0,1]$ termination)

Concepts and References

  • Continual learning
    • “Catastrophic loss of plasticity”
  • “Reward-respecting sub

Why diffusion models don’t memorize

  • Time to generation $\tau_\text{gen}$ is constant.
  • Time to memorization increases with number of examples $n$.
  • Gap between is good generalization time, early stopping.

Poster Session 1

Generalizable, real-time neural decoding with hybrid state-space models

  • Model has three parts:
    • Mapping spikes across units and sessions into a common vector state.
      • Split time into chunks.
      • Learns unit-specific embedding of units.
      • Session-specific keys and values queried with a fixed vector.
    • Vector state is updated using a state-state model to maintain memory.
    • Reading out the predicted behaviour from the history.
      • Queried with the session and time.
      • Use previous history as keys and values.
  • What is it doing?
    • Finding a state-space representation that can be linearly decoded to extract behaviour.

Finding separatrices of dynamical flows with Deep Koopman Eigenfunctions

  • Decision making as moving between different basins
  • Typical analysis is at the fixed points.
  • Found as locations where $\|f(x)\|_2^2 = 0$.
    • i.e. zero of a scalar function of the dynamics.
  • What such function to find sepratrices?
  • Key idea: map dynamics to unstable scalar dynamics $\psi(x)$
  • Zero of the function corresponds to separatrix.
  • On either side the function increases to +/- $\infty$: unstable dynamics.
  • Use a neural net to find such a function.
  • We want $$\dot \psi(x(t)) = \nabla_x \psi^T \dot x = \nabla_x \psi^T f(x) = \lambda \psi,$$ so minimize $$\EE_{x \sim p(x)} \|\nabla \psi ^T f(x) – \lambda \psi \|_2^2.$$
  • Do this with a neural net using gradient descent?
  • Once you have it, you can design optimal (smallest) perturbations that will move state to the separatrix, evoke new behaviour.

Learning to Cluster Neuronal Function

  • Deep learning models can learn per-neuron embeddings that predict responses.
  • These embeddings may reflect neuronal types, but don’t cluster.
  • If true neuron types cluster, then incorporating this information should improve embeddings.
  • Key idea: learn an initial embedding to predict responses, then augment the loss with a clustering promoting term:
  • Here Q is the data distribution, modelled as a mixture of t-distributions
    • Number of distributions is a parameter.
  • P is a target distribution based on previous work.
  • Fit using various numbers of clusters, measure quality of clustering using adjusted rand index.
    • Optimize $\beta$ using cross-validation?
  • Setting # clusters to whateve maximizes ARI finds the right number of clusters in marmoset.
  • No clear peak in ARI in V1, suggests no clear types but rather a continuum.

Johnson-Lindenstrauss Lemma Beyond Euclidean Geometry

  • JL lemma: given a set of points in $n$-dimensional Euclidean space, there exists a mapping $f$ to $m$-dimensional Euclidean space that approximately preserves distances, so $(1 – \vareps) \|x_i – x_j\| \le \|f(x_i) – f(x_j)\| \le (1 + \vareps) \|x_i – x_j\|$, where $m$ is $O(\vareps^{-2} \log n).$
  • What if only a distance matrix is available, not point coordinates, and what if distance are not necessarily metric (don’t satisfy triangle inequality)?
  • Non-metric distances can be expressed as a sum of positive and negative Euclidean part.
    • Result: Prove that JL holds for this positive-negative split.
  • Non-metric distances can also be expressed as a power distance.
    • Result: Prove that JL holds for the power distance.

Spectral Analysis of Representational Similarity with Limited Neurons

Brain-like Variational Inference**

  • Introduce FOND: Free Energy Optimization using Natural Gradient Dynamics.
  • Their idea is to derive neurally plausible algorithms by iterative minimization of the negative variational free energy.
    • This is an old idea.
  • They do a concrete demonstration by using a poisson model with log rates
  • Show that it outperforms standard methods on sparsity, neural plausibility, etc.
  • Show that it outperforms amortized inference methods.

Jacobian-Based Interpretation of Nonlinear Neural Encoding Model

  • Metric for measuring nonlinearity by quantifying changes in input-output Jacobian of NN model with input.
    • Per voxel, compute input output jacobian per sample.
    • Compute the mean across samples
    • Compute deviation relative to the mean
    • Summarize deviations across coordinates using L1 norm
    • Measure the variance of the mean absolute deviations across samples.

Brain-Like Processing Pathways Form in Models With Heterogeneous Experts

Poster Session 2

From Flat to Hierarchical: Extracting Sparse Representations with Matching Pursuit

  • Sparse auto-encoders are useful for capturing concepts.
  • Assume that images are linear combinations of concepts.
  • Natural scences have hierarchical structure.
  • Deep nets trained on natural scenes uncover hierarchical structure.
    • Concepts at different layers tend to be orthogonal.
  • Vanilla sparse auto-encoders have trouble capturing this hierarchical structure.
  • Solution: replace inference stage of auto-encoder with matching pursuit:
    • Iteratively find the concept most correlated with the current residual and subtract it out.
    • Has the property that the result residual is orthogonal to the selected concept.
    • Naturally captures the cross-hierarchy-layer orthogonality.
      • At least for adjacent layers.
  • Tested on surrogate data.
    • Data has hierarchical structure if $p(parent|child) = 1$ but $p(child|parent) < 1$.
    • MP-SAE recovers the underlying dictionary element relationships


Representational Difference Explanations

  • Want to find differences in representation
  • Outcome: clustering of samples that are considered similar in one rep but not the other.
  • Approach:
    • Compute $n \times n$ distance matrices $D_A$ and $D_B$ for the two reps.
    • Normalize distance matrices $d_A$ and $d_B$, using nearest neighbours.
    • Compute a relative difference of normalized distances: $$G_{A,B}^{ij} = {d_A^{ij} – d_B^{ij} \over \min(d_A^{ij}, d_B^{ij})}.$$
    • Compress these to $\{-1,1\}$ using $\tanh$ and form an affinity matrix by passing through negative exponential: $$F_{A,B}^{ij} = \exp(-\beta\tanh(\gamma G_{A,B}^{ij}).$$
    • This will produce strong links between pairs that are more similar in $A$ than in $B$.
    • Cluster the resulting graph, forming explanations:
      • Different configurations of samples that are more similar in $A$ than in $B$.

Connecting Jensen–Shannon and Kullback–Leibler Divergences

Fast exact recovery of noisy matrix from few entries: the infinity norm approach


Posted

in

,

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *