{"id":5410,"date":"2026-03-13T19:29:12","date_gmt":"2026-03-13T19:29:12","guid":{"rendered":"https:\/\/sinatootoonian.com\/?p=5410"},"modified":"2026-03-13T19:30:52","modified_gmt":"2026-03-13T19:30:52","slug":"neurips-2025-day-2","status":"publish","type":"post","link":"https:\/\/sinatootoonian.com\/index.php\/2026\/03\/13\/neurips-2025-day-2\/","title":{"rendered":"Neurips 2025 Day 2"},"content":{"rendered":"\n<p>My notes on Day 2 of Neurips 2025. I wanted to wait till I&#8217;d filled more in, but it&#8217;s three months later and I think there&#8217;s enough here to post!<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Rich Sutton &#8220;OaK architecture: A vision of super intelligence.&#8221;<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Agents are smaller than the world<\/li>\n\n\n\n<li>Therefore everything they learn is an approximation.<\/li>\n\n\n\n<li>Therefore the world appears non-stationary<\/li>\n\n\n\n<li>Agents must be able to learn and act at run-time<\/li>\n\n\n\n<li>Building in domain-dependent knowledge is ultimately harmful (the bitter lesson)<\/li>\n\n\n\n<li>Agents should be capable of developing open-ended abstractions.<\/li>\n\n\n\n<li>General agreement across fields on design components of intelligent agents:\n<ul class=\"wp-block-list\">\n<li>Perception<\/li>\n\n\n\n<li>Policy<\/li>\n\n\n\n<li>Value Function<\/li>\n\n\n\n<li>Transition Model<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>The one-step trap of RL: planning one step ahead\n<ul class=\"wp-block-list\">\n<li>Errors compound<\/li>\n\n\n\n<li>Exploration blows up in time.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>Solution: higher-order transition models to plan over larger timescales.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">The OaK architecture<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Learning higher-order transitions by constructing <strong>reward-respecting subgoals<\/strong>.<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Options and Knowledge\n<ul class=\"wp-block-list\">\n<li>Option: A tuple of policy (state $\\to$ action) and termination function (state $\\to [0,1]$ termination)<\/li>\n\n\n\n<li><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Concepts and References<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Continual learning\n<ul class=\"wp-block-list\">\n<li>&#8220;Catastrophic loss of plasticity&#8221;<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>&#8220;Reward-respecting sub<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Why diffusion models don&#8217;t memorize<\/h2>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Time to generation $\\tau_\\text{gen}$ is constant.<\/li>\n\n\n\n<li>Time to memorization increases with number of examples $n$.<\/li>\n\n\n\n<li>Gap between is good generalization time, early stopping.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">Poster Session 1<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Generalizable, real-time neural decoding with hybrid state-space models<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Model has three parts:\n<ul class=\"wp-block-list\">\n<li>Mapping spikes across units and sessions into a common vector state.\n<ul class=\"wp-block-list\">\n<li>Split time into chunks.<\/li>\n\n\n\n<li>Learns unit-specific embedding of units.<\/li>\n\n\n\n<li>Session-specific keys and values queried with a fixed vector.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>Vector state is updated using a state-state model to maintain memory.<\/li>\n\n\n\n<li>Reading out the predicted behaviour from the history.\n<ul class=\"wp-block-list\">\n<li>Queried with the session and time.<\/li>\n\n\n\n<li>Use previous history as keys and values.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>What is it doing?\n<ul class=\"wp-block-list\">\n<li>Finding a state-space representation that can be linearly decoded to extract behaviour.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\"><a href=\"https:\/\/neurips.cc\/virtual\/2025\/loc\/san-diego\/poster\/119952\">Finding separatrices of dynamical flows with Deep Koopman Eigenfunctions<\/a><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Decision making as moving between different basins<\/li>\n\n\n\n<li>Typical analysis is at the fixed points.<\/li>\n\n\n\n<li>Found as locations where $\\|f(x)\\|_2^2 = 0$.\n<ul class=\"wp-block-list\">\n<li>i.e. zero of a scalar function of the dynamics.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>What such function to find sepratrices?<\/li>\n\n\n\n<li>Key idea: map dynamics to unstable scalar dynamics $\\psi(x)$<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"702\" src=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2025\/12\/image-1024x702.png\" alt=\"\" class=\"wp-image-5415\" srcset=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2025\/12\/image-1024x702.png 1024w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2025\/12\/image-300x206.png 300w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2025\/12\/image-768x526.png 768w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2025\/12\/image-1536x1052.png 1536w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2025\/12\/image-2048x1403.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Zero of the function corresponds to separatrix.<\/li>\n\n\n\n<li>On either side the function increases to +\/- $\\infty$: unstable dynamics.<\/li>\n\n\n\n<li>Use a neural net to find such a function.<\/li>\n\n\n\n<li>We want $$\\dot \\psi(x(t)) = \\nabla_x \\psi^T \\dot x =  \\nabla_x \\psi^T f(x) =  \\lambda \\psi,$$ so minimize $$\\EE_{x \\sim p(x)} \\|\\nabla \\psi ^T f(x) &#8211; \\lambda \\psi \\|_2^2.$$ <\/li>\n\n\n\n<li>Do this with a neural net using gradient descent?<\/li>\n\n\n\n<li>Once you have it, you can design optimal (smallest) perturbations that will move state to the separatrix, evoke new behaviour.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\"><a href=\"https:\/\/neurips.cc\/virtual\/2025\/loc\/san-diego\/poster\/119088\">Learning to Cluster Neuronal Function<\/a><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Deep learning models can learn per-neuron embeddings that predict responses.<\/li>\n\n\n\n<li>These embeddings may reflect neuronal types, but don&#8217;t cluster.<\/li>\n\n\n\n<li>If true neuron types cluster, then incorporating this information should improve  embeddings.<\/li>\n\n\n\n<li>Key idea: learn an initial embedding to predict responses, then augment the loss with a clustering promoting term:<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"590\" height=\"102\" src=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2025\/12\/image-1.png\" alt=\"\" class=\"wp-image-5420\" style=\"width:409px;height:auto\" srcset=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2025\/12\/image-1.png 590w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2025\/12\/image-1-300x52.png 300w\" sizes=\"auto, (max-width: 590px) 100vw, 590px\" \/><\/figure>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Here Q is the data distribution, modelled as a mixture of t-distributions\n<ul class=\"wp-block-list\">\n<li>Number of distributions is a parameter.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>P is a target distribution based on previous work.<\/li>\n\n\n\n<li>Fit using various numbers of clusters, measure quality of clustering using adjusted rand index.\n<ul class=\"wp-block-list\">\n<li>Optimize $\\beta$ using cross-validation?<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>Setting # clusters to whateve maximizes ARI finds the right number of clusters in marmoset.<\/li>\n\n\n\n<li>No clear peak in ARI in V1, suggests no clear types but rather a continuum.<\/li>\n<\/ul>\n\n\n\n<p><a href=\"https:\/\/neurips.cc\/virtual\/2025\/loc\/san-diego\/poster\/118378\">Johnson-Lindenstrauss Lemma Beyond Euclidean Geometry<\/a><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>JL lemma: given a set of points in $n$-dimensional Euclidean space, there exists a mapping $f$ to $m$-dimensional Euclidean space that approximately preserves distances, so $(1 &#8211; \\vareps) \\|x_i &#8211; x_j\\| \\le \\|f(x_i) &#8211; f(x_j)\\| \\le (1 + \\vareps) \\|x_i &#8211; x_j\\|$, where $m$ is  $O(\\vareps^{-2} \\log n).$<\/li>\n\n\n\n<li>What if only a distance matrix is available, not point coordinates, and what if distance are not necessarily metric (don&#8217;t satisfy triangle inequality)?<\/li>\n\n\n\n<li>Non-metric distances can be expressed as a sum of positive and negative Euclidean part.\n<ul class=\"wp-block-list\">\n<li>Result: Prove that JL holds for this positive-negative split.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>Non-metric distances can also be expressed as a power distance.\n<ul class=\"wp-block-list\">\n<li>Result: Prove that JL holds for the power distance.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n\n\n\n<p><a href=\"https:\/\/neurips.cc\/virtual\/2025\/loc\/san-diego\/poster\/115670\">Spectral Analysis of Representational Similarity with Limited Neurons<\/a><\/p>\n\n\n\n<p><a href=\"https:\/\/neurips.cc\/virtual\/2025\/loc\/san-diego\/poster\/119894\">Brain-like Variational Inference<\/a>**<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Introduce FOND: Free Energy Optimization using Natural Gradient Dynamics.<\/li>\n\n\n\n<li>Their idea is to derive neurally plausible algorithms by iterative minimization of the negative variational free energy.\n<ul class=\"wp-block-list\">\n<li>This is an old idea.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>They do a concrete demonstration by using a poisson model with log rates<\/li>\n\n\n\n<li>Show that it outperforms standard methods on sparsity, neural plausibility, etc.<\/li>\n\n\n\n<li>Show that it outperforms amortized inference methods.<\/li>\n<\/ul>\n\n\n\n<p><a href=\"https:\/\/neurips.cc\/virtual\/2025\/loc\/san-diego\/poster\/118247\">Jacobian-Based Interpretation of Nonlinear Neural Encoding Model<\/a><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Metric for measuring nonlinearity by quantifying changes in input-output Jacobian of NN model with input.\n<ul class=\"wp-block-list\">\n<li>Per voxel, compute input output jacobian per sample.<\/li>\n\n\n\n<li>Compute the  mean across samples<\/li>\n\n\n\n<li>Compute deviation relative to the mean<\/li>\n\n\n\n<li>Summarize deviations across coordinates using L1 norm<\/li>\n\n\n\n<li>Measure the variance of the mean absolute deviations across samples.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n\n\n\n<p><a href=\"https:\/\/neurips.cc\/virtual\/2025\/loc\/san-diego\/poster\/118100\">Brain-Like Processing Pathways Form in Models With Heterogeneous Experts<\/a><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Poster Session 2<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\"><a href=\"https:\/\/neurips.cc\/virtual\/2025\/loc\/san-diego\/poster\/118531\">From Flat to Hierarchical: Extracting Sparse Representations with Matching Pursuit<\/a><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Sparse auto-encoders are useful for capturing concepts.<\/li>\n\n\n\n<li>Assume that images are linear combinations of concepts.<\/li>\n\n\n\n<li>Natural scences have hierarchical structure.<\/li>\n\n\n\n<li>Deep nets trained on natural scenes uncover hierarchical structure.\n<ul class=\"wp-block-list\">\n<li>Concepts at different layers tend to be orthogonal.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>Vanilla sparse auto-encoders have trouble capturing this hierarchical structure.<\/li>\n\n\n\n<li>Solution: replace inference stage of auto-encoder with matching pursuit:\n<ul class=\"wp-block-list\">\n<li>Iteratively find the concept most correlated with the current residual and subtract it out.<\/li>\n\n\n\n<li>Has the property that the result residual is orthogonal to the selected concept.<\/li>\n\n\n\n<li>Naturally captures the cross-hierarchy-layer orthogonality.\n<ul class=\"wp-block-list\">\n<li>At least for adjacent layers.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>Tested on surrogate data.\n<ul class=\"wp-block-list\">\n<li>Data has hierarchical structure if $p(parent|child) = 1$ but $p(child|parent)  &lt; 1$.<\/li>\n\n\n\n<li>MP-SAE recovers the underlying dictionary element relationships<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"366\" src=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2025\/12\/image-2-1024x366.png\" alt=\"\" class=\"wp-image-5446\" srcset=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2025\/12\/image-2-1024x366.png 1024w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2025\/12\/image-2-300x107.png 300w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2025\/12\/image-2-768x274.png 768w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2025\/12\/image-2-1536x548.png 1536w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2025\/12\/image-2.png 2022w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p><br><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><a href=\"https:\/\/neurips.cc\/virtual\/2025\/loc\/san-diego\/poster\/116114\">Representational Difference Explanations<\/a><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Want to find differences in representation<\/li>\n\n\n\n<li>Outcome: clustering of samples that are considered similar in one rep but not the other.<\/li>\n\n\n\n<li>Approach:\n<ul class=\"wp-block-list\">\n<li>Compute $n \\times n$ distance matrices $D_A$ and $D_B$ for the two reps.<\/li>\n\n\n\n<li>Normalize distance matrices $d_A$ and $d_B$, using nearest neighbours.<\/li>\n\n\n\n<li>Compute a relative difference of normalized distances: $$G_{A,B}^{ij} = {d_A^{ij} &#8211; d_B^{ij} \\over \\min(d_A^{ij}, d_B^{ij})}.$$<\/li>\n\n\n\n<li>Compress these to $\\{-1,1\\}$ using $\\tanh$ and form an affinity matrix by passing through negative exponential: $$F_{A,B}^{ij} = \\exp(-\\beta\\tanh(\\gamma G_{A,B}^{ij}).$$<\/li>\n\n\n\n<li>This will produce strong links between pairs that are more similar in $A$ than in $B$.<\/li>\n\n\n\n<li>Cluster the resulting graph, forming explanations:\n<ul class=\"wp-block-list\">\n<li>Different configurations of samples that are more similar in $A$ than in $B$.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"373\" src=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2025\/12\/image-3-1024x373.png\" alt=\"\" class=\"wp-image-5448\" srcset=\"https:\/\/sinatootoonian.com\/wp-content\/uploads\/2025\/12\/image-3-1024x373.png 1024w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2025\/12\/image-3-300x109.png 300w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2025\/12\/image-3-768x280.png 768w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2025\/12\/image-3-1536x559.png 1536w, https:\/\/sinatootoonian.com\/wp-content\/uploads\/2025\/12\/image-3-2048x746.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\"><a href=\"http:\/\/Connecting Jensen\u2013Shannon and Kullback\u2013Leibler Divergences: A New Bound for Representation Learning\">Connecting Jensen\u2013Shannon and Kullback\u2013Leibler Divergences<\/a><\/h3>\n\n\n\n<p><a href=\"https:\/\/neurips.cc\/virtual\/2025\/loc\/san-diego\/poster\/116927\">Fast exact recovery of noisy matrix from few entries: the infinity norm approach<br><\/a><\/p>\n\n\n\n<p><\/p>\n","protected":false},"excerpt":{"rendered":"<p>My notes on Day 2 of Neurips 2025. <\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[1,167],"tags":[],"class_list":["post-5410","post","type-post","status-publish","format-standard","hentry","category-blog","category-conference"],"acf":[],"_links":{"self":[{"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/posts\/5410","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/comments?post=5410"}],"version-history":[{"count":18,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/posts\/5410\/revisions"}],"predecessor-version":[{"id":7309,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/posts\/5410\/revisions\/7309"}],"wp:attachment":[{"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/media?parent=5410"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/categories?post=5410"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sinatootoonian.com\/index.php\/wp-json\/wp\/v2\/tags?post=5410"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}