Neurips 2025 Day 1

My notes on Day of NeurIPS 2025. I wanted to wait till they were more complete, but it’s three months later now and they’re useful as as!

Explainable AI

Attribution

System output is a function of:
- Training data, which makes the system sensitive to
- Input features, that are transformed by
- Model components to produce an output
Approaches for attributing outputs to features
- Perturbation based approaches
  - Compare system output with a feature that without
    - Game-theoretic approach
      - Compute marginal contribution of units
      - Combine these into Shapley metrics
  - Produces saliency maps
- Gradient based approaches
  - Partial derivatives of output with respect to a features
- Linear approximations
  - Approximate model as locally linear: $f(x) \approx w^T x + b.$
    - Often enough to replace $x$ with binary indicator of feature presence.
Data attribution
- Perturbation based: e.g. Leave One Out: $f(x) – f^{-j}(x)$
  - Game theoretic metric: Data Shapley
- Gradient-based
  - Loss gradient overlap
  - Influence functions: how much do parameters change without a specific training example: $\theta_{\epsilon, x_j} = \text{argmin}_\theta {1 \over n} \sum+
- Linear approximations
  - Skip training, directly predict model output with a linear model.
Component attribution
- Component: Neuron, subnetwork, etc.
- Perturbation based: Causal Mediation Analysis
  - Output with and without a specific component.
- Game Theoretic: Neural Shapley
- Causal tracing:
  - First, get the normal system output, x \to f(x): What is the Capital of France?
  - Then, get the output to a perturbed input x’ \to f(x’): What is the Capital of ____?
  - Finally, bring in component activation from unperturbed case: x’ \to f_{k^*}(x’).
  - Perturb the identified component to obtain target behaviour.
- Gradient-based component attribution:
  - $f_{k^*}(x’) -f(x’) \approx \nabla_{c_k} f(x’)[c_k(x) – c_k(x’)]$?

Building Inherently Explainable AI systems

Explainability as communication channel between the model and the human.
Inherently explainable architectures:
- Replacing neural network layers to force explicit representations of human-understandable concepts.
  - Modifying loss functions accordingly
- Does not cause performance loss.
- Transformers to Generalized Additive Models
  - Backpack Language Models
Inherently explainable training:
- Gradient-based and perturbation based approaches can attribute different answers than masking based approaches.
- One solution: change training to reflect post-hoc explainaibility paradigm.
- E.g. training on randomly masked data.

Concepts and References

LogitLens
Shapley / Data Shapley / Neural Shapley
Causal Mediation Analysis
Kumar et al. 2022, probing classifiers will rely on spurious features.
Backpack Language Models
Generalized Additive Models
Neural Additive Models

Geometric Deep Learning

Groups capture symmetries
Invariant neural networks:
- Output invariant transformed input is the same: $f(g.x) = f(x)$
Equivariance:
- Output respects transformed input: $f(g.x) = g.f(x)$
Why do we care?
- Learning efficiency
- Noise robustness
- Can affect loss-landscapes, learning

Achieving equivariance

Modifying off-the-shelf (non-equivariant) networks
- Canonicalization
  - E.g. alignment, registration
  - Problem: not always continuous: small input changes produce large output changes.
- Group averaging
  - Average output of all group transformed inputs
  - Nice mathematical properties, including continuity
  - Problem: Groups can be huge, enumerating all group elements can be hard.
    - Solution: Don’t need to use full group, group generators suffice.
      - Problem: Group generators can be hard to find.
        
        Solution: A not too large random subset of group elements can approximately capture the symmetries.
Building equivariant-networks
- Data augmentation: train on group transformed data.
  - Problem: No guarantee to exactly capture the symmetry.
- Weight-sharing: symmetries are reflected in fixed weigh structure.
  - E.g. CNNs capturing translation invariance.
  - Problem: Might not be sufficiently expressive.
- From invariant theory:
  - Build networks sensitive to specific small order polynomial functions of the input.
    - E.g. to make invariant to rotations, use dot products.

Concepts and References

Equivariance
SignNet
DeepSet
Villar “Machine Learning and Invariant Theory”
“Symmetries in Neural Network Parameter Space.”
Weight space learning “Neural Nets as Data”
Model merging
Symmetry-invariant optimization
Neural Radiance Fields
Implicit Neural Representations
Any-dimensional models
“Representational Stability” B Farb ICM 2014
Manifold hypothesis helps curse of dimensionality if curvature is bounded positive below
- E.g. avoid hyperbolas, space-filling shapes.

Posted

13 March 2026

Blog, Conference

Sina

Tags:

Neurips 2025 Day 1

Explainable AI

Attribution

Building Inherently Explainable AI systems

Concepts and References

Geometric Deep Learning

Achieving equivariance

Concepts and References

Comments

Leave a Reply Cancel reply