My notes on Day of NeurIPS 2025. I wanted to wait till they were more complete, but it’s three months later now and they’re useful as as!
Explainable AI
Attribution
- System output is a function of:
- Training data, which makes the system sensitive to
- Input features, that are transformed by
- Model components to produce an output
- Approaches for attributing outputs to features
- Perturbation based approaches
- Compare system output with a feature that without
- Game-theoretic approach
- Compute marginal contribution of units
- Combine these into Shapley metrics
- Game-theoretic approach
- Produces saliency maps
- Compare system output with a feature that without
- Gradient based approaches
- Partial derivatives of output with respect to a features
- Linear approximations
- Approximate model as locally linear: $f(x) \approx w^T x + b.$
- Often enough to replace $x$ with binary indicator of feature presence.
- Approximate model as locally linear: $f(x) \approx w^T x + b.$
- Perturbation based approaches
- Data attribution
- Perturbation based: e.g. Leave One Out: $f(x) – f^{-j}(x)$
- Game theoretic metric: Data Shapley
- Gradient-based
- Loss gradient overlap
- Influence functions: how much do parameters change without a specific training example: $\theta_{\epsilon, x_j} = \text{argmin}_\theta {1 \over n} \sum+
- Linear approximations
- Skip training, directly predict model output with a linear model.
- Perturbation based: e.g. Leave One Out: $f(x) – f^{-j}(x)$
- Component attribution
- Component: Neuron, subnetwork, etc.
- Perturbation based: Causal Mediation Analysis
- Output with and without a specific component.
- Game Theoretic: Neural Shapley
- Causal tracing:
- First, get the normal system output, x \to f(x): What is the Capital of France?
- Then, get the output to a perturbed input x’ \to f(x’): What is the Capital of ____?
- Finally, bring in component activation from unperturbed case: x’ \to f_{k^*}(x’).
- Perturb the identified component to obtain target behaviour.
- Gradient-based component attribution:
- $f_{k^*}(x’) -f(x’) \approx \nabla_{c_k} f(x’)[c_k(x) – c_k(x’)]$?
Building Inherently Explainable AI systems
- Explainability as communication channel between the model and the human.
- Inherently explainable architectures:
- Replacing neural network layers to force explicit representations of human-understandable concepts.
- Modifying loss functions accordingly
- Does not cause performance loss.
- Transformers to Generalized Additive Models
- Backpack Language Models
- Replacing neural network layers to force explicit representations of human-understandable concepts.
- Inherently explainable training:
- Gradient-based and perturbation based approaches can attribute different answers than masking based approaches.
- One solution: change training to reflect post-hoc explainaibility paradigm.
- E.g. training on randomly masked data.
Concepts and References
- LogitLens
- Shapley / Data Shapley / Neural Shapley
- Causal Mediation Analysis
- Kumar et al. 2022, probing classifiers will rely on spurious features.
- Backpack Language Models
- Generalized Additive Models
- Neural Additive Models
Geometric Deep Learning
- Groups capture symmetries
- Invariant neural networks:
- Output invariant transformed input is the same: $f(g.x) = f(x)$
- Equivariance:
- Output respects transformed input: $f(g.x) = g.f(x)$
- Why do we care?
- Learning efficiency
- Noise robustness
- Can affect loss-landscapes, learning
Achieving equivariance
- Modifying off-the-shelf (non-equivariant) networks
- Canonicalization
- E.g. alignment, registration
- Problem: not always continuous: small input changes produce large output changes.
- Group averaging
- Average output of all group transformed inputs
- Nice mathematical properties, including continuity
- Problem: Groups can be huge, enumerating all group elements can be hard.
- Solution: Don’t need to use full group, group generators suffice.
- Problem: Group generators can be hard to find.
- Solution: A not too large random subset of group elements can approximately capture the symmetries.
- Problem: Group generators can be hard to find.
- Solution: Don’t need to use full group, group generators suffice.
- Canonicalization
- Building equivariant-networks
- Data augmentation: train on group transformed data.
- Problem: No guarantee to exactly capture the symmetry.
- Weight-sharing: symmetries are reflected in fixed weigh structure.
- E.g. CNNs capturing translation invariance.
- Problem: Might not be sufficiently expressive.
- From invariant theory:
- Build networks sensitive to specific small order polynomial functions of the input.
- E.g. to make invariant to rotations, use dot products.
- Build networks sensitive to specific small order polynomial functions of the input.
- Data augmentation: train on group transformed data.
Concepts and References
- Equivariance
- SignNet
- DeepSet
- Villar “Machine Learning and Invariant Theory”
- “Symmetries in Neural Network Parameter Space.”
- Weight space learning “Neural Nets as Data”
- Model merging
- Symmetry-invariant optimization
- Neural Radiance Fields
- Implicit Neural Representations
- Any-dimensional models
- “Representational Stability” B Farb ICM 2014
- Manifold hypothesis helps curse of dimensionality if curvature is bounded positive below
- E.g. avoid hyperbolas, space-filling shapes.
Leave a Reply