Uncovering Critical Interactions in Large Language Models at Scale

Introduction

Modern artificial intelligence systems, particularly Large Language Models (LLMs), exhibit remarkable capabilities but remain notoriously opaque. Understanding how these models arrive at their decisions is essential for building trust, ensuring safety, and enabling responsible deployment. This field of interpretability research offers several lenses through which we can examine model behavior: feature attribution identifies the input features that most influence a prediction (Lundberg & Lee, 2017; Ribeiro et al., 2022); data attribution links model outputs to specific training examples (Koh & Liang, 2017; Ilyas et al., 2022); and mechanistic interpretability dissects the functions of internal components (Conmy et al., 2023; Sharkey et al., 2025). While each approach provides valuable insights, they all confront a common obstacle: complexity at scale.

Uncovering Critical Interactions in Large Language Models at Scale — Source: bair.berkeley.edu

The Challenge of Scale: Why Interactions Matter

Model behavior rarely stems from isolated factors. Instead, it emerges from intricate dependencies and patterns among features, training data points, and internal components. To achieve state-of-the-art performance, LLMs must synthesize complex feature relationships, discover shared patterns across diverse training examples, and process information through highly interconnected neural pathways. As the number of features, data points, and components grows, the potential number of pairwise and higher-order interactions expands exponentially, making exhaustive analysis computationally prohibitive.

Traditional interpretability methods often assume independence or linearity, but real-world model behavior is inherently interactive. For instance, the phrase 'bank deposit' in a financial context triggers a different set of model activations than 'river bank'—the interaction between words changes the meaning. Similarly, the influence of a training example may depend on the presence of other examples in the dataset. Without accounting for these interactions, attributions can be misleading or incomplete. To gain a truthful understanding, we must capture these critical interactions without incurring exponential costs.

Attribution Through Ablation: A Foundational Concept

A powerful and intuitive way to measure influence is through ablation—systematically removing or masking a component and observing the resulting change in the model's output. This principle applies across different interpretability lenses:

Feature Attribution: We mask specific segments of the input prompt and measure the shift in predictions. For example, erasing words like 'not' can drastically alter sentiment classification.
Data Attribution: We retrain models on subsets of the training set, assessing how the output on a test point changes when certain training data is omitted.
Model Component Attribution (Mechanistic Interpretability): We intervene on the forward pass by zeroing out or removing the influence of specific neurons, layers, or attention heads, thereby identifying which internal structures drive a prediction.

In each case, the goal is to isolate the drivers of a decision by perturbing the system. However, each ablation incurs a significant cost—either through expensive inference calls (especially for large LLMs) or retraining models from scratch. Therefore, we seek to compute attributions with the fewest possible ablations, while still capturing the most influential interactions.

The SPEX and ProxySPEX Framework

Scaling Interaction Discovery with Sparse Ablations

To identify influential interactions at scale, we developed the SPEX (Sparse Principal Explanation eXtraction) algorithm and its efficient variant, ProxySPEX. These methods are designed to discover which pairs or groups of features, data points, or components interact most strongly to affect model output, using a tractable number of ablation experiments.

The core idea is to formulate interaction discovery as a sparse identification problem. Rather than testing all possible interactions—an impossible task for large-scale systems—SPEX samples a subset of ablations strategically, guided by an optimization objective that seeks to reconstruct the full interaction landscape from limited observations. By leveraging techniques from compressed sensing and sparse recovery, SPEX can pinpoint the most significant interactions without enumerating all possibilities.

ProxySPEX: Faster Approximations

While SPEX already reduces computational demands, ProxySPEX goes further by employing a proxy model to approximate ablation effects. Instead of performing expensive inference or retraining for every candidate ablation, ProxySPEX trains a lightweight surrogate that mimics the ablation behavior of the original model. This surrogate is then used to guide the selection of the most informative ablation experiments, dramatically reducing the number of actual model evaluations required. The result is a method that can identify critical interactions in LLMs with thousands of features or millions of data points, all within a practical budget.

Practical Implications and Future Directions

The ability to uncover interactions at scale has profound implications for model debugging, fairness auditing, and safety. For example, an LLM used in medical diagnosis may inappropriately rely on interaction between a patient's race and a medication name—a harmful spurious correlation. Interaction-aware attribution can flag such dependencies, enabling developers to mitigate bias. Similarly, in mechanistic interpretability, understanding how attention heads interact across layers can reveal why a model exhibits certain failure modes.

Future work could extend these algorithms to handle even higher-order interactions (triplets, quadruplets) and to integrate with automated interpretability pipelines. As LLMs continue to grow in size and complexity, methods like SPEX and ProxySPEX will become essential tools for building transparent and trustworthy AI systems.

Conclusion

Interpretability at scale demands we move beyond independent attributions and embrace the reality of interactions. By combining the principled concept of ablation with smart sparse recovery techniques, SPEX and ProxySPEX offer a practical way to identify the most influential interactions among features, data points, or components—without an exponential price tag. These methods bring us one step closer to truly understanding the inner workings of large language models and ensuring they behave as intended.

Tags: