
Benchmarks
Aquin Labs · April 2026
What makes a feature real?
When a sparse autoencoder decomposes a model's residual stream into tens of thousands of features, a natural question surfaces immediately: which ones are real? A feature vector is just a direction in a high-dimensional space. Before it can be trusted, several distinct things need to be true about it. Its label should predict where it fires. It should fire for one coherent concept rather than many unrelated ones. And when it fires, the model should actually use it.
These are not the same question. A feature can be well-labeled but polysemantic. It can be monosemantic but ignored downstream. It can causally matter to predictions but carry a label that misses the point entirely. Aquin evaluates these dimensions independently and surfaces them as a unified benchmark panel inside the tool, so you can see the full picture on any feature you select.
The benchmarks
Each benchmark is computed live in the tool when you select a feature from the panel. They share the same prompt context and the same underlying model run, so results are directly comparable within a session.
Cohen's d between activation distributions on matching vs. non-matching sentences. Measures how well a feature's label predicts where the feature fires.
Mean pairwise cosine similarity of sentence embeddings that activate the feature. Measures whether a feature fires for one coherent concept or many unrelated ones.
KL divergence when the feature is ablated, normalized by baseline entropy. Measures whether the model actually uses the feature or largely ignores it downstream.
InterpScore: does the label predict firing?
The simplest and most direct benchmark. Given a feature's label, we run isolated models to generate two sets of sentences: ones where the feature should fire according to the label, and ones where it should not. We then run each sentence through the model, extract the feature's maximum activation across all token positions at layer 8, and compute Cohen's d between the two activation distributions.
Cohen's d is the difference in means divided by the pooled standard deviation. A score of 0.8 or higher is conventionally considered a large effect size. We clip to [0, 1] for display. A high InterpScore means the label is doing real predictive work: sentences that should activate the feature actually do, and sentences that should not are largely silent.
The default evaluation runs 10 positive and 10 negative sentences per feature. Each sentence is its own forward pass through the full model and SAE. The panel surfaces the highest-activating examples from each set, so you can read what the feature actually fires on rather than trusting the label alone.
example: feature f13910 "capital/seat-of-government"
Fires on
The capital of France is a major European hub.
8.41Parliament sits at the seat of government.
7.86Washington D.C. is where the president works.
6.93Silent on
She ordered a coffee and opened her laptop.
0.12The algorithm runs in linear time.
0.08Three dogs sat under the oak tree.
0.03Cohen's d = (8.07 — 0.08) / pooled_std ≈ 0.84. InterpScore clipped to 84%.
A low InterpScore does not always mean the label is wrong. It can also mean the label is too abstract to generate useful contrasting sentences, or that the feature fires so broadly that any reasonable label undershoots it. That is where FeaturePurityScore comes in.
FeaturePurityScore: is it about one thing?
InterpScore evaluates the label. FeaturePurityScore evaluates the feature itself, independently of its label. The question it asks is: do the contexts that activate this feature resemble each other?
We take the positive examples generated for InterpScore the sentences where the feature actually fired above threshold and embed them. We then compute the mean pairwise cosine similarity of the resulting embedding matrix, keeping only the upper triangle to exclude self-similarity. Cosine similarity lives in [−1, 1]; we remap to [0, 1] for display.
A high purity score means the activating contexts cluster tightly in embedding space: the feature is monosemantic, firing for one coherent concept. A low purity score means the contexts are scattered: the feature fires for several unrelated things, making it polysemantic and harder to trust as an interpretable unit.
purity: high vs low
High purity · feature f5042 "relational prepositions"
"The cat sat on the mat."
"She lives near the river."
"The book is beside the lamp."
"He stood behind the door."
mean cosine sim ≈ 0.81 → purity 90%
Low purity · hypothetical polysemantic feature
"The merger was announced at noon."
"She whispered in the dark."
"The algorithm converged slowly."
"He scored three goals."
mean cosine sim ≈ 0.21 → purity 61%
Polysemantic features are a known failure mode of sparse autoencoders trained with insufficient sparsity penalty or too few features relative to the residual stream's effective dimensionality. A low purity score is an early warning: weight editing or causal ablation experiments that target this feature are likely to produce unexpected side effects, because the feature is entangled with multiple unrelated concepts.
Model Utilization Index: does the model use it?
A feature can have a perfect label and perfect purity and still be functionally irrelevant. If ablating the feature leaves the model's output distribution unchanged, the feature is firing but not influencing anything downstream. The Model Utilization Index (MUI) measures this directly.
For each token position where the feature fires above a small threshold, we perform a causal ablation: we zero out that feature's contribution to the residual stream by computing the difference between the original SAE decode and the decode with the feature zeroed, then subtracting it from the residual. We run the ablated forward pass through the remaining layers and collect the final logit distribution. We then compute KL divergence between the baseline distribution and the ablated distribution.
MUI is the mean KL across all firing positions, normalized by the baseline Shannon entropy of the output distribution. Normalization matters because a high-entropy model is already uncertain, so a given KL shift means less than it would for a low-entropy, confident model. The final score is clipped to [0, 1].
per-position KL divergence: feature f13933 "geographic country associations"
""...capital of France...""
""...France is a country...""
""...visiting Paris soon...""
Mean KL
0.1752
KL divergence
Baseline H
2.3104
entropy
MUI
0.1752 / 2.3104
≈ 76%
feature fires strongest at the "France" token. ablating it shifts the output distribution significantly. MUI = 76%.
High MUI flags features that are causally load-bearing even if poorly labeled. These are the most important candidates for relabeling. Low MUI features, conversely, can be safely deprioritized: they may fire reliably but carry little causal weight in any given inference.
Reading the scores together
The benchmarks are most useful in combination. The panel surfaces an interpretation for each combination of high and low scores, so you can act on the result without mentally reconstructing what the pattern means. The most common patterns and their interpretations are shown below.
Ideal feature. Well-labeled, monosemantic, and causally critical.
Well-understood but largely decorative. The model doesn't rely on it.
Label is predictive but too broad. Feature fires for several related contexts.
Coherent and causally important, but mislabeled. Priority for relabeling.
Dead or noise feature. Consider filtering it out.
threshold for "high" is 0.6 across all benchmarks. scores are displayed as percentages in the panel.
The most actionable pattern is the fourth row: high purity and high MUI with low InterpScore. The feature is coherent, the model relies on it, but the label misses the concept. In the panel this surfaces as a direct recommendation to relabel. The label was generated from a single prompt context; running a second prompt through the model and inspecting the new activating examples is usually enough to refine it.
The fifth row, low across the board is a dead feature. These are relatively common in SAEs trained with high sparsity penalty. They survive training but rarely activate above threshold in practice, and when they do, the activations are noisy. The panel surfaces these as candidates for filtering from the feature list before downstream analysis.
Not sure if Aquin is right for you?
Aquin
