Benchmarks
Aquin Labs · May 2026
How Aquin decides which SAE features are trustworthy, and how you can build and run custom benchmarks against any loaded model without leaving your inspection context.
Part 1: Feature Benchmarks
Most SAE features get used because they have a plausible-sounding label and a clean activation plot. That is not enough. A label can be wrong. A feature can be coherent but causally irrelevant. Before a feature earns a place in a circuit graph or a steering experiment, three properties need to hold independently: the label predicts where it fires, it is monosemantic, and it actually does work in the forward pass.
These conditions are orthogonal. A feature can pass two and fail the third in any combination. Aquin scores all three separately and surfaces them as a diagnostic triple. The combination tells you what to do next: relabel, filter, or trust.
InterpScore
The question InterpScore answers: does this feature's label predict when it fires? Two sentence sets are built per feature, one where the label implies the feature should activate and one where it should not. Both pass through the model, maximum activation at layer 8 is extracted per sentence, and Cohen's d is computed between the two distributions. The result clips to [0, 1].
A score near 1 means the label and the feature agree. A score near 0 means they have drifted, so treat the auto-generated label as a guess. Each feature uses 10 positive and 10 negative sentences, 20 separate forward passes through the full model and SAE.
Cohen's d 0.84 · InterpScore 84%
Fires on
Silent on
The capital of France is a major hub.
8.41She ordered a coffee and opened her laptop.
0.12Parliament sits at the seat of government.
7.86The algorithm runs in linear time.
0.08Washington D.C. is where the president works.
6.93Three dogs sat under the oak tree.
0.03FeaturePurityScore
InterpScore evaluates the label. FeaturePurityScore evaluates the feature itself, with no label involved. The sentences where the feature fired above threshold are embedded, and mean pairwise cosine similarity of the upper triangle is computed, excluding self-similarity.
High purity means activating contexts cluster tightly in embedding space, so the feature is monosemantic. Low purity means it is firing on surface-level co-occurrence rather than a coherent concept. Polysemantic features concentrate near the sparsity penalty boundary, which is consistent with what the superposition hypothesis predicts.
High purity · f5042
Low purity · polysemantic
"The cat sat on the mat."
"The merger was announced at noon."
"She lives near the river."
"She whispered in the dark."
"The book is beside the lamp."
"The algorithm converged slowly."
"He stood behind the door."
"He scored three goals."
cosine sim 0.81 · purity 90%
cosine sim 0.21 · purity 61%
Model Utilization Index
A feature can look perfect on InterpScore and FeaturePurityScore and still be inert. The model computes it but does not route through it. This is the gap MUI is designed to close. Some cleanly labeled, monosemantic features produce near-zero KL divergence under ablation. They are decorative.
MUI measures causal load directly. At each token position where the feature fires above threshold, its projection onto the residual stream is zeroed and the forward pass re-runs. KL divergence between baseline and ablated output distributions is computed at that position, averaged across all firing positions, and normalized by baseline Shannon entropy. The result is a [0, 1] score of how much the model's output depends on this feature when it is active.
Ablating at the "France" token shifts output substantially. MUI = 76%.
Reading the scores together
The three scores are a diagnostic triple, not a leaderboard. The most actionable pattern is high purity and high MUI with low InterpScore. The feature is coherent and causally load-bearing, but its label is wrong. A relabeling pass using the actual activating examples usually resolves it in minutes. The all-low pattern is a dead feature, appearing disproportionately near the sparsity penalty boundary, and it should be filtered before any downstream analysis.
Ideal, well-labeled, monosemantic, causally active.
Understood but decorative. Model does not route through it.
Label predictive but too coarse. Fires across related contexts.
Coherent and causally active, but mislabeled. Relabeling priority.
Dead or noise. Filter before downstream use.
Part 2: The Benchmark Builder
Standard benchmark workflows require selecting a suite, configuring a harness, running the eval, and parsing results out of band. For scheduled evaluations that pipeline is fine. For a question that surfaces mid-inspection, say a suspicious feature or an unexpected output, it is a full context switch that almost never happens. The question gets dropped.
The Benchmark Builder removes the context switch. You describe what you want to measure in natural language, the agent writes the prompt suite, runs it against whatever is currently loaded, and returns a scored card in the thread, grounded in the same session that surfaced the question. The card supports four chart types and exports to CSV, JSON, PNG, and PDF.
Building a benchmark
The simplest path is build_benchmark. You supply a title, a list of capability dimensions with scores the agent has measured, and an optional summary. The agent selects the scoring method based on task class and records it in card metadata. A 67% on CoT math with partial-credit scoring is not the same as a 67% on factual recall with next-token probability. The method travels with the result.
Benchmark Builder · flow
describe in natural language
"test reasoning on multi-step problems"
agent writes prompt suite
selects scorer type from task class
prompts run against loaded model
or embedding model via cosine path
scored card returned in thread
bar / pie / radial / line chart, exportable
optionally assemble into suite
build_benchmark_suite · weighted composite
Prompts run directly against the loaded model. No re-specification needed.
Benchmarks a checkpoint at a specific training step. Results are indexed by step and tracked in the regression panel.
Overall 80%
Custom evals
The built-in consistency, suppression, and boundary evals cover known failure modes. Custom evals cover yours. run_custom_eval is a spawn-only tool that runs inside a sub-agent, so it does not block the main conversation thread. You pass a name, a description, a prompt set of 3 to 50 items, and a scorer_type.
Custom evals work on both LLMs and embedding models. For LLMs the model generates a response per prompt, then the scorer evaluates it. For embedding models no generation happens — the model encodes each prompt and the scorer receives the embedding vector directly. The same scorer interface handles both paths, so a custom eval written for an LLM can be adapted to an embedding model with a one-line change.
run_custom_eval · parameters
namestringShort name shown in the result card, e.g. "Deceptive Reasoning Scorer".
descriptionstringOne sentence describing what this eval measures.
promptsstring[]The prompt or sentence set to evaluate. 3 to 50 items.
scorer_typeenumsemantic_similarity or code. Determines how each response is scored.
scorer_codestring?Required for scorer_type=code. Python script that returns a float 0-1 or JSON {score, note}.
reference_answersstring[]?One reference per prompt. For LLMs: keyword overlap. For embedding models: cosine similarity reference.
thresholdnumber?Pass threshold. Default 0.5 for LLMs, 0.7 for embedding models.
temperaturenumber?LLM only. 0 = greedy. Ignored for embedding models.
Scorer types
The two scorer types differ in how much of the evaluation logic you own. Semantic similarity handles the comparison automatically. Code hands you the raw variables and expects a number back.
Keyword overlap between generated response and reference answers. Pass reference_answers with one string per prompt.
Cosine similarity between the prompt embedding and a reference embedding. Pass reference_answers as the reference texts.
A Python script you write. Receives prompt, response, activations, features. Must print a float 0-1 or JSON {score, note} as the last line.
A Python script you write. Receives prompt and embedding (np.ndarray). Can compare against reference_embedding if provided.
For the code scorer, LLM variables are prompt, response, activations, and features. Embedding model variables are prompt, embedding (np.ndarray), and reference_embedding. The script must print a float 0-1 or a JSON object with score and note as its last output line. Anything before that is treated as logging.
The code scorer can access activations directly, which means custom evals can measure properties of internal representations, not just surface outputs. A scorer that checks whether a specific SAE feature activated above threshold on the generated response is two lines of Python. That kind of mechanistic criterion is not available to any external eval harness.
Benchmark suites
Individual eval cards answer narrow questions. A suite answers a broader one: how does this model perform across a coherent set of concerns? build_benchmark_suite assembles multiple eval results, built-in or custom, into a named suite with a composite score. Each eval entry carries a weight, and the composite is the weighted average. The suite is also spawn-only, so it runs after the constituent evals have finished.
Weights encode judgment about relative importance. A safety audit might weight suppression at 1.5 and a one-off custom eval at 0.8, reflecting that suppression failures matter more to the deployment decision than the narrow custom signal does. The weight is visible in the card so the tradeoff is explicit.
Reading results
Scores are relative to the generated prompt suite, so they are not directly comparable to published leaderboard numbers unless you explicitly request a named standardized benchmark. The most reliable use is within-session comparison: run the same request against two models or checkpoints and compare rank order, not absolute values.
A low score is a starting point, not a verdict. A reasoning score of 67% driven by spatial failures is a different problem from one driven by arithmetic failures. A follow-up benchmark scoped to the sub-type disambiguates in one additional request. For custom evals, a low score on a code scorer that reads activations is also a prompt to dig into the attribution system, because the behavioral signal and the mechanistic signal are already pointing at the same session.
factual recall, cloze, MCQ
Log-probability on the target token. No generation required.
code generation, function completion
Generated code run against a test suite. First-attempt pass rate.
summarization, translation
LCS between output and reference as a proxy for content coverage.
refusal, safety
Fraction of prompts producing the expected refusal. Threshold configurable.
data diversity, label balance
KL divergence from a reference distribution, normalized to [0, 1].
