Benchmarks

Aquin Labs · May 2026

How Aquin decides which SAE features are trustworthy, and how you can build and run custom benchmarks against any loaded model without leaving your inspection context.

Part 1: Feature Benchmarks

Most SAE features get used because they have a plausible-sounding label and a clean activation plot. That is not enough. A label can be wrong. A feature can be coherent but causally irrelevant. Before a feature earns a place in a circuit graph or a steering experiment, three properties need to hold independently: the label predicts where it fires, it is monosemantic, and it actually does work in the forward pass.

These conditions are orthogonal. A feature can pass two and fail the third in any combination. Aquin scores all three separately and surfaces them as a diagnostic triple. The combination tells you what to do next: relabel, filter, or trust.

InterpScore

The question InterpScore answers: does this feature's label predict when it fires? Two sentence sets are built per feature, one where the label implies the feature should activate and one where it should not. Both pass through the model, maximum activation at layer 8 is extracted per sentence, and Cohen's d is computed between the two distributions. The result clips to [0, 1].

A score near 1 means the label and the feature agree. A score near 0 means they have drifted, so treat the auto-generated label as a guess. Each feature uses 10 positive and 10 negative sentences, 20 separate forward passes through the full model and SAE.

f13910 · "capital / seat-of-government" · Llama 3.2 1B Instruct

Cohen's d 0.84 · InterpScore 84%

Fires on

Silent on

The capital of France is a major hub.

8.41

She ordered a coffee and opened her laptop.

0.12

Parliament sits at the seat of government.

7.86

The algorithm runs in linear time.

0.08

Washington D.C. is where the president works.

6.93

Three dogs sat under the oak tree.

0.03

FeaturePurityScore

InterpScore evaluates the label. FeaturePurityScore evaluates the feature itself, with no label involved. The sentences where the feature fired above threshold are embedded, and mean pairwise cosine similarity of the upper triangle is computed, excluding self-similarity.

High purity means activating contexts cluster tightly in embedding space, so the feature is monosemantic. Low purity means it is firing on surface-level co-occurrence rather than a coherent concept. Polysemantic features concentrate near the sparsity penalty boundary, which is consistent with what the superposition hypothesis predicts.

High purity · f5042

Low purity · polysemantic

"The cat sat on the mat."

"The merger was announced at noon."

"She lives near the river."

"She whispered in the dark."

"The book is beside the lamp."

"The algorithm converged slowly."

"He stood behind the door."

"He scored three goals."

cosine sim 0.81 · purity 90%

cosine sim 0.21 · purity 61%

Model Utilization Index

A feature can look perfect on InterpScore and FeaturePurityScore and still be inert. The model computes it but does not route through it. This is the gap MUI is designed to close. Some cleanly labeled, monosemantic features produce near-zero KL divergence under ablation. They are decorative.

MUI measures causal load directly. At each token position where the feature fires above threshold, its projection onto the residual stream is zeroed and the forward pass re-runs. KL divergence between baseline and ablated output distributions is computed at that position, averaged across all firing positions, and normalized by baseline Shannon entropy. The result is a [0, 1] score of how much the model's output depends on this feature when it is active.

f13933 · "geographic country associations" · per-position ablation

Ablating at the "France" token shifts output substantially. MUI = 76%.

SAE dictionary health

Per-feature InterpScore, purity, and MUI answer "is this feature good?" aquin sae-stats answers "is the whole dictionary healthy?" — dead features, never-fired units, mean firing rate, and sparsity distribution across the layer. Run it before benchmarking individual features or after aquin sae train on captured activations.

Feature & dictionary benchmarks · CLI

aquin benchmark --feature <id>InterpScore + purity + MUI for one feature.

aquin sae-statsLayer-wide dictionary health report.

aquin benchmarkIn-session Benchmark Builder (agent-driven suite).

aquin sae train / sae alignTrain temp SAE on captured activations; align to public dict.

Commands run against the active session after aquin session start. One model is locked per session — start a new session to load a different checkpoint.

Reading the scores together

The three scores are a diagnostic triple, not a leaderboard. The most actionable pattern is high purity and high MUI with low InterpScore. The feature is coherent and causally load-bearing, but its label is wrong. A relabeling pass using the actual activating examples usually resolves it in minutes. The all-low pattern is a dead feature, appearing disproportionately near the sparsity penalty boundary, and it should be filtered before any downstream analysis.

InterpPurityMUIReading

HighHighHigh

Ideal, well-labeled, monosemantic, causally active.

HighHighLow

Understood but decorative. Model does not route through it.

HighLowHigh

Label predictive but too coarse. Fires across related contexts.

LowHighHigh

Coherent and causally active, but mislabeled. Relabeling priority.

LowLowLow

Dead or noise. Filter before downstream use.

Part 2: The Benchmark Builder

Standard benchmark workflows require selecting a suite, configuring a harness, running the eval, and parsing results out of band. For scheduled evaluations that pipeline is fine. For a question that surfaces mid-inspection, say a suspicious feature or an unexpected output, it is a full context switch that almost never happens. The question gets dropped.

The Benchmark Builder removes the context switch. You describe what you want to measure in natural language, the agent writes the prompt suite, runs it against whatever is currently loaded, and returns a scored card in the thread, grounded in the same session that surfaced the question. The card supports four chart types and exports to CSV, JSON, PNG, and PDF.

Building a benchmark

The simplest path is build_benchmark. You supply a title, a list of capability dimensions with scores the agent has measured, and an optional summary. The agent selects the scoring method based on task class and records it in card metadata. A 67% on CoT math with partial-credit scoring is not the same as a 67% on factual recall with next-token probability. The method travels with the result.

Benchmark Builder · flow

describe in natural language

"test reasoning on multi-step problems"

agent writes prompt suite

selects scorer type from task class

prompts run against loaded model

or embedding model via cosine path

scored card returned in thread

bar / pie / radial / line chart, exportable

optionally assemble into suite

build_benchmark_suite · weighted composite

01model inspection

Prompts run directly against the loaded model. No re-specification needed.

02training monitor

Benchmarks a checkpoint at a specific training step. Results are indexed by step and tracked in the regression panel.

6-capability result · model inspection · llama-3.2-1b · 36 prompts

Overall 80%

Custom evals

The built-in consistency, suppression, and boundary evals cover known failure modes. Custom evals cover yours. run_custom_eval is a spawn-only tool that runs inside a sub-agent, so it does not block the main conversation thread. You pass a name, a description, a prompt set of 3 to 50 items, and a scorer_type.

Custom evals work on both LLMs and embedding models. For LLMs the model generates a response per prompt, then the scorer evaluates it. For embedding models no generation happens — the model encodes each prompt and the scorer receives the embedding vector directly. The same scorer interface handles both paths, so a custom eval written for an LLM can be adapted to an embedding model with a one-line change.

run_custom_eval · parameters

namestring

Short name shown in the result card, e.g. "Deceptive Reasoning Scorer".

descriptionstring

One sentence describing what this eval measures.

promptsstring[]

The prompt or sentence set to evaluate. 3 to 50 items.

scorer_typeenum

semantic_similarity or code. Determines how each response is scored.

scorer_codestring?

Required for scorer_type=code. Python script that returns a float 0-1 or JSON {score, note}.

reference_answersstring[]?

One reference per prompt. For LLMs: keyword overlap. For embedding models: cosine similarity reference.

thresholdnumber?

Pass threshold. Default 0.5 for LLMs, 0.7 for embedding models.

temperaturenumber?

LLM only. 0 = greedy. Ignored for embedding models.

Scorer types

The two scorer types differ in how much of the evaluation logic you own. Semantic similarity handles the comparison automatically. Code hands you the raw variables and expects a number back.

01semantic similarity

LLM

Keyword overlap between generated response and reference answers. Pass reference_answers with one string per prompt.

Embed

Cosine similarity between the prompt embedding and a reference embedding. Pass reference_answers as the reference texts.

02custom Python scorer

LLM

A Python script you write. Receives prompt, response, activations, features. Must print a float 0-1 or JSON {score, note} as the last line.

Embed

A Python script you write. Receives prompt and embedding (np.ndarray). Can compare against reference_embedding if provided.

For the code scorer, LLM variables are prompt, response, activations, and features. Embedding model variables are prompt, embedding (np.ndarray), and reference_embedding. The script must print a float 0-1 or a JSON object with score and note as its last output line. Anything before that is treated as logging.

The code scorer can access activations directly, which means custom evals can measure properties of internal representations, not just surface outputs. A scorer that checks whether a specific SAE feature activated above threshold on the generated response is two lines of Python. That kind of mechanistic criterion is not available to any external eval harness.

Benchmark suites

Individual eval cards answer narrow questions. A suite answers a broader one: how does this model perform across a coherent set of concerns? build_benchmark_suite assembles multiple eval results, built-in or custom, into a named suite with a composite score. Each eval entry carries a weight, and the composite is the weighted average. The suite is also spawn-only, so it runs after the constituent evals have finished.

Weights encode judgment about relative importance. A safety audit might weight suppression at 1.5 and a one-off custom eval at 0.8, reflecting that suppression failures matter more to the deployment decision than the narrow custom signal does. The weight is visible in the card so the tradeoff is explicit.

Reading results

Scores are relative to the generated prompt suite, so they are not directly comparable to published leaderboard numbers unless you explicitly request a named standardized benchmark. The most reliable use is within-session comparison: run the same request against two models or checkpoints and compare rank order, not absolute values.

A low score is a starting point, not a verdict. A reasoning score of 67% driven by spatial failures is a different problem from one driven by arithmetic failures. A follow-up benchmark scoped to the sub-type disambiguates in one additional request. For custom evals, a low score on a code scorer that reads activations is also a prompt to dig into the attribution system, because the behavioral signal and the mechanistic signal are already pointing at the same session.

01next-token probability

factual recall, cloze, MCQ

Log-probability on the target token. No generation required.

02execution-based pass@1

code generation, function completion

Generated code run against a test suite. First-attempt pass rate.

03reference-based ROUGE-L

summarization, translation

LCS between output and reference as a proxy for content coverage.

04binary pass rate

refusal, safety

Fraction of prompts producing the expected refusal. Threshold configurable.