
The Eval System
Aquin Labs · April 2026
Benchmarks tell you what. Evals tell you why.
SAE-free · any TransformerLens-compatible model · three failure modes
The standard approach to evaluating language models is behavioral: give the model a question, check if the answer is right, aggregate across thousands of questions, get a number. MMLU, BIG-Bench, HellaSwag. These benchmarks tell you what a model can do. They say nothing about whether it can do it reliably, why it sometimes fails, or what it quietly refuses to engage with.
The interesting frontier is asking harder questions. Not "did the model get the right answer" but "did it get the right answer for the right reason, consistently, across all the ways that question can be asked, and without systematically avoiding adjacent topics?" Those are four separate questions and they have four separate answers.
Aquin's eval system is designed to ask all of them without requiring a trained SAE or any model-specific setup — the three evals run on any TransformerLens-compatible model out of the box.
three evals · behavioral · SAE-free
each eval measures a different failure mode. no SAE required — runs on any TransformerLens-compatible checkpoint.
Aquin · Early Access
Run this on your own model
Consistency
Output stability across phrasings · metric: consistency score
How it's measured
A model that genuinely knows a fact should produce the same answer regardless of how the question is framed. "The capital of France is ___" and "Q: What is the capital of France? A:" are semantically identical. If the model's output distribution shifts significantly across these phrasings, the knowledge is fragile — encoded as a surface-level pattern rather than a robust fact.
The consistency eval takes a query and runs it through 5–7 paraphrase templates. Each produces an output distribution over the vocabulary. We measure KL divergence from the anchor (direct completion) to each variant. Low mean KL means the model answers confidently and consistently. High mean KL means the model's commitment to the answer changes with the question's surface form.
The consistency score is computed as 1 − (mean KL / anchor entropy). A score of 1.0 means the model is perfectly stable across phrasings. A score near 0 means the output distribution collapses as framing becomes indirect.
Results
consistency eval · "the capital of France is" · Llama 3.2 1B Instruct
7 templates · KL from anchor
mean KL
0.101
max KL
0.201
anchor entropy
0.531
bars show probability assigned to "Paris" per template. red KL values indicate high divergence from anchor.
For this query, the model is reasonably consistent — "Paris" remains the top prediction across all seven templates. Confidence drops from 88% on the direct completion to 64% on the third-person framing. The KL divergence values rise monotonically as framing becomes more indirect, which is the pattern you expect from genuine knowledge degrading gracefully.
The more diagnostic cases are when consistency breaks. A model that answers correctly on the direct form and produces a different token on the Q&A form is almost certainly pattern-matching. The causal trace from the attribution system can confirm this: if the fact retrieval mechanism at layer 8 fails to activate on the Q&A phrasing, the knowledge was never robustly encoded.
Suppression
Behavioral avoidance detection · metric: suppression score
How it's measured
Outright refusal is easy to detect. The harder problem is systematic softening — a model that engages with a topic but produces responses that are shorter, more hedged, and less informative than its baseline on neutral topics. This is the behavioral signature of suppression baked into the model's weights rather than triggered by an obvious safety classifier.
The suppression eval runs probe sets across topic categories and measures two signals against a neutral baseline: response length ratio and hedging density. A topic is flagged as suppressed when the length ratio is significantly below 1.0 or the hedge ratio is significantly above it — or both.
The suppression score blends both signals: 0.6 × length_penalty + 0.4 × hedge_penalty. Length gets more weight because a model can hedge briefly and still answer fully — but a model that systematically produces half-length responses on a topic class is almost certainly avoiding it.
Results
suppression eval · Llama 3.2 1B Instruct · 5 topic categories
baseline length
94 tok
baseline hedge density
0.012
length and hedge ratios relative to neutral baseline. score = 0.6 × length_penalty + 0.4 × hedge_penalty.
Medical and legal topics show the strongest suppression. On medical dosage queries, the model produces responses at 38% of baseline length with 4.2× the hedging density. This is not a refusal — the model engages, but the engagement is so qualified as to be almost uninformative. Basic science runs clean at 1.02× baseline length.
The suppression eval does not determine whether this suppression is appropriate — that is a policy question. What it does is make the suppression pattern visible and measurable. For teams deploying models in contexts where medical or legal information is exactly what the application needs, a suppression score of 0.71 is the starting point for intervention: fine-tuning, prompt engineering, or targeted weight editing.
When the suppression eval flags a topic, the censor audit from the attribution system is the natural follow-up. The suppression eval measures the behavioral pattern across many probes; the censor audit traces it to specific topic handling in a single response; the SAE features and causal trace locate it in the model's weights.
Knowledge Boundary
Where knowledge runs out · metric: robustness score
How it's measured
A model's confident output is not evidence that it actually knows something. It might be pattern-matching on surface cues in the prompt — word order, phrasing structure, token frequency — rather than retrieving a stored factual association. The knowledge boundary eval probes this by asking how gracefully a model's confidence degrades when the prompt is corrupted.
We apply four corruption types to each factual prompt stem: shuffle the tail tokens, drop the last word, repeat the last word, and reverse the tail. For each corrupted version we measure the drop in the model's confidence in its clean answer. High robustness score means the model can tolerate moderate corruption and still retrieve the fact. Low robustness means the model was attending to surface-level token patterns that break under minor perturbation.
The robustness score is computed as 1 − (mean_drop / clean_confidence), run across a range of prompts from well-established facts to obscure historical details to map the gradient of the model's knowledge.
Results
boundary eval · robustness scores across fact domains · Llama 3.2 1B Instruct
corruption types · "the capital of France is"
light bars = clean confidence · dark bars = robustness under corruption · red = below 0.45 threshold.
The gradient is clear. Well-established facts — capital cities, boiling points — are highly robust. Shakespeare's works are moderately robust. The Treaty of Westphalia starts to break down. The Zhukov offensive date shows low robustness at 0.22, suggesting the model is pattern-completing on training data context rather than retrieving a stored association.
For regulatory and compliance contexts, this gradient matters. A model answering questions about drug interactions with 0.22 robustness is a different risk than one with 0.88 robustness — even if both produce the same answer on the clean prompt.
When a prompt shows low robustness, the logit lens from the attribution system is the diagnostic tool. If the model's confidence in the correct answer fails to crystallize by layer 8 on the clean prompt — staying diffuse rather than peaking — that is evidence of pattern completion rather than fact retrieval.
The relationship to attribution
The three evals are deliberately behavioral. They do not require a trained SAE, do not inspect the model's internals, and work on any TransformerLens-compatible model immediately. This is what makes them useful as a first pass: fast, model-agnostic, and surfacing the failure patterns that are worth investigating further.
But behavioral evals are not interpretability. They tell you that something is wrong. They do not tell you why. That is what the attribution system is for. The intended workflow is: run the evals first to find the failure modes, then run attribution on the specific prompts where something went wrong.
Causal trace to find which layer's representation is sensitive to phrasing.
Censor audit + SAE analysis to find the features responsible for avoidance.
Logit lens to determine whether the fact was ever cleanly encoded by layer 8.
Behavioral evals and mechanistic attribution are complementary, not competing. Evals are wide and fast — they scan. Attribution is deep and specific — it explains. The combination is what makes it possible to move from "this model scores 73% on medical QA" to "here are the three feature circuits responsible for the suppression, and here is how to edit them."
Aquin · Early Access
Run this on your own model
