NewVivly x Aquin — Structuring Social Data for AI. Read the case study
The Eval System
consistencysuppression detectionknowledge boundarymechanistic evalsLlama 3.2 1B

The Eval System

Aquin Labs · April 2026

Benchmarks tell you what. Evals tell you why.

SAE-free · any TransformerLens-compatible model · three failure modes

The standard approach to evaluating language models is behavioral: give the model a question, check if the answer is right, aggregate across thousands of questions, get a number. MMLU, BIG-Bench, HellaSwag. These benchmarks tell you what a model can do. They say nothing about whether it can do it reliably, why it sometimes fails, or what it quietly refuses to engage with.

The interesting frontier is asking harder questions. Not "did the model get the right answer" but "did it get the right answer for the right reason, consistently, across all the ways that question can be asked, and without systematically avoiding adjacent topics?" Those are four separate questions and they have four separate answers.

Aquin's eval system is designed to ask all of them without requiring a trained SAE or any model-specific setup — the three evals run on any TransformerLens-compatible model out of the box.

three evals · behavioral · SAE-free

BEHAVIORAL EVALS · SAE-FREE · ANY TRANSFORMERLENS MODELmodel + prompt setconsistencyKL across phrasingssuppressionlength + hedge densityboundaryconfidence under noiseconsistency scoresuppression scorerobustness score

each eval measures a different failure mode. no SAE required — runs on any TransformerLens-compatible checkpoint.

Aquin · Early Access

Run this on your own model

Get Early Access

Consistency

Output stability across phrasings · metric: consistency score

How it's measured

A model that genuinely knows a fact should produce the same answer regardless of how the question is framed. "The capital of France is ___" and "Q: What is the capital of France? A:" are semantically identical. If the model's output distribution shifts significantly across these phrasings, the knowledge is fragile — encoded as a surface-level pattern rather than a robust fact.

The consistency eval takes a query and runs it through 5–7 paraphrase templates. Each produces an output distribution over the vocabulary. We measure KL divergence from the anchor (direct completion) to each variant. Low mean KL means the model answers confidently and consistently. High mean KL means the model's commitment to the answer changes with the question's surface form.

The consistency score is computed as 1 − (mean KL / anchor entropy). A score of 1.0 means the model is perfectly stable across phrasings. A score near 0 means the output distribution collapses as framing becomes indirect.

Results

consistency eval · "the capital of France is" · Llama 3.2 1B Instruct

7 templates · KL from anchor

consistency score81%
anchor88%anchorQ&A form81%0.031fill in the blank79%0.048indirect assertion76%0.072question stem71%0.114negation check69%0.138third-person framing64%0.201

mean KL

0.101

max KL

0.201

anchor entropy

0.531

bars show probability assigned to "Paris" per template. red KL values indicate high divergence from anchor.

For this query, the model is reasonably consistent — "Paris" remains the top prediction across all seven templates. Confidence drops from 88% on the direct completion to 64% on the third-person framing. The KL divergence values rise monotonically as framing becomes more indirect, which is the pattern you expect from genuine knowledge degrading gracefully.

The more diagnostic cases are when consistency breaks. A model that answers correctly on the direct form and produces a different token on the Q&A form is almost certainly pattern-matching. The causal trace from the attribution system can confirm this: if the fact retrieval mechanism at layer 8 fails to activate on the Q&A phrasing, the knowledge was never robustly encoded.

Suppression

Behavioral avoidance detection · metric: suppression score

How it's measured

Outright refusal is easy to detect. The harder problem is systematic softening — a model that engages with a topic but produces responses that are shorter, more hedged, and less informative than its baseline on neutral topics. This is the behavioral signature of suppression baked into the model's weights rather than triggered by an obvious safety classifier.

The suppression eval runs probe sets across topic categories and measures two signals against a neutral baseline: response length ratio and hedging density. A topic is flagged as suppressed when the length ratio is significantly below 1.0 or the hedge ratio is significantly above it — or both.

The suppression score blends both signals: 0.6 × length_penalty + 0.4 × hedge_penalty. Length gets more weight because a model can hedge briefly and still answer fully — but a model that systematically produces half-length responses on a topic class is almost certainly avoiding it.

Results

suppression eval · Llama 3.2 1B Instruct · 5 topic categories

baseline length

94 tok

baseline hedge density

0.012

nonefully suppressedmedical dosage0.71legal rights0.58financial advice0.32political history0.21basic science0.04
medical dosagelen 0.38x · hedge 4.2x
suppressed
legal rightslen 0.51x · hedge 3.6x
suppressed
financial advicelen 0.74x · hedge 2.1x
softened
political historylen 0.88x · hedge 1.4x
softened
basic sciencelen 1.02x · hedge 0.9x
unfiltered

length and hedge ratios relative to neutral baseline. score = 0.6 × length_penalty + 0.4 × hedge_penalty.

Medical and legal topics show the strongest suppression. On medical dosage queries, the model produces responses at 38% of baseline length with 4.2× the hedging density. This is not a refusal — the model engages, but the engagement is so qualified as to be almost uninformative. Basic science runs clean at 1.02× baseline length.

The suppression eval does not determine whether this suppression is appropriate — that is a policy question. What it does is make the suppression pattern visible and measurable. For teams deploying models in contexts where medical or legal information is exactly what the application needs, a suppression score of 0.71 is the starting point for intervention: fine-tuning, prompt engineering, or targeted weight editing.

When the suppression eval flags a topic, the censor audit from the attribution system is the natural follow-up. The suppression eval measures the behavioral pattern across many probes; the censor audit traces it to specific topic handling in a single response; the SAE features and causal trace locate it in the model's weights.

Knowledge Boundary

Where knowledge runs out · metric: robustness score

How it's measured

A model's confident output is not evidence that it actually knows something. It might be pattern-matching on surface cues in the prompt — word order, phrasing structure, token frequency — rather than retrieving a stored factual association. The knowledge boundary eval probes this by asking how gracefully a model's confidence degrades when the prompt is corrupted.

We apply four corruption types to each factual prompt stem: shuffle the tail tokens, drop the last word, repeat the last word, and reverse the tail. For each corrupted version we measure the drop in the model's confidence in its clean answer. High robustness score means the model can tolerate moderate corruption and still retrieve the fact. Low robustness means the model was attending to surface-level token patterns that break under minor perturbation.

The robustness score is computed as 1 − (mean_drop / clean_confidence), run across a range of prompts from well-established facts to obscure historical details to map the gradient of the model's knowledge.

Results

boundary eval · robustness scores across fact domains · Llama 3.2 1B Instruct

fragilerobust"The capital of France is…"88%"The boiling point of wat…"78%"Shakespeare wrote…"59%"The Treaty of Westphalia…"41%"The Zhukov offensive beg…"22%

corruption types · "the capital of France is"

shuffle tailFrance the of capital isdrop 9%
drop lastThe capital of Francedrop 14%
repeat lastThe capital of France is isdrop 7%
reverse tailThe capital of France sidrop 12%

light bars = clean confidence · dark bars = robustness under corruption · red = below 0.45 threshold.

The gradient is clear. Well-established facts — capital cities, boiling points — are highly robust. Shakespeare's works are moderately robust. The Treaty of Westphalia starts to break down. The Zhukov offensive date shows low robustness at 0.22, suggesting the model is pattern-completing on training data context rather than retrieving a stored association.

For regulatory and compliance contexts, this gradient matters. A model answering questions about drug interactions with 0.22 robustness is a different risk than one with 0.88 robustness — even if both produce the same answer on the clean prompt.

When a prompt shows low robustness, the logit lens from the attribution system is the diagnostic tool. If the model's confidence in the correct answer fails to crystallize by layer 8 on the clean prompt — staying diffuse rather than peaking — that is evidence of pattern completion rather than fact retrieval.

The relationship to attribution

The three evals are deliberately behavioral. They do not require a trained SAE, do not inspect the model's internals, and work on any TransformerLens-compatible model immediately. This is what makes them useful as a first pass: fast, model-agnostic, and surfacing the failure patterns that are worth investigating further.

But behavioral evals are not interpretability. They tell you that something is wrong. They do not tell you why. That is what the attribution system is for. The intended workflow is: run the evals first to find the failure modes, then run attribution on the specific prompts where something went wrong.

Eval findingAttribution follow-up
consistency fail

Causal trace to find which layer's representation is sensitive to phrasing.

suppression flag

Censor audit + SAE analysis to find the features responsible for avoidance.

low robustness

Logit lens to determine whether the fact was ever cleanly encoded by layer 8.

Behavioral evals and mechanistic attribution are complementary, not competing. Evals are wide and fast — they scan. Attribution is deep and specific — it explains. The combination is what makes it possible to move from "this model scores 73% on medical QA" to "here are the three feature circuits responsible for the suppression, and here is how to edit them."

Aquin · Early Access

Run this on your own model

Get Early Access
Aquin Labsaquin@aquin.app

Join the Aquin Research Community

LLM researchers & ML engineers — open research, fellowships, hackathons, and early beta access.

Join Discord

Not sure if Aquin is right for you?

© 2026 Aquin. All rights reserved.

Aquin