fact checkbias detectioncensor auditLLM inspection

The Checking System

Aquin Labs · April 2026

What the model says is not what the model knows

Understanding how a language model processes a prompt which layers activate, which features fire, which tokens carry causal weight is one layer of interpretability. But it says nothing about whether the model's output is correct, biased, or suppressed. A feature can trace cleanly to a factual claim that happens to be wrong. A response can flow through the network without any anomaly and still systematically omit entire topic areas.

Aquin's checking system operates on the response itself, not the weights. It runs automatically after every model generation and populates the right-hand sidebar with three analyses: a fact check that verifies individual claims against live web search, a bias detection pass that measures lean across axes derived from the specific content, and a censor audit that maps which topics the model addressed, softened, or avoided.

The checks

All three checks run in parallel after the logit lens computation completes. They operate on the same (prompt, response) pair, with fact checking additionally using a live web search tool. Results appear in the Fact Check tab of the sidebar as they arrive.

01Fact CheckClaim verification

Extracts distinct factual claims from the model's response and verifies each one via web search. Returns a verdict of supported, refuted, or unverifiable per claim, with sources.

02Bias DetectionLean measurement

Identifies bias dimensions that are genuinely relevant to the specific content and scores where the response sits on each axis. Axes are derived from the response itself, not a fixed list.

03Censor AuditSuppression mapping

Assesses how the model treated each topic area naturally relevant to the prompt: whether it responded directly, hedged, or avoided. Attempts to classify suppression as weight-level or surface-level.

Fact Check: is it true?

The model's response is passed to a verification pipeline with web search enabled. It extracts every distinct verifiable claim from the response, skips opinions and filler, then searches the web and classifies each claim as supported, refuted, or unverifiable.

Each claim comes back with a verdict, a one-sentence explanation, and up to three sources that directly support or refute it. Sources include titles and URLs; the sidebar renders them with favicons and opens them in a new tab on click. Claims that cannot be verified via search speculative statements, very recent events, highly niche facts come back as unverifiable with an empty source list rather than a guess.

Live web search is used rather than a retrieval-augmented approach because it gives access to current information. A model responding about a recent event may produce a claim that was accurate at training time and has since become false. The fact check catches this.

example: prompt "tell me about the Eiffel Tower"

Supported

The Eiffel Tower is 330 meters tall

The Eiffel Tower stands 330 meters tall including its broadcast antenna.

Eiffel Tower official site

Supported

The Eiffel Tower was built in 1889

Construction was completed in 1889 for the World's Fair.

Britannica: Eiffel Tower

Refuted

The Eiffel Tower is the tallest structure in Europe

Several structures including the Ostankino Tower in Moscow are taller.

List of tallest structures in Europe

three of the claims a typical response might make. the third is incorrect and gets caught.

The fact check surfaces the gap between what a model confidently asserts and what is actually verifiable. This is especially useful for models with a training cutoff: they answer in the present tense about a world that may have changed. The check does not attempt to correct the model's response it annotates it.

Bias Detection: which direction does it lean?

Most bias detection tools apply a fixed set of axes political lean, sentiment, formality to every piece of text regardless of what it is about. Aquin's bias detection derives its axes from the content. The pipeline identifies 2-4 bias dimensions that are genuinely relevant to the specific prompt and response, then scores where the response sits on each one from −1.0 to +1.0.

A response about climate policy might yield axes like "alarmist vs dismissive" and "individual responsibility vs systemic change." A medical response might yield "conservative treatment vs aggressive intervention." A historical overview might yield "triumphalist vs critical framing." The axes shift with the content rather than being imposed on it.

The sidebar renders each axis as a horizontal track with a marker showing the score. The filled segment runs from the center to the marker so neutral reads as an empty center, and lean in either direction is immediately visible. A brief explanation sits below each track.

example: bias axes for a response about the Eiffel Tower

hedgedcertainty framingconfident

The response states facts without qualification even where debate exists.

Western-centriccultural lensglobal

Examples and framing draw primarily from Western European and American contexts.

scores: certainty framing +0.55 · cultural lens -0.4. center tick marks 0.0.

Content-derived axes matter because fixed axes are often the wrong lens. Applying a left-right political scale to a response about database architecture produces a misleading neutral score and no information. Deriving axes from the content means the analysis is always asking the most relevant question about that specific response.

The check closes with a one to two sentence summary of the overall bias profile. This is displayed below the axes in the sidebar and gives a compact reading that does not require interpreting each axis individually.

Censor Audit: what did it not say?

Fact check and bias detection both operate on what the model said. Censor audit operates on what it did not. Given the prompt, the pipeline identifies 3-6 topic areas that were present in the query or naturally relevant to it, then assesses how the model treated each one in its response.

Each topic is classified as one of three statuses. Unfiltered means the model addressed it directly and fully without notable hedging. Softened means the model touched it but with excessive caveats, watered-down phrasing, or a framing that deflected from the core of the topic. Suppressed means the model avoided or refused it altogether.

The check also attempts to classify the origin of any suppression it finds. Weight-level suppression is baked into the model's weights through training and shows up as consistent avoidance across different prompt framings. Surface-level suppression looks more like an instruction-following patch the model starts to engage with a topic, then redirects. This classification is presented as a short note at the bottom of the audit section, not as a definitive verdict; the distinction is genuinely hard to make from output alone without running the kind of causal analysis Aquin's attribution system provides.

example: censor audit for a response about the Eiffel Tower

construction costModel discussed budget and financing details without hedging.

unfiltered

safety incidentsAcknowledged historical accidents but framed them as resolved.

softened

political oppositionAvoided the substantial public and political opposition to the tower's construction.

suppressed

surface-level RLHF patch detected on political opposition

the model discussed the tower freely but avoided the historical controversy around its construction. origin note suggests instruction-tuning rather than weight-level suppression.

The censor audit is the most speculative of the checks. It relies on a secondary analysis making a judgment about what topics were "naturally relevant" to a prompt, which is inherently subjective. What it is good at is flagging the systematic avoidance pattern where a model consistently deflects a class of topics across many prompts rather than one-off omissions that might simply reflect a focused response.

The origin note is intended as a hypothesis to investigate, not a finding. Confirming whether suppression is weight-level requires causal intervention: ablating specific features or attention heads and checking whether the suppression disappears. That is the kind of experiment Aquin's attribution and benchmark systems are designed to support. The censor audit surfaces the candidate; the interpretability tools let you trace it.

Relationship to the attribution system

The checking system and the attribution system are complementary. Attribution traces the causal structure of a response inside the model: which layers, features, and prompt tokens produced each output token. Checking evaluates the content of the response from the outside: whether the claims are true, whether the framing is biased, whether certain topics were avoided.

When the censor audit flags a topic as suppressed, the attribution system is the right tool for the follow-up question. Which features were active when the model started to engage with that topic and then deflected? Is there a feature at layer 8 with high MUI that fires specifically on that topic class and whose ablation removes the deflection? The censor audit identifies the behavioral pattern; the attribution system traces it to a mechanism.

Similarly, when the fact check finds a refuted claim, the causal trace and logit lens can show when in the forward pass the model committed to that claim. If confidence in the wrong token crystallizes at layer 8 the same layer where factual associations tend to concentrate in Llama's architecture the SAE features active there are the natural candidates for a weight edit that corrects the association.

Aquin Labsaquin@aquin.app

Not sure if Aquin is right for you?

Join Server

Aquin