The Attribution System

Aquin Labs · April 2026

Seven tools that answer two questions: how did the model produce this output, and is the output actually correct?

Tracing facts through a language model

When a language model answers "What is the capital of France?" with "Paris", it is not looking anything up. Somewhere in 1.2 billion parameters, the association was encoded during training and is retrieved at inference time through a sequence of matrix multiplications. Two questions follow: where exactly does the retrieval happen, and once we know the mechanism, is the answer actually right?

The attribution system runs two pipelines in sequence on every output. The first traces the mechanism: which layers, features, and prompt tokens caused each response token. The second evaluates the result: whether claims are true, whether the framing leans in a direction, whether certain topics were quietly avoided. Neither is complete without the other.

The query

A single factual query run end-to-end through the full pipeline. The prompt is intentionally simple. Unambiguous causal structure makes each step's output easier to read.

prompt: "What is the capital of France?"

response: "The capital of France is Paris."

model: meta-llama/Llama-3.2-1B-Instruct

SAE: layer 8 · 16,384 features · L1_coeff 10.0 · L0 ~679

noise_scale: 3.0 · n_noise_runs: 10

ROME-style causal mediation analysis is the entry point: each prompt token's embedding is corrupted with scaled Gaussian noise, the forward pass is re-run, and the drop in the target token's probability is measured. Averaging over multiple noise runs produces a causal score for every (prompt token, response token) pair.

Attribution

Token attribution scores

Three prompt tokens dominate: "capital", "of", and "France". Together they carry nearly all the causal signal driving "Paris". "What" contributes almost nothing. The model identifies the semantically load-bearing tokens and routes most of the causal work through them, not through the full sentence structure.

WhatisthecapitalofFrance?

causal attribution · "What is the capital of France?" → "Paris"

16 layers, one peak

Causal patching localizes the retrieval to a specific layer. For each layer in turn, the clean residual stream is restored while all other layers remain corrupted, and the recovery in the target token's probability is measured. The result is a causal drop score per layer: which one, when restored alone, brings "Paris" back.

high impactmediumlowpeak layer L8

causal layer graph · drop % per layer · Llama 3.2 1B Instruct

Layer 8 accounts for 87% of the causal signal. The France to capital to Paris association is stored in the MLP sublayers at the network's midpoint. This is the key-value store pattern: the subject representation ("France") functions as a lookup key, and the MLP writes the associated value ("Paris") into the residual stream at that layer.

The logit lens: watching confidence build

The causal trace locates the retrieval site. The logit lens shows what the model is predicting at each layer as it gets there. After every transformer block, the final layer norm is applied and the residual stream is projected directly into vocabulary space as if the model had stopped at that layer and been forced to output a token.

logit lens · P(Paris) per layer · Llama 3.2 1B

top token vs 2nd token · all 16 layers

Early layers produce generic tokens like "the" and "city" with no factual commitment. Around layer 5, "France" surfaces briefly as the subject representation assembles. By layer 8, "Paris" dominates at 78% and holds flat through layer 15. The two-step structure of the retrieval is directly visible: subject formation first, then fact lookup at the MLP.

SAE Features

Top active features

The query is passed through an SAE at layer 8 to extract the top activating features at each token position. For each active feature, causal ablation zeroes out its contribution to the residual stream and re-runs the forward pass, comparing output distributions to define its functional role.

top SAE features · layer 8 · activation strength

The circuit attribution graph

The circuit attribution graph makes the feature bridge structure explicit as a directed bipartite visualization: prompt tokens on the left, SAE features in the middle, response tokens on the right. Edge weight encodes activation strength on the left side and causal ablation score on the right.

Hub features are the diagnostic signal. f13910 (capital/seat-of-government) receives signal from both "capital" and "of" in the prompt and feeds both "capital" and "Paris" in the response, acting simultaneously as a relational and a geographic feature. A hub at this position is the first candidate for any intervention targeting "Paris".

circuit attribution · prompt → features → response

What each feature does to the vocabulary

Each SAE feature is a direction in residual stream space. Its effect on the model's output is read by projecting that direction through the unembedding matrix, the logit projection. For f13933, the top boosted token is "Paris" at +4.21 and all suppressed tokens are non-French European capitals. The feature is not merely "France-related": it specifically routes the output toward French city names and away from other national capitals.

Boosts

Paris4.21

Lyon2.14

Marseille1.87

Bordeaux1.52

capital1.31

Suppresses

Berlin-3.44

London-2.98

Rome-2.71

Madrid-2.45

Tokyo-2.01

f13933 · geographic country associations · logit projection

Feature neighborhoods in weight space

Features that are geometrically close in decoder weight space tend to fire in similar contexts and produce similar vocabulary effects. For f13933, the nearest neighbor at 91% similarity is f13007 (European nation names). The neighborhood also includes f7834 (country-capital associations) and f2901 (seat-of-power contexts). Any weight editing intervention should account for this neighborhood: editing one feature risks perturbing the others.

f13933 · nearest neighbors · cosine similarity in decoder space

f13007

European nation names91%

f5042

relational prepositions84%

f9211

geographic proper nouns79%

f7834

country-capital associations74%

f2901

seat-of-power contexts68%

similarity computed over W_dec rows. bar = cosine similarity normalized to [0, 1].

The feature space: a map of 16,384 directions

UMAP projects all SAE decoder directions into three-dimensional space, making the full geometric structure of the feature space navigable. Features that fire in similar contexts and produce similar vocabulary effects cluster together.

All five features active on this query fall inside or adjacent to the same cluster, a geopolitical reference region. The UMAP view is most useful as a pre-edit diagnostic: a tight cluster means an edit to one feature will likely affect the others, and the edit scope should be set accordingly.

UMAP projection · 16,384 SAE features · layer 8 · Llama 3.2 1B

Feature steering: intervening directly

Feature steering adds a scaled multiple of a feature's decoder direction to the residual stream at layer 8 on every forward pass, amplifying or suppressing the feature without touching model weights. It is the fastest way to validate a feature's causal role before committing to a permanent weight editing intervention. Steering is reversible, weight editing is not.

Baseline

The capital of France is Paris, which has been the country's political and cultural center since the 10th century.

Steered

+4.0

The capital of France is Lyon, which has been the country's political and cultural center since the 10th century.

f13933 · geographic country associations · strength +4.0

When steering confirms the feature's role and the logit projection confirms its vocabulary signature, a ROME-style weight editing operation to correct a factual association becomes a targeted, well-scoped intervention rather than a parameter search.

Checking

The attribution pipeline explains how "Paris" was produced: layer 8, five specific features, three prompt tokens, a geopolitical cluster with a clear logit signature. That tells us nothing about whether the output is accurate, whether its framing is neutral, or whether relevant information was left out. The checking system runs automatically after every generation and produces three analyses in parallel.

Fact check: is it true?

Every distinct verifiable claim is extracted from the response and classified as supported, refuted, or unverifiable, with a one-sentence explanation and up to three sources. Live web search rather than retrieval augmentation matters here: a model may assert something accurate at training time that has since changed.

fact check · "tell me about the Eiffel Tower"

Supported

The Eiffel Tower is 330 meters tall

The Eiffel Tower stands 330 meters tall including its broadcast antenna.

Eiffel Tower official site

Supported

The Eiffel Tower was built in 1889

Construction was completed in 1889 for the World's Fair.

Britannica: Eiffel Tower

Refuted

The Eiffel Tower is the tallest structure in Europe

Several structures including the Ostankino Tower in Moscow are taller.

List of tallest structures in Europe

the third claim is incorrect. the logit lens shows when the model committed to the wrong token, and the active SAE features there are candidates for feature steering to confirm and weight editing to correct.

Bias detection: which direction does it lean?

Rather than applying a fixed set of axes to every response, bias dimensions are derived from the content. Two to four axes genuinely relevant to the specific prompt are scored from -1.0 to +1.0. A response about climate policy yields axes like "alarmist vs dismissive." The axes shift with the content rather than being imposed on it.

bias axes · Eiffel Tower response

hedgedcertainty framingconfident

The response states facts without qualification even where debate exists.

Western-centriccultural lensglobal

Examples and framing draw primarily from Western European and American contexts.

Censor audit: what did it not say?

Fact check and bias detection work on what the model said. Censor audit works on what it did not. Given the prompt, 3 to 6 topic areas naturally relevant to the response are identified, then each is assessed: addressed directly (unfiltered), engaged with excessive caveats (softened), or avoided (suppressed).

The audit also attempts to classify the origin of suppression, weight-level (consistent avoidance across prompt framings) vs surface-level (instruction-following patch). This is a hypothesis to investigate, not a finding. Confirming it requires causal mediation analysis and feature steering on the specific deflection point.

censor audit · Eiffel Tower response

construction cost

Budget and financing discussed without hedging.

unfiltered

safety incidents

Historical accidents acknowledged but framed as resolved.

softened

political opposition

Substantial public and political opposition to the tower's construction was not mentioned.

suppressed

surface-level RLHF patch detected on political opposition

model discussed the tower freely but avoided the historical controversy around its construction.

Deception features

For honest vs deceptive probe sets, aquin find-feature ranks SAE features by differential activation and persists a canonical deception feature id for steering and checkpoint diff. Use after the main inspect pipeline surfaces suspicious behavior — pairs naturally with Security red-team findings.

Attribution & SAE · CLI

aquin inspect --prompt …Full pipeline: causal trace, features, circuit, logit lens.

aquin feature-logits / feature-neighborsVocabulary effect and decoder-space neighbors for a feature id.

aquin steer / multi-steerCausal confirmation via residual-stream injection.

aquin sae-statsDictionary health: dead features, firing rates, sparsity.

aquin find-featureRank features on honest vs deceptive probes.

aquin capture-activationsExport labeled activations to train a temp SAE (see Benchmarks).

Commands run against the active session after aquin session start. One model is locked per session — start a new session to load a different checkpoint.

Reading together

A model can pass every behavioral check and still encode a factual error that mechanistic analysis catches immediately. A clean causal trace does not guarantee a correct or unbiased output. The mechanism and the result are independent questions and both require an answer.

For teams deploying models in regulated or high-stakes contexts, this is the difference between knowing a model scored 90% on a benchmark and knowing why. Which answers it gets right for the right reasons, which it suppresses, where in the network to look when something is wrong, and how to correct it.

Aquin Labsaquin@aquin.app