Tracing facts through LLMs
causal tracingSAElogit lensLlama 3.2 1B

The Attribution System

Aquin Labs · April 2026

Tracing facts through LLMs

When a language model answers "What is the capital of France?" with "Paris", it is not looking anything up. Somewhere in 1.2 billion parameters trained on a slice of the internet, the answer was stored during training and the model retrieves it at inference time through a sequence of matrix multiplications. The question we set out to answer is: where, exactly? And can we see the retrieval happen in real time?

At Aquin we built an attribution system to answer exactly this. It runs causal mediation analysis across every layer of the network, extracts interpretable features using a trained Sparse Autoencoder, and projects the residual stream back into vocabulary space at each depth to show how the model's confidence in an answer builds from the bottom up.

The experiment

We ran a single factual query end-to-end through the full pipeline. The prompt was intentionally simple so the causal structure would be clear and verifiable.

prompt: "What is the capital of France?"
response: "The capital of France is Paris."
model: meta-llama/Llama-3.2-1B-Instruct
SAE layer: 8 · n_features: 16,384 · L1_coeff: 10.0 · L0: ~679
noise_scale: 3.0 · n_noise_runs: 10 · seq_len: 64

We ran ROME-style causal mediation analysis: for each prompt token in turn, we corrupt its embedding with scaled Gaussian noise, run the forward pass, and measure how much the probability of the target response token drops. We average over multiple noise samples to reduce variance. The result is a score for every (prompt token, response token) pair.

What the attribution shows

The causal graph is strikingly clean. Three prompt tokens dominate: "capital", "of", and "France". Together they account for almost all of the causal signal driving "Paris" in the response. "What" contributes almost nothing the model is not doing full-sentence pattern matching. It is composing the answer from the semantically load-bearing parts of the prompt.

prompt

WhatisthecapitalofFrance?

response

ThecapitalofFranceisParis.
France → Paris
94%
capital → Paris
81%
of → Paris
58%
What → Paris
6%

causal attribution scores, normalized. higher = more responsible for producing "Paris".

Tokens are color-coded by their causal role: amber for significant prompt drivers, green for the key response token. The same coloring links prompt tokens to the response tokens they causally influence giving a visual representation of how information flows from input to output.

The same pattern holds for "capital" in the response: it is primarily driven by "capital" and "France" in the prompt, not by "What" or "is". The model does not attend uniformly to its context. It identifies the semantically decisive words and routes most of the causal work through them.

The network: 16 layers, one peak

After establishing which prompt tokens matter, we run causal patching across all 16 transformer layers to find where in the network the fact is stored. For each layer, we restore its clean residual stream while keeping all other layers corrupted and measure how much probability the target token recovers.

The result is a layer-level causal responsibility score. A high score at layer L means that the representation at that layer is load-bearing for retrieving this fact. The graph encodes this directly: node brightness maps to causal drop percentage, edge thickness to signal strength.

tokposembL0L1L2L3L438%L541%L635%L730%L887%peakL971%L1044%L1122%L1238%L1342%L1436%L1518%out"Paris"inL0-3L4-7L8-11L12-15out

causal graph across all 16 transformer layers. node brightness = causal drop %. L8 (amber ring) is the peak: 87% causal responsibility for "Paris". node color: red = high, yellow = medium, stone = minimal.

high causal impactmediumlowpeak layer L8

Layer 8 accounts for 87% of the causal signal for producing "Paris". The model has a specific location where the France-capital-Paris association lives, in the MLP sublayers around the midpoint of the network. Layers 4-7 show moderate warming as the representation of "capital of France" develops. Layers 12-15 contribute mainly by formatting and refining the output rather than encoding the fact.

This is consistent with the mechanistic interpretability literature on factual associations in transformers. Middle-layer MLPs act as key-value stores: the subject representation (here, "France" + "capital") is used as a key to look up and write the associated value ("Paris") into the residual stream.

The logit lens: watching confidence build

The causal trace tells us which layer is responsible. The logit lens shows us what the model is "thinking" at each layer. After every transformer block, we take the residual stream, apply the final layer norm and unembed it directly into vocabulary space. The result is a probability distribution over the next token at each depth, as if the model had stopped processing there and been forced to guess.

For this query, the progression is striking. Early layers produce generic tokens like "the" and "city" with no commitment. Around layer 5, "France" briefly surfaces as the top prediction before the model narrows in. By layer 8, "Paris" dominates at 78% probability and the distribution barely changes through layer 15. The fact crystallizes exactly at the causal peak identified by the trace.

logit lens: top predictions per layer

L0the12.0%
a9.0%
an6.0%
L1the14.0%
city7.0%
a6.0%
L2the15.0%
city9.0%
its7.0%
L3city11.0%
the10.0%
its8.0%
L4France18.0%
city13.0%
the9.0%
L5France22.0%
Paris11.0%
city8.0%
L6Paris29.0%
France18.0%
Lyon5.0%
L7Paris41.0%
France14.0%
Lyon4.0%
L8 · peakParis78.0%
Lyon4.0%
Paris,3.0%
L9Paris81.0%
Lyon3.0%
Marseille2.0%
L10Paris83.0%
Lyon2.0%
Marseille2.0%
L11Paris84.0%
Lyon2.0%
Marseille1.0%
L12Paris85.0%
Lyon2.0%
Marseille1.0%
L13Paris86.0%
Lyon1.0%
Marseille1.0%
L14Paris87.0%
Lyon1.0%
Marseille1.0%
L15Paris88.0%
Lyon1.0%
Marseille1.0%

residual stream unembedded at each layer. probability shown for the top predicted next token. L8 highlighted as the causal peak.

The lens also shows the model briefly entertaining "France" at layer 5 before committing to "Paris". This is the subject representation being assembled before the MLP at layer 8 applies the key-value lookup. The two-step structure, subject formation then fact retrieval, is visible directly in the layer-by-layer probability trace.

What the SAE sees

We trained a 16,384-feature SAE on 2 million residual stream activations at layer 8, then ran the query through it to extract the top activating features at each token position in the response.

For each active feature, a causal ablation is ran which zeroes out that feature's contribution to the residual stream and re-runs the forward pass, comparing the before-and-after logit distributions. The tokens most boosted and most suppressed by each feature define its functional role.

top SAE features active when model produces "Paris"

f13933

geographic country associations

"France"9.75
f13910

capital/seat-of-government

"capital"7.86
f13007

European nation names

"France"6.72
f4592

city names after capitals

"Paris"5.82
f5042

relational prepositions

"capital"5.38

all five features traced back to "capital", "of", "France" in the prompt. f13933 maps to geographic/country associations. activation values from layer 8 residual stream.

Feature f13933 fires at 9.75 for "France" in the response and traces directly back to "France" in the prompt. Feature f13910 fires at 7.86 for "capital" and traces back to both "capital" and "of". Feature f4592 fires for "Paris" itself and traces back to "capital" and "France" it is a feature that specifically encodes the answer to capital-of queries for certain countries.

The SAE decomposition completes the picture. The causal trace tells us layer 8 is the critical site. The logit lens shows confidence in "Paris" crystallizing there. The SAE shows exactly which features are carrying that information and how they connect back to the specific words in the prompt that activated them.

Cross-token feature attribution

The final layer of analysis connects prompt and response at the feature level. For each response token, Aquin finds which SAE features active on that token also fired on prompt tokens earlier in the context. The overlap weighted by activation magnitude builds a feature-level causal bridge between what the model was given and what it produced.

feature bridges: prompt tokens driving response tokens

"France""capital""of"
f13933, f13007f13910, f5042f13910
"Paris""capital""France"

feature indices shown are the SAE features bridging each prompt-response pair. activation overlap weighted by magnitude.

Aquin Labsaquin@aquin.app

Not sure if Aquin is right for you?

All Systems StatusPoliciesResearch© 2026 Aquin. All rights reserved.

Aquin