Inspection: LLMs (non-SAE)
Tools that inspect LLM behavior through activations, attention, and weight health without decomposing the residual stream into SAE features. Requires LLM mode: load an LLM with aquin load --model before running any command below.
4 commands
aquin attention
agent tool: run_attention_routing
Runs a forward pass on the prompt and extracts per-head attention weight matrices for every layer. Each head receives a routing score summarizing how concentrated its attention is. Use this to see which tokens attend to which, and which heads specialize on syntax vs content.
| Flag | Description |
|---|---|
| --prompt* | Input text to analyze. |
| --top_k | Number of top heads to highlight (default: 8). |
Syncs to web as an attention routing card in the orchestrator panel.
aquin layer-analysis
agent tool: run_layer_analysis
Measures activation stability across paraphrased prompts and out-of-distribution (OOD) similarity at each layer. Stability scores how much hidden states drift when the same meaning is phrased differently. OOD similarity compares in-domain vs OOD prompt activations to flag layers that collapse on unfamiliar input.
| Flag | Description |
|---|---|
| --prompts | JSON array of paraphrases for stability analysis. |
| --in_domain_prompts | In-distribution prompts for OOD comparison. |
| --ood_prompts | Out-of-distribution prompts. |
| --top_k | Top layers to report. |
aquin perturbation
agent tool: run_perturbation_sensitivity
Zeroes out (or adds Gaussian noise to) hidden channels one at a time and measures KL divergence between the perturbed output distribution and the clean baseline. Channels with high KL impact are sensitivity hotspots, useful for finding brittle representations.
| Flag | Description |
|---|---|
| --prompt* | Input prompt. |
| --n_channels | Number of channels to perturb (default: 32). |
| --method | dropout or gaussian (default: dropout). |
aquin check-weights
agent tool: check_weights
Scans all weight tensors for trojan/backdoor signatures (kurtosis spikes, outlier density, singular-value ratio) and runs SVD rank analysis across Q/K/V/O/MLP matrices. Flags collapsed or suspicious layers before you trust a checkpoint.
| Flag | Description |
|---|---|
| --collapse_threshold | SVD rank ratio below which a layer is flagged collapsed (default: 0.01). |
Renders trojan scan and rank health cards side-by-side on the web.
