Aquin LogoAquinLabs
Login

Inspection: LLMs (non-SAE)

Tools that inspect LLM behavior through activations, attention, and weight health without decomposing the residual stream into SAE features. Requires LLM mode: load an LLM with aquin load --model before running any command below.

Prerequisiteaquin load --model gpt2-small (or pythia-70m, llama-3.2-1b, etc.)

4 commands

aquin attention

agent tool: run_attention_routing

Runs a forward pass on the prompt and extracts per-head attention weight matrices for every layer. Each head receives a routing score summarizing how concentrated its attention is. Use this to see which tokens attend to which, and which heads specialize on syntax vs content.

FlagDescription
--prompt*Input text to analyze.
--top_kNumber of top heads to highlight (default: 8).
example

Syncs to web as an attention routing card in the orchestrator panel.

aquin layer-analysis

agent tool: run_layer_analysis

Measures activation stability across paraphrased prompts and out-of-distribution (OOD) similarity at each layer. Stability scores how much hidden states drift when the same meaning is phrased differently. OOD similarity compares in-domain vs OOD prompt activations to flag layers that collapse on unfamiliar input.

FlagDescription
--promptsJSON array of paraphrases for stability analysis.
--in_domain_promptsIn-distribution prompts for OOD comparison.
--ood_promptsOut-of-distribution prompts.
--top_kTop layers to report.
example

aquin perturbation

agent tool: run_perturbation_sensitivity

Zeroes out (or adds Gaussian noise to) hidden channels one at a time and measures KL divergence between the perturbed output distribution and the clean baseline. Channels with high KL impact are sensitivity hotspots, useful for finding brittle representations.

FlagDescription
--prompt*Input prompt.
--n_channelsNumber of channels to perturb (default: 32).
--methoddropout or gaussian (default: dropout).
example

aquin check-weights

agent tool: check_weights

Scans all weight tensors for trojan/backdoor signatures (kurtosis spikes, outlier density, singular-value ratio) and runs SVD rank analysis across Q/K/V/O/MLP matrices. Flags collapsed or suspicious layers before you trust a checkpoint.

FlagDescription
--collapse_thresholdSVD rank ratio below which a layer is flagged collapsed (default: 0.01).
example

Renders trojan scan and rank health cards side-by-side on the web.