Inspection (SAE): LLMs
Sparse autoencoder tools for language models. Decomposes residual-stream activations into interpretable features, runs full attribution pipelines, steers features at inference time, and benchmarks feature quality. Requires LLM mode plus a pulled SAE matching your model and layer.
7 commands
aquin inspect
agent tool: run_full_inspection
Full attribution pipeline in one pass: generates the model response, runs causal mediation analysis per token/layer, decomposes top SAE features, renders logit lens per layer, and builds the circuit graph. This is the primary LLM inspection entry point. Syncs the complete inspection card to the web.
| Flag | Description |
|---|---|
| --prompt* | Input prompt (model completes from here). |
| --model | Override active model. |
| --layer | SAE layer (default: from pulled SAE config). |
CLI-only shortcut. In aquin chat, ask: Run full inspection on "…"
aquin feature-logits
agent tool: get_feature_logits
Projects one SAE feature's decoder direction through the unembedding matrix and returns the top tokens promoted and suppressed. Answers: if I amplify this feature, which tokens become more likely?
| Flag | Description |
|---|---|
| --feature* | SAE feature index. |
| --prompt | Optional prompt context for activation weighting. |
aquin feature-neighbors
agent tool: get_feature_neighbors
Returns the nearest neighbor SAE features by decoder cosine similarity. Use to find redundant, related, or polysemous features near a target index.
| Flag | Description |
|---|---|
| --feature* | SAE feature index. |
| --top_k | Number of neighbors (default: 5). |
aquin steer
agent tool: run_steer_and_show
Adds a scaled multiple of one SAE feature's decoder direction to the residual stream at the SAE layer during the forward pass. Compares baseline vs steered output side-by-side. Feature label auto-resolves from causal labeling if omitted.
| Flag | Description |
|---|---|
| --prompt* | Input prompt. |
| --feature_idx* | SAE feature to steer. |
| --feature_label | Human-readable label (auto-resolved if omitted). |
| --strength | Steering multiplier (default: 1.0). |
| --max_new_tokens | Generation length. |
aquin multi-steer
agent tool: run_multi_steer
Steers multiple SAE features simultaneously in a single forward pass. Pass a JSON array of {feature_idx, strength, label} objects.
| Flag | Description |
|---|---|
| --prompt* | Input prompt. |
| --features* | JSON array, e.g. '[{"feature_idx":42,"strength":1.5}]' |
| --max_new_tokens | Generation length. |
aquin benchmark
agent tool: run_benchmarks_on_top_feature
Runs InterpScore, Feature Purity, and MUI (Monosemanticity Under Intervention) on a single SAE feature. Quantifies how interpretable and causally coherent the feature is.
| Flag | Description |
|---|---|
| --feature_idx* | SAE feature index. |
Distinct from behavioral evals (audit, red-team). This scores one feature's interpretability.
aquin umap
agent tool: ensure_umap_loaded
Loads the UMAP projection of all SAE decoder directions and opens the UMAP Explorer panel on the web. Requires a precomputed UMAP file for the pulled SAE.
