Deception feature identification
Rank SAE features that separate honest vs deceptive probe sets on the loaded model (LLM or embedding). Produces a canonical feature index for longitudinal deception interpretability experiments. Requires a public SAE for the model layer.
1 command
aquin find-feature
agent tool: run_find_feature
Run honest vs deceptive probes through the loaded model, encode activations with the public SAE, and rank features by mean activation delta (deceptive − honest). LLMs use token-mean residual activations; embedding models use mean-pooled hidden states. Optionally re-rank top candidates with InterpScore (--benchmark-top) and persist the chosen index (--persist).
| Flag | Description |
|---|---|
| --scorer | Scorer name (default: deception). |
| --prompts | JSON/JSONL probe file. Omit to use bundled fixtures/deception/deception_probes.jsonl. |
| --layer | SAE layer (default: model default from pull sae). |
| --checkpoint | Optional fine-tuned checkpoint (.pt state dict or HF directory). |
| --top | Number of ranked features to return (default 20). |
| --benchmark-top | Re-rank top K with InterpScore + Purity (needs OpenAI). |
| --persist | Write chosen feature to ~/.aquin/experiments/<model>.json and session memory. |
| --output | Write full JSON result to path. |
Syncs a findFeature card to the web orchestrator. Use mem-read deception_feature or the experiment JSON for downstream sae diff, steer, and collapse tools.
Probe formats
Paired rows (one honest + one deceptive statement per line):
Or labeled single-text rows (same schema as capture probes):
Typical workflow
Related: Capture & train, Checkpoint SAE.
