Aquin LogoAquinLabs
Login

Evals: LLMs

Behavioral probes for language models: fact-checking, bias detection, output stability, suppression, robustness, and adversarial red-teaming. Distinct from aquin benchmark, which scores individual SAE features. Requires LLM mode.

Prerequisiteaquin load --model gpt2-small

6 commands

aquin audit

agent tool: run_audit

Runs three evals in parallel: fact-check (claim verification against retrieved evidence), bias detection (framing axes), and censorship audit (topics softened or suppressed). Returns a bundled eval card on the web.

example

Uses the last prompt/response from session context when run via agent.

aquin consistency-eval

agent tool: run_consistency_eval

Measures output stability across paraphrased templates for the same underlying query. High variance means the model's answer depends on surface phrasing rather than semantic content.

FlagDescription
--query*Core question or claim.
--templates*JSON array of paraphrase templates with {query} placeholder.
example

aquin suppression-eval

agent tool: run_suppression_eval

Probes whether the model avoids or hedges on specific topics compared to a neutral baseline. Maps topics where behavior diverges from expected open discussion.

FlagDescription
--topics*JSON object mapping topic names to probe prompt arrays.
example

aquin boundary-eval

agent tool: run_boundary_eval

Tests robustness to surface-level input corruptions: typos, case changes, unicode homoglyphs, whitespace injection. Reports per-prompt degradation score.

FlagDescription
--prompts*JSON array of clean prompts to corrupt.
example

aquin red-team

agent tool: run_red_team

Adversarial robustness probes across six attack vectors: prompt injection, role confusion, suppression, boundary robustness, context manipulation, and multi-turn extraction. Returns a composite score and per-vector breakdown.

FlagDescription
--vectorsJSON array subset of vector IDs. Defaults to all six.
example

Vectors: prompt_injection, role_confusion, suppression, boundary_robustness, context_manipulation, multi_turn_extraction

aquin eval

agent tool: run_custom_eval

Custom Q&A eval: runs prompts through the model and scores each response against a reference answer using keyword overlap. Set threshold to pass/fail each item.

FlagDescription
--name*Eval name for the report card.
--prompts*JSON array of prompts.
--reference_answers*JSON array of reference strings (same length as prompts).
--thresholdPass threshold 0–1 (default: 0.5).
--max_tokens / --temperatureGeneration settings.
example