Activation Steering for Bias Reduction
Aquin Labs · Jun 2026
1.1 Summary/Abstract
Activation steering offers a way to change a language model's behavior at inference time without restraining it, which makes it appealing for reducing social bias. We study whether it works, and under what condition, in a small open model. Using Llama-3.2-1B-Instruct, we localize social bias to SAE features across layers 4-12. We found that a single shared hostility direction underlies bias across multiple social categories in intermediate layers. We then implemented activation steering via a norm-relative BBQ-aligned axis and evaluated the BBQ benchmark across three factors: Nationality, Religion, and Race/Ethnicity. Steering produces little but directional debiasing on Nationality and Religion, but shows no effect on Race/Ethnicity, a null result that we trace to near-zero baseline bias in that category. Multi-agent conversational simulations also further show that steering leads to more stability over extended interactions, preventing the drift toward extreme positions that baseline agents are observed to take. Our findings demonstrate that steering can reduce bias, but only under specific conditions: when the steering axis matches the behavioral construct of interest and when there is enough bias that exists to remove.
2.1 Methods - Model and Compute Infrastructure
All experiments were conducted using the Llama-3.2-1B-Instruct model. Because of access restrictions on the gated official release, we used the identical, ungated checkpoint provided by the Unsloth mirror (unsoth/Llama-3.2-1B-Instruct). Model weights were checked and confirmed as identical to the official release.
Compute was provisioned through Modal Labs' serverless GPU infrastructure. Two distinct hardware configurations were utilized:
| Component | Hardware | Purpose |
|---|---|---|
| Interpretability pipeline (Aquin) | NVIDIA L40S (46 GB VRAM) | SAE feature extinction and bias localization |
| Model serving and steering | NVIDIA A10G (24 GB VRAM) | Inference endpoint for BBQ evaluation and multi-agent simulations |
The interpretability framework employed the Aquin CLI engine (version 2.1.0), which requires CUDA-capable hosts for forward passes. We addressed the incompatibility of local development environments (Apple M2, no CUDA) by containerizing the engine and deploying it as a Modal application (modal_app/aquin_engine.py), registering a persistent GPU session using the identifier aquin-bias.
A platform-specific workaround was added to resolve the dependency of the Aquin registry on the gated Meta repository. A sitecustomize.py module was injected using PYTHONPATH to redirect weight downloads from meta-llama/Llama-3.2-1B-Instruct to the equivalent Unsloth mirror during TransformerLens initialization, which helped us in preserving architectural compatibility while finding a way around authentication barriers.
2.2 Multi-Agent Conversion Simulator
In order to evaluate steering interventions in ecologically valid conversational conditions, we used an asynchronous, non-round-based multi-agent simulation framework (sim/run_live.py). The architecture operates as follows:
Each agent persona embodied an independent asynchronous task. Agents determine speaking intervals based on three parameters: (1) a base talkativeness coefficient, (2) topical affinity weights, and (3) stochastic noise drawn from a Gaussian distribution. When activated, an agent may either reply to a temporally proximal message from another participant or generate an off-topic utterance, with the decision controlled by a context-sensitive attention mechanism. All dialogue history is persisted to an OASIS-compliant SQLite database.
A moderator process used standardized probe questions at predetermined temporal checkpoints in order to get bias-relevant responses across all conditions tested.
Two experimental arrangements were used:
Persona-varied condition (profiles.csv): Five distinct personas (e.g., Economist, Refugee, Border Official), all on Llama-3.2-3B. Bias scoring was performed post-hoc using an anchored LLM-judge protocol (sim/score_bias.py), which allowed for measurement of per-agent ideological drift over the trajectory of the conversations.
Model-varied condition (profiles_models.csv): Five agents, each implemented on a different underlying language model: Llama-3.2-3B, Qwen2.5-3B, Phi-3.5-mini, SmolLM2-1.7B, and Llama-3.2-1B. This allowed for cross-model comparison of bias expression and intervention efficacy.
Bias dynamics were quantified with a continuous “bias needle” metric that measures how hostile or aggressive a specific sentence is. The projection of each utterance's residual stream activations onto the hostility direction was identified during Phase A. Per-utterance needle values were averaged over conversations to detect directional drift.
Multi-model conversation — bias-needle magnitude differs by model
2.3 BBQ Evaluation Protocol
Bias measurement was conducted using the Bias Benchmark for QA (BBQ), a multiple-choice question-answering benchmark designed to assess stereotypical bias under both ambiguous and disambiguated conditions. Three social categories were looked at: Nationality, Religion, and Race/Ethnicity.
For each category, prompt sets were constructed that consisted of:
Ambiguous items: Insufficient contextual information to determine the correct answer (ground truth: “cannot be determined”).
Disambiguated items: Sufficient information to disambiguate between stereotype-consistent and stereotype-inconsistent responses.
Evaluation was conducted using sim/bbq_eval.py, with the following experimental design:
Conditions: Baseline (no steering) vs. Steered (coefficient = 0.2)
Random seeds: 2 independent seeds
Sample size per condition: n = 200 prompts
Total queries: 3 categories x 2 conditions x 2 seeds x 200 = 2,400 evaluations
Three main metrics were computed:
| Metric | Definition | Desired Direction |
|---|---|---|
| amb-acc | Accuracy on ambiguous questions | ↑ (higher = better) |
| s_AMB | Bias score on ambiguous items | → 0 (zero = unbiased) |
| s_DIS | Bias score on disambiguated items | → 0 (zero = unbiased) |
Signed bias scores were derived using substring-based group-matching against stereotyped demographic categories. All results were aggregated across seeds and reported as means.
2.4 SAE-Based Bias Localization (Phase A)
We used Sparse Autoencoders (SAEs) to localize bias-relevant features within the model's residual stream. SAEs were loaded for layers 4 through 12 of Llama-3.2-1B (aquin load sae llama-3.201b-l{n}).
For each of the three BBQ topics, we constructed paired probe sets:
Neutral probes: Ambiguous, stereotype-free prompts
Biased probes: Prompts eliciting stereotypical attributions
The Aquin find-feature command was executed to rank SAE features by the activation differential:
where denotes mean activation across prompts in the set. Higher values indicate features more strongly activated by biased text relative to neutral text. Features were ranked independently per layer and topic.
To verify the reliability of results, we applied the --check flag to subsidiary analysis commands (sae-stats, feature-logits), generating JSON and PNG artifacts for manual inspection. All feature rankings reported here correspond to the top-separating feature per layer.
Phase A — bias-feature separation by layer (aquin find-feature)
2.5 Activation Steering Intervention (Phase B)
Steering was added using activation addition (Turner et al. 2023): the residual stream at designated layers was perturbed toward a target “neutral” direction, encoded as a vector in activation space. This intervention was baked into the serving endpoint (llama_service.py), toggleable per request using the JSON field {“steer”: <coefficient>}.
Two steering axes were constructed and evaluated:
Attempt 1: Sentiment Axis (Exploratory)
The axis was defined as the continuum between hostile and egalitarian sentiment, operationalized using sentiment lexicons and affective norming. This axis shifted surface-level tone but was later determined to be misaligned with BBQ's construct of stereotype attribution under ambiguity.
Attempt 1 — sentiment-axis steering makes BBQ bias WORSE
Attempt 2: BBQ-Aligned Axis (Final)
The axis was redefined to span:
Positive pole: Epistemic caution + egalitarian farming (“cannot be determined,” neutral demographic treatment)
Negative pole: Confidence in stereotype attribution + hostile framing
Critical calibration: Raw additive perturbation vectors were found to destabilize the residual stream, producing output degeneration at coefficients ≥ 0.4 (e.g., token repetition, incomplete sentences). To mitigate this, we implemented norm-relative steering:
where is the residual stream at token t , is the steering coefficient, is the Euclidean norm of the current hidden state, and is the unit-normalized steering direction. This scaling ensured perturbation magnitude remained proportional to the existing activation norm, preserving model coherence at α ≤ 0.2.
Steering was applied at layers 10, 11, and 12 (the locus of maximum feature separation identified in Phase A), with a single coefficient applied uniformly across layers.
Attempt 2 — BBQ-aligned axis: bias goes DOWN (coeff 0.15–0.2)
2.6 Steered Agent Evaluation in Multi-Agent Context (Phase D)
To assess steering durability in dynamic conversational settings, we conducted two 60-message multi-agent simulations using the model-varied configuration (Section 2.2). In the experimental condition (profiles_models_steered.csv), only the Llama-3.2-1B agent received steering (coefficient = 0.2). The baseline condition used an identical configuration without steering.
Per-agent bias expression was tracked longitudinally via the “bias needle” metric (Section 2.2). Drift was quantified as the difference in mean needle values between the first and second halves of each conversion (approximately 7-8 utterances per half per agent). This approach allowed assessment of both:
Absolute positioning: Whether steering shifted the agent toward a neutral midpoint
Temporal stability: Whether steering conferred resistance to conversational drift
2.7 Post-Hoc Analysis: Race/Ethnicity Gap
Following the observation of null steering effects on the Race/Ethnicity category, an additional exploratory analysis was conducted to test if targeted interventions could close this gap. Two modifications were implemented:
Layer expansion: Added layer 10 to the steering set (previously only layers 11-12)
Probe refinement: Incorporated race-specific SAE features identified in Phase A alongside the general hostility direction
This combined intervention was evaluated using the same BBQ protocol (Section 2.3) and compared against the canonical steering axis.
All code, prompts, and evaluation scripts used in this study are available in the project repository in order to support reproducibility.
3.1 Results for Phase A: SAE-Based Bias Localization
For each of the three BBQ topics (Nationality, Religion, Race/Ethnicity), SAE features were ranked by the activation differential Δ = biased activation - neutral activation. The top-separating feature per layer is reported in Table 1.
Table 1. Top SAE feature by layer and topic, with corresponding Δ values.
| Layer | Feature ID | Nationality Δ | Religion Δ | Race/Ethnicity Δ |
|---|---|---|---|---|
| 4 | F4456 (shared) | 0.202 | 0.474 | 0.269 |
| 6 | f7583 (shared) | 0.227 | 0.534 | 0.303 |
| 8 | f3432 (shared) | 0.281 | 0.661 | 0.375 |
| 10 | f30439 (shared) | 0.403 | 0.941 | 0.543 |
| 12 | (distinct)1 | 0.405 (f13566) | 0.840 (f654) | 0.482 (f18261) |
1Topic-distinct features emerge at Layer 12; earlier layers share the same feature ID across all topics.
At layers 4 through 10, the top-separating feature was identical across all three topics. At Layer 12, topic-distinct features emerged. The largest Δ values were observed at Layer 10 for Religion (Δ = 0.941) and at Layer 12 for Nationality (Δ = 0.405). The single most prominent feature across all topics and layers was L10/f30439.
3.2 Results for Phase B: Steering Intervention Outcomes
Two steering axes were evaluated.
Attempt 1: Sentiment Axis. Steering along the hostile-egalitarian sentiment axis created visible changes in surface-level tone but didn't reduce BBQ bias. The ambiguous bias score (s_AMB) increased steadily with steering strength, rising from 0.009 at baseline to 0.196 at the maximum tested coefficient. High coefficients (α ≥ 0.4) produced output degeneration defined by token repetition and incomplete sentences. This axis was discontinued in favor of the BBQ-aligned axis.
Attempt 2: BBQ-Aligned Axis. Steering along the epistemic causation + egalitarian ↔ confident stereotype-attribution + hostile axis produced measurable bias reduction. The optimal coefficient range was identified as α = 0.15-0.20. At α = 0.2, ambiguous accuracy increased, disambiguated bias scores decreased, and model coherence was preserved. At α ≥ 0.4, output degeneration occurred even with norm-relative calibration. All further evaluations used the BBQ-aligned axis at α = 0.2.
3.3 Results for Phase C: BBQ Evaluation
BBQ was evaluated across three topics, two conditions (baseline, steered at α = 0.2), and two random seeds, with n = 200 prompts per condition. Results are aggregated in Table 2.
Table 2. BBQ metrics by topic and condition.
| Topic | Condition | amb-acc (↑) | s_AMB (→ 0) | s_DIS (→ 0) |
|---|---|---|---|---|
| Nationality | Baseline | 0.296 | -0.009 | +0.081 |
| Nationality | Steered | 0.333 | -0.009 | +0.050 |
| Religion | Baseline | 0.329 | -0.098 | +0.054 |
| Religion | Steered | 0.379 | -0.079 | +0.005 |
| Race/Ethnicity | Baseline | 0.343 | -0.076 | -0.037 |
| Race/Ethnicity | Steered | 0.338 | -0.062 | -0.068 |
Nationality: Ambiguous accuracy increased from 0.296 to 0.333 (Δ = +0.037). The disambiguated bias score decreased from +0.081 to +0.050 (Δ = -0.031). The ambiguous bias score remained unchanged at -0.009.
Religion: Ambiguous accuracy increased from 0.329 to 0.379 (Δ = +0.037). The disambiguated bias score decreased from +0.054 to +0.005 (Δ = -0.049). The ambiguous bias score decreased modestly from -0.098 to -0.079 (Δ = -0.019).
Race/Ethnicity: Ambiguous accuracy showed no meaningful change (0.343 to 0.338, Δ = -0.005). The disambiguated bias score shifted from -0.037 to -0.068 (Δ = -0.031). The ambiguous bias score shifted from -0.076 to -0.062 (Δ = +0.014).
FINAL — BBQ bias, baseline vs steered (n=200×2 seeds)
Ambiguous accuracy (↑ = less biased)
|s_DIS| disambiguated bias (↓ = less biased)
3.4 Qualitative Output Analysis
Paired model outputs (before vs. after steering) were examined for the same prompts.
In cases where baseline outputs attributed behavior or motivation by demographic group, shattered outputs tended to adopt intellectually cautious framing. For example, prompts involving religious vs. non-religious attribution shifted from group-based speculation to “cannot be determined” responses. Prompts involving immigrant status shifted from assumptive framing (“must be struggling”) to more protective framing (“might be struggling, or that their background might influence…”).
However, the effect wasn't uniform. Some prompts that were already careful at baseline showed no change, and a minority of prompts showed increased stereotypic attribution under steering (e.g., a Norway vs. Nigeria hiring prompt shifted from “cannot decide” to expressing a preference for the Norwegian candidate). This variability is consistent with the decent average effect sizes seen in the quantitative BBQ results. Complete paired outputs are provided in sim/before_after.json.
3.5 Results for Phase D: Steered Agent in Multi-Agent Conversation
Two 60-message multi-agent simulations were conducted using the model-varied configuration. In the experimental condition used, only the Llama-3.2-1B agent received steering (α = 0.2). The baseline condition used identical configuration without steering. Per-agent bias expression was tracked using the “bias needle” metric (projection onto the L10 hostility direction).
Table 3. Bias needle values for Llama-3.2-1B across conversational halves.
| Condition | Mean feat (full) | First half mean | Second half mean | Drift (second → first) |
|---|---|---|---|---|
| Baseline | -36.2 | -32.7 | -39.4 | -6.7 |
| Steered | -25.8 | -26.8 | -24.9 | +2.0 |
The baseline agent exhibited a mean needle value of -36.2, indicating emotionally one-sided positive framing (pro-immigration). Drift across the conversation was -6.7, moving further toward this extreme.
The steered agent displayed a mean needle value of -25.8, much closer to the neutral midpoint. Drift across the conversation was +2.0, indicating that there was stability with no directional movement toward extremes.
The steered agent's outputs were defined by measured, balanced language (“consider historical and contemporary implications… recognize both personal identity and diverse perspectives versus a homogeneous community”). Baseline outputs tended toward personal narrative and emotionally charged framing (“I'm a refugee, I lost everything”).
FINAL — Phase D: steered agent stays centered & drift-resistant
llama1b mean bias-needle over 60-msg chat
Drift over the conversation
3.6 Post-Hoc Race/Ethnicity Analysis
Following the null effect for Race/Ethnicity in Phase C, we conducted an exploratory analysis to test whether targeted modifications could close this gap. Two modifications were applied: (1) adding Layer 10 to the steering set and (2) incorporating race-specific SAE probes alongside the general hostility direction.
Table 4. Comparison of canonical and race-targeted steering.
| Topic | Metric | Δ (canonical) | Δ (race-targeted) |
|---|---|---|---|
| Nationality | amb-acc | +0.037 | +0.005 |
| Nationality | s_DIS | -0.031 | -0.036 |
| Religion | amb-acc | -0.049 | +0.025 |
| Religion | s_DIS | -0.049 | +0.000 |
| Race/Ethnicity | amb-acc | -0.005 | -0.010 |
| Race/Ethnicity | s_DIS | +0.031 | +0.053 |
The race-targeted intervention didn't improve Race/Ethnicity outcomes. Ambiguous accuracy stayed flat (Δ = -0.010) and the disambiguated bias score worsened (Δ = +0.053, moving away from zero). Additionally, the modifications we added diluted the positive effects that were seen in Nationality and Religion. Nationality ambiguous accuracy dropped from +0.037 to +0.005, and Religion disambiguated bias reduction dropped from -0.049 to +0.000. The canonical 3-layer axis (layers 10,11, 12) was retained for all the next analyses.
3.7 Summary of Results
The principal results from this study are as follows below:
Bias related features are shared across topics at intermediate layers (L4-L10), with topic specific features emerging only at L12. The strongest single feature is L10/f30439.
Sentiment-based steering failed to reduce BBQ bias and increased ambiguous bias scores. BBQ-aligned steering produced modest reductions in bias for Nationality and Religion (Δamb-acc ≈ +0.04-0.05, Δs_DIS ≈ -0.03 to -0.05).
Race/Ethnicity showed no measurable bias reduction during steering, consistent with near-zero baseline signed bias.
In multi-agent conversations, steering shifted the Llama-3.2-1B agent from a -36.2 needle value to a -25.8 (closer to neutral). It also prevented conversational drift. However, the baseline agent drifted further toward extreme positions.
Targeted race-specific changes failed to close the Race/Ethnicity gap and worsened the effects on other categories.
4.1 Interpretation of Findings
The results of this study helped to reveal several insights into social bias in large language models and the viability of targeted interventions using activation steering.
Shared Bias Architecture: A single SAE feature at intermediate layers (L10/f30439) separated biased from neutral text scores across all three BBQ topics, and topic-specific features emerged only at L12. This suggests that bias attribution is organized in a shared hostility/derogation direction instead of cleanly partitioned by social category. This finding is also consistent with superposition and polysemanticity in model representations, and implies that interventions at intermediate layers may generalize across categories. However, late-layer steering may be needed for topic-level precision.
Axis Alignment Determines Efficacy: The failure of the sentiment-based steering axis and the relative success of the BBQ-aligned axis demonstrate that activation addition affects only the specific behavioral dimension encoded by the steering vector. Sentiment steering shifted tone without reducing stereotype attribution under ambiguity, which was the construct actually measured by BBQ. This dissociation between linguistic style and task-specific bias shows the distinction between model persona and beliefs. This also shows that steering interventions must be validated against the target behavioral metric rather than proxy objectives.
Boundary Conditions: The null effect for Race/Ethnicity and the failure of targeted modifications to close this gap, indicates that steering operates by amplifying or reducing the force of existing activation patterns. In other words, it can't introduce novel representations absent from the model's pre-trained distribution. When baseline signed bias is near zero, there is little directional signal for the steering vector to act upon. In this case, steering is most appropriate for categories whose measurable bias exists and may be ineffective or counterproductive where the model is already relatively neutral.
Durability in Conversations: The Phase D results demonstrate that steering maintains its effects over extended multi-turn interactions. The steered agent remained centered and drift-resistant across 60 messages, whereas the baseline agent drifted toward more one-sided framing. This suggests that steering presents not only a static output shift but a constant reorientation of the model's' trajectory in conversations. This is likely by preventing the accumulation of activation drift with autoregressive feedback loops.
4.2 Significance and Broader Implications
This study contributes to mechanistic interpretability and model control in several ways. First, we demonstrate a practical pipeline from SAE-based feature localization to deployed intervention design, connecting interpretability to behavioral control. Second, the multi-agent conversational evaluation extends beyond just static benchmark testing, showing that conversational context can increase bias over time. This is an important consideration for deployed systems that deal with sustained interaction. Third, the transparent reporting of null and negative results (sentiment axis failure, Race/Ethnicity null) provides important methodological caution against publication bias and highlights the boundary conditions of activation steering.
4.3 Limitations
Several limitations constrain generalizability.
Model Specificity: Experiments were conducted on a single model (Llama-3.2-1B-Instruct). Larger models or different architectures may show different bias architectures and respond differently to steering. Replication across model families is necessary to establish generalizability.
Causal Attribution: SAE-localized features are best-separating directions, not proven causal mechanisms. Activation addition produced behavioral changes, but the relationship between feature activation and output is still correlational. Future work employing causal intervention techniques would be required to establish causality.
Statistical Power and Effect Sizes: The modest sample size (n = 200 per condition, 2 seeds) and small effect sizes (Δamb-acc ≈ +0.04-0.05) raise questions about practical significance. Larger-scale evaluations would be needed to characterize the variability of steering outcomes and detect smaller effects.
Steering Construction: The BBQ-aligned axis was constructed using a bespoke methodology involving researcher degrees of freedom. The protocol is documented but not independently validated; future work should explore automated or semi-automated axis construction methods.
Race-Specific Interaction: The null finding on Race/Ethnicity is specific to BBQ's operationalization of bias. Other forms of race related bias, such as differential task performance or subtle linguistic disparities, can persist even in the absence of noticeable BBQ bias.
4.4 Future Directions
Several pathways for future research emerge. First, methods such as layer-wise ablation (systematically deactivating or removing specific layers) or patching studies (copying specific activations from one context into another) could clarify the computational role of each layer in bias expression, which builds on the observed hierarchical organization. Second, the multi-agent framework could be extended to study more complex social dynamics, including spread of bias through conversational networks. Third, broader evaluation across benchmarks is needed to look deeper into trade-offs between bias mitigation and general capability. Fourth, for categories with near-zero or mixed bias, such as Race/Ethnicity in this study, alternative intervention strategies can be used that may be more appropriate than activation steering.
5.1 Conclusion
This study demonstrates that SAE-localized activation steering can reduce measurable bias in a LLM, but only when the steering axis aligns with the target behavior, baseline bias exists in measurable quantities, and coefficients are calibrated to maintain coherence. Efficacy was modest, directional, and category-dependent. The strongest effects were for Nationality and Religion, with the weakest/null effect for Race/Ethnicity. Mutli-agent simulations further revealed that steering confers conversational stability, preventing the drift that's observed in baseline agents. These findings advance the methodological toolkit for mechanistic debiasing while also highlighting important boundary conditions when dealing with practical deployment.
6.1 Artifacts & Reproduction
Code
modal_app/aquin.engine.py - Aquin engine + SAE pipeline on Modal L40s ( ::bias, ::topics, ::sweep)
modal_app/llama_service.py - serving endpoint with baked-in multi-layer steering (“steer” flag)
sim/run_live.py - async multi-agent conversation engine (per-agent steer column)
sim/bbq_eval.py - BBQ measurement (--steer, --seed)
sim/make_final_graphs.py - regenerates every graph in this paper
Data
sim/bbq_final_vq.json - final BBQ matrix; sim/bbq_mat_ / sim/bbq_mat2_ - raw reports
sim/sim_phaseD_steered.db, sim/sim_phaseD_baseline.db - Phase D conversations
sim/sim_models_round2.db - multi-model conversation; sim/before_after.json - before/after outputs
Reproducing the headline result
export MAMBA_URL="https://dynamic-kittie--aquin-llama-openai-compatible.modal.run" MODAL_PROFILE=dynamic-kittie modal deploy modal_app/llama_service.py # baseline vs steered on a topic: .venv/bin/python sim/bbq_eval.py --model unsloth/Llama-3.2-1B-Instruct --category Religion --n 200 .venv/bin/python sim/bbq_eval.py --model unsloth/Llama-3.2-1B-Instruct --category Religion --n 200
All numbers in this paper come from real runs on the live Aquin/Modal stack; mixed and negative results are reported as-is.
