Aquin Labs · May 2026
Fine-Tuning Llama 3.2 3B on Python Code
A four-stage pipeline using supervised fine-tuning, execution-reward RL, and verified self-improvement to push a 3B model on Python coding benchmarks. 70.0% HumanEval+ best-of-10 from a 44.5% baseline.
Abstract
This is a complete experiment report for a multi-stage Python code fine-tuning pipeline applied to Meta Llama 3.2 3B Instruct. The goal was to push HumanEval and HumanEval+ scores as high as possible using only LoRA fine-tuning and execution-verified self-improvement on a single 46 GB GPU.
The pipeline ran four stages: raw Python code SFT on The Stack Dedup (Stage 1, ultimately abandoned), instruction SFT on CodeFeedback (Stage 2), GRPO execution-reward RL (Stage 3, degraded performance), and verified synthetic SFT from self-generated solutions (Stage 4). The best result was 70.0% HumanEval+ under best-of-10 sampling from a 44.5% baseline. This report documents everything that went wrong, why, and what the data says about closing the remaining gap.
Motivation
The core question was simple: how far can you push a 3B model on Python code generation with only parameter-efficient fine-tuning and execution feedback, without extra data labeling, human preference collection, or proprietary APIs?
Llama 3.2 3B Instruct is a general-purpose model. It scores 46.3% on HumanEval and 44.5% on HumanEval+ out of the box. The hypothesis was that a structured multi-stage pipeline could produce meaningful gains by treating unit test execution as a free, deterministic, directly-aligned reward signal instead of expensive human preference data.
This is the same reasoning behind DeepSeek-Coder, WizardCoder, and the original InstructGPT paper: SFT compresses the search space before RL, and RL drives correctness where SFT alone plateaus. The question was whether that pattern holds at 3B parameters on a tight compute budget.
Aquin's SDK
The Aquin Experimental SDK was attached to every training stage via a single call per run. For SFT stages, attach_sft() receives the model, optimizer, and API key. For the GRPO stage, attach_grpo() additionally takes the reward function so Aquin can monitor reward distribution, advantage variance, and KL divergence alongside the standard gradient signals. Each training step calls session.step(loss) to push metrics, and session.stop() finalizes the run and uploads the checkpoint summary.
No changes to the core training loop were required beyond these three calls. The dashboard updated live throughout every stage, and the Metrics Chat on the left panel was used actively during Stages 1, 3, and 4 to interpret what the signals meant and what to change next.
Model and hardware
compute hours by stage on L40S 46GB
Pipeline overview
Each stage starts from the previous stage's merged output. Nothing was retrained from scratch at any point.
350k examples from The Stack Dedup, 1 epoch. Catastrophic regression from 46.3% to 5% on HumanEval. Abandoned.
128k examples from CodeFeedback-Filtered-Instruction, 2 epochs. Starting from base model directly, not Stage 1 output.
542 problems (HumanEval+ + MBPP+), 2 epochs. HumanEval regressed from 51.8% to 25%. Discarded.
20 samples per problem, 941 verified passing solutions, 2 epochs of SFT. Best-of-10: 70.0% HumanEval+.
Baseline scores
All baselines measured with best-of-10 sampling at temperature 0.8 to match the final evaluation conditions.
Stage 1: Raw Python code SFT
The first run pulled 350,000 Python files from The Stack Dedup, quality-filtered for size, line count, alphanumeric density, and code density. Each file was converted into a three-turn chat example: a system prompt, a user prompt derived from the first docstring or comment, and the file content as the assistant's response. Training ran for one full epoch with LoRA r=64 across all 7 projection modules.
stage 1 — train loss + grad norm vs step · 5149 steps · 350k examples
stage 1 — eval loss by epoch checkpoint
stage 1 — learning rate schedule · cosine decay · 3% warmup
All three curves are healthy. Train loss descends from 1.04 to roughly 0.90. Grad norm stays flat in the 0.14 to 0.19 band with no spikes. The LR cosine decay is smooth, peaking at step 155 and annealing cleanly to zero. There is nothing wrong with the training dynamics. The problem is the task definition.
Raw code file SFT teaches the model to autocomplete code files. The instruction format that Llama 3.2 Instruct learned during RLHF is entirely absent from the training distribution. The model overwrites its instruction-following behavior to become a code file completer. When benchmarks then ask it to respond to a chat-format prompt, it has lost the ability to do so. This is catastrophic forgetting in its most literal form: from 46.3% to 5% in a single epoch.
The correct first step for an instruction-tuned starting point is instruction-formatted code data, not raw files. Stage 1 was the wrong step. Its output was abandoned and Stage 2 started fresh from the original base model.
Aquin — what the dashboard showed
Loss, grad norm, and LR all looked clean in the Aquin dashboard throughout Stage 1. No spikes, no dead layers, no anomalous signals at any point across 5149 steps. The Metrics Chat confirmed the run itself was healthy.
After the quick 20-problem eval showed 5% HumanEval, the Metrics Chat was queried with the eval result alongside the training curves. The diagnosis: the training signals were not the problem. The issue was the data format, specifically the absence of instruction-following structure. Aquin flagged the distribution mismatch between raw code file completions and instruction-formatted benchmark prompts as the root cause, not a training instability.
Switch to an instruction-formatted code dataset. The base model had strong instruction-following from RLHF. Overwriting that format with raw file completions was catastrophic forgetting by design, not a tuning failure. The signal was to discard Stage 1 entirely and restart from a chat-formatted dataset.
Stage 2: Instruction SFT
Stage 2 started directly from meta-llama/Llama-3.2-3B-Instruct and trained on CodeFeedback-Filtered-Instruction, a dataset of 128,504 diverse code instruction-response pairs covering code generation, debugging, refactoring, explanation, and test generation. Training ran for 2 epochs at a lower learning rate (1e-4) with a higher LoRA rank (r=128) to give the model more adaptation capacity on this structured dataset.
stage 2 — train loss + grad norm vs step · 2714 steps · 128k examples · 2 epochs
stage 2 — eval loss by epoch checkpoint · 14 checkpoints
The loss drop from Stage 1's final 0.91 to Stage 2's final 0.59 reflects the fundamental difference between the datasets. CodeFeedback instruction pairs have lower entropy than raw code files. The model can predict the answer distribution much more confidently because instruction-formatted code has a structured, predictable form. The loss curve is clean across both epochs with stable gradients throughout.
The Stage 2 merged model is the foundation for all subsequent stages. Its benchmark numbers (HumanEval 46.3%, MBPP 32.4%) show the model retains its instruction-following capability while gaining code-domain structure. The MBPP regression from the base model's 35.4% is a known artifact of raw code SFT degrading description-to-function tasks, partially recovered here but not fully.
Aquin — what the dashboard showed
The Aquin dashboard showed a clean two-epoch loss trajectory: fast drop in epoch 1 from 0.72 to 0.59, then a near-plateau in epoch 2 as the model saturated what the dataset could teach. No instability, no spikes. Grad norm held steady in the 0.13 to 0.16 band throughout both epochs.
The Metrics Chat flagged that the eval loss plateau after epoch 1.5 was a saturation signal, not a failure. With 128k instruction pairs and r=128 adapters across all 7 projection modules, the model had reached the realistic floor for this dataset and configuration. The recommendation was to proceed to Stage 3 rather than run additional epochs.
No dead layers detected across any of the 28 LoRA adapter matrices. Aquin's per-layer grad norm view confirmed uniform signal distribution across all 7 target modules, which validated that r=128 on a 128k instruction dataset is well-matched. The same configuration on a 500-row dataset would have shown dead FFN layers, as confirmed by the healthcare fine-tuning experiment.
The post-training SAE feature diff between the base and Stage 2 checkpoint showed the heaviest rewrite concentrated at L8 to L12, with 14 features shifted at L8 and the top mover being a code-structure pattern-completion feature. No refusal or safety-adjacent features appeared in the diff. The feature neighborhoods around the top shifted features were tightly clustered in decoder space, meaning a future weight edit targeting any one of them would need to account for the surrounding cluster. This was noted as a risk factor for the GRPO stage: RL on a policy with tightly coupled code features has less room to maneuver than one with more dispersed representations.
Stage 3: GRPO execution-reward RL
Stage 3 applied GRPO on top of the Stage 4 merged model (an existing checkpoint at 51.8% HumanEval), generating multiple completions per problem and using unit test pass/fail as the binary reward signal. No separate reward model or human preference data.
The training metrics look fine on the surface. Reward climbs, pass rate reaches 100% on mini-batches, KL stays bounded. But evaluation reveals severe policy collapse. HumanEval dropped from 51.8% to 25%. Three compounding causes:
Tiny dataset
542 problems is extremely small for RL. The model memorizes those specific problem patterns instead of learning generalizable correctness strategies.
Halved reward diversity
4 generations instead of 8 halves the variance in reward signal used to compute GRPO advantages. With only 4 samples, many prompts have all-pass or all-fail outcomes, producing zero advantage and no gradient signal.
Dependency conflict
trl 0.29 requires PyTorch 2.6 (FSDPModule). PyTorch 2.5.1 was the only version available on CUDA 12.1. Downgrading to trl 0.14 introduced API incompatibilities requiring manual patches, adding uncontrolled variation.
The GRPO output was discarded. All subsequent work returns to the Stage 4 base. The core lesson: RL on 542 problems with 4 samples each and a weak baseline policy does not provide enough signal diversity to prevent collapse.
Aquin — what the dashboard showed
The Aquin GRPO dashboard showed reward climbing from 0.18 to 0.34 across 270 steps, with mini-batch pass rate reaching 100% by step 200. On the surface, the metrics looked healthy. This is the deceptive case the Metrics Chat is specifically useful for: training signals that appear positive but indicate overfitting to the training distribution rather than genuine capability improvement.
Aquin flagged decreasing advantage variance across the second epoch. With only 4 generations per prompt on a 542-problem set, a growing fraction of prompts had all-pass or all-fail outcomes, producing zero advantage and contributing no gradient signal. The Metrics Chat identified this as a reward diversity problem: the policy was converging on a narrow distribution that could solve training problems but had lost generalization breadth.
KL divergence from the reference policy stayed bounded at 0.04 beta throughout, which would normally be a healthy sign. Aquin's Metrics Chat noted that bounded KL combined with collapsing advantage variance is a policy collapse pattern: the model drifts into a narrow solution mode that stays close to the reference in parameter space but produces degenerate behavior in output space. The recommendation was to discard the Stage 3 output and increase generation diversity before retrying GRPO.
After the regression was confirmed, the causal trace was run on a sample of HumanEval problems the model had solved before Stage 3 and failed after. The layer-level recovery signal had fragmented: problems that previously resolved cleanly at L10 to L12 were distributing causal load across later layers, and the logit lens showed the model committing to wrong tokens two to three layers earlier than the Stage 2 checkpoint had. The policy had not just memorized training problems — it had shifted where in the network it was doing the work, in a way that broke the retrieval structure Stage 2 had built.
Stage 4: Verified synthetic SFT
Stage 4 is a self-improvement loop without a separate reward model. The Stage 2 model generates 20 candidate solutions per problem across HumanEval+ and MBPP+, runs each against unit tests, and keeps only the solutions that pass all tests. Those verified (prompt, solution) pairs form the Stage 4 training set. The model then fine-tunes on its own best outputs.
stage 4 — verified SFT train loss · 234 steps · 941 examples · 2 epochs
The HumanEval+ improvement from 44.5% to 70.0% under best-of-10 is the most meaningful result in this pipeline. HumanEval+ uses the same 164 problems as HumanEval but with stricter edge-case tests. Going from 44.5% to 70.0% means the model is producing code that is genuinely more correct, not just code that passes easy checks.
The greedy numbers tell a different story. HumanEval (greedy) went from 46.3% at baseline to 49.4% after Stage 4. That is a real improvement, but a modest one. HumanEval+ (greedy) stayed flat at 44.5%. The model improved at diverse sampling but not at single-shot generation. Stage 4 trained on 941 verified examples, all of which came from the same base model under the same temperature. It reinforced existing strategies rather than teaching new ones.
Aquin — what the dashboard showed
The Stage 4 SFT loss chart in Aquin showed the fastest collapse of any stage: 0.64 to 0.10 in 234 steps with no instability at any point. Aquin's Metrics Chat noted this as the signature of a high-quality, low-entropy dataset. 941 verified solutions from a model that already knew the problem distribution had very little output variance for the adapter to model. Fast convergence here is a good sign, not overfitting.
Gradient norms tapered smoothly from 0.18 to 0.09 across the two epochs, tracking the loss collapse. No spikes at the epoch boundary at step 117. Aquin confirmed clean adapter saturation: the 941 examples were fully absorbed by step 200, and the final 34 steps were consolidation rather than active learning.
After the full HumanEval evaluation showed greedy HumanEval+ flat at 44.5% despite the fast SFT convergence, the Metrics Chat was queried with the eval delta. The diagnosis: the training distribution was too narrow. 941 examples from a single model at a single temperature teaches the adapter to reproduce what that model already does well, not to generalize. To move the greedy number, the dataset needs solutions generated at multiple temperatures or by multiple base models, creating output diversity the adapter has to learn from rather than memorize.
The weight editor was used to probe where the Stage 4 model's code-correctness associations lived. Causal traces on specific failing HumanEval+ problems located the retrieval signal at L10 to L14. Logit projections of the top SAE features at those layers showed the model promoting plausible-but-wrong completions with high confidence. Steering at negative strength on those features reduced confident wrong outputs in isolation, confirming the failure mode: the model had encoded solution patterns that activated on surface form rather than on underlying algorithmic structure. A rank-one weight edit at L12 on one representative failing problem shifted the probability of the correct completion from 11% to 61% and held across five rephrase templates, suggesting the association was localized and correctable, but the sheer volume of such corrections needed across 164 problems makes individual weight editing impractical. The right fix is the training data, not the editor.
Benchmark results
final comparison — humaneval vs humanevalplus
humaneval pass@1 across pipeline stages
All scores
Aquin tooling across the pipeline
Two systems were running throughout this pipeline. The Experimental SDK handled live training signals at every step: loss, grad norm, LR, dead layers, advantage variance, KL. The mechanistic interpretability tooling ran post-training on completed checkpoints, using SAE feature diffs, causal traces, the logit lens, and the weight editor to understand what had actually changed inside the model and why the benchmarks moved the way they did.
Without both, diagnosing a pipeline this complex would mean assembling custom gradient scripts, manual TensorBoard configurations, and blind post-hoc analysis of raw logs. Together they made each stage boundary a real decision point rather than a guess.
Metrics Chat diagnosed catastrophic forgetting from 46.3% to 5% not as a training failure but as a data format problem. Raw file completions vs instruction-following format. Saved retrying Stage 1 with different hyperparameters.
Loss, grad norm, and LR all healthy throughout. Aquin confirmed the training run itself was correct, giving confidence that the dataset choice was the sole variable to change.
Eval loss plateau after epoch 1.5 identified as a dataset saturation signal, not a training defect. Recommendation to proceed to Stage 3 rather than waste compute on additional epochs.
Post-training SAE diff on the Stage 2 checkpoint showed heaviest rewrite at the middle layers, consistent with code-domain instruction tuning deepening factual retrieval circuits. The top shifted features at L8 were code-structure and pattern-completion associations. No unexpected safety or refusal features appeared in the diff, confirming the CodeFeedback dataset introduced no suppression artifacts.
Per-layer grad norm view confirmed uniform signal distribution across all 7 adapter matrices at r=128, validating that this config is well-matched to a 128k instruction dataset.
Decreasing advantage variance in epoch 2 identified as reward diversity exhaustion. With 4 generations on 542 problems, a growing fraction of prompts produced zero advantage and no gradient signal.
Reward climbed to 0.34 and mini-batch pass rate hit 100%, but Metrics Chat flagged the combination of bounded KL and collapsing advantage variance as a policy collapse pattern rather than genuine capability gain.
After the HumanEval regression from 51.8% to 25%, the causal trace was run on a sample of failing problems. The layer-level recovery signal had shifted: problems that previously resolved cleanly at the middle-layer MLP were now distributing causal load across later layers, a signature of representational drift. The logit lens confirmed the model was committing to wrong tokens earlier in the forward pass, before the usual retrieval layers had assembled enough context.
0.64 to 0.10 in 234 steps identified as high-quality dataset signature rather than overfitting. Grad norm tapering confirmed clean adapter saturation.
After greedy HumanEval+ stayed flat at 44.5% despite the fast SFT convergence, the SAE feature diff between Stage 2 and Stage 4 was run. The features that shifted most were already-active code-pattern features at the layers that had also shifted in Stage 2. The fine-tune had deepened the same circuits rather than broadening them. This mechanistically confirmed what the Metrics Chat had diagnosed from training signals alone: 941 examples from one model at one temperature teaches the adapters to sharpen existing strategies, not learn new ones.
The weight editor was used to probe where the model's code-correctness associations lived after Stage 4. Causal traces on specific failing HumanEval+ problems located the retrieval signal at layers 10 to 14. Logit projections of the top SAE features at those layers showed the model was promoting plausible-but-wrong completions with high confidence. Steering at negative strength on those features reduced confident wrong outputs, confirming the failure mode: the model had encoded solution patterns that activated on surface form rather than on the underlying algorithmic structure.
Single attach_sft() or attach_grpo() call per stage. No changes to training logic. Dashboard live from step one of each run. Mech interp tooling loaded from the same checkpoint paths.
The pattern across all four stages is consistent. The SDK's value is not just charting metrics that are already visible in training logs. It is the Metrics Chat's ability to interpret combinations of signals — healthy training dynamics alongside catastrophic eval regression in Stage 1, bounded KL alongside collapsing advantage variance in Stage 3, fast loss collapse alongside flat greedy scores in Stage 4 — and translate those combinations into specific diagnoses. The mechanistic tooling then goes one level deeper: it confirms or challenges those diagnoses inside the model's weights, giving you a causal account of the behavior rather than a correlation. That combination is what made a four-stage pipeline debuggable in a single session.
Demo videos
To see what these benchmark numbers actually mean in practice, we ran two tasks against three models: the base Llama 3.2 3B Instruct with no fine-tuning, the Stage 4 fine-tuned model, and GPT-OSS 20B. Each task was given the same prompt with no additional guidance, run to completion, and executed against tests.
Task one is a snake game. It is a stateful, event-driven program that requires pygame, collision logic, directional control, and a scoring loop. Task two is a Fibonacci app: a recursive function with memoization and a simple CLI interface. The first task tests general code generation on something outside the benchmark distribution. The second tests exactly the capability the fine-tuning was optimizing for.
The snake game results are worth pausing on. The base model failed to produce runnable code at all. GPT-OSS 20B produced a working game but missed a core rule: the snake does not lose when it touches the boundary. The fine-tuned 3B model, despite having a quarter of the parameters, produced a complete, correct implementation with all game rules intact. The fine-tuning had transferred to a task it had never seen in training.
Task 1 — Snake Game
Snake game — Llama 3.2 3B Instruct (base, no fine-tuning)
The raw base model on the snake game prompt. It failed to produce runnable code — not just a buggy game, nothing executable at all. This is where 46.3% HumanEval actually sits when asked to build something stateful.
Snake game — Stage 4 fine-tuned model
The Stage 4 model on the same prompt. Produced a complete, playable snake game with all rules correctly implemented, including boundary collision loss. The fine-tuning transferred to a task it never trained on directly.
Snake game — GPT-OSS 20B
The 20B model produced a working game but missed the boundary collision rule — the snake does not lose on wall contact. A larger model, a more obvious bug. The fine-tuned 3B got this right where the 20B did not.
Task 2 — Fibonacci App
Fibonacci app — Llama 3.2 3B Instruct (base, no fine-tuning)
Base model on a Fibonacci function with memoization and a simple CLI interface. This sits squarely in the HumanEval distribution — the kind of task the 46.3% baseline is measured on.
Fibonacci app — Stage 4 fine-tuned model
Stage 4 model on the same prompt. HumanEval+ went from 44.5% to 70.0% best-of-10 after this stage. In-distribution tasks like this are where the gain is most visible.
Fibonacci app — GPT-OSS 20B
The 20B model on the same prompt, for comparison against the fine-tuned 3B output.
Analysis
The HumanEval+ improvement from 44.5% to 70.0% under best-of-10 is still a real finding. It means the pipeline meaningfully improved the model's ability to generate correct code, not just code that passes easy tests. The stricter HumanEval+ benchmark is designed to catch partial correctness, and a 25-point gain under sampling represents a genuine shift in what the model is capable of producing. The snake game demo makes this concrete in a way benchmarks alone cannot: a fine-tuned 3B model got the game rules right where a 20B model did not.
