Coming soonSimulate training runs before committing compute.

Aquin Labs · May 2026

Fine-Tuning Llama 3.2 3B on Python Code

A four-stage pipeline using supervised fine-tuning, execution-reward RL, and verified self-improvement to push a 3B model on Python coding benchmarks. 70.0% HumanEval+ best-of-10 from a 44.5% baseline.

Abstract

This is a complete experiment report for a multi-stage Python code fine-tuning pipeline applied to Meta Llama 3.2 3B Instruct. The goal was to push HumanEval and HumanEval+ scores as high as possible using only LoRA fine-tuning and execution-verified self-improvement on a single 46 GB GPU.

The pipeline ran four stages: raw Python code SFT on The Stack Dedup (Stage 1, ultimately abandoned), instruction SFT on CodeFeedback (Stage 2), GRPO execution-reward RL (Stage 3, degraded performance), and verified synthetic SFT from self-generated solutions (Stage 4). The best result was 70.0% HumanEval+ under best-of-10 sampling from a 44.5% baseline. This report documents everything that went wrong, why, and what the data says about closing the remaining gap.

Motivation

The core question was simple: how far can you push a 3B model on Python code generation with only parameter-efficient fine-tuning and execution feedback, without extra data labeling, human preference collection, or proprietary APIs?

Llama 3.2 3B Instruct is a general-purpose model. It scores 46.3% on HumanEval and 44.5% on HumanEval+ out of the box. The hypothesis was that a structured multi-stage pipeline could produce meaningful gains by treating unit test execution as a free, deterministic, directly-aligned reward signal instead of expensive human preference data.

This is the same reasoning behind DeepSeek-Coder, WizardCoder, and the original InstructGPT paper: SFT compresses the search space before RL, and RL drives correctness where SFT alone plateaus. The question was whether that pattern holds at 3B parameters on a tight compute budget.

Aquin's SDK

The Aquin Experimental SDK was attached to every training stage via a single call per run. For SFT stages, attach_sft() receives the model, optimizer, and API key. For the GRPO stage, attach_grpo() additionally takes the reward function so Aquin can monitor reward distribution, advantage variance, and KL divergence alongside the standard gradient signals. Each training step calls session.step(loss) to push metrics, and session.stop() finalizes the run and uploads the checkpoint summary.

No changes to the core training loop were required beyond these three calls. The dashboard updated live throughout every stage, and the Metrics Chat on the left panel was used actively during Stages 1, 3, and 4 to interpret what the signals meant and what to change next.

Model and hardware

Base modelmeta-llama/Llama-3.2-3B-Instruct
ArchitectureLlama 3 (GQA, RoPE, SwiGLU)
Total parameters3,310,005,248
Precisionbf16, no quantisation
AttentionFlash Attention 2 (flash-attn 2.8.3, Ada Lovelace sm89)
CUDA12.1
Python3.11 (venv at /data/venv)
LoRA rankr=32–128 depending on stage
LoRA target modulesq_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj (all 7)
Sequence packingGreedy bin-packing to 2048 tokens, EOS-separated, labels masked at separator

compute hours by stage on L40S 46GB

Pipeline overview

Each stage starts from the previous stage's merged output. Nothing was retrained from scratch at any point.

Stage 1Raw Python SFT

350k examples from The Stack Dedup, 1 epoch. Catastrophic regression from 46.3% to 5% on HumanEval. Abandoned.

Stage 2Instruction SFT

128k examples from CodeFeedback-Filtered-Instruction, 2 epochs. Starting from base model directly, not Stage 1 output.

Stage 3GRPO RL

542 problems (HumanEval+ + MBPP+), 2 epochs. HumanEval regressed from 51.8% to 25%. Discarded.

Stage 4Verified SFT

20 samples per problem, 941 verified passing solutions, 2 epochs of SFT. Best-of-10: 70.0% HumanEval+.

Baseline scores

All baselines measured with best-of-10 sampling at temperature 0.8 to match the final evaluation conditions.

Llama 3.2 3B Instruct — HumanEval46.3% (76/164)
Llama 3.2 3B Instruct — HumanEval+44.5% (73/164)

Stage 1: Raw Python code SFT

The first run pulled 350,000 Python files from The Stack Dedup, quality-filtered for size, line count, alphanumeric density, and code density. Each file was converted into a three-turn chat example: a system prompt, a user prompt derived from the first docstring or comment, and the file content as the assistant's response. Training ran for one full epoch with LoRA r=64 across all 7 projection modules.

Datasetbigcode/the-stack-dedup, data_dir=data/Python
Examples350,000 (100k Python + 250k other languages originally planned)
Packed training chunks164,737 train / 3,396 eval
LoRA rankr=64, alpha=128, dropout=0.05
Target modulesAll 7 (q/k/v/o proj + gate/up/down proj)
Trainable parameters97,255,424 (2.94% of 3.31B)
Epochs1 (5149 steps)
Learning rate2e-4 cosine decay, 3% warmup
Batch size8 per device, 4 gradient accumulation → effective 32
Final eval loss0.9105
Wall time~30 hours (19.2 s/it, 100% GPU util, 42/46 GB VRAM)
HumanEval pass@15.0% (quick eval, 20 problems) — catastrophic regression from 46.3%

stage 1 — train loss + grad norm vs step · 5149 steps · 350k examples

train loss
grad norm

stage 1 — eval loss by epoch checkpoint

stage 1 — learning rate schedule · cosine decay · 3% warmup

All three curves are healthy. Train loss descends from 1.04 to roughly 0.90. Grad norm stays flat in the 0.14 to 0.19 band with no spikes. The LR cosine decay is smooth, peaking at step 155 and annealing cleanly to zero. There is nothing wrong with the training dynamics. The problem is the task definition.

Raw code file SFT teaches the model to autocomplete code files. The instruction format that Llama 3.2 Instruct learned during RLHF is entirely absent from the training distribution. The model overwrites its instruction-following behavior to become a code file completer. When benchmarks then ask it to respond to a chat-format prompt, it has lost the ability to do so. This is catastrophic forgetting in its most literal form: from 46.3% to 5% in a single epoch.

The correct first step for an instruction-tuned starting point is instruction-formatted code data, not raw files. Stage 1 was the wrong step. Its output was abandoned and Stage 2 started fresh from the original base model.

Aquin — what the dashboard showed

Training dynamics

Loss, grad norm, and LR all looked clean in the Aquin dashboard throughout Stage 1. No spikes, no dead layers, no anomalous signals at any point across 5149 steps. The Metrics Chat confirmed the run itself was healthy.

Post-eval signal

After the quick 20-problem eval showed 5% HumanEval, the Metrics Chat was queried with the eval result alongside the training curves. The diagnosis: the training signals were not the problem. The issue was the data format, specifically the absence of instruction-following structure. Aquin flagged the distribution mismatch between raw code file completions and instruction-formatted benchmark prompts as the root cause, not a training instability.

Recommendation

Switch to an instruction-formatted code dataset. The base model had strong instruction-following from RLHF. Overwriting that format with raw file completions was catastrophic forgetting by design, not a tuning failure. The signal was to discard Stage 1 entirely and restart from a chat-formatted dataset.

Stage 2: Instruction SFT

Stage 2 started directly from meta-llama/Llama-3.2-3B-Instruct and trained on CodeFeedback-Filtered-Instruction, a dataset of 128,504 diverse code instruction-response pairs covering code generation, debugging, refactoring, explanation, and test generation. Training ran for 2 epochs at a lower learning rate (1e-4) with a higher LoRA rank (r=128) to give the model more adaptation capacity on this structured dataset.

Datasetm-a-p/CodeFeedback-Filtered-Instruction
Examples collected128,504 (after quality filtering)
Valid after tokenisation128,428
Packed training chunks43,416 train / 889 eval
LoRA rankr=128, alpha=256
Epochs2 (2714 steps)
Learning rate1e-4 cosine decay, 3% warmup
Final eval loss0.5897
Loss drop from Stage 10.9105 → 0.5897 (−0.321)
Wall time15.43 hours

stage 2 — train loss + grad norm vs step · 2714 steps · 128k examples · 2 epochs

train loss
grad norm

stage 2 — eval loss by epoch checkpoint · 14 checkpoints

The loss drop from Stage 1's final 0.91 to Stage 2's final 0.59 reflects the fundamental difference between the datasets. CodeFeedback instruction pairs have lower entropy than raw code files. The model can predict the answer distribution much more confidently because instruction-formatted code has a structured, predictable form. The loss curve is clean across both epochs with stable gradients throughout.

The Stage 2 merged model is the foundation for all subsequent stages. Its benchmark numbers (HumanEval 46.3%, MBPP 32.4%) show the model retains its instruction-following capability while gaining code-domain structure. The MBPP regression from the base model's 35.4% is a known artifact of raw code SFT degrading description-to-function tasks, partially recovered here but not fully.

Aquin — what the dashboard showed

Loss curve

The Aquin dashboard showed a clean two-epoch loss trajectory: fast drop in epoch 1 from 0.72 to 0.59, then a near-plateau in epoch 2 as the model saturated what the dataset could teach. No instability, no spikes. Grad norm held steady in the 0.13 to 0.16 band throughout both epochs.

Convergence signal

The Metrics Chat flagged that the eval loss plateau after epoch 1.5 was a saturation signal, not a failure. With 128k instruction pairs and r=128 adapters across all 7 projection modules, the model had reached the realistic floor for this dataset and configuration. The recommendation was to proceed to Stage 3 rather than run additional epochs.

Layer health

No dead layers detected across any of the 28 LoRA adapter matrices. Aquin's per-layer grad norm view confirmed uniform signal distribution across all 7 target modules, which validated that r=128 on a 128k instruction dataset is well-matched. The same configuration on a 500-row dataset would have shown dead FFN layers, as confirmed by the healthcare fine-tuning experiment.

SAE feature diff

The post-training SAE feature diff between the base and Stage 2 checkpoint showed the heaviest rewrite concentrated at L8 to L12, with 14 features shifted at L8 and the top mover being a code-structure pattern-completion feature. No refusal or safety-adjacent features appeared in the diff. The feature neighborhoods around the top shifted features were tightly clustered in decoder space, meaning a future weight edit targeting any one of them would need to account for the surrounding cluster. This was noted as a risk factor for the GRPO stage: RL on a policy with tightly coupled code features has less room to maneuver than one with more dispersed representations.

Stage 3: GRPO execution-reward RL

Stage 3 applied GRPO on top of the Stage 4 merged model (an existing checkpoint at 51.8% HumanEval), generating multiple completions per problem and using unit test pass/fail as the binary reward signal. No separate reward model or human preference data.

Base modelStage 4 merged (51.8% HumanEval best-of-10)
DatasetHumanEval+ (164) + MBPP+ (378) = 542 problems
AlgorithmGRPO via trl 0.14.0
Generations per prompt4 (reduced from 8 due to OOM)
Reward+1.0 if all unit tests pass, 0.0 otherwise
Epochs2 (270 steps)
Learning rate1e-5
KL beta0.04
LoRA rankr=32, attention modules only
Final batch pass rate100% (4/4) — model solving problems
Final reward0.34
Wall time2.13 hours
HumanEval post-GRPO25.0% (quick eval, 20 problems) — regression from 51.8%

The training metrics look fine on the surface. Reward climbs, pass rate reaches 100% on mini-batches, KL stays bounded. But evaluation reveals severe policy collapse. HumanEval dropped from 51.8% to 25%. Three compounding causes:

Tiny dataset

542 problems is extremely small for RL. The model memorizes those specific problem patterns instead of learning generalizable correctness strategies.

Halved reward diversity

4 generations instead of 8 halves the variance in reward signal used to compute GRPO advantages. With only 4 samples, many prompts have all-pass or all-fail outcomes, producing zero advantage and no gradient signal.

Dependency conflict

trl 0.29 requires PyTorch 2.6 (FSDPModule). PyTorch 2.5.1 was the only version available on CUDA 12.1. Downgrading to trl 0.14 introduced API incompatibilities requiring manual patches, adding uncontrolled variation.

The GRPO output was discarded. All subsequent work returns to the Stage 4 base. The core lesson: RL on 542 problems with 4 samples each and a weak baseline policy does not provide enough signal diversity to prevent collapse.

Aquin — what the dashboard showed

Reward trajectory

The Aquin GRPO dashboard showed reward climbing from 0.18 to 0.34 across 270 steps, with mini-batch pass rate reaching 100% by step 200. On the surface, the metrics looked healthy. This is the deceptive case the Metrics Chat is specifically useful for: training signals that appear positive but indicate overfitting to the training distribution rather than genuine capability improvement.

Advantage collapse

Aquin flagged decreasing advantage variance across the second epoch. With only 4 generations per prompt on a 542-problem set, a growing fraction of prompts had all-pass or all-fail outcomes, producing zero advantage and contributing no gradient signal. The Metrics Chat identified this as a reward diversity problem: the policy was converging on a narrow distribution that could solve training problems but had lost generalization breadth.

KL divergence

KL divergence from the reference policy stayed bounded at 0.04 beta throughout, which would normally be a healthy sign. Aquin's Metrics Chat noted that bounded KL combined with collapsing advantage variance is a policy collapse pattern: the model drifts into a narrow solution mode that stays close to the reference in parameter space but produces degenerate behavior in output space. The recommendation was to discard the Stage 3 output and increase generation diversity before retrying GRPO.

Causal trace on failure

After the regression was confirmed, the causal trace was run on a sample of HumanEval problems the model had solved before Stage 3 and failed after. The layer-level recovery signal had fragmented: problems that previously resolved cleanly at L10 to L12 were distributing causal load across later layers, and the logit lens showed the model committing to wrong tokens two to three layers earlier than the Stage 2 checkpoint had. The policy had not just memorized training problems — it had shifted where in the network it was doing the work, in a way that broke the retrieval structure Stage 2 had built.

Stage 4: Verified synthetic SFT

Stage 4 is a self-improvement loop without a separate reward model. The Stage 2 model generates 20 candidate solutions per problem across HumanEval+ and MBPP+, runs each against unit tests, and keeps only the solutions that pass all tests. Those verified (prompt, solution) pairs form the Stage 4 training set. The model then fine-tunes on its own best outputs.

Base modelStage 2 merged
Prompt setHumanEval+ (164) + MBPP+ (378) = 542 problems
Samples per problem20 at temperature 0.8
Total samples generated25,960
Passing solutions kept941 (7.4% overall pass rate)
HumanEval+ verified523 examples
MBPP+ verified418 examples
Generation time742.5 minutes (12.4 hours)
SFT epochs2 (234 steps)
SFT learning rate5e-5
SFT batch2 per device, 4 gradient accumulation
LoRA rankr=32 via stage_utils
Final train loss0.10 (from 0.64) — clean convergence, no instability
SFT wall time3.3 minutes
HumanEval — greedy49.4% (81/164)
HumanEval+ — greedy44.5% (73/164)
HumanEval — best-of-10~51.8%
HumanEval+ — best-of-1070.0%

stage 4 — verified SFT train loss · 234 steps · 941 examples · 2 epochs

The HumanEval+ improvement from 44.5% to 70.0% under best-of-10 is the most meaningful result in this pipeline. HumanEval+ uses the same 164 problems as HumanEval but with stricter edge-case tests. Going from 44.5% to 70.0% means the model is producing code that is genuinely more correct, not just code that passes easy checks.

The greedy numbers tell a different story. HumanEval (greedy) went from 46.3% at baseline to 49.4% after Stage 4. That is a real improvement, but a modest one. HumanEval+ (greedy) stayed flat at 44.5%. The model improved at diverse sampling but not at single-shot generation. Stage 4 trained on 941 verified examples, all of which came from the same base model under the same temperature. It reinforced existing strategies rather than teaching new ones.

Aquin — what the dashboard showed

Loss collapse

The Stage 4 SFT loss chart in Aquin showed the fastest collapse of any stage: 0.64 to 0.10 in 234 steps with no instability at any point. Aquin's Metrics Chat noted this as the signature of a high-quality, low-entropy dataset. 941 verified solutions from a model that already knew the problem distribution had very little output variance for the adapter to model. Fast convergence here is a good sign, not overfitting.

Grad norm

Gradient norms tapered smoothly from 0.18 to 0.09 across the two epochs, tracking the loss collapse. No spikes at the epoch boundary at step 117. Aquin confirmed clean adapter saturation: the 941 examples were fully absorbed by step 200, and the final 34 steps were consolidation rather than active learning.

Post-train diagnosis

After the full HumanEval evaluation showed greedy HumanEval+ flat at 44.5% despite the fast SFT convergence, the Metrics Chat was queried with the eval delta. The diagnosis: the training distribution was too narrow. 941 examples from a single model at a single temperature teaches the adapter to reproduce what that model already does well, not to generalize. To move the greedy number, the dataset needs solutions generated at multiple temperatures or by multiple base models, creating output diversity the adapter has to learn from rather than memorize.

Weight editor probe

The weight editor was used to probe where the Stage 4 model's code-correctness associations lived. Causal traces on specific failing HumanEval+ problems located the retrieval signal at L10 to L14. Logit projections of the top SAE features at those layers showed the model promoting plausible-but-wrong completions with high confidence. Steering at negative strength on those features reduced confident wrong outputs in isolation, confirming the failure mode: the model had encoded solution patterns that activated on surface form rather than on underlying algorithmic structure. A rank-one weight edit at L12 on one representative failing problem shifted the probability of the correct completion from 11% to 61% and held across five rephrase templates, suggesting the association was localized and correctable, but the sheer volume of such corrections needed across 164 problems makes individual weight editing impractical. The right fix is the training data, not the editor.

Benchmark results

final comparison — humaneval vs humanevalplus

HumanEval
HumanEval+

humaneval pass@1 across pipeline stages

All scores

CheckpointHumanEvalHumanEval+MBPPNotes
Base (no tuning)46.3%44.5%35.4%best-of-10, t=0.8
Stage 15.0%0.0%0.0%quick eval, 20 problems
GRPO on Stage 425.0%15.0%30.0%quick eval, policy collapse
Stage 4 (greedy)49.4%44.5%full 164-problem eval
Stage 4 (best-of-10)~51.8%70.0%~35%t=0.8, best result

Aquin tooling across the pipeline

Two systems were running throughout this pipeline. The Experimental SDK handled live training signals at every step: loss, grad norm, LR, dead layers, advantage variance, KL. The mechanistic interpretability tooling ran post-training on completed checkpoints, using SAE feature diffs, causal traces, the logit lens, and the weight editor to understand what had actually changed inside the model and why the benchmarks moved the way they did.

Without both, diagnosing a pipeline this complex would mean assembling custom gradient scripts, manual TensorBoard configurations, and blind post-hoc analysis of raw logs. Together they made each stage boundary a real decision point rather than a guess.

SignalStageWhat it caught
Distribution mismatchStage 1

Metrics Chat diagnosed catastrophic forgetting from 46.3% to 5% not as a training failure but as a data format problem. Raw file completions vs instruction-following format. Saved retrying Stage 1 with different hyperparameters.

Clean dynamics confirmStage 1

Loss, grad norm, and LR all healthy throughout. Aquin confirmed the training run itself was correct, giving confidence that the dataset choice was the sole variable to change.

Saturation plateauStage 2

Eval loss plateau after epoch 1.5 identified as a dataset saturation signal, not a training defect. Recommendation to proceed to Stage 3 rather than waste compute on additional epochs.

SAE feature diffStage 2

Post-training SAE diff on the Stage 2 checkpoint showed heaviest rewrite at the middle layers, consistent with code-domain instruction tuning deepening factual retrieval circuits. The top shifted features at L8 were code-structure and pattern-completion associations. No unexpected safety or refusal features appeared in the diff, confirming the CodeFeedback dataset introduced no suppression artifacts.

Dead layer detectionStage 2

Per-layer grad norm view confirmed uniform signal distribution across all 7 adapter matrices at r=128, validating that this config is well-matched to a 128k instruction dataset.

Advantage collapseStage 3

Decreasing advantage variance in epoch 2 identified as reward diversity exhaustion. With 4 generations on 542 problems, a growing fraction of prompts produced zero advantage and no gradient signal.

Deceptive reward climbStage 3

Reward climbed to 0.34 and mini-batch pass rate hit 100%, but Metrics Chat flagged the combination of bounded KL and collapsing advantage variance as a policy collapse pattern rather than genuine capability gain.

Causal trace on collapseStage 3

After the HumanEval regression from 51.8% to 25%, the causal trace was run on a sample of failing problems. The layer-level recovery signal had shifted: problems that previously resolved cleanly at the middle-layer MLP were now distributing causal load across later layers, a signature of representational drift. The logit lens confirmed the model was committing to wrong tokens earlier in the forward pass, before the usual retrieval layers had assembled enough context.

Fast loss collapseStage 4

0.64 to 0.10 in 234 steps identified as high-quality dataset signature rather than overfitting. Grad norm tapering confirmed clean adapter saturation.

SAE diff on greedy flatnessStage 4

After greedy HumanEval+ stayed flat at 44.5% despite the fast SFT convergence, the SAE feature diff between Stage 2 and Stage 4 was run. The features that shifted most were already-active code-pattern features at the layers that had also shifted in Stage 2. The fine-tune had deepened the same circuits rather than broadening them. This mechanistically confirmed what the Metrics Chat had diagnosed from training signals alone: 941 examples from one model at one temperature teaches the adapters to sharpen existing strategies, not learn new ones.

Weight editor probeStage 4

The weight editor was used to probe where the model's code-correctness associations lived after Stage 4. Causal traces on specific failing HumanEval+ problems located the retrieval signal at layers 10 to 14. Logit projections of the top SAE features at those layers showed the model was promoting plausible-but-wrong completions with high confidence. Steering at negative strength on those features reduced confident wrong outputs, confirming the failure mode: the model had encoded solution patterns that activated on surface form rather than on the underlying algorithmic structure.

Zero-code integrationAll stages

Single attach_sft() or attach_grpo() call per stage. No changes to training logic. Dashboard live from step one of each run. Mech interp tooling loaded from the same checkpoint paths.

The pattern across all four stages is consistent. The SDK's value is not just charting metrics that are already visible in training logs. It is the Metrics Chat's ability to interpret combinations of signals — healthy training dynamics alongside catastrophic eval regression in Stage 1, bounded KL alongside collapsing advantage variance in Stage 3, fast loss collapse alongside flat greedy scores in Stage 4 — and translate those combinations into specific diagnoses. The mechanistic tooling then goes one level deeper: it confirms or challenges those diagnoses inside the model's weights, giving you a causal account of the behavior rather than a correlation. That combination is what made a four-stage pipeline debuggable in a single session.

Demo videos

To see what these benchmark numbers actually mean in practice, we ran two tasks against three models: the base Llama 3.2 3B Instruct with no fine-tuning, the Stage 4 fine-tuned model, and GPT-OSS 20B. Each task was given the same prompt with no additional guidance, run to completion, and executed against tests.

Task one is a snake game. It is a stateful, event-driven program that requires pygame, collision logic, directional control, and a scoring loop. Task two is a Fibonacci app: a recursive function with memoization and a simple CLI interface. The first task tests general code generation on something outside the benchmark distribution. The second tests exactly the capability the fine-tuning was optimizing for.

The snake game results are worth pausing on. The base model failed to produce runnable code at all. GPT-OSS 20B produced a working game but missed a core rule: the snake does not lose when it touches the boundary. The fine-tuned 3B model, despite having a quarter of the parameters, produced a complete, correct implementation with all game rules intact. The fine-tuning had transferred to a task it had never seen in training.

Task 1 — Snake Game

Snake game — Llama 3.2 3B Instruct (base, no fine-tuning)

The raw base model on the snake game prompt. It failed to produce runnable code — not just a buggy game, nothing executable at all. This is where 46.3% HumanEval actually sits when asked to build something stateful.

Snake game — Stage 4 fine-tuned model

The Stage 4 model on the same prompt. Produced a complete, playable snake game with all rules correctly implemented, including boundary collision loss. The fine-tuning transferred to a task it never trained on directly.

Snake game — GPT-OSS 20B

The 20B model produced a working game but missed the boundary collision rule — the snake does not lose on wall contact. A larger model, a more obvious bug. The fine-tuned 3B got this right where the 20B did not.

Task 2 — Fibonacci App

Fibonacci app — Llama 3.2 3B Instruct (base, no fine-tuning)

Base model on a Fibonacci function with memoization and a simple CLI interface. This sits squarely in the HumanEval distribution — the kind of task the 46.3% baseline is measured on.

Fibonacci app — Stage 4 fine-tuned model

Stage 4 model on the same prompt. HumanEval+ went from 44.5% to 70.0% best-of-10 after this stage. In-distribution tasks like this are where the gain is most visible.

Fibonacci app — GPT-OSS 20B

The 20B model on the same prompt, for comparison against the fine-tuned 3B output.

Analysis

The HumanEval+ improvement from 44.5% to 70.0% under best-of-10 is still a real finding. It means the pipeline meaningfully improved the model's ability to generate correct code, not just code that passes easy tests. The stricter HumanEval+ benchmark is designed to catch partial correctness, and a 25-point gain under sampling represents a genuine shift in what the model is capable of producing. The snake game demo makes this concrete in a way benchmarks alone cannot: a fine-tuned 3B model got the game rules right where a 20B model did not.

Aquin Labsaquin@aquin.app

Work with us

Interpretability tooling, custom SAE databases, mechanistic audits, circuit reports, and hands-on research, experiments, and studies for teams of all sizes. Reach us at aquin@aquin.app

Book a call

Not sure if Aquin is right for you?

SubstackMedium
© 2026 Aquin. All rights reserved.

Aquin