The Sandbox
sandboxlora fine-tuningdataset editinghyperparameter controlagent tuningdirect training

The Sandbox

Aquin Labs · April 2026

Training from the dashboard, without writing a line of code

The standard fine-tuning workflow requires at minimum a training script, a dataset file, a hyperparameter configuration, and a machine with a GPU. Those are reasonable requirements when the training objective is well-understood and the data is already clean. They are significant barriers when the goal is exploratory: testing whether a small dataset can push a model toward a specific behavior, or seeing what happens to calibration when the rank is doubled, or asking whether the agent can generate and validate training data for a niche topic before you invest time writing it yourself.

The sandbox training system removes those barriers. It runs LoRA fine-tuning on Llama 3.2 1B Instruct directly from the Aquin dashboard: you edit the dataset in the browser, configure every hyperparameter through a panel with inline explanations, and start the run with a single button. The training executes on Aquin's GPU infrastructure. The full live monitoring pipeline kicks in automatically, so every step event, every signal, and the full post-training model diff and SAE feature diff arrive in the dashboard exactly as they would from an SDK-connected run.

The agent can participate in every part of that workflow. It reads the current dataset and configuration, generates new training examples directly, tunes hyperparameters based on what it knows about the task, and explains the reasoning behind its suggestions. The loop from question to configured run can happen entirely in the chat panel.

sandbox pipeline · from dataset to diff

Datasetconfirmed, versionedConfigrank, lr, scheduler…RunGCP VM · LoRA on GPUMonitorlive SSE streamDiffmodel + SAE + calibration

the full monitoring stack runs automatically on every sandbox run. no additional setup.

AQUIN · EARLY ACCESS
Train your own model from the dashboard
join waitlist

Building and editing the training dataset

The sandbox uses instruction-response pairs as its training format, which maps directly to the template the training worker applies to each row before tokenization. Each pair becomes a structured string with an instruction block and a response block, passed to the tokenizer at the configured maximum sequence length. The format is fixed; the content is entirely yours.

The dataset panel gives you six ways to populate and modify rows. You can edit directly in the inline table, upload a JSONL file, paste multiple JSONL lines at once and have the parser fill the rows automatically, or ask the agent to generate examples for a given topic. The inline editor lets you change any cell in any row at any time, and deleted rows take effect immediately without a separate save step.

Version tracking is built into the confirmation flow. When you confirm a dataset, it receives a version number. That number is stamped against the run when training starts, so the monitoring panel can always tell you exactly which version of the dataset produced which run. If you edit the dataset after a completed run, the badge changes to flag the mismatch between the confirmed version and the version used in the last run, so you know a new run is needed before the dataset change takes effect in the model.

dataset ingestion paths

01
Start with demo data

The sandbox initialises with 25 general-knowledge instruction-response pairs. Enough to confirm the pipeline runs cleanly before you bring your own data.

02
Edit inline

Open the dataset editor to add, delete, or rewrite individual rows directly in the dashboard. Each field is an editable textarea. No file handling.

03
Upload .jsonl

Drop a JSONL file where each line is a JSON object with instruction and response keys. The parser strips malformed lines silently and appends only valid pairs.

04
Paste multiple lines

Paste multiple JSONL lines directly into the editor. The system detects newline-separated JSON and fills each row automatically, without requiring a file.

05
Agent generation

Ask the agent to generate rows for a topic and count. It calls the generation endpoint server-side, appends the rows directly to the dataset, and you see them appear live without any dialog.

06
Confirm and version

Confirming the dataset stamps a version number. The version badge tracks which version was used for the last run and flags when the dataset has changed since then.

version tracking · confirm stamps a version, badge reflects mismatch

v125 rows · general knowledgeused in run 1
v240 rows · added 15 rows on reasoningused in run 2
v340 rows · edited 3 rowspending · not yet used

version stamps on confirm. badge shows whether the current dataset matches what the last run used.

Full hyperparameter control

Every configurable parameter in the LoRA training loop is exposed in the algo panel. The panel groups parameters into three sections: LoRA core settings that control the adapter architecture, optimization settings that control how the adapter is trained, and batching settings that control how data is fed to the optimizer. Each parameter has a valid range enforced at the input, a current default, and a tooltip that explains what it does and when to change it.

The target modules selector is worth calling out specifically. It controls which attention projections receive LoRA adapters. The default is Q and V, which is the standard configuration for most instruction-tuning tasks. Adding K and O increases the adapter's ability to modify attention patterns more completely. Adding the MLP projections (gate, up, down) extends adaptation into the feed-forward layers, which is useful when the task requires changes to factual associations rather than just generation style. The panel renders each projection as a toggle, so the selection is visual and does not require knowing the projection key names.

Gradient accumulation deserves its own note. The training worker processes each dataset row as its own forward pass, so the real batch size is always one. Gradient accumulation simulates a larger batch by accumulating gradients across N steps before updating the optimizer. Setting accum steps to 4 on a 40-row dataset produces the same optimizer updates as a batch size of 4 would. The panel displays the effective batch size derived from the accum steps value, so the relationship is visible without calculation.

presets · fast / balanced / quality

Fast
rank4
alpha8
lr3e-4
epochs1
modulesq+v
accum1
seq len128

Low rank, no regularization, one epoch. Right for rapid iteration where you want to confirm data format and pipeline before committing to a longer run.

Balanced
rank8
alpha16
lr2e-4
epochs1
modulesq+v
accum1
seq len128

Default settings. Good quality-to-speed tradeoff for most instruction-tuning tasks on datasets of 20 to 200 examples.

Quality
rank16
alpha32
lr1e-4
epochs3
modulesq+k+v+o
accum4
seq len256

Higher rank, warmup, more target modules, longer sequences, gradient accumulation for a larger effective batch. Slower but more thorough adaptation. Use when data quality is high and overfitting is not a concern.

presets apply a coherent configuration across all parameters at once. individual fields remain editable after applying a preset.

full parameter reference · key · range · default · description

LoRA core
rank1–648Adapter rank. Higher rank means more trainable parameters and more expressive adaptation, at the cost of memory and compute.
alpha1–12816Scaling factor applied to the adapter output. Typically set to 2× rank. Controls how strongly the adaptation is applied at merge time.
dropout0–0.50.05Dropout applied to LoRA layers during training. Adds stochasticity, reduces overfitting on small datasets.
target_modulesselectionq, vWhich attention projections receive adapters. Q and V cover most fine-tuning needs. Adding K, O, and MLP projections increases adapter expressiveness and parameter count.
Optimization
learning_rate1e-6–1e-22e-4Step size for the optimizer. The most important hyperparameter for fine-tuning stability. Too high causes divergence; too low causes plateau without convergence.
epochs1–201Number of full passes through the training dataset. More epochs increase the risk of overfitting on small datasets.
warmup_steps0–5000Linear learning rate warmup before the main schedule takes over. Stabilises early training on high learning rates.
grad_clip0.1–101.0Maximum gradient norm. Values above this are rescaled before the optimizer step. Prevents instability from gradient spikes.
weight_decay0–0.50.01L2 regularization applied through the optimizer. Penalises large weight magnitudes, reducing overfitting.
optimizerselectionadamwAdamW is the default. SGD uses momentum without adaptive learning rates. Lion is more memory-efficient, often matches AdamW quality at lower memory cost.
schedulerselectioncosineLearning rate decay curve after warmup. Cosine is smooth and well-suited to transformer fine-tuning. Linear decays uniformly. Constant holds LR fixed.
Batching
grad_accum_steps1–641Number of forward passes before each optimizer step. Simulates a larger effective batch size without increasing memory usage per step.
max_seq_len32–2048128Tokenizer truncation length. Longer sequences capture more context per training example but increase VRAM usage linearly.

The agent as a training co-pilot

The training agent has read and write access to both the dataset and the hyperparameter configuration before the run starts. It can read the current rows, generate new ones and append them immediately, restructure the dataset, and patch any subset of hyperparameters in a single call. Those changes take effect in the UI the moment the tool call resolves, so you can watch the dataset panel update and the algo config panel reflect new values as the agent works.

The practical workflow this enables is conversational configuration. You can describe the task, the domain, and roughly how many examples you want, and the agent will generate a coherent dataset and suggest a configuration tuned for that task size and domain complexity. For a 30-row factual reinforcement task, it will likely suggest a low rank, low learning rate, and no dropout. For a 100-row instruction-following task with diverse phrasing, it might suggest adding K and O to the target modules, increasing gradient accumulation, and setting a warmup step count to stabilise early training.

The agent does not start training runs autonomously. It prepares the configuration and explains its reasoning, and the run starts when you press the button. That boundary is intentional: the preparation is something the agent can do with high confidence based on the task description, but the decision to commit GPU time is yours. After a run completes, the agent can read the resulting metrics from the training snapshot, evaluate whether the configuration produced healthy training dynamics, and suggest what to change for the next run. The loop from task description to configured run to post-run analysis to next-run configuration is entirely conversational.

agent tool surface · sandbox mode

agent tools · sandbox mode
dataset_get()Read all current training rows. The agent calls this before editing so it knows what is already there.
dataset_generate(topic, n)Generate n instruction-response pairs for a topic server-side and append them to the dataset immediately. User sees the rows appear live.
dataset_set(rows)Replace the dataset with a provided row array. Used after filtering, restructuring, or manual rewrites.
algo_get_config()Read all current hyperparameter values: rank, alpha, lr, epochs, dropout, target_modules, warmup_steps, grad_clip, weight_decay, grad_accum_steps, optimizer, scheduler, max_seq_len.
algo_set_config(patch)Update one or more hyperparameters by name. Only the provided fields change. Takes effect immediately in the UI. Cannot be called while training is running.

algo_set_config accepts a partial patch. only the provided fields change. unspecified fields keep their current values.

The Sandbox versus SDK mode

The training tab offers two modes: sandbox, which runs from the dashboard and uses Aquin's infrastructure, and SDK, which connects to your own training process via a Python SDK and an API key. The two modes are distinct but feed into the same monitoring system. A sandbox run and an SDK-connected run produce the same step events, the same model diff, the same SAE feature diff, and the same calibration panel.

The practical division is by use case. The Sandbox is for exploration: small datasets, quick iteration, tasks where you want to see a result without setting up infrastructure. SDK is for production: large datasets, custom training loops with framework-specific optimizations, distributed training, or runs that need to execute on hardware you control. The two modes coexist in the same tab; switching is a single selection at the start of the session.

The agent works in both modes, though with different tool availability. In sandbox mode it has access to the full dataset and configuration tool surface. In SDK mode it can still read training metrics and interact with the monitoring panels, but dataset and config tools are unavailable since those are controlled by your training script. The mode determines which tools are active; the agent adapts accordingly.

What happens after you start a run

When a sandbox run starts, the frontend opens a streaming connection to the training API. The training worker runs in a thread on Aquin's GPU infrastructure and emits step events as the run progresses. Those events flow through the same ingestion pipeline that SDK runs use: each step event triggers the signal engine, updates the loss and gradient charts, and feeds the signals panel. There is no polling and no page refresh. The dashboard updates on each event, typically within a few hundred milliseconds of the training step completing.

The monitoring panel locks dataset and config editing while training is running. The version badge is frozen at the version used to start the run. The hyperparameter panel greys out all inputs. The intent is to preserve a clear record of what produced the run: the dataset version and configuration that were active when you pressed the button are exactly what the training worker received, and nothing can be changed mid-run to create ambiguity about that.

When the run completes, the model diff arrives as a streaming event and renders in the monitoring panel. The SAE feature diff follows, then the calibration panel. Those three together give you the behavioral comparison, the internal representation changes, and the confidence calibration shift that your dataset and configuration produced. The agent can read all three from the training snapshot and narrate what they mean for the next iteration, closing the loop from run to diagnosis to adjusted configuration.

AQUIN · EARLY ACCESS
Train your own model from the dashboard
join waitlist
Aquin Labsaquin@aquin.app

Join the Aquin Research Community

LLM researchers & ML engineers — open research, fellowships, hackathons, and early beta access.

Join Discord

Not sure if Aquin is right for you?

StatusPoliciesResearchCommunity·© 2026 Aquin. All rights reserved.

Aquin