
The Data Inspection System
Aquin Labs · April 2026
What is actually in your training data
Most dataset investigations stop at the sample: pull a hundred rows, eyeball the distribution, move on. What that misses is the systematic: the 6% of rows carrying toxicity signals, the SSNs concentrated in one column, the 33% of data that traces back to model-generated sources. Aquin's data inspection system runs a full eight-module analysis on any dataset you can load, at the row and column level, and produces a complete audit trail of everything it found.
Load a dataset from HuggingFace by ID, or upload a CSV, JSONL, or Parquet file directly. The analysis stack covers toxicity, PII, synthetic detection, liability chain tracing, bias, copyright, text quality, and overall quality scoring. Every finding cites specific rows and columns. Nothing is inferred from samples.
Ingestion
Four input paths: HuggingFace datasets by repo ID (public or gated), and direct file upload as CSV, JSONL, or Parquet. Every ingestion records source, timestamp, and a file hash as the first entry in the audit trail.
ingestion sources → parse → audit
Eight analysis modules
Each module answers a question the others cannot. They chain: a high near-duplicate rate triggers synthetic detection; elevated PII density triggers liability chain tracing on those columns. Or you run them directly, starting with the question you have and letting the findings determine what follows.
analysis pipeline
Text quality
Text quality runs first because its findings set context for everything else. Language distribution, exact and near-duplicate detection, license resolution (dataset-level and inline columns), jurisdiction inference from URL domains, topic classification, and opt-out registry cross-referencing for web-sourced datasets. A near-duplicate rate of 11.2% on 5,000 rows is a signal: those clusters almost always trace back to synthetic augmentation pipelines.
language · topic · medical_qa_v2.parquet · 5,000 rows
Toxicity
Six categories scored per row and aggregated per column: toxicity, severe toxicity, obscenity, threat, insult, and identity attack. The radar shows which categories drive the signal; the table shows where across columns that risk concentrates. A column where 6.2% of rows are flagged with a peak score of 0.94 is a different problem than one where 0.2% are flagged at 0.51.
toxicity · radar + column summary + flagged rows
flag threshold 0.5 · 367 flagged across all columns
PII detection
Named entity recognition across every text column: SSNs, emails, phone numbers, addresses, financial identifiers, person names, health data references. Each entity is tiered from critical to low. PII rarely distributes evenly. The text column here carries 5.1 entities per 100 rows at critical risk, while context carries 0.9 at medium. Those are different remediation problems.
PII · entity type breakdown + column risk
343 rows · 759 entities · overall risk: critical
Synthetic detection
Row-level scoring across four confidence buckets: human, uncertain, likely synthetic, and synthetic. The distribution matters. Twelve percent high-confidence synthetic with 21% uncertain is a different situation than 33% uniformly likely-synthetic. The distribution is the finding.
synthetic · row distribution
Liability chain
Synthetic data rarely arrives in one step. The liability chain module traces provenance recursively for rows above the detection threshold: directly generated, paraphrased, translated, or some combination. Each step gets a confidence rating. A three-step chain at high confidence scores higher than a two-step chain with uncertain steps, meaning the structure is the finding, not just the terminal number.
liability chains · high-liability sample
2 deep chains (depth 3+) · avg liability 0.61
Bias detection
Imbalance across four protected attribute axes: gender, age, geography, and sentiment. Each is shown as deviation from the 50% midpoint, with direction explicit. A 72% US-origin dataset is not automatically broken, but it is information that should inform how the resulting model is evaluated and where it is deployed.
bias · attribute skew · diverging from center
Dataset quality scoring
Five dimensions scored independently: completeness, consistency, dedup quality, label fidelity, and an overall grade. Each dimension is visible separately so you can see exactly which axis is pulling the overall score down. Consistency and dedup quality are the weak points in this dataset.
quality dimensions
Compliance framework coverage
The analysis modules map directly onto three frameworks: EU AI Act Articles 10, 11, and 12; India's DPDPA; and the NIST AI RMF. The compliance view shows which requirements each module satisfies, which are partial, and which are not addressed. This is not a certification, but a map from findings to framework requirements, so your legal team has specifics rather than ambiguity.
compliance coverage · EU AI Act · India DPDPA · NIST AI RMF
Full coverage. Each article maps to a specific module output.
PII and consent covered. Data minimization flagged as partial.
Full alignment with GOVERN and MAP function data requirements.
The audit trail
Every operation is logged: ingestion, each module run, every finding, every threshold applied. The trail is sealed with a SHA-256 hash when the session ends. One-click export produces a structured JSON with the full machine-readable log or a formatted PDF with findings and the sealed trail.
SHA-256 sealed · exportable as PDF or JSON
Not sure if Aquin is right for you?
Aquin
