data inspectiontoxicitypiisyntheticliability chainbiascomplianceaudit trail

The Data Inspection System

Aquin Labs · April 2026

What is actually in your training data

Most dataset investigations stop at the sample: pull a hundred rows, eyeball the distribution, move on. What that misses is the systematic: the 6% of rows carrying toxicity signals, the SSNs concentrated in one column, the 33% of data that traces back to model-generated sources. Aquin's data inspection system runs a full eight-module analysis on any dataset you can load, at the row and column level, and produces a complete audit trail of everything it found.

Load a dataset from HuggingFace by ID, or upload a CSV, JSONL, or Parquet file directly. The analysis stack covers toxicity, PII, synthetic detection, liability chain tracing, bias, copyright, text quality, and overall quality scoring. Every finding cites specific rows and columns. Nothing is inferred from samples.

AQUIN · EARLY ACCESS

Run this on your own dataset

join waitlist

Ingestion

Four input paths: HuggingFace datasets by repo ID (public or gated), and direct file upload as CSV, JSONL, or Parquet. Every ingestion records source, timestamp, and a file hash as the first entry in the audit trail.

ingestion sources → parse → audit

Eight analysis modules

Each module answers a question the others cannot. They chain: a high near-duplicate rate triggers synthetic detection; elevated PII density triggers liability chain tracing on those columns. Or you run them directly, starting with the question you have and letting the findings determine what follows.

analysis pipeline

Text quality

Text quality runs first because its findings set context for everything else. Language distribution, exact and near-duplicate detection, license resolution (dataset-level and inline columns), jurisdiction inference from URL domains, topic classification, and opt-out registry cross-referencing for web-sourced datasets. A near-duplicate rate of 11.2% on 5,000 rows is a signal: those clusters almost always trace back to synthetic augmentation pipelines.

language · topic · medical_qa_v2.parquet · 5,000 rows

Exact duplicates207 (4.1%)

Near-duplicate rate11.2%

Dataset licenseCC BY 4.0

Jurisdiction (URLs)EU 14% · US 61%

Opt-out registry hit1 source flagged

Dominant topicMedical / clinical

Toxicity

Six categories scored per row and aggregated per column: toxicity, severe toxicity, obscenity, threat, insult, and identity attack. The radar shows which categories drive the signal; the table shows where across columns that risk concentrates. A column where 6.2% of rows are flagged with a peak score of 0.94 is a different problem than one where 0.2% are flagged at 0.51.

toxicity · radar + column summary + flagged rows

toxicity

0.62

severe toxicity

0.21

obscenity

0.44

threat

0.09

insult

0.38

identity attack

0.17

ColumnFlaggedPctPeak labelSeverity

text3126.2%toxicityflagged

response470.9%identity_attackmoderate

context80.2%insultmoderate

flagged rows (sample)

#1847texttoxicity0.94

#3201textsevere_toxicity0.87

#892responseidentity_attack0.71

#4419texttoxicity0.68

flag threshold 0.5 · 367 flagged across all columns

PII detection

Named entity recognition across every text column: SSNs, emails, phone numbers, addresses, financial identifiers, person names, health data references. Each entity is tiered from critical to low. PII rarely distributes evenly. The text column here carries 5.1 entities per 100 rows at critical risk, while context carries 0.9 at medium. Those are different remediation problems.

PII · entity type breakdown + column risk

Government ID

41%

Contact info

26%

Financial

19%

Person name

14%

ColumnPII rowsPctTop entityPeak risk

text2184.3%SSNcritical

response911.8%EMAILhigh

context340.7%PERSON_NAMEmedium

343 rows · 759 entities · overall risk: critical

Synthetic detection

Row-level scoring across four confidence buckets: human, uncertain, likely synthetic, and synthetic. The distribution matters. Twelve percent high-confidence synthetic with 21% uncertain is a different situation than 33% uniformly likely-synthetic. The distribution is the finding.

synthetic · row distribution

Liability chain

Synthetic data rarely arrives in one step. The liability chain module traces provenance recursively for rows above the detection threshold: directly generated, paraphrased, translated, or some combination. Each step gets a confidence rating. A three-step chain at high confidence scores higher than a two-step chain with uncertain steps, meaning the structure is the finding, not just the terminal number.

liability chains · high-liability sample

2 deep chains (depth 3+) · avg liability 0.61

Bias detection

Imbalance across four protected attribute axes: gender, age, geography, and sentiment. Each is shown as deviation from the 50% midpoint, with direction explicit. A 72% US-origin dataset is not automatically broken, but it is information that should inform how the resulting model is evaluated and where it is deployed.

bias · attribute skew · diverging from center

Dataset quality scoring

Five dimensions scored independently: completeness, consistency, dedup quality, label fidelity, and an overall grade. Each dimension is visible separately so you can see exactly which axis is pulling the overall score down. Consistency and dedup quality are the weak points in this dataset.

quality dimensions

Compliance framework coverage

The analysis modules map directly onto three frameworks: EU AI Act Articles 10, 11, and 12; India's DPDPA; and the NIST AI RMF. The compliance view shows which requirements each module satisfies, which are partial, and which are not addressed. This is not a certification, but a map from findings to framework requirements, so your legal team has specifics rather than ambiguity.

compliance coverage · EU AI Act · India DPDPA · NIST AI RMF

EU AI Act3/3 covered

✓

Art. 10: Data governance

✓

Art. 11: Technical documentation

✓

Art. 12: Record-keeping

Full coverage. Each article maps to a specific module output.

India DPDPA2/3 covered

✓

PII identification

✓

Consent traceability

Data minimization audit

PII and consent covered. Data minimization flagged as partial.

NIST AI RMF3/3 covered

✓

Data documentation

✓

Provenance tracking

✓

Bias surface reporting

Full alignment with GOVERN and MAP function data requirements.

The audit trail

Every operation is logged: ingestion, each module run, every finding, every threshold applied. The trail is sealed with a SHA-256 hash when the session ends. One-click export produces a structured JSON with the full machine-readable log or a formatted PDF with findings and the sealed trail.

audit trail · medical_qa_v2.parquet

PDFJSON