The Data Inspection System
data inspectiontoxicitypiisyntheticliability chainbiascomplianceaudit trail

The Data Inspection System

Aquin Labs · April 2026

What is actually in your training data

Most dataset investigations stop at the sample: pull a hundred rows, eyeball the distribution, move on. What that misses is the systematic: the 6% of rows carrying toxicity signals, the SSNs concentrated in one column, the 33% of data that traces back to model-generated sources. Aquin's data inspection system runs a full eight-module analysis on any dataset you can load, at the row and column level, and produces a complete audit trail of everything it found.

Load a dataset from HuggingFace by ID, or upload a CSV, JSONL, or Parquet file directly. The analysis stack covers toxicity, PII, synthetic detection, liability chain tracing, bias, copyright, text quality, and overall quality scoring. Every finding cites specific rows and columns. Nothing is inferred from samples.

AQUIN · EARLY ACCESS
Run this on your own dataset
join waitlist

Ingestion

Four input paths: HuggingFace datasets by repo ID (public or gated), and direct file upload as CSV, JSONL, or Parquet. Every ingestion records source, timestamp, and a file hash as the first entry in the audit trail.

ingestion sources → parse → audit

HFHuggingFaceby repo IDCSVCSVany delimiterJLJSONLnested fieldsPQParquetcolumnaringestion · source hash · timestamp · column parse · audit trail entry #1

Eight analysis modules

Each module answers a question the others cannot. They chain: a high near-duplicate rate triggers synthetic detection; elevated PII density triggers liability chain tracing on those columns. Or you run them directly, starting with the question you have and letting the findings determine what follows.

analysis pipeline

text qualitytoxicityPIIsyntheticliability chaincopyrightbiasquality score

Text quality

Text quality runs first because its findings set context for everything else. Language distribution, exact and near-duplicate detection, license resolution (dataset-level and inline columns), jurisdiction inference from URL domains, topic classification, and opt-out registry cross-referencing for web-sourced datasets. A near-duplicate rate of 11.2% on 5,000 rows is a signal: those clusters almost always trace back to synthetic augmentation pipelines.

language · topic · medical_qa_v2.parquet · 5,000 rows

88%EnglishEnglish 88%German 4%French 3%Spanish 2%Other 3%Topic distributionMedical44%Legal18%Science16%Finance12%General10%
Exact duplicates207 (4.1%)
Near-duplicate rate11.2%
Dataset licenseCC BY 4.0
Jurisdiction (URLs)EU 14% · US 61%
Opt-out registry hit1 source flagged
Dominant topicMedical / clinical

Toxicity

Six categories scored per row and aggregated per column: toxicity, severe toxicity, obscenity, threat, insult, and identity attack. The radar shows which categories drive the signal; the table shows where across columns that risk concentrates. A column where 6.2% of rows are flagged with a peak score of 0.94 is a different problem than one where 0.2% are flagged at 0.51.

toxicity · radar + column summary + flagged rows

toxsevereobscenethreatinsultidentity
toxicity
0.62
severe toxicity
0.21
obscenity
0.44
threat
0.09
insult
0.38
identity attack
0.17
ColumnFlaggedPctPeak labelSeverity
text3126.2%toxicityflagged
response470.9%identity_attackmoderate
context80.2%insultmoderate
flagged rows (sample)
#1847texttoxicity0.94
#3201textsevere_toxicity0.87
#892responseidentity_attack0.71
#4419texttoxicity0.68

flag threshold 0.5 · 367 flagged across all columns

PII detection

Named entity recognition across every text column: SSNs, emails, phone numbers, addresses, financial identifiers, person names, health data references. Each entity is tiered from critical to low. PII rarely distributes evenly. The text column here carries 5.1 entities per 100 rows at critical risk, while context carries 0.9 at medium. Those are different remediation problems.

PII · entity type breakdown + column risk

759entities
Government ID
41%
Contact info
26%
Financial
19%
Person name
14%
ColumnPII rowsPctTop entityPeak risk
text2184.3%SSNcritical
response911.8%EMAILhigh
context340.7%PERSON_NAMEmedium

343 rows · 759 entities · overall risk: critical

Synthetic detection

Row-level scoring across four confidence buckets: human, uncertain, likely synthetic, and synthetic. The distribution matters. Twelve percent high-confidence synthetic with 21% uncertain is a different situation than 33% uniformly likely-synthetic. The distribution is the finding.

synthetic · row distribution

33%synthetichuman48%uncertain19%likely synthetic21%synthetic12%

Liability chain

Synthetic data rarely arrives in one step. The liability chain module traces provenance recursively for rows above the detection threshold: directly generated, paraphrased, translated, or some combination. Each step gets a confidence rating. A three-step chain at high confidence scores higher than a two-step chain with uncertain steps, meaning the structure is the finding, not just the terminal number.

liability chains · high-liability sample

#20410.91synthetic0.88paraphrase0.74translation0.61#38870.74synthetic0.79paraphrase0.68syntheticparaphrasetranslation

2 deep chains (depth 3+) · avg liability 0.61

Bias detection

Imbalance across four protected attribute axes: gender, age, geography, and sentiment. Each is shown as deviation from the 50% midpoint, with direction explicit. A 72% US-origin dataset is not automatically broken, but it is information that should inform how the resulting model is evaluated and where it is deployed.

bias · attribute skew · diverging from center

50%malefemalegenderyoungolderageUSnon-USgeographynegpossentiment

Dataset quality scoring

Five dimensions scored independently: completeness, consistency, dedup quality, label fidelity, and an overall grade. Each dimension is visible separately so you can see exactly which axis is pulling the overall score down. Consistency and dedup quality are the weak points in this dataset.

quality dimensions

91Completeness76Consistency69Dedup83Label78Overall

Compliance framework coverage

The analysis modules map directly onto three frameworks: EU AI Act Articles 10, 11, and 12; India's DPDPA; and the NIST AI RMF. The compliance view shows which requirements each module satisfies, which are partial, and which are not addressed. This is not a certification, but a map from findings to framework requirements, so your legal team has specifics rather than ambiguity.

compliance coverage · EU AI Act · India DPDPA · NIST AI RMF

EU AI Act3/3 covered
Art. 10: Data governance
Art. 11: Technical documentation
Art. 12: Record-keeping

Full coverage. Each article maps to a specific module output.

India DPDPA2/3 covered
PII identification
Consent traceability
~
Data minimization audit

PII and consent covered. Data minimization flagged as partial.

NIST AI RMF3/3 covered
Data documentation
Provenance tracking
Bias surface reporting

Full alignment with GOVERN and MAP function data requirements.

The audit trail

Every operation is logged: ingestion, each module run, every finding, every threshold applied. The trail is sealed with a SHA-256 hash when the session ends. One-click export produces a structured JSON with the full machine-readable log or a formatted PDF with findings and the sealed trail.

audit trail · medical_qa_v2.parquet
PDFJSON
14:02:11
Dataset ingestedmedical_qa_v2.parquet · 5,000 rows · 3 columns
14:02:14
Text quality startedlanguage · dedup · license · jurisdiction · topic
14:02:31
Text quality complete341 duplicates · CC BY 4.0 · EU 14% · medical
14:02:32
Toxicity startedcolumns: text, response, context
14:02:58
Toxicity complete367 flagged · peak toxicity 0.94 on row 1847
14:02:59
PII detection startedNER across all text columns
14:03:19
PII complete343 rows · 759 entities · overall risk: critical
14:03:20
Synthetic detectionrow-level scoring
14:03:44
Synthetic complete33% synthetic or likely · verdict: mixed
14:03:45
Liability chaintracing rows with synthetic score > 0.6
14:04:01
Liability complete2 deep chains (depth 3+) · avg liability 0.61
14:04:02
Audit trail sealedSHA-256 hash logged · PDF and JSON export ready

SHA-256 sealed · exportable as PDF or JSON

AQUIN · EARLY ACCESS
Run this on your own dataset
join waitlist
Aquin Labsaquin@aquin.app

Not sure if Aquin is right for you?

StatusPoliciesResearch·© 2026 Aquin. All rights reserved.

Aquin