NewVivly x Aquin — Structuring Social Data for AI. Read the case study

Case Study

case studyMay 2026

Structuring Social Data for AI

Vivly×Aquin· May 2026
Structuring Social Data for AI — Aquin x Vivly

1. Abstract

This case study details the pipeline of extracting, structuring, and validating public sentiment data during a live privacy scandal. In February 2026, reports surfaced that human contractors were reviewing intimate footage captured by Meta Ray-Ban smart glasses, triggering a massive backlash regarding wearable AI and privacy.

To analyze this critical moment, the project integrated Vivly to autonomously identify signals and Aquin to rigorously inspect the dataset. The result is a clean, 1,500 entry training dataset sourced from Reddit and Hacker News that maps the public response to surveillance, privacy, and deceptive marketing.


2. Background

The catalyst for this data collection was a major privacy breach. It was revealed that human reviewers contracted through Sama in Kenya were actively watching user video and audio clips from Meta Ray-Ban glasses to train AI models. Because users largely believed these interactions were private, highly intimate footage captured in personal spaces was unexpectedly reviewed by third parties.

After Swedish journalists exposed the operation, Meta abruptly cut ties with Sama. This led to over 1,000 workers losing their jobs overnight and sparked immediate lawsuits and investigations over deceptive marketing practices and consumer surveillance.


3. Architecture

This project utilizes two primary platforms to process the unstructured data into a secure training set.

3.1 Data Acquisition and Structuring: Vivly

Vivly is a signal identification platform for public and social data. It surfaces meaningful signals from large-scale discussions, helping enterprises understand exactly what is being discussed, by whom, and why it matters. The project used the Vivly SDK, available via pip and npm, to fetch relevant discussions around the Meta Ray-Ban privacy controversy.

3.2 Dataset Validation and Compliance: Aquin

Aquin is a platform dedicated to building, inspecting, and improving artificial intelligence models, especially large language models. It focuses on peering into how models work internally to ensure they are reliable, safe, and accurate before deployment.

Because the dataset contained raw internet reactions to a highly sensitive privacy controversy, it required thorough sanitization before being used for training and analysis. Aquin's Dataset Inspector ingests raw social data and processes it through a safety and compliance framework designed for AI datasets. The platform performed the following critical checks:

01Prompt Injection Scan
02Opt-Out and Consent Registry
03Bias Surface and Fairness Analysis
04Toxicity Analysis
05System Prompt Leak and Role Confusion Check
06Synthetic Content Detection
07Poisoned Sample Detection
08Copyright and License Risk Assessment
09Privacy and PII Scan
10Text Quality and Duplication Analysis
11Compliance Audit Trail Flags
12Framework Scores and Remediation Plan

4. Data Preparation

4.1 Data Sources

To capture authentic public reactions to Meta Ray-Ban smart glasses, data was sourced from two primary platforms: Reddit and Hacker News. These sites host some of the most unfiltered debates on emerging tech and privacy.

4.2 Data Extraction

The Vivly SDK was used to automate the extraction. The following query was passed to the SDK:

"Meta Ray-Ban glasses privacy recording creepy scandal on Reddit and Hacker News"

The SDK analysed the intent and identified the specific communities actively discussing it. It then interfaced with the Reddit and Hacker News APIs to fetch discussions by matching relevant keywords generated directly from the initial query. Because these platforms contain a large amount of irrelevant chatter and sarcasm, the raw output is inherently noisy. To solve this, data was passed through Vivly's noise-to-signal module, leaving only high-value, relevant conversations focused on the core themes.

4.3 Result

Total entries~1,500 discussion items
Collection methodOfficial Reddit API + Vivly SDK
Data rangeCurrent year (spike period)
Primary themesPrivacy, surveillance, skepticism, wearable AI

The structure below shows how each discussion entry is represented in the raw output:

{
  "id": "1stq3ct",
  "url": "https://www.reddit.com/r/privacy/comments/1stq3ct/...",
  "score": 479,
  "title": "Being recorded with meta glasses during work",
  "content": "Today I was doing my job at a restaurant...",
  "subreddit": "privacy",
  "created_date": "2026-04-23T17:51:33+00:00",
  "num_comments": 227,
  "comments": [
    {
      "id": "ohv9lnq",
      "body": "Mention it to bosses as it has to be addressed in some standard
               yet inoffensive way for staff — that you can politely decline to
               be recorded more than a couple of seconds, say.",
      "score": 351,
      "depth": 0,
      "created_utc": 1776968911.0
    }
  ]
}

5. Data Processing

5.1 Dataset Preparation for Aquin

The raw JSON data extracted from Reddit and Hacker News was deeply valuable but far too unstructured for direct model training.

The key step here was using Claude Sonnet 4.6 not to generate content, but to restructure it. The model analyzed the raw data and logically grouped scattered discussions based on shared article links and core topics. This preserved the contextual richness of the human conversations while organizing them into coherent, unified threads.

Once the discussions were logically grouped, the data was passed through a lightweight formatting script. This step required no additional AI processing — the script simply converted the grouped data into a strict, LLaMA-compatible prompt-and-answer format. The output was a clean JSON Lines (JSONL) file, precisely structured to match the ingestion requirements of Aquin's Dataset Inspector.

Finally, the formatted JSONL file was uploaded into Aquin. The Dataset Inspector automatically processed the entries through its predefined evaluation pipelines.


6. Process Views

A selection of views from the dataset inspector, audit surfaces, and pipeline output across each stage of the project.

View 1

6.1 Prompt Injection Scan

This process scans the dataset's training rows to detect embedded prompt injection patterns — inputs designed to hijack the AI by overriding its primary instructions. The dataset was analyzed across 296 rows and returned a completely clean verdict. Zero rows were flagged, and the average injection score was an extremely low 0.0037, meaning the data is secure from basic injection attacks.

6.2 Opt-Out and Consent Registry

This step checks any web links present in the dataset against the Spawning AI opt-out registry and standard robots.txt restrictions, ensuring the data respects creator consent and legal scraping boundaries. The status was entirely clear. The scanner detected zero URL columns in this dataset, meaning no domains were blocked and no further opt-out compliance checks were required.

6.3 Bias Surface and Fairness Analysis

This analysis detects protected attributes — such as gender, race, or age — and measures label imbalances to ensure the dataset is fair and won't train the AI to exhibit discriminatory behavior. Bias risk was marked as low. The system detected zero protected attributes and zero label columns across the 296 rows, concluding that there are no significant bias signals or fairness concerns.

6.4 Toxicity Analysis

This scan evaluates the dataset for harmful, offensive, or inappropriate language, providing a severity breakdown and pinning the worst offenders for manual review. The overall verdict was clean. While 4.7% of the data (14 rows) was flagged for minor toxicity, only 1 row was classified as severe (scoring ≥ 0.8). The vast majority of the sample remains safe.

6.5 System Prompt Leak & Role Confusion Check

A deeper injection scan targeting adversarial attacks that attempt to cause role confusion or trick the AI into leaking its confidential backend system prompts. Just like the primary injection scan, this came back clean — a 0% flag rate for advanced manipulation tactics in both the user and assistant columns.

6.6 Synthetic Content Detection

This process analyzes text to determine if it was generated by an AI rather than written by a human, scoring the likelihood of AI origin and pinpointing suspect rows. The overall dataset is classified as human, with a low average synthetic score of 0.1432. However, 0.7% of the data (2 rows) was flagged as highly synthetic: Row #178 (assistant) hit 100% synthetic confidence, and Row #138 (user) hit 90% confidence.

6.7 Poisoned Sample Detection

This process detects maliciously altered training samples by searching for cluster outliers, label inconsistencies, and loss anomaly signals. Across all 296 rows, the dataset performed with a 0% flagged rate and a clean verdict. No cluster outliers, label inconsistencies, or loss anomalies were found, and the average anomaly score remained extremely low at 0.1423.

6.8 Copyright and License Risk Assessment

This analysis evaluates potential intellectual property violations by calculating a composite IP score based on domain analysis, inline license signals, and copyrighted content markers. Unlike the security scans, this returned an elevated overall risk with a composite score of 46 out of 100 — driven entirely by the absence of a declared license, which caused the system to assume a restricted status and generate a high license risk score of 75. The actual content analysis, however, posed very low risk (0.0125), with 0% copyright notices, open license references, or book markers found across a 200-row sample.

6.9 Privacy and PII Scan

This scan combs through the dataset to identify Personally Identifiable Information such as names, emails, phone numbers, and locations. This scan resulted in a high risk verdict. The system found that 30.4% of the dataset (90 out of 296 rows) contains PII, detecting 106 specific entities — 105 flagged as sensitive (Nationality/Religion, medium risk) and 1 flagged as contact (a phone number, high risk). Exposure is heavily concentrated in the user column, with a high PII density affecting 29.7% of its rows.

6.10 Text Quality and Duplication Analysis

This check evaluates foundational text quality by analyzing language distribution and scanning for exact or near-duplicate rows. The dataset showed exceptional text hygiene — 100% English with no mixed-language anomalies. Duplicate detection (using a 0.85 Jaccard similarity threshold) confirmed that 100% of the 296 rows are clean, with 0% near-duplicates and 0 exact identical rows.

6.11 Compliance Audit Trail Flags

This process grades the dataset's privacy and security metrics against established regulatory frameworks. The dataset received an overall health score of 40 out of 100. Of 5 clauses assessed, 4 failed and 1 passed. Failures were tied to the PII discovered in the previous scan — the dataset failed Article 10(3) (Special categories of personal data), Section 4 (Lawful basis for processing), NIST AI RMF MAP 2.2, and MANAGE 3.1. It successfully passed Section 9 (Sensitive Personal Data) due to the absence of hyper-sensitive identifiers like Aadhaar, PAN, health records, or financial data.

6.12 Framework Scores and Remediation Plan

This summarizes audit findings into definitive framework scores and outputs a structured remediation plan. Due to unmitigated PII, the dataset is currently non-compliant with the EU AI Act (score: 25%) and NIST AI RMF (score: 25%), while sitting at partial compliance with the India DPDPA (score: 62%). The remediation plan requires: data minimization and anonymization targeting the exposed phone number; aligning processing activities with lawful bases and staff privacy training; and implementing continuous monitoring as the dataset grows.


7. Conclusion

This project successfully gathered and organized public reactions to the Meta Ray-Ban privacy scandal. By using Vivly to collect the discussions and Aquin to thoroughly inspect them, the team proved that the raw dataset is mostly secure and of very high quality. The text is free from AI manipulation, harmful bias, and poisoned data, making it an excellent starting point for understanding real human concerns about wearable AI.

However, the final safety checks highlighted a major privacy issue that must be addressed before the data can be used. Approximately 30% of the dataset still contains personal information — including a phone number and other sensitive details — causing it to fail strict legal compliance thresholds. The immediate next step is to scrub this personal data and follow the generated remediation plan to ensure the dataset is fully safe and legally compliant.

The Vivly SDK handled source discovery. Aquin's Dataset Inspector handled the final generation and compliance pass. Each tool played to its strengths, and the handoffs between them were clean.

Not sure if Aquin is right for you?

© 2026 Aquin. All rights reserved.

Aquin