Embedding Models

Aquin Labs · May 2026

Geometry inspection, retrieval evaluation, fine-tuning monitoring, embedding diff across checkpoints, and sparse autoencoder feature analysis. Load any sentence-transformers compatible encoder and get the full picture of your embedding space.

Embedding models in Aquin

An embedding model is an encoder that collapses a variable-length input into a single dense vector. That vector is the whole output. There is no next-token distribution, no chain of reasoning, no generation. Everything the model knows about the input is compressed into a fixed-size point in a high-dimensional space, and the quality of that compression determines whether downstream retrieval, clustering, or classification works.

Most embedding tooling stops at benchmark numbers. Aquin goes into the space itself. You can load any sentence transformer checkpoint, visualize the geometry of your dataset, measure whether the space is healthy or anisotropic, trace similarity through the encoder layer by layer, evaluate retrieval quality on your own query-document pairs, compare two checkpoints to see exactly what a fine-tune changed, and decompose individual embeddings into interpretable sparse features using a trained sparse autoencoder.

embedding space · UMAP projection · 3 topic clusters

cluster separation, outlier detection, and per-label coloring. OOD points flagged before retrieval.

Supported models

Aquin supports any HuggingFace checkpoint that follows the sentence-transformers interface, a transformer encoder with a pooling layer on top. Pooling strategy is detected automatically from model config: CLS token pooling, mean pooling, or weighted mean pooling. For Instructor-style models with instruction prefixes, the prefix is applied transparently at inference time.

Bi-encoders

Bi-encoders embed query and document independently and compare them with cosine similarity. This makes them fast for large-scale retrieval: you embed the corpus once, index it, and query at inference time. The tradeoff is that the encoder cannot model interactions between query and document. BGE, E5, GTE, Nomic, Jina, Instructor, MiniLM, and SBERT are all bi-encoders. Every tool in Aquin's embedding system runs on bi-encoders.

FamilyVariantsPooling

BGEbge-small-en-v1.5 · bge-base-en-v1.5 · bge-large-en-v1.5 · bge-m3CLS

E5e5-small-v2 · e5-base-v2 · e5-large-v2 · multilingual-e5-largemean

GTEgte-small · gte-base · gte-large · gte-Qwen2-1.5Bmean

Nomicnomic-embed-text-v1 · nomic-embed-text-v1.5mean

Jinajina-embeddings-v2-base-en · jina-embeddings-v3mean

Instructorinstructor-base · instructor-large · instructor-xlmean

MiniLMall-MiniLM-L6-v2 · all-MiniLM-L12-v2mean

SBERTall-mpnet-base-v2 · paraphrase-multilingual-mpnet-base-v2mean

Cross-encoders

Cross-encoders take a query-document pair as a single concatenated input and output a relevance score. They do not produce an embedding vector. Because they model query-document interactions directly, they are significantly more accurate than bi-encoders on reranking tasks, but cannot be used for large-scale retrieval directly. Aquin supports cross-encoders for reranking evaluation: load a cross-encoder alongside a bi-encoder retriever and compare the rank distributions before and after reranking.

Inspection signals

Retrieval benchmarks tell you a number. Inspection signals tell you why that number is what it is. Each row maps to a CLI verb on the loaded embedding model (no embed- prefix — mode follows aquin load).

SignalWhat it shows

aquin layer-driftCosine similarity of representations across encoder layers on probe sentences.

aquin isotropyPairwise cosine distribution — collapsed vs healthy geometry.

aquin matrix / aquin spaceSimilarity matrix heatmap and UMAP geometry explorer.

aquin oodDistance from in-distribution centroid; flags unreliable queries pre-retrieval.

aquin attributionToken-level contribution to the pooled embedding.

aquin perturbationSensitivity of embedding to token swaps and truncations.

aquin retrievalRecall@k, MRR, NDCG on your query-document JSONL.

Hard-negative gapCosine delta between closest positive and hardest negative per query.

Embedding geometry

The embedding explorer projects your dataset into 2D using UMAP and plots every point. Color by label, by cluster assignment, or by OOD score. Points that sit far from any cluster, inputs the model has not learned to place reliably, are flagged automatically. The explorer is the starting point for understanding whether your embedding space is doing what you need it to do before you run any retrieval or classification on top of it.

intrinsic dimensionality adds a quantitative view. If a 768-dimensional embedding space only needs 40 dimensions to explain 95% of the variance in your dataset, the model is compressing your data heavily. Whether that is good or bad depends on the task, but knowing it is essential context for choosing embedding dimension, comparing models, and diagnosing retrieval failures.

Anisotropy

anisotropy is a geometric degeneration where all embeddings cluster in a narrow cone rather than distributing across the full sphere. In an anisotropic space, random pairs of inputs have high cosine similarity not because they are semantically similar, but because every vector points in roughly the same direction. This inflates similarity scores across the board and makes retrieval unreliable.

Aquin measures anisotropy as the mean pairwise cosine similarity across a random sample of embeddings. A well-distributed space has mean similarity near 0. A collapsed space has mean similarity approaching 1. The distribution is plotted as a histogram so you can see whether the problem is severe across the board or concentrated in a subset of the data.

anisotropy · pairwise cosine similarity distribution

healthy geometry

sim = 0.0sim = 1.0

high anisotropy

sim = 0.0sim = 1.0

left: healthy geometry, mass distributed near 0. right: anisotropic, mass shifted toward 1 and similarity scores are unreliable.

Layer-by-layer analysis

An embedding model's final vector is not built in one step. It emerges across the encoder's layers as attention heads route information and the feed-forward sublayers transform representations. Aquin plots mean pairwise cosine similarity of hidden states at each encoder layer. This shows at which layer the model's representation stabilizes, where collapse begins if it does, and whether the final pooled output reflects the geometry of earlier layers or diverges from it.

layer-wise similarity · mean pairwise cosine by encoder layer

similarity builds steadily toward the final layer. sharp jumps indicate where the most information integration happens.

OOD detection

An input that embeds far from the centroid of your dataset's embedding distribution is out-of-distribution for your corpus. Including OOD inputs in a retrieval index degrades retrieval quality, they pull nearest-neighbor scores away from genuinely relevant results. Aquin computes an OOD proximity score for each input by measuring cosine distance from the corpus centroid. Inputs above a configurable threshold are flagged and listed for review before indexing.

Retrieval evaluation

Aquin evaluates retrieval quality on your own query-document pairs. Upload a JSONL file with query and document fields, optionally with relevance labels, and Aquin computes the full retrieval metric suite: Recall@1, Recall@5, Recall@10, MRR, and NDCG@10. Results are broken down by topic category when labels are available.

The hard negatives gap is the most actionable metric. It measures how much cosine similarity separates the closest true positive from the closest hard negative for each query. A small gap means the model is barely distinguishing relevant from near-relevant documents at the decision boundary, the failure mode that standard Recall@k scores miss entirely.

MetricDescription

Recall@1Fraction of queries where the top-1 result is the correct document

Recall@5Fraction of queries where the correct document appears in the top 5

Recall@10Fraction of queries where the correct document appears in the top 10

MRRMean Reciprocal Rank, average of 1/rank across all queries

NDCG@10Normalized Discounted Cumulative Gain, accounts for graded relevance labels

Hard-neg gapMean cosine delta between closest positive and closest hard negative

nearest-neighbor rank distribution · ground-truth document rank per query

mass at rank 1 means good retrieval. long tail toward higher ranks indicates queries where the model struggles.

Fine-tuning support

Live fine-tune metrics stream through aquin watch ingest (see Training). For contrastive runs, aquin pairs-generate and aquin simulate forecast geometry shifts before you commit GPU — see Simulating Training.

For contrastive loss objectives, InfoNCE, NT-Xent, triplet, the loss is decomposed into positive pair similarity and negative pair similarity tracked separately. A widening gap between the two is healthy. A narrowing gap means the model is pulling negatives in, not just pushing positives together.

LoRA fine-tuning on embedding models is supported natively. Adapter matrices are merged at load time for inspection. The training monitor tracks per-layer gradient norms across the encoder layers, flagging layers where gradients have died or spiked. The same dead-layer detector used for LLMs, applied to the encoder stack.

Full fine-tune

All encoder parameters updated. Gradient norms tracked per layer.

LoRA

Low-rank adapters on Q, K, V projections. Merged at load for inspection.

Contrastive

InfoNCE, NT-Xent, triplet loss. Positive and negative pair similarity tracked separately.

Embedding diff

When you fine-tune an embedding model, the geometry of the space changes. Aquin's embedding diff runs both checkpoints on the same probe dataset and compares: centroid positions per topic cluster, cosine similarity distribution shift, anisotropy delta, and nearest-neighbor rank changes across the query set. This tells you what the fine-tune changed in the space, not just whether task metrics went up.

embedding drift is reported as a composite score, a weighted average of centroid shift magnitude, rank change count, and anisotropy delta. A fine-tune that improves retrieval by pulling topic clusters apart without inflating anisotropy scores well. A fine-tune that improved one cluster's retrieval by collapsing another's geometry scores poorly even if headline Recall@1 went up.

embedding diff · cluster centroid shift · base vs fine-tuned

dashed circles: base checkpoint cluster positions. solid circles: fine-tuned. arrows show direction and magnitude of centroid drift per topic.

Sparse autoencoders

Geometry tells you the shape of the space. A sparse autoencoder tells you what is in it. Standard embedding analysis shows that two sentences are close together, but not why. SAE feature analysis opens the vector and reads out the specific concepts it contains.

A sparse autoencoder is a dictionary learning model trained on the final-layer activations of an embedding model. It learns a set of unit-norm decoder vectors called features, one per dictionary entry, such that any activation can be approximately reconstructed as a sparse linear combination of them. The coefficients in that combination are the feature activations: a large coefficient on a feature means the input strongly expresses the concept that feature has learned to represent. In practice, features tend to correspond to interpretable concepts: domains (medical, legal, financial), linguistic patterns (negation, formal register), and topic clusters.

Embedding SAE tools are available through the programming agent. Load an embedding model and ask the agent to decompose a sentence, browse the feature space, or trace how a concept builds across encoder layers.

SAE pipeline · from input text to sparse feature activations

Feature decomposition

The entry point to SAE analysis is feature decomposition: run a text through the embedding model and SAE encoder, and read out which features activate and how strongly. A typical sentence activates 50 to 200 features out of a 16,384-feature dictionary. The top 10 to 20 are usually interpretable, and looking at them reveals the model's understanding of the input at a granularity that neither the raw embedding vector nor the distance to other sentences can show.

Contrastive decomposition runs two texts side by side and returns only the features that differ between them. This is useful when two semantically similar sentences should map to the same retrieval result but do not. The diverging features show where the model is drawing a distinction that might not be meaningful for your task.

Feature browser

The feature browser runs your corpus through the SAE and ranks the features by total activation. For each feature it shows an auto-generated label (derived from the top-activating examples via an LLM), activation frequency across the corpus, the maximum activation value observed, and the three to five sentences that activated it most strongly. Clicking a feature expands the example list.

The browser is the fastest way to understand what your data looks like from the model's perspective. Run it on a corpus and you will see which concepts the model has learned to distinguish and which it has compressed together. Corpora with many domain-specific terms tend to produce a small number of high-frequency domain features. Corpora that mix registers produce more general linguistic features at the top of the ranking.

feature browser · top SAE features · gte-small · 50-sentence medical corpus

#featurefreqmax act

#319medical diagnosis22%2.41

top activating examples

The patient presented with acute chest pain

Diagnosis confirmed via CT scan

Symptoms consistent with pneumonia

#142legal terminology18%2.14

#76technical writing55%1.29

#27financial risk16%1.88

#188negation patterns48%1.08

click any feature to expand activating examples. features ranked by total activation across corpus.

Network graph

Features do not activate in isolation. Sentences that activate a feature for medical diagnosis also tend to activate features for clinical symptoms and drug dosage. The co-activation network makes these relationships visible. Each node is a feature, sized by mean activation and colored by activation frequency. Edges connect features that co-activate on the same sentences, with thickness proportional to co-activation frequency.

The network reveals the latent cluster structure of your corpus at the feature level. Tightly connected subgraphs correspond to semantic domains where the model has learned to group related concepts. Loosely connected features are general-purpose, they fire across domains. Click any node to see its neighbors and their co-activation frequencies.

SAE co-activation network · 12 features · 3 domain clusters

12 features13 edgesthreshold 0.1550 texts

high freqlow freqco-activationnode size = mean activation

nodes sized by mean activation, colored by frequency (amber = high, violet = low). edges connect features that co-activate. click any node to inspect.

Circuit tracing

Feature decomposition tells you which features activate at the final layer. Circuit tracing tells you at which encoder layer each feature appears and how its activation builds across the stack. For a given text and target feature, Aquin runs the SAE independently on the hidden state at each encoder layer and plots the target feature's activation across layers.

The resulting circuit graph is a horizontal DAG, one column per layer, with bezier arcs showing activation growth between layers. Features that appear early and grow steadily are structural, the model is building the concept progressively. Features that appear suddenly in the final two or three layers are late-binding, the model is making a classification-like decision rather than building up the representation. The co-active features in each column show what the model was also representing at that layer.

circuit trace · feature #319 · medical diagnosis · gte-small · 12 layers

"The patient presented with acute chest pain and elevated troponin levels…"

Feature #319 — activation through 12 layers(L0–L11)

peak L11 — 2.4100total gain +2.30005 growth stepsscroll graph · click column to inspect

target featureco-active featuresactivation growth

target feature activation grows from layer 4 onward, stabilizing at layer 11. click any column to inspect co-active features at that layer.

Steering

SAE steering adds a scaled version of a feature's decoder direction to the final-layer activation before computing the embedding. Boosting a feature pushes the resulting embedding toward inputs that activate it strongly. Suppressing a feature pulls it away. The result is a modified embedding that you can use to measure how much that feature influences the model's output.

The steering tool reports cosine shift, how far the steered embedding moved from the original, and optionally re-ranks a retrieval corpus to show how the results change. Boosting a domain feature on a borderline query will typically pull in more domain-specific results. This is a way to verify that a feature actually encodes the concept its label suggests, and to understand how robustly that concept influences retrieval.

feature steering · feature #319 · delta +4.0 · retrieval shift

input text

"The patient presented with acute chest pain and elevated troponin levels."

feature #319medical diagnosisdelta +4cosine shift 0.180

retrieval before

Chest pain evaluation protocol

Troponin interpretation guide

Acute coronary syndrome workup

retrieval after

Myocardial infarction diagnosis criteria

STEMI vs NSTEMI differentiation

Cardiac biomarker reference ranges

Chest pain evaluation protocol

left: original retrieval results. right: results after boosting the medical diagnosis feature. cosine shift 0.183.

Absorption and polysemy diagnostics

A well-trained SAE has features that fire independently and correspond to distinct concepts. Two failure modes undermine this: feature absorption and polysemy. Absorption is when feature A always fires when feature B fires, meaning one concept has been absorbed into another and the absorbed feature contributes no independent signal. Polysemy is when a single feature fires on semantically unrelated inputs, meaning it has been overloaded to represent several distinct concepts.

Aquin's diagnostics scanner finds both. Absorption is detected by computing conditional activation probability across a corpus: if P(B activates | A activates) exceeds a threshold, the pair is flagged. Polysemy is detected by measuring the semantic variance of a feature's top-activating examples. High variance means the activating texts are semantically distant from each other. Both reports come with the specific pairs or features identified, so you can judge whether the overlap is an artifact or reflects a genuine latent structure in your data.

absorption and polysemy diagnostics · gte-small · 50-sentence mixed corpus

absorption pairs

absorberabsorbedP(B|A)

#319medical diagnosis

#55clinical symptoms

94%

#142legal terminology

#203liability clauses

88%

#27financial risk

#98market volatility

81%

polysemous features

#501interest rates / pricingvariance 0.72

central bank policy textsproduct pricing documents

#334formal register / legalesevariance 0.61

academic writinglegal contracts

Retrieval faithfulness

Retrieval faithfulness measures which SAE features are load-bearing for retrieval quality. For each feature, Aquin zeros it out in all query embeddings and recomputes NDCG@k on your query-document pairs. The drop in NDCG tells you how much that feature was contributing to retrieval. A large drop means the feature is essential. A negligible drop means the feature, despite high activation, is redundant with other features in the embedding.

This analysis surfaces a question that benchmark scores cannot answer: which parts of the embedding actually drive retrieval performance? Run aquin sae-faithfulness on your pairs JSONL. aquin space-decomp decomposes a single embedding into sparse feature contributions for debugging borderline queries.

Cross-model feature matching

Two embedding models trained on similar data may learn similar concepts, but their SAE feature dictionaries are completely independent. Cross-model matching identifies which features correspond across models by computing cosine similarity between decoder vectors. Features with high decoder similarity represent the same concept in both models. Features with no match are model-specific, concepts the model has learned that the other has not.

This is useful when choosing between models for a task. If both models have learned the concepts relevant to your domain and the features match closely, the models are interchangeable for that domain and you can pick on latency and size. If one model has domain-specific features that the other lacks, that is a meaningful capability difference. Cross-model matching makes the comparison specific rather than abstract.

Aquin Labsaquin@aquin.app