Embedding Architecture (Re-ranker Disabled, Gap-Fill Neutral)

Status (session 23): Both the embedding re-ranker AND gap-fill seeds are confirmed neutral on honest cold-start measurement. Three runs with and without embeddings produced identical P@10 (0.176, 0.175, 0.176). The previous "+11% gap-fill" finding was task memory contamination: gap-fill kept injecting symbols that accumulated task memory was boosting, creating a feedback loop. Framework equivalence classes (263 classes, session 23) now solve the vocabulary gap that gap-fill was designed for, making embeddings redundant. Infrastructure preserved for future investigation with code-tuned models.

The embedding re-ranker was a post-RWR stage that reordered the top-50 candidates by cosine similarity to the task description. It uses a pre-trained code embedding model (nomic-embed-text-v1.5, default) running locally via pure-Go ONNX inference. No API calls, no cloud services, no charges.

Impact: P@10 0.207 -> 0.247 (+19%), R@10 0.306 -> 0.380 (+24%) on the full 167-task cross-system benchmark. Every metric improved. Biggest single improvement in project history.

Architecture

The re-ranker sits between scoring (step 7) and budget packing (step 8) in the retrieval pipeline:

[7. Scoring]           6-component formula
    |
    v
[7b. Embedding Re-rank] embed query, cosine-sort top-50 candidates
    |
    v
[8. Budget Packing]    density-ranked greedy knapsack

Why re-ranking, not independent search

Three models (BGE, jina-code, nomic) were tested as an independent Channel 3 (embed query, HNSW search, RRF-fuse with graph results). All produced identical P@10 to baseline. The models find the same symbols as BM25 because the vocabulary gap between task descriptions and symbol names is a structural problem, not a model quality problem.

Re-ranking works because the graph already surfaces relevant candidates via structural relationships (calls, imports, type membership). The embedding model then promotes candidates whose textual description is semantically close to the task, even when their graph score was low. The architecture matters more than the model.

Why pure re-rank (weight=0.0)

A parameter sweep tested blended scoring at weights 0.0, 0.5, 0.7, 0.85, 0.95, 1.0 (where weight = original score contribution). Pure re-rank (weight=0.0) produced the best P@10 and R@10 on the full corpus. Higher weights preserve MRR (rank of the single best result) but sacrifice recall. Since consumers read all 10 results (not just #1), pure re-rank is the correct default.

Configurable via ReRankOriginalWeight in internal/context/walk.go.

Vector Cache

Embedding inference is the bottleneck: ~13ms per text in batch, ~660ms for the full 51-text re-rank call (1 query + 50 candidates). The vector cache eliminates redundant inference by storing pre-computed vectors in SQLite.

How it works

Index time:
  IndexBatch(nodes) -> EmbedBatch(texts) -> store vectors in HNSW + SQLite

Re-rank time (cached):
  ReRankByHashes(query, nodeHashes, fallbackTexts)
    1. Embed query (1 text, ~120ms)
    2. Read 50 vectors from SQLite by node_hash (~100ms)
    3. Compute 50 cosine similarities (~0.006ms)
    Total: ~220ms

Re-rank time (uncached, first run):
  ReRankByHashes(query, nodeHashes, fallbackTexts)
    1. Embed query (1 text)
    2. Cache miss on all hashes
    3. Embed 50 fallback texts (~540ms)
    4. Persist vectors to SQLite for next time
    Total: ~660ms (same as old path, but only happens once)

Storage

The embeddings table stores vectors keyed by (node_hash, model):

CREATE TABLE embeddings (
    node_hash  BLOB NOT NULL,
    model      TEXT NOT NULL,
    vector     BLOB NOT NULL,
    PRIMARY KEY (node_hash, model)
);

Vectors are serialized as little-endian float32 arrays. At 768 dimensions (jina-code), each vector is 3072 bytes. Storage overhead for typical repos:

Repo size	Nodes embedded	Cache size
Small (1K nodes)	1,000	~3 MB
Medium (5K nodes)	5,000	~15 MB
Large (50K nodes)	50,000	~150 MB

The model column allows multiple models to coexist. Switching models (via KNOWING_EMBED_MODEL) does not invalidate vectors from other models.

Cache lifecycle

Population: IndexBatch writes vectors on every index run. Re-indexing updates vectors for changed nodes (UPSERT).
Cache miss: ReRankByHashes falls back to on-the-fly embedding and auto-persists the result. First query after a fresh index is uncached; all subsequent queries hit cache.
Invalidation: Node hashes are content-addressed. When a symbol changes, its hash changes, so stale vectors are naturally orphaned. Old vectors from deleted nodes remain but are never queried (no hash match).

Latency Profile

Measured on Apple Silicon, nomic-code (768 dims), hugot v0.7.2 pure-Go ONNX.

Operation	Time
Model load (one-time)	73ms
Single embed	120ms
Batch 51 (old re-rank, no cache)	660ms
Cached re-rank (embed 1 + SQLite read)	220ms
50x cosine similarity (768-dim)	0.006ms

All time is ONNX inference (tokenization + forward pass). Cosine computation is negligible. The cached path is 3x faster than uncached.

Why not GPU acceleration

CoreML/Metal support exists in hugot via the ORT (ONNX Runtime) backend, but it requires CGO and a platform-specific shared library (libonnxruntime.dylib). This would break knowing's zero-runtime-deps guarantee. With the vector cache, the remaining inference cost is a single query embedding (~120ms), which is acceptable for interactive use. GPU acceleration would save ~90ms at the cost of portability.

Configuration

Setting	Default	Description
`--embeddings`	off	Enable embedding re-ranker on `knowing mcp`
`--embed-model`	nomic-code	Model: `nomic-code` (default), `jina-code`, `bge-small`
`KNOWING_EMBED_MODEL`	nomic-code	Env var (CLI flag takes precedence)
`BENCH_EMBEDDINGS`	0	Enable in benchmark adapter
`BENCH_RERANK_WEIGHT`	0.0	Override `ReRankOriginalWeight` in benchmarks
`ReRankOriginalWeight`	0.0	Blend weight (0.0 = pure re-rank, 1.0 = no re-rank)

Code Map

File	Purpose
`internal/embedding/embedding.go`	`Embedder`: hugot session, model loading, `Embed`/`EmbedBatch`
`internal/embedding/searcher.go`	`Searcher`: HNSW index, `IndexBatch`, `ReRank`, `ReRankByHashes`, vector cache
`internal/context/context.go`	`VectorReRanker` interface, `reRankWithEmbeddings` call site
`internal/context/walk.go`	`ReRankOriginalWeight` (blend parameter)
`internal/store/migrations/019_add_embeddings.sql`	Schema for vector cache
`internal/store/sqlite.go`	`BatchPutEmbeddings`, `GetEmbeddings`
`bench/cross-system/EMBEDDING-EVAL.md`	Evaluation log: all experiments, results, latency data
`docs/proposals/pure-go-embeddings.md`	Original proposal and discovery narrative

Key Findings

Architecture > model. Three models produced identical results as independent search. The re-ranker architecture unlocked value that no model switch could provide.
Pure re-rank beats blending. Weight=0.0 is optimal because agents consume all 10 results, not just the top-1. Blending preserves MRR at the cost of recall.
Cache eliminates the latency problem. The "11s/task" number that originally motivated a custom inference engine was total indexing time, not per-query cost. With cached vectors, re-rank is 220ms per query.
Persistent pack cache must be disabled for experiments. The notes-table cache returns stale results, masking all delta measurements. Always call DisablePersistentCache() in benchmarks.
Embeddings help most where graph connectivity is sparse. Kubernetes (+92.8%) and Kafka (+39.5%) saw the largest gains in session 15. Initial regressions on VS Code (-16%) and Ocelot (-30.8%) reported in session 15 were not reproducible in session 16 testing: both repos showed 0% P@10 delta with neutral-to-positive NDCG and MRR improvements. The regressions were likely artifacts of the pre-vector-cache build.