Concurrency Primer
This guide explains where knowing uses Go's concurrency primitives and, more importantly, where it does not. Most of the codebase is sequential. Concurrency exists in six subsystems, each for a specific reason. If you are adding code to knowing, read this before reaching for go.
Overview
knowing uses goroutines, channels, sync.WaitGroup, sync.Mutex, sync.RWMutex, sync/atomic, and semaphore-patterned buffered channels. The indexer pipeline and the enrichment system are the two largest concurrent subsystems. The retrieval path (RWR, HITS, ranking, packing) is entirely sequential but uses a precomputed adjacency cache to eliminate per-node database queries during the walk. The MCP server handles concurrent requests but delegates all coordination to SQLite's WAL mode.
The guiding principle: add concurrency only where a measurable bottleneck exists. Extraction is CPU-bound (tree-sitter parsing). SQLite writes are IO-bound. Separating them into a pipeline doubles throughput. LSP enrichment is IO-bound (network round-trips to language servers). Parallelizing LSP calls with serialized DB writes gives near-linear speedup. Retrieval already completes in under 100ms, so parallelizing it would add complexity for no user benefit.
The Indexer Pipeline
The indexer (internal/indexer/indexer.go) uses a producer-consumer pipeline to overlap CPU-bound extraction with IO-bound storage.
Pipeline architecture:
flowchart LR
subgraph Feeder
F[File Feeder]
end
subgraph Workers["Extraction Workers (GOMAXPROCS)"]
W1[Worker 1]
W2[Worker 2]
W3[Worker N]
end
subgraph Writer
WR[Batch Writer]
end
F -->|workCh| W1
F -->|workCh| W2
F -->|workCh| W3
W1 -->|resultCh| WR
W2 -->|resultCh| WR
W3 -->|resultCh| WR
WR -->|batch INSERT| DB[(SQLite WAL)]
extractWg[extractWg.Wait] -.->|tracks| Workers
writeWg[writeWg.Wait] -.->|tracks| Writer
Architecture:
[workCh] --> N extraction workers --> [resultCh] --> 1 storage writer --> SQLite
- N extraction workers (defaults to
runtime.GOMAXPROCS(0)) read file indices fromworkChand producefileResultvalues intoresultCh. - 1 writer goroutine (the main goroutine of
IndexRepo) reads fromresultChand batches results into SQLite. extractWgtracks when all workers have finished, so the feeder goroutine knows when to closeresultCh.- Channel backpressure:
resultChis buffered atnumWorkers*2. If the writer falls behind, workers block on the send rather than accumulating unbounded results in memory.
resultCh := make(chan fileResult, numWorkers*2)
Why this design: extraction is CPU-bound (tree-sitter parsing takes 1-10ms per file). SQLite writes are IO-bound (batch INSERT across a transaction). Running them in the same goroutine means the CPU sits idle during writes and the disk sits idle during parsing. The pipeline keeps both saturated.
Error Propagation Across Goroutine Boundaries
Extraction workers send errors through resultCh as part of the fileResult struct. There is no separate error channel. The writer goroutine (main goroutine) checks each result:
for fr := range resultCh {
if fr.err != nil {
storeErr = fmt.Errorf("extract file %s: %w", fr.relPath, fr.err)
break
}
// ... batch and store ...
}
When an error occurs, the writer breaks out of the loop. This means:
- Remaining results still buffered in
resultChare abandoned (never read). - Workers that try to send after the writer stops will block until the process exits.
- The function returns
storeErrto its caller. - Partial data already flushed to SQLite in previous batches remains (knowing's content-addressed design makes this safe; the next index run will re-extract the failed file).
Why not use a separate error channel? A single struct keeps the code simpler. The pipeline stops on first error rather than accumulating partial results because a half-indexed file would leave the graph inconsistent. The next run picks up where this one failed.
Batch Flush Timing
The writer goroutine accumulates results and flushes them to SQLite under two conditions:
const batchSize = 500
if len(batchFiles) >= batchSize || completed == totalFiles {
// Flush batch to DB...
}
- Every 500 files: once the batch accumulates 500 file results, flush immediately. This bounds memory usage (at most 500 files' worth of nodes/edges in memory) and gives SQLite data within seconds of starting.
- When the channel drains (final batch): when
completed == totalFiles, flush whatever remains, even if it is less than 500 files. This handles the tail end of processing.
Why 500? Empirically, 500 files produce roughly 2000-5000 nodes and a similar number of edges. Each multi-row INSERT stays well under SQLite's 999-variable limit. Larger batches reduce transaction overhead; smaller batches reduce memory. 500 is the sweet spot tested on repos from 1K to 200K files.
Kill safety: because batches flush incrementally, killing the process at any time leaves a partial but valid graph in SQLite. The next run detects which files were not indexed (via content-hash comparison) and picks them up.
What Happens if the Writer Goroutine Panics
The writer goroutine is the main goroutine of IndexRepo (the for fr := range resultCh loop). If it panics:
- The panic propagates up the call stack and crashes the process (or is caught by a higher-level
recoverif one exists). - Extraction workers continue running until they try to send on
resultCh. Since nobody is reading, buffered slots fill up and workers block permanently. - The feeder goroutine (which closes
resultChafterextractWg.Wait()) also blocks because workers never finish. - Go's runtime detects the main goroutine panic and tears down all goroutines.
In practice this is not a concern because the writer loop's operations (append to slice, call BatchPutNodes) do not panic under normal conditions. SQLite errors are returned, not panicked. The only realistic panic source would be a nil pointer dereference on a malformed fileResult, which would indicate a bug in an extractor.
The CGO Watchdog Pattern
Problem: tree-sitter is a C library called via CGO. Go's context.WithTimeout cancels goroutines by closing a channel, but CGO calls cannot be interrupted by Go. A file with pathological nesting (e.g., deeply nested JSON) can block a tree-sitter parse for minutes.
Solution: fire-and-forget goroutine with a timer select.
Watchdog timeout flow:
sequenceDiagram
participant Main as Main Goroutine
participant Timer as 10s Timer
participant CGO as CGO Goroutine
participant TS as tree-sitter (C)
Main->>CGO: go func() { extract(file) }
Main->>Timer: time.After(10s)
alt Extraction completes first
CGO->>TS: Parse file
TS-->>CGO: AST result
CGO-->>Main: result via channel
Main->>Main: Process result
else Timer fires first
Timer-->>Main: timeout!
Main->>Main: Skip file, log warning
Note over CGO: Goroutine leaks (safe:<br/>stateless, no locks held)
end
done := make(chan extractResult, 1)
go func() {
r, f, err := idx.extractFile(ctx, opts)
done <- extractResult{result: r, file: f, err: err}
}()
timer := time.NewTimer(10 * time.Second)
select {
case er := <-done:
timer.Stop()
resultCh <- fileResult{...}
case <-timer.C:
// Extraction stuck in CGO. Skip this file.
resultCh <- fileResult{result: &types.ExtractResult{}, ...}
}
If the timer fires first, the file is skipped. The background goroutine eventually completes and its result is discarded (nobody reads from done). The goroutine "leaks" in the sense that it continues running, but Go's runtime cleans it up on process exit.
Why this is safe: extraction is stateless per-file. A leaked goroutine holds no locks, writes nothing to the database, and mutates no shared state. The only cost is memory for the parse tree until the CGO call returns.
Why not context.WithTimeout? Go contexts cancel by closing a channel. For pure-Go code, goroutines check ctx.Done() between operations and exit early. But CGO calls are opaque to the Go runtime: the C code never checks channel state, so context cancellation has no effect on a running tree-sitter parse. The watchdog pattern is the only reliable way to bound CGO execution time.
The Git Blame Worker Pool
After extraction completes, the indexer runs git blame on each changed file to extract authorship edges (authored_by). This is expensive (one subprocess per file) and parallelized with a dedicated worker pool.
Architecture:
blameResults := make([]blameResult, len(allFiles)) // pre-allocated, indexed by file
var blameWg sync.WaitGroup
blameCh := make(chan int, len(allFiles))
for w := 0; w < blameWorkers; w++ {
blameWg.Add(1)
go func() {
defer blameWg.Done()
for i := range blameCh {
// Run git blame, store result at blameResults[i]
blameResults[i] = blameResult{nodes: an, edges: ae}
}
}()
}
Key pattern: pre-allocated results slice. Each worker writes to blameResults[i] where i is the file index it received from the channel. Because each index is unique (the channel sends each i exactly once), workers never write to the same slot. This eliminates the need for a mutex on the results collection.
Why this is safe without synchronization:
- Each i is sent exactly once on blameCh, so only one worker ever writes to blameResults[i].
- blameWg.Wait() provides the happens-before guarantee: all writes complete before the main goroutine reads the results.
- Workers read allFiles[i] (shared immutable slice, set before workers start) and call idx.store.NodesByFilePath (thread-safe read, WAL mode).
Error handling: errors from individual git blame invocations are silently skipped (continue). This is intentional: authorship is best-effort metadata. A file that fails blame (binary file, detached HEAD, shallow clone) should not block the index run.
The Enrichment System
The enricher (internal/enrichment/enricher.go) upgrades ast_inferred edges to lsp_resolved by querying language servers (gopls, typescript-language-server, pyright, etc.). It introduces several concurrency patterns distinct from the indexer because the bottleneck is LSP round-trip latency rather than CPU.
Producer-Consumer for Edge Upgrades
upgradeCallEdges uses the same producer-consumer pattern as the indexer but inverted: workers are IO-bound (waiting for LSP responses) rather than CPU-bound.
Architecture:
[workItems] --> N LSP workers (semaphore-bounded) --> [results channel] --> 1 DB writer
- N LSP workers resolve edges via
GetDefinition. After the two-phase warmup (one probe file + retry until server responds), the semaphore is set to 128 concurrent workers. Before warmup, it starts ate.concurrency(default 8). - 1 writer goroutine reads from
resultsand serializes allDeleteEdge/PutEdgemutations against SQLite. writerWgensures the writer finishes beforeupgradeCallEdgesreturns.
// Post-warmup: high concurrency once server is loaded.
postWarmupConcurrency := 128
sem := make(chan struct{}, postWarmupConcurrency)
// Results channel: parallel workers -> serial writer.
results := make(chan edgeResolveResult, postWarmupConcurrency*2)
Why this design: LSP calls take 5-50ms each post-warmup. With 128 concurrent workers, 42K edges on kubernetes resolve in ~20 minutes. The single writer avoids SQLite SQLITE_BUSY contention entirely. The two-phase warmup is critical: without it, all 128 workers fire requests before the server has loaded any packages, producing zero results.
Semaphore Pattern (Buffered Channel as Concurrency Limiter)
Both upgradeCallEdges and discoverNewEdgesBatched use the same semaphore pattern:
sem := make(chan struct{}, e.concurrency)
for i := range workItems {
wg.Add(1)
sem <- struct{}{} // blocks when concurrency slots are full
go func(item WorkItem) {
defer wg.Done()
defer func() { <-sem }() // release slot
// ... do work ...
}(workItems[i])
}
wg.Wait()
The pattern is a bounded fan-out: the main goroutine spawns workers but blocks on the semaphore send when all slots are occupied. Each worker releases its slot on completion. Unlike a fixed pool of long-lived workers, this pattern spawns one goroutine per work item but never exceeds e.concurrency active goroutines at once.
File-Level Parallelism in Discovery
discoverNewEdgesBatched (internal/enrichment/enricher.go) processes files in batches of 50. Within each batch, individual files are queried concurrently (bounded by the semaphore). Between batches, files are closed to release LSP server memory (gopls on Kubernetes with 3K open files consumes 900MB+).
Batch 1: open 50 files -> concurrent symbol queries -> close 50 files
Batch 2: open 50 files -> concurrent symbol queries -> close 50 files
...
DB writes from concurrent workers are serialized by e.writeMu:
type Enricher struct {
// ...
writeMu sync.Mutex // serializes DB writes from concurrent discover workers
}
func (e *Enricher) insertEdgesFromLocations(...) {
for _, loc := range locations {
e.writeMu.Lock()
existing, err := e.store.GetEdge(ctx, edgeHash)
if err != nil || existing != nil {
e.writeMu.Unlock()
continue
}
_ = e.store.PutEdge(ctx, edge)
e.writeMu.Unlock()
}
}
The writeMu ensures that the check-then-insert for each edge is atomic. Without it, two workers discovering the same edge could both pass the existence check and attempt duplicate inserts (causing SQLite UNIQUE constraint violations or SQLITE_BUSY).
Multi-Module Orchestration
runMultiModule (internal/enrichment/enricher.go) handles Go workspaces with multiple go.mod files. Each module needs its own gopls instance because gopls operates per-module.
Strategy:
- Root module (sequential): processed first, solo, because it is typically large (1000+ files, 1.2GB gopls memory).
- Sub-modules (4 parallel): processed concurrently after the root completes. Sub-modules are small (200-500 files each), so 4 simultaneous gopls instances use ~800MB total.
const moduleParallelism = 4
modSem := make(chan struct{}, moduleParallelism)
var modWg sync.WaitGroup
for i := range subModules {
modWg.Add(1)
modSem <- struct{}{}
go func(mod ModuleInfo, idx int) {
defer modWg.Done()
defer func() { <-modSem }()
e.enrichModule(ctx, serverCfg, ..., mod, ...)
}(subModules[i], i)
}
modWg.Wait()
Each enrichModule call starts its own gopls instance, runs the full enrichment pipeline (upgrade + discover), and shuts down the server. Progress is tracked so interrupted runs resume without re-processing completed modules.
Per-Symbol Timeout with Child Context
WithSymbolTimeout (internal/enrichment/timeout.go) wraps each LSP call in a timeout without cancelling the parent context:
func WithSymbolTimeout(ctx context.Context, timeout time.Duration, fn func(ctx context.Context) error) error {
childCtx, cancel := context.WithTimeout(ctx, timeout)
defer cancel()
done := make(chan error, 1)
go func() {
done <- fn(childCtx)
}()
select {
case err := <-done:
return err
case <-childCtx.Done():
if ctx.Err() != nil {
return ctx.Err() // parent cancelled
}
return ErrSymbolTimeout // only this symbol timed out
}
}
Key property: cancelling the child context does NOT cancel the parent. If one symbol's GetReferences call hangs (e.g., gopls analyzing a massive package), only that symbol is skipped. The remaining symbols in the file continue processing normally. The default timeout is 10 seconds per symbol.
Atomic Counters for Statistics
enrichStats uses sync/atomic.Int64 fields for lock-free concurrent updates:
type enrichStats struct {
edgesProcessed atomic.Int64
edgesUpgraded atomic.Int64
edgesSkipped atomic.Int64
edgeErrors atomic.Int64
newEdges atomic.Int64
filesProcessed atomic.Int64
fileErrors atomic.Int64
}
Multiple LSP workers call stats.edgesProcessed.Add(1) concurrently without coordination. The final log summary reads all counters after wg.Wait() ensures all workers have completed (happens-before).
Progress Persistence with Atomic File Writes
EnrichProgress (internal/enrichment/progress.go) tracks per-module completion for crash recovery:
- Thread safety:
sync.Mutexprotects theModulesmap.MarkModuleandIsCompleteboth acquire the lock. - Crash safety:
SaveProgresswrites to a.tmpfile first, then atomically renames it over the target. If the process crashes during the write, the previous progress file remains intact. - Resume semantics: on startup,
LoadProgressreads the file. Modules withCompleted: trueare skipped. This prevents re-enriching modules that finished in a previous (interrupted) run.
func SaveProgress(workspaceRoot string, p *EnrichProgress) error {
tmp := target + ".tmp"
os.WriteFile(tmp, data, 0644)
return os.Rename(tmp, target) // atomic on POSIX
}
SQLite Single-Writer
SQLite in WAL (Write-Ahead Logging) mode allows any number of concurrent readers but exactly one writer at a time. knowing's architecture respects this constraint at every level.
Pragmas (set in internal/store/sqlite.go):
| Pragma | Value | Why |
|---|---|---|
journal_mode |
WAL | Concurrent readers during writes |
synchronous |
NORMAL | fsync on checkpoint only, not every commit |
busy_timeout |
5000 | Retry for 5s on lock contention instead of immediate SQLITE_BUSY |
mmap_size |
268435456 | 256MB memory-mapped IO for read-heavy workloads |
cache_size |
-64000 | 64MB page cache |
Batch INSERT strategy: multi-row INSERT statements pack multiple rows per SQL statement to reduce transaction overhead. The chunk sizes are chosen to stay under SQLite's 999-variable limit:
- Edges: 100 per statement (9 params each = 900)
- Nodes: 99 per statement (10 params each = 990)
- Files: 249 per statement (4 params each = 996)
Design rule: all writes flow through the single writer goroutine in the indexer pipeline. Extraction workers never touch the database directly. They produce results into resultCh, and the writer goroutine consumes them.
In-Memory Node/Edge Cache (sync.Map + Atomic)
The SQLiteStore (internal/store/sqlite.go) layers an in-process cache on top of SQLite to eliminate redundant SQL round-trips on hot-path traversals (blast_radius walks hundreds of edges, each requiring a node lookup).
Implementation:
type SQLiteStore struct {
db *sql.DB
nodeCache sync.Map // types.Hash -> *types.Node
nodeCacheCount atomic.Int64
edgeCache sync.Map // types.Hash -> *types.Edge
edgeCacheCount atomic.Int64
}
Why sync.Map instead of map + RWMutex? The access pattern is heavily read-biased: many concurrent MCP handlers read cached nodes, while writes happen rarely (only on cache miss). sync.Map is optimized for this pattern. It uses internal atomics and copy-on-write techniques that avoid lock contention for reads, whereas RWMutex still contends on the read counter.
Eviction strategy (full clear at capacity):
const nodeCacheMaxEntries = 50_000
func (s *SQLiteStore) GetNode(ctx context.Context, hash types.Hash) (*types.Node, error) {
if v, ok := s.nodeCache.Load(hash); ok {
return v.(*types.Node), nil
}
// ... query DB ...
if s.nodeCacheCount.Load() >= nodeCacheMaxEntries {
s.nodeCache.Range(func(k, _ any) bool {
s.nodeCache.Delete(k)
return true
})
s.nodeCacheCount.Store(0)
}
s.nodeCache.Store(hash, n)
s.nodeCacheCount.Add(1)
return n, nil
}
When the cache exceeds 50,000 entries, it is cleared entirely. This is the simplest possible eviction: no LRU tracking, no sharding, no partial eviction. It works because:
- The cache refills quickly (blast_radius traversals re-populate hundreds of entries per call).
- 50K entries is bounded memory (roughly 50K * 200 bytes = 10MB).
- Full cache clear happens rarely in practice (only during very large traversals or after indexing many repos).
Invalidation: InvalidateCache() clears both caches completely. It is called at the start of each index run so freshly written rows are not shadowed by stale cached values.
Thread safety of eviction: the Range + Delete pattern is safe with sync.Map (it tolerates concurrent loads during deletion). The atomic counter might briefly go negative if multiple goroutines evict simultaneously, but Store(0) resets it. In practice, GetNode is only called from MCP handlers (read path) and eviction is extremely rare.
SubgraphCache and Merkle-Keyed Invalidation
The SubgraphCache (internal/cache/subgraph.go) caches serialized query results (blast_radius, test_scope) keyed by the Merkle root hash of the queried subgraph.
Thread safety: sync.RWMutex protects the entries map:
- Get: acquires read lock to check the map, then acquires write lock to update hit/miss counters.
- Put: acquires exclusive write lock.
- InvalidatePackages: acquires exclusive write lock.
Merkle-keyed invalidation:
The hierarchical Merkle tree assigns each package a root hash derived from all its edges. When a file changes, only its package's root hash changes. Cache entries are keyed by this package root hash, so:
- After each index run, the daemon computes
DiffHierarchicalTrees(prevTree, newTree). - The diff produces
ChangedPackages,AddedPackages,RemovedPackages. - For each changed package, the daemon calls
cache.InvalidatePackages(changedPkgs, newTree). InvalidatePackageslooks up each package's new root hash in the tree and deletes the entry.
func (c *SubgraphCache) InvalidatePackages(packages []string, tree *snapshot.HierarchicalTree) {
c.mu.Lock()
defer c.mu.Unlock()
for _, pkg := range packages {
root, ok := tree.PackageRoots[pkg]
if !ok {
continue
}
delete(c.entries, root)
}
}
Why this works: if a package has NOT changed, its Merkle root is the same, so any cached result for that root is still valid. Only changed packages get evicted. On a 200K-edge graph where a single file changes, this evicts one or two cache entries instead of the entire cache.
Eviction at capacity: when the cache is full and a new key is being inserted, one random entry is removed (Go map iteration order is random). This is O(1) and avoids the complexity of LRU tracking. For knowing's use case (bounded query diversity, most queries repeat), random eviction performs well.
Thread-Safe Extractors
Each Extract() call creates its own tree-sitter parser instance. Parsers are cheap to create (approximately 1 microsecond) but are NOT goroutine-safe. The tree-sitter C library uses internal state that is mutated during parsing.
The extractor struct itself (e.g., GoExtractor, PythonExtractor) is stateless: it has no mutable fields. This means many goroutines can call Extract() on the same extractor instance concurrently, because each call allocates its own parser on the stack.
Rules:
- Never share a *sitter.Parser between goroutines.
- Never store a parser as a field on an extractor struct.
- Creating a parser per-call is intentional, not wasteful.
When multiple extractors handle the same file (e.g., Go extractor + proto extractor), the first one that parses sets opts.ParsedTree; subsequent extractors reuse the same parsed tree. This reuse happens within a single goroutine (the extraction loop for one file), so there is no concurrency concern.
No Concurrency in Retrieval (But Two Key Precomputation Patterns)
The context engine (internal/context/walk.go) and all retrieval paths (ForTask, ForFiles, ForPR) are fully sequential. No goroutines run during the walk. However, two patterns eliminate the performance bottlenecks that would otherwise tempt parallelization:
Adjacency Cache (Precomputed Binary Blob)
BuildAdjacencyCache (internal/context/walk.go) serializes the entire edge set into a compact binary format and stores it in the notes table. During RWR, buildFromCache deserializes and runs BFS entirely in memory (zero DB queries during the walk itself).
Binary format: [num_edges: 4 bytes LE] followed by num_edges records of [source: 32 bytes][target: 32 bytes][type_id: 1 byte] = 65 bytes per edge.
func BuildAdjacencyCache(ctx context.Context, store types.GraphStore) error {
allEdges, _ := bulk.AllEdges(ctx)
buf := bytes.NewBuffer(make([]byte, 0, 4+len(allEdges)*65))
binary.LittleEndian.PutUint32(header, uint32(len(allEdges)))
buf.Write(header)
for _, e := range allEdges {
buf.Write(e.SourceHash[:])
buf.Write(e.TargetHash[:])
buf.WriteByte(adjEdgeTypeToID[e.EdgeType])
}
// Store base64-encoded in notes table.
}
Performance impact: on a 200K-edge graph, the fallback path (per-node EdgesFrom/EdgesTo queries during BFS) takes ~80ms. With the adjacency cache, the same walk completes in ~2ms (4717x improvement measured in benchmarks). The cache is rebuilt after each index run while the daemon's write lock is held.
Why not just keep edges in memory? The MCP server is a long-running daemon, but knowing index (CLI mode) starts fresh each time. The binary cache persists across process restarts without loading all edges on startup. It also avoids the memory cost of holding all edges in Go heap structures (the base64 blob is stored in SQLite and decoded on demand).
Double-Buffer Map Pattern (Zero-Allocation Iteration)
rwrIterate (internal/context/walk.go) uses two pre-allocated maps that swap roles each iteration:
mapA := make(map[types.Hash]float64, len(seedVec)*4)
mapB := make(map[types.Hash]float64, len(seedVec)*4)
prob := mapA
for iter := 0; iter < maxIter; iter++ {
next := mapB
for k := range next {
delete(next, k) // clear without reallocating
}
// ... compute next distribution into `next` ...
prob, next = next, prob
mapA, mapB = mapB, mapA
}
Why this matters: Go's garbage collector tracks map allocations. Creating a new map each iteration (20 iterations on a 5000-entry map) generates 20 allocations of ~40KB each, all becoming garbage simultaneously. The double-buffer pattern reuses two maps across all iterations: mapB is cleared (keys deleted, backing storage retained) and refilled. GC pressure drops to zero for the iteration loop.
Early Termination (Top-K Stability)
RWR terminates early when the top-10 ranked nodes are stable for 2 consecutive iterations, even if low-ranked nodes are still shifting. topKFromProb extracts the top-10 hashes into a fixed [10]types.Hash array that is compared by value between iterations. This typically saves 5-8 iterations on large graphs where the tail never fully converges.
Summary: Why No Goroutines
- RWR: pre-loads the adjacency map via the binary cache (one read, then in-memory BFS), then iterates over the map with double-buffered power iteration. No goroutines.
- HITS: runs on a pre-selected subset of nodes. Pure linear algebra on slices.
- Ranking and packing: sorting and knapsack on scored nodes. Pure computation.
Why not parallelize? Retrieval completes in under 100ms for typical queries (tested on graphs with 200K+ edges). The adjacency cache eliminates the DB-query bottleneck. The iteration loop is pure computation on in-memory maps, which parallelizes poorly (each iteration depends on the previous). Adding goroutines would introduce synchronization overhead for no measurable benefit.
MCP Server Request Handling
The MCP server (internal/mcp/server.go) handles multiple client requests concurrently:
Concurrent readers with serialized writes:
flowchart LR
subgraph Clients["Agent Requests"]
C1[Request 1]
C2[Request 2]
C3[Request 3]
end
subgraph Server["MCP Server"]
H1[Handler goroutine]
H2[Handler goroutine]
H3[Handler goroutine]
end
subgraph Store["SQLiteStore"]
R[WAL Readers<br/>concurrent OK]
W[Single Writer<br/>serialized]
end
C1 --> H1
C2 --> H2
C3 --> H3
H1 -->|"context_for_task<br/>RLock"| R
H2 -->|"graph_query<br/>RLock"| R
H3 -->|"feedback<br/>Lock"| W
R --> DB[(SQLite WAL)]
W --> DB
- Stdio mode: the mcp-go library multiplexes JSON-RPC requests over stdin/stdout, dispatching each to its own goroutine.
- HTTP mode: standard
net/httpserver spawns a goroutine per connection.
All request handlers share a single SQLiteStore instance. This is safe because:
- WAL mode allows unlimited concurrent readers.
- Read-only handlers (blast_radius, context_for_task, etc.) only call SELECT queries.
- Write handlers (index_repo, feedback) go through SQLite's internal serialization. If two writes overlap,
busy_timeout=5000causes the second to retry for up to 5 seconds.
Shared state between handlers:
| State | Protection | Why |
|---|---|---|
SQLiteStore |
WAL + busy_timeout | Read concurrency is native to SQLite |
nodeCache / edgeCache |
sync.Map + atomic | Lock-free reads, bounded by count |
SubgraphCache |
sync.RWMutex | Multiple readers, exclusive writes |
contextCalls / symbolsServed |
atomic.Int64 | Lock-free counter increments |
wire.Session |
Internal to package | Deduplication state for GCF format |
vecSearch (embedding index) |
Built once at startup | Read-only after background build |
Atomic counters: session-level metrics (contextCalls, symbolsServed) use atomic.Int64 for lock-free concurrent increments. These are used for the knowing://session resource that reports usage stats.
Background vector index build: at server startup, if KNOWING_EMBEDDINGS=1 is set, a goroutine builds the HNSW vector index by embedding all nodes. This goroutine accesses the store (read-only, safe via WAL) and writes to the searcher's internal index. The searcher is not used until the build completes (handlers check vecSearch != nil and the searcher internally handles partial state).
Daemon File Watcher
The daemon (internal/daemon/daemon.go) runs four concurrent goroutines coordinated by a sync.RWMutex:
Daemon goroutine architecture:
flowchart TB
subgraph Main["Main Goroutine"]
START[Start] --> MCP[MCP Server<br/>handles requests]
end
subgraph Background["Background Goroutines"]
WL[watchLoop<br/>fsnotify events]
IW[indexWorker<br/>re-index on signal]
TI[traceIngestLoop<br/>OTel traces]
end
WL -->|"indexQueue<br/>(buffered chan)"| IW
IW -->|"mu.Lock()<br/>exclusive write"| DB[(SQLite)]
MCP -->|"mu.RLock()<br/>concurrent read"| DB
TI -->|"mu.Lock()<br/>exclusive write"| DB
FS[File System] -->|fsnotify| WL
WL -->|"debounce<br/>time.AfterFunc"| WL
- watchLoop: reads
CommitEventvalues fromGitWatcherand enqueues index requests into a buffered channel. - indexWorker: drains the index queue sequentially, holding the daemon's write lock during each index run.
- MCP server: serves queries, acquiring a read lock for each request.
- traceIngestLoop (optional): runs OTLP trace ingestion with its own dedicated DB connection.
Debounce pattern: the GitWatcher monitors .git/HEAD and ref files via fsnotify. A single git operation (commit, rebase, merge) can trigger multiple file writes in rapid succession. Each write resets a per-repo time.AfterFunc timer (default 500ms). Only when the timer fires (no writes for 500ms) does the watcher emit a CommitEvent. This coalesces rapid changes into a single re-index.
p.timer = time.AfterFunc(gw.debounce, func() {
gw.handleCommitChange(repoPath)
})
Lock protocol:
- indexWorker acquires d.mu.Lock() (exclusive) during indexing. MCP queries block until indexing completes.
- MCP handlers acquire d.mu.RLock() (shared). Multiple queries run concurrently.
- After indexing completes, a background goroutine spawns for LSP enrichment. This goroutine does NOT hold the write lock, so queries resume immediately.
Why exclusive lock during indexing? Without it, a query could read a half-written graph (some files indexed, others not). The lock ensures readers always see a complete, consistent snapshot.
Index queue overflow: the indexQueue channel is buffered at 128. If the queue is full when a new commit event arrives, the event is dropped (select { case d.indexQueue <- req: default: }). This prevents unbounded memory growth if indexing falls behind rapid commits. The next commit event will trigger a fresh re-index that picks up all accumulated changes.
Community Detection Inside the Write Lock
After a successful index run, the daemon runs incremental community detection while still holding the write lock:
d.mu.Lock()
indexErr := d.cfg.IndexFunc(ctx, ...)
// Still under write lock:
if indexErr == nil && d.cfg.Store != nil {
d.runIncrementalCommunities(ctx, req.repoURL, changedPkgs)
}
d.mu.Unlock()
Why inside the lock? Community detection writes to graph_notes (via SaveAssignments). If it ran outside the lock, a concurrent MCP query reading community assignments could see partially-written state (some nodes assigned, others still with stale assignments). Running inside the lock guarantees atomicity from the readers' perspective.
Performance impact: incremental community detection (DetectIncremental) only recomputes assignments for nodes in changed packages. Unchanged nodes keep their previous assignment (frozen). For a typical edit affecting one package, this completes in under 100ms even on a 5000-node graph. The full Louvain algorithm only runs on the first index (no previous assignments exist).
Also inside the lock: Merkle diff computation, cache invalidation, and scoped FTS rebuild. These are all fast operations (map lookups and bounded writes) that must be atomic with the index to prevent readers from seeing intermediate state.
FTS Rebuild: Parallel Computation, Sequential Writes
The RebuildFTS method in SQLiteStore splits work into three phases:
// Phase 1: Read all nodes into memory (sequential, DB query).
var allRows []ftsRow
// ... scan rows ...
// Phase 2: Parallel splitForFTS computation (CPU-bound, 8 workers).
var wg sync.WaitGroup
workers := 8
chunkSize := (len(allRows) + workers - 1) / workers
for w := 0; w < workers; w++ {
wg.Add(1)
go func(s, e int) {
defer wg.Done()
for i := s; i < e; i++ {
prepped[i] = ftsPrepped{...splitForFTS(...)...}
}
}(start, end)
}
wg.Wait()
// Phase 3: Batch INSERT into SQLite (sequential, single writer).
tx, _ := s.db.BeginTx(ctx, nil)
for _, p := range prepped {
stmt.ExecContext(ctx, ...)
}
tx.Commit()
Why parallelize only Phase 2? splitForFTS splits CamelCase and snake_case identifiers into tokens. On 100K+ nodes, this string manipulation is the bottleneck (not the DB writes). Phase 1 and 3 are I/O-bound (SQLite queries and inserts) and cannot be parallelized due to single-writer constraint.
Same pre-allocated slice pattern as git blame: each worker writes to prepped[i] for indices in its assigned range. No synchronization needed because ranges do not overlap.
Common Pitfalls for Contributors
Real Bugs That Happened
These are not hypothetical warnings. Each was a bug that shipped, was discovered in production or benchmarks, and was fixed. Learn from them.
1. The FTS Background Goroutine Bug (fixed in cab5f79)
The original code ran FTS rebuild in a fire-and-forget goroutine:
// BROKEN: FTS never completes in CLI mode.
go func() {
_ = fr.RebuildFTS(context.Background())
}()
return snap, nil // process exits here, killing the goroutine
In daemon mode this worked fine (process stays alive). In CLI mode (knowing index), the process exits immediately after IndexRepo returns, killing the goroutine before FTS writes complete. Result: nodes_fts_content was always empty for CLI-indexed repos. BM25 search returned zero results.
Fix: run FTS synchronously after snapshot computation. The 200-300ms cost is acceptable because CLI mode is not latency-sensitive.
Lesson: never use fire-and-forget goroutines for work that MUST complete. If the caller can exit before the goroutine finishes, the work will be lost. Use sync.WaitGroup or run synchronously.
2. The SQLite WAL Contention Bug (fixed in 950b91b)
An earlier optimization ran FTS rebuild and snapshot computation concurrently:
// BROKEN: both compete for SQLite WAL locks.
var wg sync.WaitGroup
wg.Add(2)
go func() { defer wg.Done(); computeSnapshot() }()
go func() { defer wg.Done(); rebuildFTS() }()
wg.Wait()
Both operations are write-heavy (snapshot writes a row, FTS does thousands of INSERTs). With SQLite's single-writer constraint, they serialize anyway via busy_timeout. Worse, the contention causes both to be slower than running sequentially (retry overhead). Snapshot computation went from 50ms to 800ms.
Fix: run snapshot first (fast, small write), then FTS (slow, heavy writes). Sequential execution is faster than contention.
Lesson: parallelizing two write-heavy SQLite operations gives you the worst of both worlds: contention overhead plus no actual concurrency. Only parallelize reads (concurrent via WAL) or CPU work (no DB access).
3. The CGO Timeout Stall (watchdog pattern prevents this)
Before the watchdog pattern existed, a single pathological file (deeply nested JSON with 1000+ levels) could stall the entire pipeline for 3+ minutes. All workers were blocked waiting for resultCh to drain (the writer was waiting for this worker), and the writer was waiting for this result. The pipeline deadlocked in slow motion.
The watchdog pattern (fire-and-forget + timer) ensures no single file blocks the pipeline for more than 10 seconds.
Rules for New Code
-
Don't add goroutines to the retrieval path. It is fast enough (<100ms). Goroutines add complexity, make debugging harder, and provide no measurable speedup for in-memory computation.
-
Don't share tree-sitter parsers between goroutines. The C library uses internal mutable state. Create a new parser per
Extract()call. -
Don't write to SQLite from extraction workers. Use the
resultChchannel. The single-writer goroutine owns all database mutations during indexing. Writing from workers causes SQLITE_BUSY errors and data races. -
Don't use
context.WithTimeoutto cancel CGO calls. It does not work. Go contexts cancel by closing a channel; CGO calls cannot observe channel state. Use the watchdog pattern (fire-and-forget goroutine + timer select) instead. -
Don't hold locks across channel sends. If the channel is full (backpressure), the send blocks. If you are holding a mutex, other goroutines waiting for that mutex also block. This creates a deadlock where the channel consumer needs the lock to make progress but cannot acquire it.
-
Don't bypass the daemon's
sync.RWMutex. If you are adding a new background task that modifies the graph, it must acquire the write lock. If it only reads, it must acquire the read lock. Forgetting this causes readers to see inconsistent state. -
Don't run expensive computation inside a lock. The daemon's write lock blocks all MCP queries. Keep the critical section as short as possible: do expensive work (tree-sitter parsing, git blame) outside the lock, then acquire the lock only for the final database writes and fast post-processing (community detection, cache invalidation).
-
Don't use fire-and-forget goroutines for work that must complete. If the process might exit before the goroutine finishes, the work is silently lost. Either run synchronously, track with a
WaitGroupon the daemon'swg, or use a buffered channel that is drained on shutdown. -
Don't parallelize multiple SQLite write operations. WAL mode only helps readers. Writers still serialize. Running two write-heavy operations concurrently causes
busy_timeoutretry overhead that makes both slower than sequential execution. -
Don't write to SQLite from enrichment LSP workers directly. Use the
resultschannel (for edge upgrades) or acquiree.writeMu(for discovery). The enrichment system uses the same "parallel work, serial persistence" principle as the indexer but with different mechanisms depending on the phase. -
Don't cancel the parent context when an LSP call times out. Use
WithSymbolTimeoutwhich creates a child context. Cancelling the parent would abort all remaining work for the file or batch. Only the individual symbol that timed out should be skipped. -
Don't open all files at once in the enrichment system. gopls consumes 300-900MB+ when thousands of files are open simultaneously. Use
discoverNewEdgesBatched's pattern: open a batch, query symbols, close the batch before opening the next. Memory stays bounded at ~50 files' worth of LSP server state. -
Don't skip the writeMu in
insertEdgesFromLocations. The check-then-insert for each edge must be atomic. Without the mutex, two concurrent discover workers that find the same edge would both pass the existence check and attempt duplicate writes.
Quick Reference Table
| Subsystem | Primitives Used | Concurrency Level |
|---|---|---|
| Indexer extraction | goroutines, channels, WaitGroup | N workers (GOMAXPROCS) |
| Indexer storage | single goroutine reading from channel | 1 writer |
| CGO watchdog | goroutine, select, Timer | 1 per file (fire-and-forget) |
| Enrichment edge upgrade | semaphore (buffered chan), WaitGroup, results channel | 128 LSP workers post-warmup, 1 DB writer |
| Enrichment discovery | semaphore (buffered chan), WaitGroup, Mutex (writeMu) | 8 concurrent per batch of 50 files |
| Enrichment multi-module | semaphore (buffered chan), WaitGroup, atomic.Int64 | 1 root sequential, 4 sub-modules parallel |
| Enrichment per-symbol timeout | context.WithTimeout, goroutine, select | child context per LSP call |
| Enrichment progress | sync.Mutex, atomic file write (rename) | mutex-protected map, crash-safe persistence |
| Enrichment stats | sync/atomic.Int64 | lock-free increments from all workers |
| SQLite store | WAL mode, busy_timeout pragma | N readers, 1 writer |
| Node/edge cache | sync.Map, atomic.Int64 | lock-free reads |
| SubgraphCache | sync.RWMutex | N readers, exclusive writes |
| MCP server | goroutine per request (net/http / stdio) | unbounded readers |
| Daemon coordination | sync.RWMutex, buffered channel | write lock during index |
| File watcher | fsnotify, time.AfterFunc, Mutex | 1 event loop goroutine |
| RWR adjacency cache | binary blob precomputation, in-memory BFS | built at index time, read at query time |
| RWR iteration | double-buffer map swap, top-K stability | fully sequential (zero allocation) |
| HITS / ranking / packing | none | fully sequential |
| FTS rebuild | WaitGroup, goroutines (split phase) | 8 workers for string splitting |
| Git blame extraction | WaitGroup, channel | N workers (GOMAXPROCS) |
| Community detection | runs inside daemon write lock | sequential (incremental) |
| Trace ingestor | dedicated goroutine, own DB conn | independent pipeline |
| Vector index build | 1 background goroutine at startup | one-shot, read-only DB |