Proposal: migrate-postgres-semantic-index-to-pgvector

in progresstasks15/17migrate-postgres-semantic-index-to-pgvector

openspec/changes/migrate-postgres-semantic-index-to-pgvector/View on GitHub →

Artifacts

Official change artifacts tracked under openspec/.

The Postgres semantic-search path stores embeddings in semanticsearchblob.embedding as JSONB (384-dim float arrays, roughly 4.8 KB/row versus roughly 1.5 KB as a pgvector vector) and answers queries by SELECTing candidate rows and brute-force cosine-scoring them in JavaScript (postgres-search.js postgresSemanticSearch). The live deployment already runs the pgvector/pgvector:pg16 image, so the vector extension is available but unused. At the live table size (~1.85M rows / ~10 GB) the JSONB representation wastes roughly 3× the storage and the brute-force read path ships every candidate embedding over the wire to score it in JS — worse, the candidate SELECT carries a bare LIMIT with no ordering, so on scopes larger than the per-connector overscan the JS pass scores an arbitrary candidate subset rather than the true nearest neighbors.

Design

semanticsearchblob on the Postgres backend stores one embedding per (connectorinstanceid, scopekey, recordkey) as a JSONB float array. postgresSemanticSearch SELECTs candidate rows (bare LIMIT, no ordering), parses each JSONB array, and computes cosine distance in JS, then sorts and slices. The live deployment runs pgvector/pgvector:pg16 (extension available, unused) with ~1.85M rows / ~10 GB in this table.

Tasks (15/17)

Spec Deltas (1)

Affected capabilities

Capability specs this change proposes to modify.

Reference Implementation Architecture

The reference SHALL, on a Postgres-backed deployment where the vector (pgvector) extension is available, persist semantic-index embeddings in semanticsearchblob.embedding as pgvector vector values rather than JSONB float arrays, and SHALL answer semantic index queries with the database's cosine-distance operator (embedding <=> query ORDER BY … LIMIT k) supported by an HNSW index over the production embedding dimensionality, rather than fetching candidate embeddings and computing distances in process.

reference-implementation-architecture