Tasks — migrate-postgres-semantic-index-to-pgvector
tasks15/17
Created Updated openspec/changes/migrate-postgres-semantic-index-to-pgvector/tasks.mdView on GitHub →
1. Spec delta
- Add the
reference-implementation-architecturerequirement for pgvector-backed semantic embedding storage on the Postgres backend (vector representation, database-side cosine distance, boot migration, unchanged retrieval semantics, JSONB fallback when the extension is absent). -
openspec validate migrate-postgres-semantic-index-to-pgvector --strict.
2. Schema + boot migration (postgres-storage.js)
- Probe the
vectorextension at bootstrap (existingCREATE EXTENSION IF NOT EXISTS vectorattempt, thenpg_extensioncheck) and cache the semantic embedding mode (jsonb|vector) plus thehnsw.iterative_scancapability, reset on init/close. - Add
migratePostgresSemanticEmbeddingToVectorin the existing boot-migration style: detect the JSONBembeddingcolumn, drop non-castable rows (logged), addembedding_vec vector, backfill in bounded batches (default 50k,PDPP_PG_SEMANTIC_MIGRATION_BATCH_SIZEoverride) with progress logging, swap columns in one transaction, set NOT NULL. - Build the partial expression HNSW index
(
(embedding::vector(384)) vector_cosine_ops WHERE vector_dims(embedding) = 384) withCREATE INDEX IF NOT EXISTS, raisingmaintenance_work_memfor the session (default 256MB,PDPP_PG_SEMANTIC_INDEX_MAINTENANCE_WORK_MEMoverride) and resetting it after. Build serially (max_parallel_maintenance_workers = 0): parallel HNSW builds need/dev/shm≥maintenance_work_memand die in default containers (could not resize shared memory segment, reproduced on pgvector/pgvector:pg16). - Thread an optional
{ log }throughinitPostgresStorage/bootstrapPostgresSchema(no-op default) and pass the startup logger fromserver/index.js.
3. Read/write path (postgres-search.js)
-
postgresSemanticIndexUpsertMany: insert$n::vectorliterals in vector mode ($n::jsonbunchanged in fallback mode); skip empty vectors in vector mode. -
postgresSemanticSearch(vector mode):ORDER BY embedding::vector(d) <=> $q::vector(d) LIMIT kwith the existingconnector_instance_id/scope_key/record_keyscoping plusvector_dims(embedding) = d; transaction-scopedSET LOCAL hnsw.ef_searchandhnsw.iterative_scan = strict_order(when supported); mapNaN → Infinity; re-sort ties in JS under the existing total order. JSONB brute-force path retained verbatim for fallback mode.
4. Tests (test/postgres-semantic-pgvector.test.js, gated on PDPP_TEST_POSTGRES_URL)
- Boot migration on a seeded legacy-JSONB-shape table (scratch schema):
column becomes
vector, row count and values preserved, batched backfill exercised (batch override), HNSW index present, NOT NULL restored. - Resume safety: a manufactured half-migrated state (partial
embedding_vecbackfill) completes on the next bootstrap. - Query parity: ordering and
distancevalues from the pgvector path match a JS brute-force replica on a small fixture (within 1e-5), including scope-key scoping andrecordKeysfiltering. - Mixed-dimension coexistence: rows of different dimensions in the shared table do not interfere (each query sees only its own dimension).
- Existing Postgres suites stay green:
postgres-runtime-storage,semantic-index-state-postgres-routing,connector-instances-acceptance,postgres-records-ingest-noop, plus SQLitesemantic-retrieval,hybrid-retrieval,lexical-retrievalsuites.
5. Quality gates
- Biome/ultracite clean on touched files.
-
openspec validate migrate-postgres-semantic-index-to-pgvector --strict.
6. Live rollout (owner)
- Schedule the boot migration for the live deployment (~1.85M rows: batched backfill minutes, HNSW build tens of minutes; server does not serve until complete; interruption-safe).
- After migration, confirm semantic search results and reclaim the dropped
JSONB column's space via routine vacuum (or
pg_repack/VACUUM FULLin a window).