Runbook: Context Engine Assembly Latency High

Alert: ContextEngineAssemblyP95High Severity: critical Owner: Backend SME (primary), DevOps (secondary)

Symptom

Assembly p95 latency exceeds the 500ms SLO for 5 minutes. The LLD #7 §2.3 pipeline has a 9-step hot path; one or more steps is slow.

Open Grafana → Context Engine Overview.
Compare the p50 / p95 / p99 trendlines:
- Flat p50, spiking p95 → tail-latency issue (GC, slow tenant, DB contention).
- All three rising → systemic regression, most likely a new deploy.

Check the recent deploy log:

gh api repos/upsquad-ai/upsquad-core/releases --jq '.[0:3] | .[] | {name, created_at}'

Check which step dominates by querying the per-step histograms (once emitter wiring lands in a follow-up PR).

Cause	Signal	Fix
pgvector ANN search slow	`context_engine_retrieval_duration_seconds{search_type="vector"}` p95 > 200ms	Check `rag_chunks_embeddings` `ivfflat` index health; REINDEX if needed.
DB connection pool saturated	pgxpool `acquire_wait` metric > 0	Raise `MAX_DB_CONNS` in ConfigMap, restart pods.
Compaction blocking assembly	Assembly latency correlates with compaction events	Verify compaction runs async; if not, it is a bug — open P0 against backend SME.
Confidence gate thrashing	`context_engine_retrieval_duration_seconds{search_type="expansion"}` rate > 0.3	Raise `CONFIDENCE_MIN_SCORE` in ConfigMap.
Redis layer cache cold	`context_engine_cache_hit_ratio{cache_type="layer"}` < 0.3	Investigate why L1–L4 cache was evicted.

If the cause is not identifiable within 15 minutes and p95 is climbing, flip the global engine mode to retrieval_only via Redis:

redis-cli -u "$REDIS_URL" set "ff:context:mode:global" "retrieval_only"

This skips steps 2, 5, 9 of the pipeline and should restore latency. File a P0 issue immediately and unflip the flag once root cause is known.