Runbook: Context Engine Assembly Latency High
Alert: ContextEngineAssemblyP95High
Severity: critical
Owner: Backend SME (primary), DevOps (secondary)
Symptom
Assembly p95 latency exceeds the 500ms SLO for 5 minutes. The LLD #7 §2.3 pipeline has a 9-step hot path; one or more steps is slow.
Triage (5 minutes)
- Open Grafana → Context Engine Overview.
- Compare the p50 / p95 / p99 trendlines:
- Flat p50, spiking p95 → tail-latency issue (GC, slow tenant, DB contention).
- All three rising → systemic regression, most likely a new deploy.
- Check the recent deploy log:
gh api repos/upsquad-ai/upsquad-core/releases --jq '.[0:3] | .[] | {name, created_at}'
- Check which step dominates by querying the per-step histograms (once emitter wiring lands in a follow-up PR).
Common causes
| Cause | Signal | Fix |
|---|---|---|
| pgvector ANN search slow | context_engine_retrieval_duration_seconds{search_type="vector"} p95 > 200ms | Check rag_chunks_embeddings ivfflat index health; REINDEX if needed. |
| DB connection pool saturated | pgxpool acquire_wait metric > 0 | Raise MAX_DB_CONNS in ConfigMap, restart pods. |
| Compaction blocking assembly | Assembly latency correlates with compaction events | Verify compaction runs async; if not, it is a bug — open P0 against backend SME. |
| Confidence gate thrashing | context_engine_retrieval_duration_seconds{search_type="expansion"} rate > 0.3 | Raise CONFIDENCE_MIN_SCORE in ConfigMap. |
| Redis layer cache cold | context_engine_cache_hit_ratio{cache_type="layer"} < 0.3 | Investigate why L1–L4 cache was evicted. |
Emergency mitigation
If the cause is not identifiable within 15 minutes and p95 is climbing,
flip the global engine mode to retrieval_only via Redis:
redis-cli -u "$REDIS_URL" set "ff:context:mode:global" "retrieval_only"
This skips steps 2, 5, 9 of the pipeline and should restore latency. File a P0 issue immediately and unflip the flag once root cause is known.