Skip to main content

Runbook: Context Engine Assembly Latency High

Alert: ContextEngineAssemblyP95High Severity: critical Owner: Backend SME (primary), DevOps (secondary)

Symptom

Assembly p95 latency exceeds the 500ms SLO for 5 minutes. The LLD #7 §2.3 pipeline has a 9-step hot path; one or more steps is slow.

Triage (5 minutes)

  1. Open Grafana → Context Engine Overview.
  2. Compare the p50 / p95 / p99 trendlines:
    • Flat p50, spiking p95 → tail-latency issue (GC, slow tenant, DB contention).
    • All three rising → systemic regression, most likely a new deploy.
  3. Check the recent deploy log:
    gh api repos/upsquad-ai/upsquad-core/releases --jq '.[0:3] | .[] | {name, created_at}'
  4. Check which step dominates by querying the per-step histograms (once emitter wiring lands in a follow-up PR).

Common causes

CauseSignalFix
pgvector ANN search slowcontext_engine_retrieval_duration_seconds{search_type="vector"} p95 > 200msCheck rag_chunks_embeddings ivfflat index health; REINDEX if needed.
DB connection pool saturatedpgxpool acquire_wait metric > 0Raise MAX_DB_CONNS in ConfigMap, restart pods.
Compaction blocking assemblyAssembly latency correlates with compaction eventsVerify compaction runs async; if not, it is a bug — open P0 against backend SME.
Confidence gate thrashingcontext_engine_retrieval_duration_seconds{search_type="expansion"} rate > 0.3Raise CONFIDENCE_MIN_SCORE in ConfigMap.
Redis layer cache coldcontext_engine_cache_hit_ratio{cache_type="layer"} < 0.3Investigate why L1–L4 cache was evicted.

Emergency mitigation

If the cause is not identifiable within 15 minutes and p95 is climbing, flip the global engine mode to retrieval_only via Redis:

redis-cli -u "$REDIS_URL" set "ff:context:mode:global" "retrieval_only"

This skips steps 2, 5, 9 of the pipeline and should restore latency. File a P0 issue immediately and unflip the flag once root cause is known.