Runbook: Context Engine Cardinality Explosion
Alert: ContextEngineCardinalityExplosion
Severity: critical
Owner: DevOps
Symptom
Prometheus active series count for context_engine_* metrics has exceeded
50,000. This alert fires well before the $1K/mo cost ceiling but must be
investigated within 30 minutes because cardinality growth is usually
quadratic once a bad label lands.
Triage (5 minutes)
-
Open Prometheus, query:
topk(20, count by (__name__) ({__name__=~"context_engine_.*"}))The top result is the offending metric.
-
Identify which label exploded:
count(count by (org_id)({__name__="<metric>"}))count(count by (agent_id)({__name__="<metric>"}))count(count by (violation_type)({__name__="<metric>"}))Any label with more than 1,000 distinct values is the culprit.
Mitigation (15 minutes)
- Immediate — add a
metricRelabelingsdrop rule todeployments/observability/prometheus/context-engine-servicemonitor.yaml:- sourceLabels: [__name__, <offending_label>]regex: "<metric_pattern>;.*"action: drop - Open a PR, self-approve for SEV-1, merge. ArgoCD applies within 3 minutes.
- Confirm the active series count drops below 50k in Grafana Context Engine Cost dashboard.
Root cause (24 hours)
- Open a follow-up issue against the Context Engine emitter code in
internal/context/metrics/emitter.go. - Reduce label cardinality at source — typically by hashing high-cardinality
values into a bucket (e.g.
agent_id_bucket = hash(agent_id) % 64). - Remove the relabel drop rule in the same PR that ships the emitter fix.
Non-goals
Do NOT temporarily raise the alert threshold above 50k. If the threshold needs adjusting, that is an architecture decision and must go through principal-architect review.