Skip to main content

Runbook: Context Engine Cardinality Explosion

Alert: ContextEngineCardinalityExplosion Severity: critical Owner: DevOps

Symptom

Prometheus active series count for context_engine_* metrics has exceeded 50,000. This alert fires well before the $1K/mo cost ceiling but must be investigated within 30 minutes because cardinality growth is usually quadratic once a bad label lands.

Triage (5 minutes)

  1. Open Prometheus, query:

    topk(20, count by (__name__) ({__name__=~"context_engine_.*"}))

    The top result is the offending metric.

  2. Identify which label exploded:

    count(count by (org_id)({__name__="<metric>"}))
    count(count by (agent_id)({__name__="<metric>"}))
    count(count by (violation_type)({__name__="<metric>"}))

    Any label with more than 1,000 distinct values is the culprit.

Mitigation (15 minutes)

  1. Immediate — add a metricRelabelings drop rule to deployments/observability/prometheus/context-engine-servicemonitor.yaml:
    - sourceLabels: [__name__, <offending_label>]
    regex: "<metric_pattern>;.*"
    action: drop
  2. Open a PR, self-approve for SEV-1, merge. ArgoCD applies within 3 minutes.
  3. Confirm the active series count drops below 50k in Grafana Context Engine Cost dashboard.

Root cause (24 hours)

  1. Open a follow-up issue against the Context Engine emitter code in internal/context/metrics/emitter.go.
  2. Reduce label cardinality at source — typically by hashing high-cardinality values into a bucket (e.g. agent_id_bucket = hash(agent_id) % 64).
  3. Remove the relabel drop rule in the same PR that ships the emitter fix.

Non-goals

Do NOT temporarily raise the alert threshold above 50k. If the threshold needs adjusting, that is an architecture decision and must go through principal-architect review.