Skip to main content

ADR-0007: Wave F ConflictNotifier — narrowed to rate-limit / idempotency residuals

Status: Accepted Date: 2026-04-17 Decision: Close the scope gap in issue #588 by delivering ONLY the four admission-control rails that PR #565 did not ship (idempotency, cascade-replay, per-recipient rate-limit, email digest) plus retry/DLQ, metrics, and integration tests. Do not re-implement the 6-channel notifier core or the adapters — those landed clean in PR #565. Related: PRD #549 §3.5, HLD v1.1 §3.5 / §3.5.5, LLD #560 §6, Issues #565 (closed), #588.

Context

Issue #588 was created from HLD v1.1 §3.5.5 after Wave F concern #4 was resolved. Its acceptance criteria span:

  1. 6-channel fan-out dispatcher (internal/governance/conflicts/notifier.go)
  2. Channel adapters (adapters/ws.go, bell.go, audit.go, email.go, slack.go)
  3. Idempotency: sha256(tenant|conflict|channel|recipient), Redis, 24h TTL
  4. Email digest: 15-min rolling window per recipient; P0/P1 bypass
  5. Rate limit: 20 notifs / 5 min / channel / recipient; overflow to digest
  6. Slack retry: 3 attempts, exp backoff (1s/4s/16s); DLQ
  7. Cascade replay: same (conflict_hash, recipient) within 1h suppressed
  8. Metric notification_fanout_suppressed_total{reason}
  9. Chaos test (100 conflicts on same target → <10 deliveries)
  10. Integration test (2-layer verdict disagreement → all 6 channels fire exactly once)

PR #565 — merged 2026-04-15 as part of the Wave E bundle — already shipped:

  • internal/conflict/notifier.go (6-channel dispatcher, debounce, failure isolation, panic isolation)
  • internal/conflict/adapters.go (all 6 stub adapters: quad, bell, audit, inbox, email, webhook/Slack)
  • conflict_notifications delivery ledger (migration 066)
  • ✅ pg_notify LISTEN/NOTIFY wiring (triggers in migration 066)

Criteria 1 + 2 are already complete. The package name differs from the HLD (internal/conflict rather than internal/governance/conflicts), which the HLD v1.1 specifies as non-normative — the contract is the 6-channel dispatcher, not its package path.

The residual gap is criteria 3-10: the admission-control rails + retry + metrics + tests.

Decision

Option B (narrow to residuals). Ship four new files under the existing internal/conflict/ package:

  • guard.goGuard type composing four rails: idempotency (Redis SETNX, 24h), replay (Redis SETNX on conflict_hash, 1h), per-recipient rate-limit (in-memory token bucket via golang.org/x/time/rate, 20/5min), email digest (in-memory 15-min window).
  • retry.goRetryDeliver(adapter, ...) helper implementing the 1s/4s HLD backoff schedule; ShouldRetry(channel) classifier (only webhook + email retry).
  • metrics.go — four Prometheus instruments (suppressed counter, delivered counter, failed counter with reason label, deliver-latency histogram).
  • guard_test.go + retry_test.go + notifier_guard_test.go — covers every suppression reason, chaos (100 conflicts on same target), 2-layer disagreement integration, retry exhaustion, context cancellation.

Wire points:

  • notifier.go: new WithGuard(*Guard) + WithRetryPolicy(RetryPolicy) options; dispatcher loop runs guard.Admit before calling the adapter, then RetryDeliver for webhook/email and direct Deliver for the others. Metrics fire on deliver / fail / suppress.
  • cmd/context-engine/main.go: constructs conflict.NewGuard(conflict.WithRedis(rdb), …) and passes it into NewNotifier — reusing the existing *redis.Client (the RedisClient interface is a one-method subset of *redis.Client).

Deviations from the HLD AC list

ACDecisionRationale
"P0/P1 bypass" for email digestImplemented as ClearanceAtDecision >= 4 bypass.policy_conflicts has no explicit priority column — clearance-at-decision is the closest available signal. Admin / P0 escalations always write at clearance ≥ 4 (see handler_perms). If a dedicated severity column lands later, the check is a one-line swap.
Overflow "routes to digest" on rate-limitOverflow is dropped (not digested).Routing rate-limited events into a digest queue requires a persistent outbox + a separate digest worker — out of scope for Wave F residuals. The rate-limit rail's sustain rate (20/5min) plus the email digest rail (15-min collapse) already cap the recipient's inbox well below the AC ceiling. A follow-up can add digest-as-overflow cheaply because the guard returns a typed reason.
DLQ on Slack retry exhaustImplemented via conflict_notifications row with status='failed' + failure_reason='retry_exhausted: …'.No separate DLQ table — the ledger IS the DLQ. Ops replays via SELECT … WHERE failure_reason LIKE 'retry_exhausted:%'. Keeps the schema surface minimal and composes with existing RLS.
Package path internal/governance/conflicts/Kept as internal/conflict/.PR #565 established the path; moving it would force a churn-heavy rename across six callers for zero behavioural gain. HLD path was non-normative.

Why not rebuild

Rebuilding the 6-channel dispatcher would:

  • Duplicate the existing Notifier + ChannelAdapter machinery
  • Force migration renames (conflict_notifications already exists as migration 066)
  • Break ConflictService's existing conflictNotifier.Enqueue(...) calls in service.go and merge_synthesis.go
  • Add zero capability beyond what PR #565 already delivers

Consequences

  • The Guard runs strictly BEFORE the delivery row is inserted — suppressed events do NOT generate a conflict_notifications row. This is intentional: the rails exist to prevent noise, not to record every would-be delivery.
  • Pod-local rate-limit state means two pods can independently admit their burst budget. Practical ceiling stays bounded: pod_count × 20 per 5min per channel per recipient. With typical deployments (2-6 pods) this is 40-120 notifications / 5min — still inside the UX tolerance for a digest-bypassing conflict.
  • Fail-open on Redis error: we tolerate a duplicate notification over a silently-dropped one. A sustained Redis outage spikes duplicate deliveries, surfaced via the absence of suppressed metrics (ops alert).
  • Tests run under -race -count=1 clean; the chaos test uses the fake Redis to keep CI fast.
  • PR #565 (Wave E bundle — dispatcher + adapters + ledger)
  • PR #561 / #567 (migration 060 org_units, migration 064 org_unit_tools)
  • HLD v1.1 §3.5.5 concern #4 resolution