Skip to main content

ADR-0008: Wave F MergeSynthesisWorker — narrowed to residuals pending PRD #558

Status: Accepted Date: 2026-04-17 Decision: Narrow issue #589 to hardening the existing MergeSynthesisWorker (MaxAttempts enforcement + exponential backoff + golden tests for StubSynthesizer + worker observability). Leave StubSynthesizer as the plugged implementation. Defer real two-pass Sonnet→Opus synthesis to PRD #558 which owns the LLMGateway + llm_usage_events attribution surface. Related: PRD #549, HLD #556 v1.1 §3.5.4 / §6.2.2, LLD #560 §9, PRD #558, issues #565 (Wave E shipped the stub), #589 (this).

Context

PR #565 (Wave E, merged 2026-04-10) shipped:

  • internal/ai/routerStubRouter + Router interface, with Resolve(functionKey, orgID) → *ResolvedModel. The only place in the Wave E tree that names a model slug.
  • internal/conflict/merge_synthesis.goMergeSynthesisWorker pool (size=4 per pod, FOR UPDATE SKIP LOCKED, 30s invoke timeout, per-row SET LOCAL app.org_id under SET LOCAL row_security = off for RLS-aware cross-tenant sweep) + StubSynthesizer that returns a deterministic skeleton merge.

The architect raised residual gaps (#589) against the merged worker. Issue #589's original acceptance list includes real LLMGateway wiring + two-pass Sonnet→Opus + confidence-based escalation + per-call llm_usage_events. This work cannot land in Wave F because:

  1. LLMGateway does not exist. It is the central deliverable of PRD #558 (Platform AI Model Configuration). Stubbing a second LLM client outside that PRD would ship dead code + contradict the master PRD NFR ("NO HARDCODED MODEL SLUGS in business code").
  2. llm_usage_events.agent_id is NOT NULL. The schema for per-call attribution was landed for agent runtime use; the MergeSynthesisWorker is not an agent invocation, so writing rows would require a schema delta that belongs to PRD #558, not a Wave F hardening PR.
  3. Confidence-based escalation requires the real model client. A stub cannot emit confidence_score; gating on it would be fiction.

Founder approved Option B on 2026-04-17 (escalation: https://github.com/upsquad-ai/upsquad-core/issues/589#issuecomment-4266999198) — narrow #589 scope, ship only the hardening residuals, defer real synthesis to PRD #558.

Decision

Ship only these residuals under issue #589:

  1. MaxAttempts enforcement (default 3). Transient failures (router resolve, synthesizer error, SetSuggestedMerge write) increment an in-memory attempt counter keyed by conflict id. On reaching the cap, the row is stamped with a permanent synthesis_max_attempts_exceeded marker so the claim scan stops re-picking it.
  2. Exponential backoff with cap. The Nth retry waits min(BaseBackoff × 2^(N-1), BaseBackoff × 10). Defaults: BaseBackoff=1s, so the schedule is 1s → 2s → 4s over the three-attempt window (well inside the claim tick cadence so rows don't starve the sweeper).
  3. Golden-file tests for the deterministic StubSynthesizer. Five representative conflict shapes (org-over-member tool call, platform-over-org data export, org_unit-over-member knowledge query, member-wins wildcard target, platform-default-no-loser) locked into internal/conflict/testdata/merge_golden/*.json. An injectable Clock makes generated_at deterministic; UPSQUAD_GOLDEN_UPDATE=1 regenerates.
  4. Worker observability. Three cardinality-bounded Prometheus instruments, registered via promauto so the default /metrics scrape picks them up:
    • synthesis_attempts_total{outcome}outcome ∈ {success, transient_failure, permanent_failure, skipped} (4 active series).
    • synthesis_duration_seconds — histogram, 1ms → ~65s buckets.
    • synthesis_failures_total{reason}reason ∈ {set_org_id, load_row, router_resolve, synthesizer, invalid_json, write_merge, max_attempts_exceeded, unknown} (8 active series). Total package-level cardinality ≤ 12.
  5. Seam documentation. A prominent block comment at the top of merge_synthesis.go states: "PRD #558 will replace StubSynthesizer with real LLMGateway two-pass. Do not mutate this seam without coordinating with that PRD."

What this ADR explicitly does NOT deliver

  • Real LLM call (Sonnet 4.6 default) — deferred to PRD #558.
  • Opus 4 escalation on confidence_score < 0.7 — deferred to PRD #558.
  • llm_usage_events per-call row — deferred to PRD #558.
  • LLMGateway interface — owned by PRD #558.
  • No new migrations. No changes to internal/ai/router/. No changes to llm_usage_events schema.

Handoff contract to PRD #558

When PRD #558 lands:

  • The MergeSynthesizer interface is the drop-in seam. PRD #558 will ship a new implementation (e.g. internal/ai/synthesis/LLMGatewayMergeSynthesizer) that the cmd/context-engine/main.go wiring swaps in place of conflict.NewStubSynthesizer().
  • The golden fixtures describe the expected output shape. PRD #558's real LLM prompt MUST produce a compatible payload (field names + types) so downstream consumers (audit exports, client admin UI) don't branch on "stub vs real".
  • The three worker metrics continue to apply; PRD #558 adds provider-specific sub-metrics alongside, not replaces.
  • MaxAttempts semantics stay identical — PRD #558's two-pass (Sonnet → Opus on confidence miss) counts as a single attempt from the worker's perspective because escalation is an internal concern of the synthesizer, not the retry accountant.

Consequences

Positive

  • Wave F ships the narrow hardening without blocking on PRD #558 scope or timeline.
  • No speculative code in the tree: every exported entity has a production caller (shelfware gate).
  • Golden fixtures act as the shape contract for the PRD #558 LLM prompt engineer — less coordination ambiguity than prose.
  • MaxAttempts + backoff eliminates the current risk of a poisoned row silently re-attempting forever.

Negative

  • The StubSynthesizer remains the plugged implementation in production until PRD #558 ships. This is acceptable because policy_conflicts inbox consumers (admin UI, SIEM export) treat suggested_merge_json as advisory; the authoritative resolution happens in the Acknowledge/Resolve RPCs.
  • An in-memory attempt tracker loses state on pod restart. Acceptable because (a) the row-in-DB is the queue; (b) ClaimPendingSynthesis re-picks eligible rows; (c) MaxAttempts is a last-resort cap, not a correctness property.

Alternatives considered

  • Ship everything in #589's original scope now. Rejected: requires LLMGateway + llm_usage_events.agent_id schema delta + stubbed Opus escalation. Scope belongs to PRD #558.
  • Ship nothing; wait for PRD #558. Rejected: leaves the worker without retry caps (poisoned row risk) and without observability (operators can't see failure modes).
  • Ship MaxAttempts as a DB column. Rejected: adds a migration for a last-resort cap; in-memory tracker is simpler and the row-in-DB is already the queue of record.

Evidence

  • internal/conflict/merge_synthesis.go — seam comment + retry policy + attempt tracker.
  • internal/conflict/merge_synthesis_metrics.go — the three instruments with bounded cardinality.
  • internal/conflict/merge_synthesis_golden_test.go + testdata/merge_golden/*.json — 5 locked shapes.
  • internal/conflict/merge_synthesis_retry_test.go — backoff + attempt tracker unit tests.