HLD: Agent Runtime Wave 2 — Approval Chain Engine + Pause/Resume API
| Field | Value |
|---|---|
| Status | Draft — awaiting founder approval |
| Version | 1.0 |
| Date | 2026-04-13 |
| Author | Principal Technical Architect Agent |
| Scope | PRD #380 Wave 2 — P4.8.3 Approval Chain Engine + P4.8.4 Pause/Resume API |
| PRD | #380 (approved) |
| Parent PRD | #1 |
| Tracker | #381 |
| Milestone | 9 |
| Wave 1 sign-off | https://github.com/upsquad-ai/upsquad-core/issues/381#issuecomment-4247176283 |
| Precedent HLD style | docs/hld/agent-runtime-wave-1-agent-isolation.md |
| Related | docs/hld/agent-runtime-wave2-delta.md (checkpoint persistence primitive) |
1. Why one HLD for two items
P4.8.3 (Approval Chain Engine) and P4.8.4 (Pause/Resume API) are inseparable. Pause/Resume is the runtime primitive; Approval Chain is the first — and for now, only — consumer of that primitive beyond direct operator override. Designing them separately risks an abstraction mismatch where the approval flow leaks runtime concerns or the runtime primitive bakes in approval-specific assumptions.
Architecturally this document specifies:
- A runtime-level
PauseSession/ResumeSessionRPC pair that suspends a session at a clean message boundary and re-hydrates it from the existing Wave 2-delta checkpoint table, exposing a typedsession_paused/session_resumedevent on the portal stream. - An
ApprovalServicelayered on top that activates the P4.3 requirement set: 5-level policy hierarchy, multi-channel recording (dashboard / email / Slack / SCM webhook), timeout + escalation, delegation with clearance check, dedup, and hash-chained audit.
Anything in this HLD that conflicts with the parent PRD or the Wave 2-delta HLD is a bug in this document, not an override.
2. Current state — what exists on main today
Confirmed by direct source inspection (2026-04-13).
| # | Fact | Evidence |
|---|---|---|
| 1 | AGENT_STATUS_SUSPENDED = 5 is already defined in the proto but never emitted | proto/upsquad/runtime/v1/runtime.proto:154 |
| 2 | governance_approvals table already exists with columns (id, org_id, team_id, agent_id, action_type, target, status, requested_at, resolved_at, resolved_by, reason, metadata, expires_at) and RLS policy | internal/context/store/migrations/028_governance_policies.up.sql:52-78 |
| 3 | GovernanceService.Check already returns verdict requires_approval with an approval_id, and ResolveApproval RPC polls until resolved | proto/upsquad/governance/v1/governance.proto:19-33 |
| 4 | MCP middleware blocks in-process for ApprovalTimeout = 5 * time.Minute inside the tool call path when verdict is requires_approval | internal/mcp/middleware/middleware.go:81-176 |
| 5 | Checkpoint persistence (agent_checkpoints table + sweeper-based resume) shipped in the Wave 2-delta and is the state primitive pause/resume will reuse | docs/hld/agent-runtime-wave2-delta.md §5, internal/runtime/session/sweeper.go, internal/runtime/checkpointstore/ |
| 6 | Hash-chained audit (wave-1 item 4) is live; every approval lifecycle event will slot into the existing chain | internal/context/store/migrations/036_audit_log_hash_chain.up.sql, internal/runtime/audit/hashchain.go |
Two things are broken-as-designed and this HLD replaces them:
- B-1. 5-minute in-process block.
middleware.go:167-174holds the MCP request goroutine while pollingResolveApproval. This does not survive orchestrator restarts, wastes compute, caps us at ~5 min approvals even though the PRD calls for 24 h and 72 h defaults, and makes the caller observe a timeout rather than a paused session. Replace with pause/resume. - B-2.
governance_approvals.expires_atdefault of 1 hour and no escalation. Must become policy-driven (24 h dev, 72 h critical-path) with scheduled escalation to the next hierarchy level on timeout.
3. Architecture overview
3.1 Component diagram
Portal / Operator UI
│
│ gRPC (existing)
▼
┌──────────────────────────────┐
│ LifecycleService (Go) │◄──── PauseSession / ResumeSession (NEW)
│ internal/runtime/server │ — operator-initiated
└───────┬──────────────────────┘
│ session state transitions
▼
┌──────────────────────────────┐
│ session.Manager │
│ + pause/resume transitions │
│ + checkpoint hydrate │
└───────┬──────────────────────┘
│
┌───────────────┼────────────────────────────┐
▼ ▼ ▼
┌────────────────┐ ┌─────────────────┐ ┌──────────────────────────┐
│ checkpointstore│ │ streaming.Pub │ │ Redis sorted-set │
│ (pg) │ │ (session_paused │ │ ZSET approval_deadlines │
│ │ │ / _resumed) │ │ score=unix deadline │
└────────────────┘ └─────────────────┘ └──────────┬───────────────┘
│
▼
┌──────────────────────┐
│ ApprovalScheduler │
│ (orch goroutine) │ NEW
│ timeout + escalation │
└──────────┬───────────┘
│
┌───────────────────────────────┼────────────────────┐
▼ ▼ ▼
┌────────────────────┐ ┌──────────────────────┐ ┌─────────────────┐
│ ApprovalService │ │ PolicyResolver │ │ ChannelAdapter │
│ (gRPC, NEW) │◄─────┤ 5-level hierarchy │ │ Registry (NEW) │
│ - Request │ │ platform→tenant→ │ │ - Dashboard │
│ - RecordDecision │ │ parent→sub→per-req │ │ - Email (HMAC) │
│ - List / Get │ │ lookup │ │ - Slack │
│ - Delegate │ │ │ │ - SCM │
└─────────┬──────────┘ └──────────────────────┘ └───────┬─────────┘
│ │
│ pgx (governance_approvals + NEW tables) │ webhooks in
▼ ▼
┌─────────────────────┐ ┌─────────────────────┐
│ Postgres │ │ ChannelWebhookGW │
│ governance_approvals│ │ (HTTP handler) │
│ approval_policies │ NEW │ HMAC verify, dedup │
│ approval_delegations│ NEW │ → RecordDecision │
│ approval_events │ NEW (lifecycle log) └─────────────────────┘
└─────────────────────┘
│
▼
┌─────────────────────┐
│ agent_audit_log │ (existing, hash-chained)
│ every pause/resume/ │
│ request/decision │
└─────────────────────┘
3.2 Two primary data flows
Flow A — operator-initiated pause. Operator clicks Pause in the portal (future UI; Wave 2 exposes the API only, see §14).
OperatorClient → LifecycleService.PauseSession(session_id, reason)
→ session.Manager.RequestPause(session_id) // sets pause_requested=true
→ worker receives pause signal via control-plane header on next event boundary
→ worker flushes final ExecuteStepEvent for the in-flight step, emits
ExecuteStepEvent{paused=PausedEvent{reason}}
→ orchestrator writes checkpoint, transitions session → SUSPENDED
→ emit AgentEvent{session_paused=SessionPausedEvent{reason, checkpoint_key}}
→ audit entry: session_paused
→ RPC returns
Flow B — approval-initiated pause. MCP tool call hits governance Check, verdict is requires_approval.
agent-worker → MCP middleware → GovernanceService.Check
→ verdict=requires_approval, approval_id=<uuid>
→ middleware emits a PAUSE_REQUIRED control event back through ExecuteStep stream
→ orchestrator runs the same pause path as Flow A (shared code), but records
approval_id on the session row
→ ApprovalService.RequestApproval(approval_id, session_id, tool, args_hash, policy_id)
→ ChannelAdapter.Dispatch → email + slack + SCM comment (per policy template)
→ ApprovalScheduler schedules deadline in Redis ZSET score=expires_at
→ session stays SUSPENDED until RecordDecision
On RecordDecision(approval_id, APPROVED, operator, reason):
ApprovalService validates operator clearance (incl. delegation chain)
→ update governance_approvals.status = 'approved'
→ audit: approval_decision (hash-chained)
→ call LifecycleService.ResumeSession(session_id, operator_input={approval_metadata})
→ session.Manager loads checkpoint, transitions → ACTIVE
→ worker re-dialled, approval metadata injected as system-tagged context message
→ emit AgentEvent{session_resumed=SessionResumedEvent{resumed_by_approval=true}}
On RecordDecision(approval_id, DENIED, operator, reason):
→ update governance_approvals.status = 'denied'
→ audit: approval_decision denied
→ LifecycleService.TerminateSession(session_id, reason="approval denied: …")
→ emit AgentEvent{error=ErrorEvent{reason=approval_denied}}
→ session → FAILED
4. RPC surface
4.1 LifecycleService additions (runtime v1)
// proto/upsquad/runtime/v1/lifecycle.proto
service LifecycleService {
// ...existing RPCs unchanged...
// PauseSession requests that an active session transition to SUSPENDED
// at the next clean message boundary. Idempotent: pausing an already-
// paused session returns success with status=SUSPENDED.
rpc PauseSession(PauseSessionRequest) returns (PauseSessionResponse);
// ResumeSession re-hydrates a SUSPENDED session from its latest
// checkpoint and transitions it back to ACTIVE. operator_input is
// optional; when set, it is injected into the session context as a
// system-tagged message before the loop continues.
//
// Resuming a non-SUSPENDED session returns FailedPrecondition.
rpc ResumeSession(ResumeSessionRequest) returns (ResumeSessionResponse);
}
message PauseSessionRequest {
string tenant_id = 1; // from JWT, gateway-injected
string session_id = 2;
// reason is a human-readable label; recorded in audit.
string reason = 3;
// pause_source enumerates who triggered the pause. Approval-engine
// callers set APPROVAL; operator UI sets OPERATOR.
PauseSource pause_source = 4;
// correlation_id optionally links this pause to an approval_id or a
// policy event for cross-system audit correlation.
string correlation_id = 5;
}
enum PauseSource {
PAUSE_SOURCE_UNSPECIFIED = 0;
PAUSE_SOURCE_OPERATOR = 1;
PAUSE_SOURCE_APPROVAL = 2;
PAUSE_SOURCE_POLICY = 3; // future: automated policy gate
}
message PauseSessionResponse {
AgentStatus status = 1; // SUSPENDED on success
string checkpoint_key = 2; // pointer into agent_checkpoints
google.protobuf.Timestamp paused_at = 3;
bool was_already_suspended = 4; // idempotency signal
}
message ResumeSessionRequest {
string tenant_id = 1;
string session_id = 2;
// operator_input is optional structured input injected into the context
// as if from a human message. JSON-encoded; schema negotiated per
// pause_source. For approval resumes this carries:
// { "approval_id": "...", "decision": "approved",
// "operator_id": "...", "reason": "...", "delegated_from": "..." }
bytes operator_input = 3;
string resume_reason = 4;
}
message ResumeSessionResponse {
AgentStatus status = 1; // ACTIVE on success
int32 resumed_at_loop = 2;
google.protobuf.Timestamp resumed_at = 3;
}
Two new event variants extend AgentEvent.oneof event:
message AgentEvent {
oneof event {
TokenEvent token = 1;
StatusEvent status = 2;
CompletionEvent completion = 3;
ErrorEvent error = 4;
SessionPausedEvent session_paused = 5; // NEW
SessionResumedEvent session_resumed = 6; // NEW
}
}
message SessionPausedEvent {
string reason = 1;
PauseSource pause_source = 2;
string correlation_id = 3; // approval_id when source=APPROVAL
string checkpoint_key = 4;
google.protobuf.Timestamp paused_at = 5;
}
message SessionResumedEvent {
bool resumed_by_approval = 1;
string approval_id = 2; // empty when not approval-triggered
int32 resumed_at_loop = 3;
google.protobuf.Timestamp resumed_at = 4;
}
4.2 ApprovalService (governance v1, new file)
New proto file: proto/upsquad/governance/v1/approval.proto. Keeping GovernanceService untouched — ResolveApproval stays as a polling convenience but is deprecated in-place.
syntax = "proto3";
package upsquad.governance.v1;
service ApprovalService {
// RequestApproval is called from the runtime (MCP middleware / worker
// control plane) when a governance Check returns requires_approval.
// Idempotent on (session_id, tool_name, args_sha256) — returns the
// existing approval_id if one is already pending.
rpc RequestApproval(RequestApprovalRequest) returns (RequestApprovalResponse);
// RecordDecision is the single converged entry point for all channels
// (dashboard, email webhook, Slack webhook, SCM webhook). Channel
// adapters translate their inbound payload to this RPC.
rpc RecordDecision(RecordDecisionRequest) returns (RecordDecisionResponse);
// Delegate transfers approval authority to another member. Delegatee
// MUST have clearance >= the approval's required_clearance. The
// original approver retains audit-of-record for the delegation.
rpc Delegate(DelegateRequest) returns (DelegateResponse);
// Get / List are read surfaces for the operator UI (frontend deferred,
// but APIs shipped in Wave 2 so dashboard work can start in parallel).
rpc GetApproval(GetApprovalRequest) returns (Approval);
rpc ListApprovals(ListApprovalsRequest) returns (ListApprovalsResponse);
}
message RequestApprovalRequest {
string org_id = 1;
string team_id = 2;
string session_id = 3;
string agent_id = 4;
string member_id = 5; // the agent's effective member identity
int32 clearance = 6;
string action_type = 7; // e.g. "tool_call"
string tool_name = 8;
string target = 9;
bytes args = 10; // the full tool args (for context)
string args_sha256 = 11; // used for dedup
string policy_id = 12; // policy that produced requires_approval
int32 required_clearance = 13;
string template = 14; // "dev_only" | "dev_review" | "full_pipeline" | "critical_path"
google.protobuf.Timestamp deadline = 15; // scheduler-enforced
repeated string channels = 16; // e.g. ["dashboard","email","slack","scm"]
map<string,string> metadata = 17;
}
message RequestApprovalResponse {
string approval_id = 1;
bool was_deduplicated = 2; // true when an open approval matched
google.protobuf.Timestamp deadline = 3;
}
message RecordDecisionRequest {
string org_id = 1;
string approval_id = 2;
Decision decision = 3;
string operator_id = 4;
string reason = 5;
// channel records who recorded it — for audit + telemetry.
Channel channel = 6;
// idempotency_key prevents double-submission from webhook retries.
string idempotency_key = 7;
}
enum Decision {
DECISION_UNSPECIFIED = 0;
DECISION_APPROVED = 1;
DECISION_DENIED = 2;
}
enum Channel {
CHANNEL_UNSPECIFIED = 0;
CHANNEL_DASHBOARD = 1;
CHANNEL_EMAIL = 2;
CHANNEL_SLACK = 3;
CHANNEL_SCM = 4;
CHANNEL_API = 5; // direct programmatic
}
message RecordDecisionResponse {
// result is OK on first-write, DUPLICATE when idempotency_key collides,
// CONFLICT when the approval is already resolved with a different
// decision (see §6.3 dedup semantics).
RecordResult result = 1;
Approval approval = 2;
}
enum RecordResult {
RECORD_RESULT_UNSPECIFIED = 0;
RECORD_RESULT_OK = 1;
RECORD_RESULT_DUPLICATE = 2;
RECORD_RESULT_CONFLICT = 3;
}
message Approval {
string approval_id = 1;
string org_id = 2;
string session_id = 3;
string agent_id = 4;
string tool_name = 5;
string target = 6;
string status = 7; // pending|approved|denied|expired|escalated
string template = 8;
int32 required_clearance = 9;
google.protobuf.Timestamp requested_at = 10;
google.protobuf.Timestamp deadline = 11;
google.protobuf.Timestamp resolved_at = 12;
string resolved_by = 13;
string resolution_reason = 14;
repeated string channels_dispatched = 15;
repeated DelegationLink delegation_chain = 16;
string policy_id = 17;
int32 escalation_level = 18; // 0 = original; incremented on escalation
}
message DelegateRequest {
string org_id = 1;
string approval_id = 2;
string from_member_id = 3;
string to_member_id = 4;
string reason = 5;
}
message DelegateResponse {
Approval approval = 1;
}
message DelegationLink {
string from_member_id = 1;
string to_member_id = 2;
int32 to_clearance = 3;
google.protobuf.Timestamp at = 4;
string reason = 5;
}
5. State machine
5.1 Session status transitions
Existing enum: AGENT_STATUS_{UNSPECIFIED, INITIALIZING, ACTIVE, SUSPENDED, TERMINATED, ERROR}. This HLD uses all states but does not add new ones.
INITIALIZING ──► ACTIVE ──► SUSPENDED ──► ACTIVE
│ │ │
│ │ └► TERMINATED (deny)
│ └► TERMINATED (force terminate)
└──────────────► TERMINATED (normal completion)
└──────────────► ERROR
Invariants (enforced at session.Manager):
ACTIVE → SUSPENDEDis only entered at a message boundary, i.e. betweenExecuteStepEventbatches whereloop_counthas been fully persisted. Never mid-token, never mid-tool-call (see §5.2).SUSPENDED → ACTIVErequires a fresh checkpoint load and re-dial of a worker — possibly a different worker than before.SUSPENDEDsessions do not count against the worker's concurrency budget. The worker is free to be recycled.- While
SUSPENDED, no AgentEvents (other than the terminatingsession_resumed/ error) may be emitted on the portal stream. TERMINATEDis absorbing — a terminated session cannot be resumed.
5.2 Pause granularity — message boundary
Decision: pause at the next message boundary, never mid-tool-call. Rationale:
- Mid-tool-call pause would require tool-level cancellation primitives we do not have. Tools would either double-fire on resume (bad — tokens + side effects) or need per-tool checkpoint protocols (expensive for Wave 2).
- Mid-LLM-stream pause would require cancelling a provider request mid-flight; providers bill for emitted tokens regardless, and partial assistant messages are garbage context for resume.
- Message boundary = after a
CompletionEventOR after a tool result has been folded into state and the loop is about to dispatch the next LLM call. Both are already the checkpoint boundaries in Wave 2-delta §3.2 step 8.
Worst-case pause latency is bounded by the longest single LLM call + tool call in the current step, which we already cap at 120 s via the SendMessage deadline (delta §3.2). The operator sees StatusEvent{pending_pause} immediately and SessionPausedEvent within that bound.
5.3 In-flight tool calls when pause hits
If a governance Check returns requires_approval for a tool call, the tool does not run. The pause-request is raised before dispatch. On resume with APPROVED, the same tool call is re-dispatched with approval metadata in context.
If an operator-initiated pause arrives while a tool is mid-execution, the tool runs to completion, its result is folded in, and the session suspends at the subsequent boundary. The portal stream emits a StatusEvent{pending_pause=true} so the operator UI can show "pausing…".
5.4 Streaming subscribers
On pause:
- Fan-out Redis channel remains open; the next event on it is
SessionPausedEvent. - The gRPC portal stream does not close — it remains open, idle, until
ResumeSessionstarts new events. This lets the browser keep the connection warm. A 10-minute idle keepalive ping is added toLifecycleServicestreams. - If the portal client disconnects, resume will re-emit via Redis fan-out; reconnecting clients catch up via
sequence_numreplay (existing Wave 2-delta mechanism, §3.5).
Decision deferred to LLD: whether to close and re-open the portal stream on pause, shifting resume notification to Redis/WebSocket. Simpler model is "keep stream open." We will prototype and measure.
6. Approval policy lookup — 5-level hierarchy
6.1 Resolution order (highest priority wins)
Per-Request override (metadata on the CheckRequest)
Sub-Team policy (team_governance_policies where team_id = agent.sub_team)
Parent-Team policy (team_governance_policies where team_id = agent.parent_team)
Tenant policy (org_governance_policies)
Platform policy (NEW: platform_governance_policies, seeded on bootstrap)
Precedence is first match wins — the resolver walks from most-specific to least-specific and returns the first matching (action_type, target) row.
6.2 New schema additions
Migration 037_approval_chain_engine.up.sql:
-- Platform-level policies (Wave 2 new table). Zero rows on greenfield;
-- seeded via Pulumi-managed governance bundle during bootstrap.
CREATE TABLE platform_governance_policies (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
action_type TEXT NOT NULL,
target TEXT NOT NULL DEFAULT '*',
effect TEXT NOT NULL CHECK (effect IN ('allow','deny','requires_approval')),
min_clearance INT NOT NULL DEFAULT 0,
template TEXT NOT NULL DEFAULT 'dev_only', -- dev_only|dev_review|full_pipeline|critical_path
timeout_seconds INT NOT NULL DEFAULT 86400, -- 24h default
escalation_minutes_before_deadline INT NOT NULL DEFAULT 240,
conditions JSONB NOT NULL DEFAULT '{}',
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
UNIQUE (action_type, target)
);
-- Platform policies are not RLS-filtered; they are read-only for tenants
-- and managed exclusively via admin portal.
-- Extend org_ and team_governance_policies with template + timeout fields
-- (idempotent migration — ADD COLUMN IF NOT EXISTS).
ALTER TABLE org_governance_policies
ADD COLUMN IF NOT EXISTS template TEXT NOT NULL DEFAULT 'dev_only',
ADD COLUMN IF NOT EXISTS timeout_seconds INT NOT NULL DEFAULT 86400,
ADD COLUMN IF NOT EXISTS escalation_minutes_before_deadline INT NOT NULL DEFAULT 240;
ALTER TABLE team_governance_policies
ADD COLUMN IF NOT EXISTS template TEXT NOT NULL DEFAULT 'dev_only',
ADD COLUMN IF NOT EXISTS timeout_seconds INT NOT NULL DEFAULT 86400,
ADD COLUMN IF NOT EXISTS escalation_minutes_before_deadline INT NOT NULL DEFAULT 240;
-- Extend governance_approvals with the fields the engine needs.
ALTER TABLE governance_approvals
ADD COLUMN IF NOT EXISTS session_id TEXT,
ADD COLUMN IF NOT EXISTS policy_id TEXT,
ADD COLUMN IF NOT EXISTS template TEXT NOT NULL DEFAULT 'dev_only',
ADD COLUMN IF NOT EXISTS required_clearance INT NOT NULL DEFAULT 0,
ADD COLUMN IF NOT EXISTS tool_name TEXT,
ADD COLUMN IF NOT EXISTS args_sha256 TEXT,
ADD COLUMN IF NOT EXISTS escalation_level INT NOT NULL DEFAULT 0,
ADD COLUMN IF NOT EXISTS channels_dispatched TEXT[] NOT NULL DEFAULT '{}';
-- Dedup: one open approval per (org_id, session_id, tool_name, args_sha256).
CREATE UNIQUE INDEX IF NOT EXISTS ix_gov_approvals_dedup
ON governance_approvals(org_id, session_id, tool_name, args_sha256)
WHERE status = 'pending';
-- Delegation chain.
CREATE TABLE approval_delegations (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
org_id TEXT NOT NULL,
approval_id UUID NOT NULL REFERENCES governance_approvals(id) ON DELETE CASCADE,
from_member_id TEXT NOT NULL,
to_member_id TEXT NOT NULL,
to_clearance INT NOT NULL,
reason TEXT,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
ALTER TABLE approval_delegations ENABLE ROW LEVEL SECURITY;
ALTER TABLE approval_delegations FORCE ROW LEVEL SECURITY;
CREATE POLICY scope_isolation ON approval_delegations
USING (org_id = current_setting('app.org_id', true));
CREATE INDEX ix_delegations_approval ON approval_delegations(approval_id);
-- Append-only lifecycle log (separate from the hash-chained agent_audit_log,
-- which also receives one entry per transition; this table is the
-- operationally-queryable read model).
CREATE TABLE approval_events (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
org_id TEXT NOT NULL,
approval_id UUID NOT NULL,
event_type TEXT NOT NULL
CHECK (event_type IN ('requested','dispatched','delegated',
'approved','denied','expired','escalated',
'channel_duplicate','channel_conflict')),
channel TEXT,
actor_member_id TEXT,
idempotency_key TEXT,
payload JSONB NOT NULL DEFAULT '{}',
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
ALTER TABLE approval_events ENABLE ROW LEVEL SECURITY;
ALTER TABLE approval_events FORCE ROW LEVEL SECURITY;
CREATE POLICY scope_isolation ON approval_events
USING (org_id = current_setting('app.org_id', true));
CREATE UNIQUE INDEX ix_approval_events_idem
ON approval_events(approval_id, idempotency_key)
WHERE idempotency_key IS NOT NULL;
6.3 Dedup semantics
Decision: one pending approval per (org_id, session_id, tool_name, args_sha256). A second RequestApproval with matching key returns the same approval_id (was_deduplicated=true). RecordDecision is idempotent on (approval_id, idempotency_key). If two channels race to record the same decision, the first wins and subsequent attempts return DUPLICATE (same decision) or CONFLICT (conflicting decision). On CONFLICT, the first decision stands — per PRD P4.3.10, consensus-breaking on conflicting approvals is explicitly deferred to V1.1.
7. Multi-channel adapter interface
7.1 Adapter contract (Go)
// internal/runtime/approval/channel/adapter.go
type Adapter interface {
Name() Channel // governancev1.CHANNEL_DASHBOARD etc.
// Dispatch is called once per approval creation (and once per
// escalation). Must be idempotent on approval_id.
Dispatch(ctx context.Context, req DispatchRequest) error
}
type DispatchRequest struct {
ApprovalID string
OrgID string
SessionID string
AgentID string
ToolName string
Target string
ArgsRedacted []byte // redacted per tenant security config
Template string
Deadline time.Time
RequiredClearance int
DashboardURL string
DecisionURLs map[Decision]string // HMAC-signed one-click URLs
}
7.2 Inbound webhook gateway (HMAC)
One HTTP handler shared across email / Slack / SCM replies, mounted at:
POST /api/v1/approvals/callback/{channel}/{approval_id}?t=<deadline>&d=<decision>&sig=<hmac>
sigis HMAC-SHA256 ofapproval_id|decision|t|operator_idusing a per-tenant rotating HMAC secret stored in Vault (aligned with PRD P4.3.11 "one-click approval for non-technical users").- Clock skew tolerance: ±5 min.
- Idempotency key = hash(approval_id, channel, decision, t).
- Handler translates to
ApprovalService.RecordDecisionand returns a signed confirmation page (email) or 200 OK (Slack/SCM webhook).
7.3 Adding a new channel
New channels implement Adapter, register with channel.Registry, add a Channel enum variant, and extend the webhook handler's routing table. No other changes required.
8. Timeout and escalation
8.1 Scheduler mechanism — Redis ZSET
Decision: Redis sorted set, not a daily walker. The daily-walker alternative would poll SELECT ... WHERE expires_at < now() AND status='pending' and is O(n) per tick, missing the 4-hour-before-deadline escalation window badly.
ZSET key: approvals:deadlines:{deadline_kind}
kind ∈ {"escalation", "expiry"}
member: approval_id
score: unix_timestamp_seconds of next action
Orchestrator singleton goroutine:
every 10s:
ZRANGEBYSCORE key -inf now LIMIT 100
for each approval_id:
load from PG, verify still pending
if kind=escalation: perform escalation (§8.3), re-ZADD with score=expiry
if kind=expiry: mark expired, trigger default action (deny), pause/resume
ZREM key approval_id
8.2 Defaults per template (from PRD)
| Template | Timeout | Escalation window | Expiry action |
|---|---|---|---|
dev_only | 24 h | none | deny |
dev_review | 24 h | 4 h before | escalate once, then deny |
full_pipeline | 48 h | 8 h before | escalate up hierarchy, then deny |
critical_path | 72 h | 24 h before | escalate up hierarchy, then deny |
Override chain from §6.1 applies — per-request can tighten but never loosen beyond platform ceiling.
8.3 Escalation semantics
On escalation tick, the engine walks one level up the policy hierarchy (sub-team → parent-team → tenant → platform), dispatches the approval to the new level's designated approvers (configured on the policy row), increments escalation_level, and writes an approval_events.event_type=escalated row. The original deadline stays; escalation is a notification change, not a time extension. This keeps the semantics simple for Wave 2; tunable extension on escalation is an LLD open item.
8.4 Scheduler HA
Single-replica for Wave 2 (same posture as the crash-recovery sweeper, delta §5.3 OoS #1). Orchestrator Deployment stays pinned to replicas: 1 while this runs in-process. A multi-replica follow-up would gate each tick with a Postgres advisory lock.
9. Delegation
9.1 Data model
approval_delegations table (§6.2). Each row is one hop: from_member_id → to_member_id for a single approval_id. Chains are formed by multiple rows.
9.2 Clearance rule
Delegation is allowed when to_member.clearance >= approval.required_clearance. The engine resolves to_member.clearance via the existing RBAC grants table (rbac_grants, migration 019).
Decision: we do NOT require to_member.clearance >= from_member.clearance. The semantic is "can this operator discharge this specific approval?" — their clearance on other decisions is not at issue. This matches the PRD's "delegatee must have >= required clearance" phrasing.
9.3 Audit chain
Every delegation writes a hash-chained audit entry action_type=approval_delegated with {from, to, approval_id, reason}. On subsequent decision, the resolved_by is the final delegatee; the delegation chain is retrievable via GetApproval.delegation_chain.
10. Audit
Every lifecycle event writes two rows:
| Event | agent_audit_log (hash-chained) | approval_events (operational) |
|---|---|---|
| Pause requested | session_paused | — |
| Session suspended | session_suspended | — |
| Approval requested | approval_requested | requested |
| Channel dispatched | approval_dispatched | dispatched per channel |
| Delegation | approval_delegated | delegated |
| Decision recorded | approval_decision | approved / denied |
| Escalation tick | approval_escalated | escalated |
| Expiry | approval_expired | expired |
| Session resumed | session_resumed | — |
agent_audit_log rows are hash-chained per the wave-1-item-4 mechanism. approval_events is a denormalised, query-optimised read model — the two never diverge because both are written inside the same pgx transaction that mutates governance_approvals.
11. Metrics
OpenTelemetry instrument names (prefix runtime_ or approval_):
runtime_sessions_paused_total{tenant_id, pause_source}runtime_sessions_resumed_total{tenant_id, resumed_by}(approval|operator|policy)runtime_pause_latency_seconds{tenant_id}— histogram, request → SUSPENDEDruntime_resume_latency_seconds{tenant_id}— histogram, RecordDecision → ACTIVEruntime_suspended_sessions_current{tenant_id}— gaugeapproval_requests_total{tenant_id, template}approval_decisions_total{tenant_id, decision, channel}approval_resolution_seconds{tenant_id, template}— histogram, request → decisionapproval_expiries_total{tenant_id, template}approval_escalations_total{tenant_id, template, to_level}approval_delegations_total{tenant_id}approval_webhook_invocations_total{channel, result}(ok|hmac_invalid|duplicate|conflict|expired_link)approval_dedup_hits_total{tenant_id}approval_scheduler_tick_duration_seconds— histogram
All label cardinality reviewed — no session_id / approval_id on labels (following 6a learning from Wave 2-delta).
12. Threat model
12.1 Defended against
- Policy bypass via forged decision. All inbound channel webhooks are HMAC-signed and verified against per-tenant Vault-stored secrets.
- Replay of approval decision links. Idempotency key = hash(approval_id, channel, decision, t) de-dups at DB level; unique index enforces.
- Cross-tenant approval theft. RLS on
governance_approvals,approval_delegations,approval_eventsgates every read. - Clearance escalation via delegation.
to_member.clearance >= required_clearanceverified on theDelegatepath; delegations cannot raise an approver's effective authority. - Stuck session DoS. Deadlines are Redis-scheduler enforced, never operator-trusted.
- In-flight decision race. Single-transaction UPDATE with a
WHERE status='pending'guard; losers seeCONFLICT/DUPLICATE. - Approval-spam DoS. Dedup index at
(org_id, session_id, tool_name, args_sha256)means repeated tool calls with identical args → one approval.
12.2 Not defended against (accepted)
- Compromised approver account. Outside scope — relies on the Clerk + clearance authentication layer.
- Social engineering of approvers via channel content. Channel payloads are operator-visible; the tool args displayed are redacted per tenant security config but a sufficiently crafted prompt-injected args field could mislead a human. Mitigation is operator training + the eventual ML-layer detection (P4.8.8, Wave 5).
- Consensus-breaking conflicts. If two simultaneous channels record opposite decisions, first-write wins. PRD P4.3.10 explicitly defers consensus resolution to V1.1.
- Scheduler single-point-of-failure during 10s tick window. Crash within that window can delay a deadline tick by up to restart-time + 10 s. Acceptable for 24h-72h SLAs.
13. Rollout
13.1 Feature flags (all via existing tenant_security_config JSONB)
| Flag | Default | Purpose |
|---|---|---|
approval_chain_enabled | false | Master switch. When false, MCP middleware falls back to the legacy in-process 5-min block (B-1). Set true per-tenant after migration. |
approval_channels | ["dashboard"] | Array; controls which channels dispatch. Email/Slack/SCM opt-in. |
approval_hmac_secret_id | null | Vault key pointer; required when any non-dashboard channel is enabled. |
runtime_pause_resume_enabled | true | Pause/Resume RPC availability. Safe to enable platform-wide since it is additive. |
13.2 Backward compatibility
- Sessions started pre-Wave-2 use the old AGENT_STATUS_ACTIVE path and never receive pause events. The checkpoint format in agent_checkpoints (Wave 2-delta) is schema-versioned; pause simply writes a new checkpoint with the same header.
- Clients that do not handle
session_paused/session_resumedignore unknown oneof variants per proto3 semantics; no wire break. GovernanceService.ResolveApprovalstays functional in deprecation — it now polls the new engine. Internal callers (MCP middleware) are migrated off it in this wave.
13.3 Migration ordering
- Ship migration 037 (non-breaking — additive).
- Ship
LifecycleService.PauseSession/ResumeSession+ session state machine; emit paused/resumed events.approval_chain_enabled=falsemeans nothing calls them except operator UI (deferred). - Ship
ApprovalServiceRPCs + scheduler + channel adapters. Legacy middleware still active. - Per-tenant, set
approval_chain_enabled=true. MCP middleware branch-switches at runtime from legacy block to RequestApproval + pause. Both paths tested in integration; rollback = toggle flag. - After all tenants migrated, remove legacy branch in a subsequent release.
14. Explicit non-goals (Wave 2)
- Operator dashboard UI. Wave 2 ships APIs only. Dashboard is a separate frontend PRD. Per PRD clarification on #380, frontend is deferred.
- Mobile push notifications for approvals. Not in P4.3.
- Consensus breaking on conflicting approvals. Deferred to V1.1 per PRD P4.3.10. First-write wins.
- Multi-approver quorum (e.g. "requires 2 of 3 approvers"). Deferred — single approver with optional delegation is Wave 2 scope.
- Pause mid-tool-call. §5.2 — message boundary only.
- Tunable deadline extension on escalation. §8.3 — escalation is a notification change, not a time extension, in this wave.
- Multi-replica scheduler HA. §8.4 — single-replica like the sweeper.
- Operator delegation-of-delegation restrictions. Any approver can delegate once; chained delegations are allowed but the clearance check applies at each hop.
- Cross-tenant approval (platform-level approver) — platform approvals use a platform-admin member identity but still carry
org_idof the requesting tenant. - Scheduled-pause. No cron/scheduled pause in Wave 2. Operator-initiated and policy-initiated only.
15. Open questions for founder sign-off
Each of these genuinely changes the wave scope or has cost/UX implications the architect should not unilaterally decide.
- OQ-1. Portal stream on pause — keep open vs close? §5.4. Recommended: keep open with idle keepalive. Closing simplifies server code but requires a robust reconnect/replay in every client. Founder preference?
- OQ-2. Scheduler tick interval. 10 s is defensible for 24h-72h deadlines (max 10 s late on escalation). Tighter wastes cycles, looser risks missed 4-hour-before-deadline notifications. Acceptable?
- OQ-3. Channel set at GA. PRD calls for dashboard, email, Slack, SCM. Implementing all four in Wave 2 is tractable, but if we want to ship faster we can gate Slack + SCM behind a Wave 2.1 delta. Recommended: ship all four, opt-in per tenant. Confirm?
- OQ-4. HMAC secret rotation cadence. Per-tenant Vault-stored. Proposed: 90-day rotation, dual-key grace period. Approve or prefer shorter?
- OQ-5. Platform policy management surface. Platform governance policies are admin-portal-only. Wave 2 ships the table + seed bundle; the admin portal CRUD UI is a separate repo's work. Is a seed-file-only bootstrap acceptable for initial GA, with CRUD following?
- OQ-6. Conflict-on-race posture — first write wins, as specced? §6.3 and §12.2. PRD P4.3.10 defers consensus to V1.1; we've encoded that literally (
CONFLICTresult + no reconciliation). Confirm this is acceptable for GA. - OQ-7. Delegatee-clearance-vs-from-clearance. §9.2. Specced as
to >= required_clearanceonly, notto >= from. Is that correct? - OQ-8. Single-replica scheduler. §8.4. Same posture as crash-recovery sweeper. Accept the availability envelope (restart-time + 10 s worst-case) for Wave 2?
16. Risks
- R-1. Checkpoint cost on high-volume approval scenarios. Every pause writes a new checkpoint row. A tenant with 100 concurrent agents each triggering multiple approvals per session multiplies checkpoint write rate. Mitigation: Wave 2-delta's existing checkpoint compression (zstd) + index on
(session_id, loop_iteration DESC). Measure in load test before GA. - R-2. MCP middleware branching complexity. Flag-gated dual-path (legacy 5-min block vs new pause/resume) increases test surface. Mitigation: both paths covered by parallel integration tests; legacy branch has a hard deprecation date one release after GA.
- R-3. Channel webhook DoS surface. Public HMAC-signed endpoints are an invitation. Mitigation: per-tenant rate limit on the callback handler; tight HMAC clock-skew window; 401 on sig failure; structured logging of sig failures for monitoring.
- R-4. Redis ZSET as sole source of deadline truth. A Redis AOF/RDB gap could drop scheduled deadlines. Mitigation: ZSET is a cache — source of truth is
governance_approvals.expires_at. On scheduler startup,SELECT approval_id, expires_at FROM governance_approvals WHERE status='pending'and re-ZADD. This also closes the bootstrap gap. - R-5. Hash-chain ordering on concurrent transitions. The existing hash chain uses per-session monotonic sequence. Approval lifecycle events happen outside a specific session's execution path but carry
session_id. Must serialise via the existing chain-tracker to avoid duplicate sequences. Confirm in LLD. - R-6. Large tool args in approval dispatch. If an agent tries to call a tool with a 1 MB args blob, the channel payload explodes. Mitigation: args redaction/truncation at dispatch time, with a
"View full args in dashboard"link. Redaction rules come from existingtenant_security_config. - R-7. Deadline clock drift on GKE nodes. Deadlines are UTC wall-clock. Node clock skew > 5 min would break HMAC validation and scheduler accuracy. Mitigation: rely on GKE-managed NTP; existing posture. No new work.
17. Appendix — mapping to PRD P4.3 sub-items
| PRD sub-item | Addressed in HLD section | Notes |
|---|---|---|
| P4.3.1 5-level hierarchy | §6.1 | Platform → Tenant → Parent → Sub → Per-request |
| P4.3.2 Templates | §6.2, §8.2 | Enum + timeout/escalation per template |
| P4.3.3 Dashboard channel | §7.1, §7.2 | Dashboard is CHANNEL_DASHBOARD, default-on |
| P4.3.4 Email channel | §7.1, §7.2 | HMAC one-click links |
| P4.3.5 Slack channel | §7.1, §7.2 | Slack webhook → RecordDecision |
| P4.3.6 SCM channel | §7.1, §7.2 | SCM issue comment webhook |
| P4.3.7 Timeout defaults | §8.2 | 24h/48h/72h per template |
| P4.3.8 Escalation | §8.3 | One-level-up on configured window |
| P4.3.9 Delegation | §9 | Clearance-checked, chained, audited |
| P4.3.10 Consensus | §6.3, §12.2, §14 | Explicitly deferred to V1.1 |
| P4.3.11 Non-technical one-click | §7.2 | HMAC-signed short-URL confirmation |
| P4.8.3 Runtime integration | §3.2 Flow B, §13.3 | MCP middleware migration behind flag |
| P4.8.4 Pause/Resume RPC | §4.1, §5 | PauseSession, ResumeSession, events |