Skip to main content

HLD: Agent Runtime Wave 2 — Approval Chain Engine + Pause/Resume API

FieldValue
StatusDraft — awaiting founder approval
Version1.0
Date2026-04-13
AuthorPrincipal Technical Architect Agent
ScopePRD #380 Wave 2 — P4.8.3 Approval Chain Engine + P4.8.4 Pause/Resume API
PRD#380 (approved)
Parent PRD#1
Tracker#381
Milestone9
Wave 1 sign-offhttps://github.com/upsquad-ai/upsquad-core/issues/381#issuecomment-4247176283
Precedent HLD styledocs/hld/agent-runtime-wave-1-agent-isolation.md
Relateddocs/hld/agent-runtime-wave2-delta.md (checkpoint persistence primitive)

1. Why one HLD for two items

P4.8.3 (Approval Chain Engine) and P4.8.4 (Pause/Resume API) are inseparable. Pause/Resume is the runtime primitive; Approval Chain is the first — and for now, only — consumer of that primitive beyond direct operator override. Designing them separately risks an abstraction mismatch where the approval flow leaks runtime concerns or the runtime primitive bakes in approval-specific assumptions.

Architecturally this document specifies:

  1. A runtime-level PauseSession / ResumeSession RPC pair that suspends a session at a clean message boundary and re-hydrates it from the existing Wave 2-delta checkpoint table, exposing a typed session_paused / session_resumed event on the portal stream.
  2. An ApprovalService layered on top that activates the P4.3 requirement set: 5-level policy hierarchy, multi-channel recording (dashboard / email / Slack / SCM webhook), timeout + escalation, delegation with clearance check, dedup, and hash-chained audit.

Anything in this HLD that conflicts with the parent PRD or the Wave 2-delta HLD is a bug in this document, not an override.


2. Current state — what exists on main today

Confirmed by direct source inspection (2026-04-13).

#FactEvidence
1AGENT_STATUS_SUSPENDED = 5 is already defined in the proto but never emittedproto/upsquad/runtime/v1/runtime.proto:154
2governance_approvals table already exists with columns (id, org_id, team_id, agent_id, action_type, target, status, requested_at, resolved_at, resolved_by, reason, metadata, expires_at) and RLS policyinternal/context/store/migrations/028_governance_policies.up.sql:52-78
3GovernanceService.Check already returns verdict requires_approval with an approval_id, and ResolveApproval RPC polls until resolvedproto/upsquad/governance/v1/governance.proto:19-33
4MCP middleware blocks in-process for ApprovalTimeout = 5 * time.Minute inside the tool call path when verdict is requires_approvalinternal/mcp/middleware/middleware.go:81-176
5Checkpoint persistence (agent_checkpoints table + sweeper-based resume) shipped in the Wave 2-delta and is the state primitive pause/resume will reusedocs/hld/agent-runtime-wave2-delta.md §5, internal/runtime/session/sweeper.go, internal/runtime/checkpointstore/
6Hash-chained audit (wave-1 item 4) is live; every approval lifecycle event will slot into the existing chaininternal/context/store/migrations/036_audit_log_hash_chain.up.sql, internal/runtime/audit/hashchain.go

Two things are broken-as-designed and this HLD replaces them:

  • B-1. 5-minute in-process block. middleware.go:167-174 holds the MCP request goroutine while polling ResolveApproval. This does not survive orchestrator restarts, wastes compute, caps us at ~5 min approvals even though the PRD calls for 24 h and 72 h defaults, and makes the caller observe a timeout rather than a paused session. Replace with pause/resume.
  • B-2. governance_approvals.expires_at default of 1 hour and no escalation. Must become policy-driven (24 h dev, 72 h critical-path) with scheduled escalation to the next hierarchy level on timeout.

3. Architecture overview

3.1 Component diagram

Portal / Operator UI

│ gRPC (existing)

┌──────────────────────────────┐
│ LifecycleService (Go) │◄──── PauseSession / ResumeSession (NEW)
│ internal/runtime/server │ — operator-initiated
└───────┬──────────────────────┘
│ session state transitions

┌──────────────────────────────┐
│ session.Manager │
│ + pause/resume transitions │
│ + checkpoint hydrate │
└───────┬──────────────────────┘

┌───────────────┼────────────────────────────┐
▼ ▼ ▼
┌────────────────┐ ┌─────────────────┐ ┌──────────────────────────┐
│ checkpointstore│ │ streaming.Pub │ │ Redis sorted-set │
│ (pg) │ │ (session_paused │ │ ZSET approval_deadlines │
│ │ │ / _resumed) │ │ score=unix deadline │
└────────────────┘ └─────────────────┘ └──────────┬───────────────┘


┌──────────────────────┐
│ ApprovalScheduler │
│ (orch goroutine) │ NEW
│ timeout + escalation │
└──────────┬───────────┘

┌───────────────────────────────┼────────────────────┐
▼ ▼ ▼
┌────────────────────┐ ┌──────────────────────┐ ┌─────────────────┐
│ ApprovalService │ │ PolicyResolver │ │ ChannelAdapter │
│ (gRPC, NEW) │◄─────┤ 5-level hierarchy │ │ Registry (NEW) │
│ - Request │ │ platform→tenant→ │ │ - Dashboard │
│ - RecordDecision │ │ parent→sub→per-req │ │ - Email (HMAC) │
│ - List / Get │ │ lookup │ │ - Slack │
│ - Delegate │ │ │ │ - SCM │
└─────────┬──────────┘ └──────────────────────┘ └───────┬─────────┘
│ │
│ pgx (governance_approvals + NEW tables) │ webhooks in
▼ ▼
┌─────────────────────┐ ┌─────────────────────┐
│ Postgres │ │ ChannelWebhookGW │
│ governance_approvals│ │ (HTTP handler) │
│ approval_policies │ NEW │ HMAC verify, dedup │
│ approval_delegations│ NEW │ → RecordDecision │
│ approval_events │ NEW (lifecycle log) └─────────────────────┘
└─────────────────────┘


┌─────────────────────┐
│ agent_audit_log │ (existing, hash-chained)
│ every pause/resume/ │
│ request/decision │
└─────────────────────┘

3.2 Two primary data flows

Flow A — operator-initiated pause. Operator clicks Pause in the portal (future UI; Wave 2 exposes the API only, see §14).

OperatorClient → LifecycleService.PauseSession(session_id, reason)
→ session.Manager.RequestPause(session_id) // sets pause_requested=true
→ worker receives pause signal via control-plane header on next event boundary
→ worker flushes final ExecuteStepEvent for the in-flight step, emits
ExecuteStepEvent{paused=PausedEvent{reason}}
→ orchestrator writes checkpoint, transitions session → SUSPENDED
→ emit AgentEvent{session_paused=SessionPausedEvent{reason, checkpoint_key}}
→ audit entry: session_paused
→ RPC returns

Flow B — approval-initiated pause. MCP tool call hits governance Check, verdict is requires_approval.

agent-worker → MCP middleware → GovernanceService.Check
→ verdict=requires_approval, approval_id=<uuid>
→ middleware emits a PAUSE_REQUIRED control event back through ExecuteStep stream
→ orchestrator runs the same pause path as Flow A (shared code), but records
approval_id on the session row
→ ApprovalService.RequestApproval(approval_id, session_id, tool, args_hash, policy_id)
→ ChannelAdapter.Dispatch → email + slack + SCM comment (per policy template)
→ ApprovalScheduler schedules deadline in Redis ZSET score=expires_at
→ session stays SUSPENDED until RecordDecision

On RecordDecision(approval_id, APPROVED, operator, reason):

ApprovalService validates operator clearance (incl. delegation chain)
→ update governance_approvals.status = 'approved'
→ audit: approval_decision (hash-chained)
→ call LifecycleService.ResumeSession(session_id, operator_input={approval_metadata})
→ session.Manager loads checkpoint, transitions → ACTIVE
→ worker re-dialled, approval metadata injected as system-tagged context message
→ emit AgentEvent{session_resumed=SessionResumedEvent{resumed_by_approval=true}}

On RecordDecision(approval_id, DENIED, operator, reason):

→ update governance_approvals.status = 'denied'
→ audit: approval_decision denied
→ LifecycleService.TerminateSession(session_id, reason="approval denied: …")
→ emit AgentEvent{error=ErrorEvent{reason=approval_denied}}
→ session → FAILED

4. RPC surface

4.1 LifecycleService additions (runtime v1)

// proto/upsquad/runtime/v1/lifecycle.proto

service LifecycleService {
// ...existing RPCs unchanged...

// PauseSession requests that an active session transition to SUSPENDED
// at the next clean message boundary. Idempotent: pausing an already-
// paused session returns success with status=SUSPENDED.
rpc PauseSession(PauseSessionRequest) returns (PauseSessionResponse);

// ResumeSession re-hydrates a SUSPENDED session from its latest
// checkpoint and transitions it back to ACTIVE. operator_input is
// optional; when set, it is injected into the session context as a
// system-tagged message before the loop continues.
//
// Resuming a non-SUSPENDED session returns FailedPrecondition.
rpc ResumeSession(ResumeSessionRequest) returns (ResumeSessionResponse);
}

message PauseSessionRequest {
string tenant_id = 1; // from JWT, gateway-injected
string session_id = 2;
// reason is a human-readable label; recorded in audit.
string reason = 3;
// pause_source enumerates who triggered the pause. Approval-engine
// callers set APPROVAL; operator UI sets OPERATOR.
PauseSource pause_source = 4;
// correlation_id optionally links this pause to an approval_id or a
// policy event for cross-system audit correlation.
string correlation_id = 5;
}

enum PauseSource {
PAUSE_SOURCE_UNSPECIFIED = 0;
PAUSE_SOURCE_OPERATOR = 1;
PAUSE_SOURCE_APPROVAL = 2;
PAUSE_SOURCE_POLICY = 3; // future: automated policy gate
}

message PauseSessionResponse {
AgentStatus status = 1; // SUSPENDED on success
string checkpoint_key = 2; // pointer into agent_checkpoints
google.protobuf.Timestamp paused_at = 3;
bool was_already_suspended = 4; // idempotency signal
}

message ResumeSessionRequest {
string tenant_id = 1;
string session_id = 2;
// operator_input is optional structured input injected into the context
// as if from a human message. JSON-encoded; schema negotiated per
// pause_source. For approval resumes this carries:
// { "approval_id": "...", "decision": "approved",
// "operator_id": "...", "reason": "...", "delegated_from": "..." }
bytes operator_input = 3;
string resume_reason = 4;
}

message ResumeSessionResponse {
AgentStatus status = 1; // ACTIVE on success
int32 resumed_at_loop = 2;
google.protobuf.Timestamp resumed_at = 3;
}

Two new event variants extend AgentEvent.oneof event:

message AgentEvent {
oneof event {
TokenEvent token = 1;
StatusEvent status = 2;
CompletionEvent completion = 3;
ErrorEvent error = 4;
SessionPausedEvent session_paused = 5; // NEW
SessionResumedEvent session_resumed = 6; // NEW
}
}

message SessionPausedEvent {
string reason = 1;
PauseSource pause_source = 2;
string correlation_id = 3; // approval_id when source=APPROVAL
string checkpoint_key = 4;
google.protobuf.Timestamp paused_at = 5;
}

message SessionResumedEvent {
bool resumed_by_approval = 1;
string approval_id = 2; // empty when not approval-triggered
int32 resumed_at_loop = 3;
google.protobuf.Timestamp resumed_at = 4;
}

4.2 ApprovalService (governance v1, new file)

New proto file: proto/upsquad/governance/v1/approval.proto. Keeping GovernanceService untouched — ResolveApproval stays as a polling convenience but is deprecated in-place.

syntax = "proto3";
package upsquad.governance.v1;

service ApprovalService {
// RequestApproval is called from the runtime (MCP middleware / worker
// control plane) when a governance Check returns requires_approval.
// Idempotent on (session_id, tool_name, args_sha256) — returns the
// existing approval_id if one is already pending.
rpc RequestApproval(RequestApprovalRequest) returns (RequestApprovalResponse);

// RecordDecision is the single converged entry point for all channels
// (dashboard, email webhook, Slack webhook, SCM webhook). Channel
// adapters translate their inbound payload to this RPC.
rpc RecordDecision(RecordDecisionRequest) returns (RecordDecisionResponse);

// Delegate transfers approval authority to another member. Delegatee
// MUST have clearance >= the approval's required_clearance. The
// original approver retains audit-of-record for the delegation.
rpc Delegate(DelegateRequest) returns (DelegateResponse);

// Get / List are read surfaces for the operator UI (frontend deferred,
// but APIs shipped in Wave 2 so dashboard work can start in parallel).
rpc GetApproval(GetApprovalRequest) returns (Approval);
rpc ListApprovals(ListApprovalsRequest) returns (ListApprovalsResponse);
}

message RequestApprovalRequest {
string org_id = 1;
string team_id = 2;
string session_id = 3;
string agent_id = 4;
string member_id = 5; // the agent's effective member identity
int32 clearance = 6;
string action_type = 7; // e.g. "tool_call"
string tool_name = 8;
string target = 9;
bytes args = 10; // the full tool args (for context)
string args_sha256 = 11; // used for dedup
string policy_id = 12; // policy that produced requires_approval
int32 required_clearance = 13;
string template = 14; // "dev_only" | "dev_review" | "full_pipeline" | "critical_path"
google.protobuf.Timestamp deadline = 15; // scheduler-enforced
repeated string channels = 16; // e.g. ["dashboard","email","slack","scm"]
map<string,string> metadata = 17;
}

message RequestApprovalResponse {
string approval_id = 1;
bool was_deduplicated = 2; // true when an open approval matched
google.protobuf.Timestamp deadline = 3;
}

message RecordDecisionRequest {
string org_id = 1;
string approval_id = 2;
Decision decision = 3;
string operator_id = 4;
string reason = 5;
// channel records who recorded it — for audit + telemetry.
Channel channel = 6;
// idempotency_key prevents double-submission from webhook retries.
string idempotency_key = 7;
}

enum Decision {
DECISION_UNSPECIFIED = 0;
DECISION_APPROVED = 1;
DECISION_DENIED = 2;
}

enum Channel {
CHANNEL_UNSPECIFIED = 0;
CHANNEL_DASHBOARD = 1;
CHANNEL_EMAIL = 2;
CHANNEL_SLACK = 3;
CHANNEL_SCM = 4;
CHANNEL_API = 5; // direct programmatic
}

message RecordDecisionResponse {
// result is OK on first-write, DUPLICATE when idempotency_key collides,
// CONFLICT when the approval is already resolved with a different
// decision (see §6.3 dedup semantics).
RecordResult result = 1;
Approval approval = 2;
}

enum RecordResult {
RECORD_RESULT_UNSPECIFIED = 0;
RECORD_RESULT_OK = 1;
RECORD_RESULT_DUPLICATE = 2;
RECORD_RESULT_CONFLICT = 3;
}

message Approval {
string approval_id = 1;
string org_id = 2;
string session_id = 3;
string agent_id = 4;
string tool_name = 5;
string target = 6;
string status = 7; // pending|approved|denied|expired|escalated
string template = 8;
int32 required_clearance = 9;
google.protobuf.Timestamp requested_at = 10;
google.protobuf.Timestamp deadline = 11;
google.protobuf.Timestamp resolved_at = 12;
string resolved_by = 13;
string resolution_reason = 14;
repeated string channels_dispatched = 15;
repeated DelegationLink delegation_chain = 16;
string policy_id = 17;
int32 escalation_level = 18; // 0 = original; incremented on escalation
}

message DelegateRequest {
string org_id = 1;
string approval_id = 2;
string from_member_id = 3;
string to_member_id = 4;
string reason = 5;
}

message DelegateResponse {
Approval approval = 1;
}

message DelegationLink {
string from_member_id = 1;
string to_member_id = 2;
int32 to_clearance = 3;
google.protobuf.Timestamp at = 4;
string reason = 5;
}

5. State machine

5.1 Session status transitions

Existing enum: AGENT_STATUS_{UNSPECIFIED, INITIALIZING, ACTIVE, SUSPENDED, TERMINATED, ERROR}. This HLD uses all states but does not add new ones.

INITIALIZING ──► ACTIVE ──► SUSPENDED ──► ACTIVE
│ │ │
│ │ └► TERMINATED (deny)
│ └► TERMINATED (force terminate)
└──────────────► TERMINATED (normal completion)
└──────────────► ERROR

Invariants (enforced at session.Manager):

  • ACTIVE → SUSPENDED is only entered at a message boundary, i.e. between ExecuteStepEvent batches where loop_count has been fully persisted. Never mid-token, never mid-tool-call (see §5.2).
  • SUSPENDED → ACTIVE requires a fresh checkpoint load and re-dial of a worker — possibly a different worker than before.
  • SUSPENDED sessions do not count against the worker's concurrency budget. The worker is free to be recycled.
  • While SUSPENDED, no AgentEvents (other than the terminating session_resumed / error) may be emitted on the portal stream.
  • TERMINATED is absorbing — a terminated session cannot be resumed.

5.2 Pause granularity — message boundary

Decision: pause at the next message boundary, never mid-tool-call. Rationale:

  • Mid-tool-call pause would require tool-level cancellation primitives we do not have. Tools would either double-fire on resume (bad — tokens + side effects) or need per-tool checkpoint protocols (expensive for Wave 2).
  • Mid-LLM-stream pause would require cancelling a provider request mid-flight; providers bill for emitted tokens regardless, and partial assistant messages are garbage context for resume.
  • Message boundary = after a CompletionEvent OR after a tool result has been folded into state and the loop is about to dispatch the next LLM call. Both are already the checkpoint boundaries in Wave 2-delta §3.2 step 8.

Worst-case pause latency is bounded by the longest single LLM call + tool call in the current step, which we already cap at 120 s via the SendMessage deadline (delta §3.2). The operator sees StatusEvent{pending_pause} immediately and SessionPausedEvent within that bound.

5.3 In-flight tool calls when pause hits

If a governance Check returns requires_approval for a tool call, the tool does not run. The pause-request is raised before dispatch. On resume with APPROVED, the same tool call is re-dispatched with approval metadata in context.

If an operator-initiated pause arrives while a tool is mid-execution, the tool runs to completion, its result is folded in, and the session suspends at the subsequent boundary. The portal stream emits a StatusEvent{pending_pause=true} so the operator UI can show "pausing…".

5.4 Streaming subscribers

On pause:

  1. Fan-out Redis channel remains open; the next event on it is SessionPausedEvent.
  2. The gRPC portal stream does not close — it remains open, idle, until ResumeSession starts new events. This lets the browser keep the connection warm. A 10-minute idle keepalive ping is added to LifecycleService streams.
  3. If the portal client disconnects, resume will re-emit via Redis fan-out; reconnecting clients catch up via sequence_num replay (existing Wave 2-delta mechanism, §3.5).

Decision deferred to LLD: whether to close and re-open the portal stream on pause, shifting resume notification to Redis/WebSocket. Simpler model is "keep stream open." We will prototype and measure.


6. Approval policy lookup — 5-level hierarchy

6.1 Resolution order (highest priority wins)

Per-Request override (metadata on the CheckRequest)
Sub-Team policy (team_governance_policies where team_id = agent.sub_team)
Parent-Team policy (team_governance_policies where team_id = agent.parent_team)
Tenant policy (org_governance_policies)
Platform policy (NEW: platform_governance_policies, seeded on bootstrap)

Precedence is first match wins — the resolver walks from most-specific to least-specific and returns the first matching (action_type, target) row.

6.2 New schema additions

Migration 037_approval_chain_engine.up.sql:

-- Platform-level policies (Wave 2 new table). Zero rows on greenfield;
-- seeded via Pulumi-managed governance bundle during bootstrap.
CREATE TABLE platform_governance_policies (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
action_type TEXT NOT NULL,
target TEXT NOT NULL DEFAULT '*',
effect TEXT NOT NULL CHECK (effect IN ('allow','deny','requires_approval')),
min_clearance INT NOT NULL DEFAULT 0,
template TEXT NOT NULL DEFAULT 'dev_only', -- dev_only|dev_review|full_pipeline|critical_path
timeout_seconds INT NOT NULL DEFAULT 86400, -- 24h default
escalation_minutes_before_deadline INT NOT NULL DEFAULT 240,
conditions JSONB NOT NULL DEFAULT '{}',
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
UNIQUE (action_type, target)
);
-- Platform policies are not RLS-filtered; they are read-only for tenants
-- and managed exclusively via admin portal.

-- Extend org_ and team_governance_policies with template + timeout fields
-- (idempotent migration — ADD COLUMN IF NOT EXISTS).
ALTER TABLE org_governance_policies
ADD COLUMN IF NOT EXISTS template TEXT NOT NULL DEFAULT 'dev_only',
ADD COLUMN IF NOT EXISTS timeout_seconds INT NOT NULL DEFAULT 86400,
ADD COLUMN IF NOT EXISTS escalation_minutes_before_deadline INT NOT NULL DEFAULT 240;

ALTER TABLE team_governance_policies
ADD COLUMN IF NOT EXISTS template TEXT NOT NULL DEFAULT 'dev_only',
ADD COLUMN IF NOT EXISTS timeout_seconds INT NOT NULL DEFAULT 86400,
ADD COLUMN IF NOT EXISTS escalation_minutes_before_deadline INT NOT NULL DEFAULT 240;

-- Extend governance_approvals with the fields the engine needs.
ALTER TABLE governance_approvals
ADD COLUMN IF NOT EXISTS session_id TEXT,
ADD COLUMN IF NOT EXISTS policy_id TEXT,
ADD COLUMN IF NOT EXISTS template TEXT NOT NULL DEFAULT 'dev_only',
ADD COLUMN IF NOT EXISTS required_clearance INT NOT NULL DEFAULT 0,
ADD COLUMN IF NOT EXISTS tool_name TEXT,
ADD COLUMN IF NOT EXISTS args_sha256 TEXT,
ADD COLUMN IF NOT EXISTS escalation_level INT NOT NULL DEFAULT 0,
ADD COLUMN IF NOT EXISTS channels_dispatched TEXT[] NOT NULL DEFAULT '{}';

-- Dedup: one open approval per (org_id, session_id, tool_name, args_sha256).
CREATE UNIQUE INDEX IF NOT EXISTS ix_gov_approvals_dedup
ON governance_approvals(org_id, session_id, tool_name, args_sha256)
WHERE status = 'pending';

-- Delegation chain.
CREATE TABLE approval_delegations (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
org_id TEXT NOT NULL,
approval_id UUID NOT NULL REFERENCES governance_approvals(id) ON DELETE CASCADE,
from_member_id TEXT NOT NULL,
to_member_id TEXT NOT NULL,
to_clearance INT NOT NULL,
reason TEXT,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
ALTER TABLE approval_delegations ENABLE ROW LEVEL SECURITY;
ALTER TABLE approval_delegations FORCE ROW LEVEL SECURITY;
CREATE POLICY scope_isolation ON approval_delegations
USING (org_id = current_setting('app.org_id', true));
CREATE INDEX ix_delegations_approval ON approval_delegations(approval_id);

-- Append-only lifecycle log (separate from the hash-chained agent_audit_log,
-- which also receives one entry per transition; this table is the
-- operationally-queryable read model).
CREATE TABLE approval_events (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
org_id TEXT NOT NULL,
approval_id UUID NOT NULL,
event_type TEXT NOT NULL
CHECK (event_type IN ('requested','dispatched','delegated',
'approved','denied','expired','escalated',
'channel_duplicate','channel_conflict')),
channel TEXT,
actor_member_id TEXT,
idempotency_key TEXT,
payload JSONB NOT NULL DEFAULT '{}',
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
ALTER TABLE approval_events ENABLE ROW LEVEL SECURITY;
ALTER TABLE approval_events FORCE ROW LEVEL SECURITY;
CREATE POLICY scope_isolation ON approval_events
USING (org_id = current_setting('app.org_id', true));
CREATE UNIQUE INDEX ix_approval_events_idem
ON approval_events(approval_id, idempotency_key)
WHERE idempotency_key IS NOT NULL;

6.3 Dedup semantics

Decision: one pending approval per (org_id, session_id, tool_name, args_sha256). A second RequestApproval with matching key returns the same approval_id (was_deduplicated=true). RecordDecision is idempotent on (approval_id, idempotency_key). If two channels race to record the same decision, the first wins and subsequent attempts return DUPLICATE (same decision) or CONFLICT (conflicting decision). On CONFLICT, the first decision stands — per PRD P4.3.10, consensus-breaking on conflicting approvals is explicitly deferred to V1.1.


7. Multi-channel adapter interface

7.1 Adapter contract (Go)

// internal/runtime/approval/channel/adapter.go
type Adapter interface {
Name() Channel // governancev1.CHANNEL_DASHBOARD etc.
// Dispatch is called once per approval creation (and once per
// escalation). Must be idempotent on approval_id.
Dispatch(ctx context.Context, req DispatchRequest) error
}

type DispatchRequest struct {
ApprovalID string
OrgID string
SessionID string
AgentID string
ToolName string
Target string
ArgsRedacted []byte // redacted per tenant security config
Template string
Deadline time.Time
RequiredClearance int
DashboardURL string
DecisionURLs map[Decision]string // HMAC-signed one-click URLs
}

7.2 Inbound webhook gateway (HMAC)

One HTTP handler shared across email / Slack / SCM replies, mounted at:

POST /api/v1/approvals/callback/{channel}/{approval_id}?t=<deadline>&d=<decision>&sig=<hmac>
  • sig is HMAC-SHA256 of approval_id|decision|t|operator_id using a per-tenant rotating HMAC secret stored in Vault (aligned with PRD P4.3.11 "one-click approval for non-technical users").
  • Clock skew tolerance: ±5 min.
  • Idempotency key = hash(approval_id, channel, decision, t).
  • Handler translates to ApprovalService.RecordDecision and returns a signed confirmation page (email) or 200 OK (Slack/SCM webhook).

7.3 Adding a new channel

New channels implement Adapter, register with channel.Registry, add a Channel enum variant, and extend the webhook handler's routing table. No other changes required.


8. Timeout and escalation

8.1 Scheduler mechanism — Redis ZSET

Decision: Redis sorted set, not a daily walker. The daily-walker alternative would poll SELECT ... WHERE expires_at < now() AND status='pending' and is O(n) per tick, missing the 4-hour-before-deadline escalation window badly.

ZSET key: approvals:deadlines:{deadline_kind}
kind ∈ {"escalation", "expiry"}
member: approval_id
score: unix_timestamp_seconds of next action

Orchestrator singleton goroutine:
every 10s:
ZRANGEBYSCORE key -inf now LIMIT 100
for each approval_id:
load from PG, verify still pending
if kind=escalation: perform escalation (§8.3), re-ZADD with score=expiry
if kind=expiry: mark expired, trigger default action (deny), pause/resume
ZREM key approval_id

8.2 Defaults per template (from PRD)

TemplateTimeoutEscalation windowExpiry action
dev_only24 hnonedeny
dev_review24 h4 h beforeescalate once, then deny
full_pipeline48 h8 h beforeescalate up hierarchy, then deny
critical_path72 h24 h beforeescalate up hierarchy, then deny

Override chain from §6.1 applies — per-request can tighten but never loosen beyond platform ceiling.

8.3 Escalation semantics

On escalation tick, the engine walks one level up the policy hierarchy (sub-team → parent-team → tenant → platform), dispatches the approval to the new level's designated approvers (configured on the policy row), increments escalation_level, and writes an approval_events.event_type=escalated row. The original deadline stays; escalation is a notification change, not a time extension. This keeps the semantics simple for Wave 2; tunable extension on escalation is an LLD open item.

8.4 Scheduler HA

Single-replica for Wave 2 (same posture as the crash-recovery sweeper, delta §5.3 OoS #1). Orchestrator Deployment stays pinned to replicas: 1 while this runs in-process. A multi-replica follow-up would gate each tick with a Postgres advisory lock.


9. Delegation

9.1 Data model

approval_delegations table (§6.2). Each row is one hop: from_member_id → to_member_id for a single approval_id. Chains are formed by multiple rows.

9.2 Clearance rule

Delegation is allowed when to_member.clearance >= approval.required_clearance. The engine resolves to_member.clearance via the existing RBAC grants table (rbac_grants, migration 019).

Decision: we do NOT require to_member.clearance >= from_member.clearance. The semantic is "can this operator discharge this specific approval?" — their clearance on other decisions is not at issue. This matches the PRD's "delegatee must have >= required clearance" phrasing.

9.3 Audit chain

Every delegation writes a hash-chained audit entry action_type=approval_delegated with {from, to, approval_id, reason}. On subsequent decision, the resolved_by is the final delegatee; the delegation chain is retrievable via GetApproval.delegation_chain.


10. Audit

Every lifecycle event writes two rows:

Eventagent_audit_log (hash-chained)approval_events (operational)
Pause requestedsession_paused
Session suspendedsession_suspended
Approval requestedapproval_requestedrequested
Channel dispatchedapproval_dispatcheddispatched per channel
Delegationapproval_delegateddelegated
Decision recordedapproval_decisionapproved / denied
Escalation tickapproval_escalatedescalated
Expiryapproval_expiredexpired
Session resumedsession_resumed

agent_audit_log rows are hash-chained per the wave-1-item-4 mechanism. approval_events is a denormalised, query-optimised read model — the two never diverge because both are written inside the same pgx transaction that mutates governance_approvals.


11. Metrics

OpenTelemetry instrument names (prefix runtime_ or approval_):

  • runtime_sessions_paused_total{tenant_id, pause_source}
  • runtime_sessions_resumed_total{tenant_id, resumed_by} (approval|operator|policy)
  • runtime_pause_latency_seconds{tenant_id} — histogram, request → SUSPENDED
  • runtime_resume_latency_seconds{tenant_id} — histogram, RecordDecision → ACTIVE
  • runtime_suspended_sessions_current{tenant_id} — gauge
  • approval_requests_total{tenant_id, template}
  • approval_decisions_total{tenant_id, decision, channel}
  • approval_resolution_seconds{tenant_id, template} — histogram, request → decision
  • approval_expiries_total{tenant_id, template}
  • approval_escalations_total{tenant_id, template, to_level}
  • approval_delegations_total{tenant_id}
  • approval_webhook_invocations_total{channel, result} (ok|hmac_invalid|duplicate|conflict|expired_link)
  • approval_dedup_hits_total{tenant_id}
  • approval_scheduler_tick_duration_seconds — histogram

All label cardinality reviewed — no session_id / approval_id on labels (following 6a learning from Wave 2-delta).


12. Threat model

12.1 Defended against

  • Policy bypass via forged decision. All inbound channel webhooks are HMAC-signed and verified against per-tenant Vault-stored secrets.
  • Replay of approval decision links. Idempotency key = hash(approval_id, channel, decision, t) de-dups at DB level; unique index enforces.
  • Cross-tenant approval theft. RLS on governance_approvals, approval_delegations, approval_events gates every read.
  • Clearance escalation via delegation. to_member.clearance >= required_clearance verified on the Delegate path; delegations cannot raise an approver's effective authority.
  • Stuck session DoS. Deadlines are Redis-scheduler enforced, never operator-trusted.
  • In-flight decision race. Single-transaction UPDATE with a WHERE status='pending' guard; losers see CONFLICT / DUPLICATE.
  • Approval-spam DoS. Dedup index at (org_id, session_id, tool_name, args_sha256) means repeated tool calls with identical args → one approval.

12.2 Not defended against (accepted)

  • Compromised approver account. Outside scope — relies on the Clerk + clearance authentication layer.
  • Social engineering of approvers via channel content. Channel payloads are operator-visible; the tool args displayed are redacted per tenant security config but a sufficiently crafted prompt-injected args field could mislead a human. Mitigation is operator training + the eventual ML-layer detection (P4.8.8, Wave 5).
  • Consensus-breaking conflicts. If two simultaneous channels record opposite decisions, first-write wins. PRD P4.3.10 explicitly defers consensus resolution to V1.1.
  • Scheduler single-point-of-failure during 10s tick window. Crash within that window can delay a deadline tick by up to restart-time + 10 s. Acceptable for 24h-72h SLAs.

13. Rollout

13.1 Feature flags (all via existing tenant_security_config JSONB)

FlagDefaultPurpose
approval_chain_enabledfalseMaster switch. When false, MCP middleware falls back to the legacy in-process 5-min block (B-1). Set true per-tenant after migration.
approval_channels["dashboard"]Array; controls which channels dispatch. Email/Slack/SCM opt-in.
approval_hmac_secret_idnullVault key pointer; required when any non-dashboard channel is enabled.
runtime_pause_resume_enabledtruePause/Resume RPC availability. Safe to enable platform-wide since it is additive.

13.2 Backward compatibility

  • Sessions started pre-Wave-2 use the old AGENT_STATUS_ACTIVE path and never receive pause events. The checkpoint format in agent_checkpoints (Wave 2-delta) is schema-versioned; pause simply writes a new checkpoint with the same header.
  • Clients that do not handle session_paused / session_resumed ignore unknown oneof variants per proto3 semantics; no wire break.
  • GovernanceService.ResolveApproval stays functional in deprecation — it now polls the new engine. Internal callers (MCP middleware) are migrated off it in this wave.

13.3 Migration ordering

  1. Ship migration 037 (non-breaking — additive).
  2. Ship LifecycleService.PauseSession / ResumeSession + session state machine; emit paused/resumed events. approval_chain_enabled=false means nothing calls them except operator UI (deferred).
  3. Ship ApprovalService RPCs + scheduler + channel adapters. Legacy middleware still active.
  4. Per-tenant, set approval_chain_enabled=true. MCP middleware branch-switches at runtime from legacy block to RequestApproval + pause. Both paths tested in integration; rollback = toggle flag.
  5. After all tenants migrated, remove legacy branch in a subsequent release.

14. Explicit non-goals (Wave 2)

  • Operator dashboard UI. Wave 2 ships APIs only. Dashboard is a separate frontend PRD. Per PRD clarification on #380, frontend is deferred.
  • Mobile push notifications for approvals. Not in P4.3.
  • Consensus breaking on conflicting approvals. Deferred to V1.1 per PRD P4.3.10. First-write wins.
  • Multi-approver quorum (e.g. "requires 2 of 3 approvers"). Deferred — single approver with optional delegation is Wave 2 scope.
  • Pause mid-tool-call. §5.2 — message boundary only.
  • Tunable deadline extension on escalation. §8.3 — escalation is a notification change, not a time extension, in this wave.
  • Multi-replica scheduler HA. §8.4 — single-replica like the sweeper.
  • Operator delegation-of-delegation restrictions. Any approver can delegate once; chained delegations are allowed but the clearance check applies at each hop.
  • Cross-tenant approval (platform-level approver) — platform approvals use a platform-admin member identity but still carry org_id of the requesting tenant.
  • Scheduled-pause. No cron/scheduled pause in Wave 2. Operator-initiated and policy-initiated only.

15. Open questions for founder sign-off

Each of these genuinely changes the wave scope or has cost/UX implications the architect should not unilaterally decide.

  1. OQ-1. Portal stream on pause — keep open vs close? §5.4. Recommended: keep open with idle keepalive. Closing simplifies server code but requires a robust reconnect/replay in every client. Founder preference?
  2. OQ-2. Scheduler tick interval. 10 s is defensible for 24h-72h deadlines (max 10 s late on escalation). Tighter wastes cycles, looser risks missed 4-hour-before-deadline notifications. Acceptable?
  3. OQ-3. Channel set at GA. PRD calls for dashboard, email, Slack, SCM. Implementing all four in Wave 2 is tractable, but if we want to ship faster we can gate Slack + SCM behind a Wave 2.1 delta. Recommended: ship all four, opt-in per tenant. Confirm?
  4. OQ-4. HMAC secret rotation cadence. Per-tenant Vault-stored. Proposed: 90-day rotation, dual-key grace period. Approve or prefer shorter?
  5. OQ-5. Platform policy management surface. Platform governance policies are admin-portal-only. Wave 2 ships the table + seed bundle; the admin portal CRUD UI is a separate repo's work. Is a seed-file-only bootstrap acceptable for initial GA, with CRUD following?
  6. OQ-6. Conflict-on-race posture — first write wins, as specced? §6.3 and §12.2. PRD P4.3.10 defers consensus to V1.1; we've encoded that literally (CONFLICT result + no reconciliation). Confirm this is acceptable for GA.
  7. OQ-7. Delegatee-clearance-vs-from-clearance. §9.2. Specced as to >= required_clearance only, not to >= from. Is that correct?
  8. OQ-8. Single-replica scheduler. §8.4. Same posture as crash-recovery sweeper. Accept the availability envelope (restart-time + 10 s worst-case) for Wave 2?

16. Risks

  • R-1. Checkpoint cost on high-volume approval scenarios. Every pause writes a new checkpoint row. A tenant with 100 concurrent agents each triggering multiple approvals per session multiplies checkpoint write rate. Mitigation: Wave 2-delta's existing checkpoint compression (zstd) + index on (session_id, loop_iteration DESC). Measure in load test before GA.
  • R-2. MCP middleware branching complexity. Flag-gated dual-path (legacy 5-min block vs new pause/resume) increases test surface. Mitigation: both paths covered by parallel integration tests; legacy branch has a hard deprecation date one release after GA.
  • R-3. Channel webhook DoS surface. Public HMAC-signed endpoints are an invitation. Mitigation: per-tenant rate limit on the callback handler; tight HMAC clock-skew window; 401 on sig failure; structured logging of sig failures for monitoring.
  • R-4. Redis ZSET as sole source of deadline truth. A Redis AOF/RDB gap could drop scheduled deadlines. Mitigation: ZSET is a cache — source of truth is governance_approvals.expires_at. On scheduler startup, SELECT approval_id, expires_at FROM governance_approvals WHERE status='pending' and re-ZADD. This also closes the bootstrap gap.
  • R-5. Hash-chain ordering on concurrent transitions. The existing hash chain uses per-session monotonic sequence. Approval lifecycle events happen outside a specific session's execution path but carry session_id. Must serialise via the existing chain-tracker to avoid duplicate sequences. Confirm in LLD.
  • R-6. Large tool args in approval dispatch. If an agent tries to call a tool with a 1 MB args blob, the channel payload explodes. Mitigation: args redaction/truncation at dispatch time, with a "View full args in dashboard" link. Redaction rules come from existing tenant_security_config.
  • R-7. Deadline clock drift on GKE nodes. Deadlines are UTC wall-clock. Node clock skew > 5 min would break HMAC validation and scheduler accuracy. Mitigation: rely on GKE-managed NTP; existing posture. No new work.

17. Appendix — mapping to PRD P4.3 sub-items

PRD sub-itemAddressed in HLD sectionNotes
P4.3.1 5-level hierarchy§6.1Platform → Tenant → Parent → Sub → Per-request
P4.3.2 Templates§6.2, §8.2Enum + timeout/escalation per template
P4.3.3 Dashboard channel§7.1, §7.2Dashboard is CHANNEL_DASHBOARD, default-on
P4.3.4 Email channel§7.1, §7.2HMAC one-click links
P4.3.5 Slack channel§7.1, §7.2Slack webhook → RecordDecision
P4.3.6 SCM channel§7.1, §7.2SCM issue comment webhook
P4.3.7 Timeout defaults§8.224h/48h/72h per template
P4.3.8 Escalation§8.3One-level-up on configured window
P4.3.9 Delegation§9Clearance-checked, chained, audited
P4.3.10 Consensus§6.3, §12.2, §14Explicitly deferred to V1.1
P4.3.11 Non-technical one-click§7.2HMAC-signed short-URL confirmation
P4.8.3 Runtime integration§3.2 Flow B, §13.3MCP middleware migration behind flag
P4.8.4 Pause/Resume RPC§4.1, §5PauseSession, ResumeSession, events