Skip to main content

Runbook: Org Model v2.3 — Phase 2 Cutover (Dev)

Parent tracker: #570 · HLD: #556 · LLD: #560 §10.2 · Task issue: #692 · Milestone: Wave J+K — Phase 2 Cutover + Legacy Drop (#11)

This runbook is the procedure; the founder is the operator. The runbook does not automate flag flips on the founder's behalf. Every flag below is flipped by hand by the founder in dev, one at a time, with explicit verification between each flip.


0. Prod flip authority

Prod flip authority: TBD. Revisit this section when the first prod tenant onboards. Until then, every flag-flip step below is dev only. Do not copy this runbook into a prod change ticket without replacing this section with the approved prod authority.


1. What this runbook covers

Four feature flags that move the platform from Phase 1 (dual-write running, reads still from legacy pillars / teams / team_memberships) to Phase 2 (reads swing to the Org Model v2.3 mirror tables, the new services go live):

#FlagEffect when flipped to trueLLD reference
1dual_write_org_unitsAlready true throughout Phase 1. Off-switch for the K.2 schema drop — must remain true until Phase 3 decision.§10.1
2read_from_org_unitsMemberService + TenantService swing reads from team_memberships / teams / pillars to org_units / org_unit_memberships. J.2 shim activation.§10.2 step 1
3cascade_4layerGovernanceService Check() uses the 4-layer cascade (platform → org → org_unit walk → member) instead of the legacy 3-layer path.§10.2 step 4
4new_services_enabledOrgUnitService, RbacService, ConflictService endpoints start serving traffic at the gateway.§10.2 step 5

The flip order is not arbitrary. Each flag depends on the one above it landing cleanly (see §5 below).

2. Who flips what

  • Dev: the founder flips every flag personally via the procedure in §6. Agents never flip flags on the founder's behalf. If an agent is asked to automate or script the flip, the request is out of scope and must be escalated back to the founder.
  • Prod: TBD — see §0.

3. Pre-flight checks (before flag 1)

Run all of these before flipping any flag. Any failure is a hard stop.

3.1 7-day zero-delta gate

The reconciliation reports in docs/runbooks/org-v23-reconciliation-reports/ must show seven consecutive daily gates of PASS ending on today's (UTC) report. The most recent report must include the banner:

7-day zero-delta gate: MET. Phase 2 cutover is unblocked from the reconciliation side.

# Confirm the banner is present in today's file.
grep -l "7-day zero-delta gate: MET" \
docs/runbooks/org-v23-reconciliation-reports/$(date -u +%Y-%m-%d).md

If the file does not exist or the banner is missing, stop. Wait for the CI baseline run, or run cmd/reconcile-report manually against a live DB. Document why the gate missed (e.g. the reconciler was paused, drift was observed) in the tracking issue (#570) before retrying.

3.2 Live drift metric is zero

# kubectl port-forward to the reconciler's metrics endpoint, then:
curl -s localhost:9119/metrics | grep '^dualwrite_open_drift '

Expected: every series reports 0. A non-zero value means the reconciler detected drift in the current pass — do not flip any flag until that series returns to zero for at least one full detect interval (RECONCILER_DETECT_INTERVAL, default 1h).

3.3 No open DualwriteDriftPersisting / DualwriteReconcilerStalled alerts

Check Grafana (Org Model dashboard) or Alertmanager directly. Any open critical-severity dualwrite alert is a hard stop.

3.4 Backend track is merged

Issues #686–#690 (the backend track that lands the flag plumbing, the TenantService.ListTeams shim, and the 4-layer cascade switch) must all be merged and deployed to the dev cluster. Confirm via:

gh api repos/upsquad-ai/upsquad-core/milestones/11 \
--jq '.open_issues, .closed_issues'

4. Rollback lever (applies to every step below)

Each flag has a single atomic off-switch: flip it back to false. Data is preserved because dual_write_org_units stays true throughout Phase 2 — the mirror tables continue to receive writes, and legacy reads continue to work the moment the read-swing flag is flipped back.

Rollback = flip the just-flipped flag back to false, in the exact
reverse order you flipped it. Verify §7 checks rebound
green. Record what you observed in #570.

Rollback is not possible after Phase 3 (legacy tables dropped). Phase 3 is a separate runbook — do not start it from this document.

5. Flag flip order (dev)

The LLD §10.2 order is mandatory. Each step has a gate the founder checks before moving on.

┌─ Pre-flight §3 passes ─┐
│ │
▼ │
Step 1: read_from_org_units = true (§6.1, verify §7.1)
│ │
▼ │
Step 2: deploy TenantService.ListTeams shim (already landed via
│ backend track #686–#690; confirm via §7.2)
│ │
▼ │
Step 3: portal smoke tests (§7.3)
│ │
▼ │
Step 4: cascade_4layer = true (§6.2, verify §7.4)
│ │
▼ │
Step 5: new_services_enabled = true (§6.3, verify §7.5)
│ │
▼ │
Post-flip soak §8

dual_write_org_units is not flipped in this runbook. It stays true and is only flipped false as the final off-switch in Phase 3 (covered by Wave K, separate runbook).

6. Procedure — flag flips

The founder executes these commands personally. Agents may help prepare the commands, but the flip itself is a manual act to preserve operator accountability for the change.

6.1 read_from_org_unitstrue

  1. Pre-flip checklist:

    • §3 pre-flight checks all green.
    • Current time falls within the agreed change window.
    • You (founder) have a shell open against the dev cluster.
  2. The command — exact invocation depends on the flag surface landed by the backend track. The flag is tenant-scoped but for dev we flip it platform-wide via the global key:

    # Flag store is Redis (precedent: internal/context/versioning/featureflag.go).
    # Global key pattern: ff:orgv23:read_from_org_units:global
    kubectl -n platform exec deploy/redis -- \
    redis-cli SET ff:orgv23:read_from_org_units:global true

    If the backend track landed a different flag surface (e.g. a GUC or a ConfigMap), replace the command above with whatever the backend PRs documented in their merge description. Confirm the flag surface with @backend-sme before flipping — do not guess.

  3. Verification: see §7.1.

  4. If §7.1 fails: flip back with

    kubectl -n platform exec deploy/redis -- \
    redis-cli SET ff:orgv23:read_from_org_units:global false

6.2 cascade_4layertrue

  1. Gated on §7.1 and §7.3 green.
  2. Command:
    kubectl -n platform exec deploy/redis -- \
    redis-cli SET ff:orgv23:cascade_4layer:global true
  3. Verification: §7.4.
  4. Rollback: set to false via the same key. Rollback criterion from the LLD: GovernanceService.Check p95 > 7 ms for > 5 min.

6.3 new_services_enabledtrue

  1. Gated on §7.4 green.
  2. Command:
    kubectl -n platform exec deploy/redis -- \
    redis-cli SET ff:orgv23:new_services_enabled:global true
  3. Verification: §7.5.
  4. Rollback: set to false. Gateway routes for the new services return 404 again (expected).

7. Verification — per-step checks

7.1 After read_from_org_units = true

# Reads should now resolve from org_units / org_unit_memberships.
# Use the tenant-service list endpoint as a smoke:
grpcurl -plaintext \
-H "x-upsquad-org-id: $TEST_ORG_ID" \
dev.tenant-service:50051 \
upsquad.tenant.v1.TenantService/ListTeams

# Expected: same row count as the legacy ListTeams returned yesterday.
# If the count drops to zero, the shim is not resolving org_units — roll back.
  • Member count, team count, and org chart must be byte-identical to the pre-flip baseline captured in the change ticket.
  • Gateway error rate must stay < 0.5% for 10 minutes.

7.2 After the shim is live

The shim is expected to be live before §7.1 (it was deployed by the backend track). Confirm its health:

kubectl -n platform logs deploy/tenant-service --since=5m \
| grep -E "ListTeams|shim"
# Expected: shim path invoked, no "fallback to legacy" warnings.

7.3 Portal smoke tests

Run the Playwright suite against the dev portal:

cd ../upsquad-client && npm run e2e -- --grep "org-chart|governance"

All four flows must pass: ListMembers, ListTeams, Org Chart, Governance policy list (LLD §10.2 step 3).

7.4 After cascade_4layer = true

# GovernanceService p95 latency — target < 5 ms, hard stop at > 7 ms.
kubectl -n platform port-forward svc/prometheus 9090:9090 &
curl -s 'http://localhost:9090/api/v1/query' \
--data-urlencode 'query=histogram_quantile(0.95, sum(rate(governance_check_duration_seconds_bucket[5m])) by (le))' \
| jq '.data.result[0].value[1]'
# Expected: < 0.005 (5 ms). Rollback criterion: > 0.007 for > 5 min.
  • policy_conflicts insertion rate should not spike — an unexpected rise means the 4-layer cascade is detecting conflicts the 3-layer path silently collapsed.

7.5 After new_services_enabled = true

# All three new services must answer health.
for svc in orgunit-service rbac-service conflict-service; do
grpcurl -plaintext dev.$svc:50051 grpc.health.v1.Health/Check
done
# Expected: SERVING on all three.
  • Gateway 5xx rate stays < 0.1% for 15 minutes after flipping.

8. Post-flip soak

After step 5 lands, soak for 24 hours before moving on to Wave K (legacy drop):

  • Reconciliation reports continue to emit daily; gate stays PASS.
  • dualwrite_open_drift stays at zero across every tenant.
  • No open arch-concern or bug issues filed against MemberService, TenantService, GovernanceService, OrgUnitService, RbacService, or ConflictService.
  • Portal smoke suite run once more at the end of the soak.

9. Sign-off checklist

Every box must be checked by the named owner before the tracker is closed and Wave K is opened.

  • Founder — §3 pre-flight complete (paste command output).
  • Founder — Step 1 flipped, §7.1 green.
  • Founder — §7.2 shim health confirmed.
  • Founder — §7.3 portal smoke tests passed.
  • Founder — Step 4 flipped, §7.4 green.
  • Founder — Step 5 flipped, §7.5 green.
  • Founder — §8 soak complete.
  • DevOps — post-soak reconciliation report attached in #570.
  • Principal Architect — reviewed the soak evidence, no outstanding concerns.
  • Project Manager — closed #692 and the parent §10.2 tracker.

10. Appendix — how this runbook was generated

  • Written as part of Wave J.7 (#692).
  • Flag-flip authority statements reflect founder decisions recorded 2026-04-19 in the #570 thread.
  • Source of truth for the step order is LLD §10.2. Any future change to the flip order MUST update both the LLD and this runbook in the same PR.

11. Appendix — Wave J.4 cascade_4layer auto-rollback breaker

The 4-layer cascade replaces a single-level team-policy walk with a fanned-out member → org_unit → org → platform tightest-first evaluation. The HLD #556 §6 SLO is p95 Check latency < 5ms at depth 10. Wave J.4 (#689) adds a p95 breaker that auto-rolls back to the legacy 3-layer path if the new cascade blows past the SLO for a sustained window. No human in the loop — this supplements the manual rollback lever in §4 and the criterion in §6.2.

11.1 Breaker defaults

ConfigDefaultSource (code)
p95 latency threshold7 msgovernance.DefaultBreakerThreshold
Consecutive minutes to trip5governance.DefaultBreakerConsecutiveMinutes
Min samples per minute bucket20governance.DefaultBreakerMinSamplesPerBucket

The 7ms threshold sits ~40% above the HLD 5ms SLO so the breaker does not false-trip on normal load. Buckets with fewer than MinSamplesPerBucket samples are neutral — they neither count toward the breach chain nor break it. That protects against a single slow probe call flipping the breaker during low-traffic windows.

11.2 What happens on trip

  1. An atomic tripped flag flips to true.
  2. governance.CascadeAutoRollbackTotal counter increments (exactly once per trip — CAS-guarded).
  3. internal/config.DBResolver.Refresh() is called so any cached "true" reads of cascade_4layer are immediately invalidated.
  4. A WARN log fires under key governance.cascade.breaker.tripped with the active threshold + consecutive-minutes config + the required operator action.
  5. Next Engine.Check call takes the legacy 3-layer path. The trip is process-local — other replicas continue running the cascade until their own breaker fires.

11.3 Dashboards / alerts

MetricTypeAlert suggestion
governance_cascade_auto_rollback_totalcounterAny increase → page on-call (indicates a replica rolled back to legacy).
governance_check_latency_seconds{layers=4}histogramP95 > 5ms sustained 2m → warning; > 7ms sustained 5m = breaker territory.
governance_check_latency_seconds{layers=3}histogramControl — legacy path baseline for diffing.
governance_cascade_duration_secondshistogramInner cascade (LoadCascade + EvaluateCascade) latency; finer-grained.
governance_cascade_evaluations_totalcounterVerdict distribution — sudden deny spike on flag flip = regression.

Dashboards should panel both {layers=3} and {layers=4} histograms side-by-side so a breaker trip is visible as a sudden volume shift rather than requiring a log grep.

11.4 Re-enabling after a trip (manual only)

Auto-recovery is deliberately not wired. An operator must:

  1. Diagnose the root cause (query plan regression, missing index, org_units.ancestor_path drift, etc.) using the governance_cascade_duration_seconds histogram + pg_stat_statements.
  2. Resolve the underlying issue and verify in a canary replica.
  3. Clear the trip by either:
    • Restarting the affected context-engine pod (simplest; the trip is process-local), OR
    • Calling the admin governance.Reset() surface if wired (Wave K).
  4. Confirm via governance_cascade_auto_rollback_total that no subsequent trip has fired within the observation window.

Do NOT re-enable by flipping the DB override while replicas are still reporting tripped — the local override wins even when the DB says true. You will chase a phantom flag-off state.

  • internal/config/feature_flags.go — flag metadata + resolvers (J.1).
  • internal/governance/breaker.go — Wave J.4 breaker implementation.
  • internal/governance/engine.go — flag gate + histogram wiring.
  • cmd/context-engine/main.go — production wiring.
  • deploy/feature-flags/{dev,staging,prod}.yaml — environment-specific flag posture source of truth.

12. Appendix — Wave J.5 gateway exposure + smoke tests

Wave J.5 (#690) is the gateway-side counterpart to §6.3. It wires the three new Connect services behind new_services_enabled and ships a Go-side smoke suite that operators run after each flip to prove the portal flows are live.

12.1 Gated service paths

All three services are mounted at the public gateway on the same host as the other Connect services (dev: https://dev.upsquad.ai). The gate is per-request — flipping new_services_enabled applies without a pod restart (see §6.3 rollback lever).

ServicePath prefixFlag-off response
OrgUnitService/upsquad.orgunit.v1.OrgUnitService/HTTP 404 + new_services_enabled=false body
RbacService/upsquad.rbac.v1.RbacService/HTTP 404 + new_services_enabled=false body
ConflictService/upsquad.conflict.v1.ConflictService/HTTP 404 + new_services_enabled=false body

When flipped off, every path returns HTTP 404 with a one-line text/plain body naming the flag, so a curl probe during rollback shows operators why the route is dark rather than a silent empty response. Flag state is also echoed at boot under the orgv23 feature flag posture structured-log event.

12.2 Post-flip smoke (make smoke-wave-j)

The smoke suite at test/smoke/wave_j_test.go exercises the four portal-facing flows end-to-end via the generated Connect clients:

  1. Org chart render — OrgUnitService.GetTree
  2. Governance policy list — GovernanceService.ListPolicies
  3. Role management — RbacService.ListRoles
  4. Conflict inbox — ConflictService.ListConflicts

The suite has two modes:

  • Mount check (no token) — asserts every path is live by rejecting a raw HTTP 404 from the gate. Unauthenticated / permission-denied Connect errors are treated as pass because the handler was reached. This is the recommended post-flip sanity:

    SMOKE_GATEWAY_URL=https://dev.upsquad.ai make smoke-wave-j
  • End-to-end (token) — uses a bearer token to make real RPCs and asserts success (any row count, including zero):

    SMOKE_GATEWAY_URL=https://dev.upsquad.ai \
    SMOKE_GATEWAY_TOKEN="$(cat ~/.upsquad/dev-token)" \
    SMOKE_GATEWAY_ORG_ID="$TEST_ORG_ID" \
    make smoke-wave-j

Without SMOKE_GATEWAY_URL the target skips — the suite is safe under go test ./... on workstations that do not have a dev gateway in reach. Running from CI, the suite is wired into the make smoke-wave-j workflow step gated on successful deploy.

12.3 Failure shapes and operator actions

Smoke failure signalLikely causeOperator action
gate returned 404 — new_services_enabled is falseFlag not flipped, or flipped back by §4 rollbackRe-check §6.3; confirm DB override or env var is set.
raw 404 from gateway — service may be dark or unmountedPod running a pre-J.5 imageCheck Context-Engine deploy version matches Wave J.5 merge SHA.
unknown connect error — gateway unhealthy?Gateway 5xx or TLS errorInspect context-engine pod logs; run grpc.health.v1 probe.
authenticated call failed — code=…Token / org-id mismatch or RBAC denialRegenerate the test token; confirm the org-id tagged on the shim.
  • cmd/context-engine/new_services_gate.go — per-request gate.
  • cmd/context-engine/main.go — registration under the gate.
  • test/smoke/wave_j_test.go — Connect-client smoke suite.
  • Makefilesmoke-wave-j target.
  • internal/config/feature_flags.go — flag resolver.

13. Appendix — Wave J.2 follow-up — admin flag-cache refresh endpoint

cfgpkg.DBResolver (internal/config/feature_flags.go) caches each DB-override read for 30 seconds. After flipping a row in platform_feature_flags (§6.1, §6.2, §6.3) the flip is invisible for up to 30s on every replica until the next re-read. That fits within the pre-prod < 5s rollback contract via a pod restart, but #704 closes the gap so an operator can force the cache drop without rolling pods.

13.1 Endpoint

FieldValue
Method + pathPOST /v1/admin/feature-flags/refresh
Required clearance100 (platform-admin — same as EmergencyRotateHMAC)
AuthBearer token via the standard Clerk JWT middleware
Body (optional){"reason": "<free text — recorded in audit>"}
200 response{"refreshed": true, "at": "<RFC3339>", "broadcast": <bool>}
401 / 403{"error": {"code": "UNAUTHORIZED" | "INSUFFICIENT_CLEARANCE", "message": …}}

broadcast reflects whether the cross-pod fan-out succeeded. A false value means the local Refresh ran but the Redis publish failed — re-issue the call to retry the broadcast, or restart the affected peer pods.

13.2 Cross-pod fan-out

The handler publishes a tiny envelope on Redis pub/sub channel v23_flags_refresh after running the local Refresh. Every replica runs a cfgrefresh.RedisSubscriber from boot (wired in cmd/context-engine/main.go and cmd/agent-orchestrator/main.go) that calls the local v23Flags.Refresh() on each event. Mirrors the rbac_role_changed pattern in internal/rbac/pubsub.go.

If Redis is down the local refresh still runs on the pod that received the call. Operators flipping a flag in a Redis-degraded environment should plan to restart pods to propagate, or wait for the 30s TTL.

13.3 Operator workflow (post-flag-flip)

TOKEN=$(cat ~/.upsquad/dev-admin-token) # platform-admin JWT
curl -sS -X POST https://dev.upsquad.ai/v1/admin/feature-flags/refresh \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"reason":"§6.3 new_services_enabled flip"}'
# {"refreshed":true,"at":"2026-04-26T12:34:56Z","broadcast":true}

Verify the flip took effect using the §7 verification probes for the flag you just flipped (or the §6.3 mount-check curl for new_services_enabled).

13.4 Audit shape

Each successful refresh emits exactly one agent_audit_log row via the shared async batch writer (migration 078, #907) so portal audit reads (#246, #333) and SIEM export (LLD-21) see the same shape as Phase B config events:

{
"audit_kind": "config.refresh",
"surface": "v23-feature-flags",
"actor_user_id": "user_…",
"actor_clearance": 100,
"reason": "§6.3 new_services_enabled flip"
}
  • internal/config/refresh/handler.go — HTTP handler + auth gate.
  • internal/config/refresh/pubsub.go — Redis publisher + subscriber.
  • internal/config/refresh/audit.goconfig.refresh audit emitter.
  • cmd/context-engine/main.go — endpoint mount + publisher wiring.
  • cmd/agent-orchestrator/main.go — subscriber wiring.
  • internal/config/feature_flags.go §Refresh() — the cache-drop primitive.