Runbook: Org Model v2.3 — Phase 2 Cutover (Dev)
Parent tracker: #570 · HLD: #556 · LLD: #560 §10.2 · Task issue: #692 · Milestone: Wave J+K — Phase 2 Cutover + Legacy Drop (#11)
This runbook is the procedure; the founder is the operator. The runbook does not automate flag flips on the founder's behalf. Every flag below is flipped by hand by the founder in dev, one at a time, with explicit verification between each flip.
0. Prod flip authority
Prod flip authority: TBD. Revisit this section when the first prod tenant onboards. Until then, every flag-flip step below is dev only. Do not copy this runbook into a prod change ticket without replacing this section with the approved prod authority.
1. What this runbook covers
Four feature flags that move the platform from Phase 1 (dual-write
running, reads still from legacy pillars / teams /
team_memberships) to Phase 2 (reads swing to the Org Model v2.3
mirror tables, the new services go live):
| # | Flag | Effect when flipped to true | LLD reference |
|---|---|---|---|
| 1 | dual_write_org_units | Already true throughout Phase 1. Off-switch for the K.2 schema drop — must remain true until Phase 3 decision. | §10.1 |
| 2 | read_from_org_units | MemberService + TenantService swing reads from team_memberships / teams / pillars to org_units / org_unit_memberships. J.2 shim activation. | §10.2 step 1 |
| 3 | cascade_4layer | GovernanceService Check() uses the 4-layer cascade (platform → org → org_unit walk → member) instead of the legacy 3-layer path. | §10.2 step 4 |
| 4 | new_services_enabled | OrgUnitService, RbacService, ConflictService endpoints start serving traffic at the gateway. | §10.2 step 5 |
The flip order is not arbitrary. Each flag depends on the one above it landing cleanly (see §5 below).
2. Who flips what
- Dev: the founder flips every flag personally via the procedure in §6. Agents never flip flags on the founder's behalf. If an agent is asked to automate or script the flip, the request is out of scope and must be escalated back to the founder.
- Prod: TBD — see §0.
3. Pre-flight checks (before flag 1)
Run all of these before flipping any flag. Any failure is a hard stop.
3.1 7-day zero-delta gate
The reconciliation reports in
docs/runbooks/org-v23-reconciliation-reports/ must show seven
consecutive daily gates of PASS ending on today's (UTC) report. The
most recent report must include the banner:
✅ 7-day zero-delta gate: MET. Phase 2 cutover is unblocked from the reconciliation side.
# Confirm the banner is present in today's file.
grep -l "7-day zero-delta gate: MET" \
docs/runbooks/org-v23-reconciliation-reports/$(date -u +%Y-%m-%d).md
If the file does not exist or the banner is missing, stop. Wait
for the CI baseline run, or run cmd/reconcile-report manually
against a live DB. Document why the gate missed (e.g. the reconciler
was paused, drift was observed) in the tracking issue (#570) before
retrying.
3.2 Live drift metric is zero
# kubectl port-forward to the reconciler's metrics endpoint, then:
curl -s localhost:9119/metrics | grep '^dualwrite_open_drift '
Expected: every series reports 0. A non-zero value means the
reconciler detected drift in the current pass — do not flip any
flag until that series returns to zero for at least one full
detect interval (RECONCILER_DETECT_INTERVAL, default 1h).
3.3 No open DualwriteDriftPersisting / DualwriteReconcilerStalled alerts
Check Grafana (Org Model dashboard) or Alertmanager directly. Any open critical-severity dualwrite alert is a hard stop.
3.4 Backend track is merged
Issues #686–#690
(the backend track that lands the flag plumbing, the
TenantService.ListTeams shim, and the 4-layer cascade switch) must
all be merged and deployed to the dev cluster. Confirm via:
gh api repos/upsquad-ai/upsquad-core/milestones/11 \
--jq '.open_issues, .closed_issues'
4. Rollback lever (applies to every step below)
Each flag has a single atomic off-switch: flip it back to false.
Data is preserved because dual_write_org_units stays true
throughout Phase 2 — the mirror tables continue to receive writes,
and legacy reads continue to work the moment the read-swing flag is
flipped back.
Rollback = flip the just-flipped flag back to false, in the exact
reverse order you flipped it. Verify §7 checks rebound
green. Record what you observed in #570.
Rollback is not possible after Phase 3 (legacy tables dropped). Phase 3 is a separate runbook — do not start it from this document.
5. Flag flip order (dev)
The LLD §10.2 order is mandatory. Each step has a gate the founder checks before moving on.
┌─ Pre-flight §3 passes ─┐
│ │
▼ │
Step 1: read_from_org_units = true (§6.1, verify §7.1)
│ │
▼ │
Step 2: deploy TenantService.ListTeams shim (already landed via
│ backend track #686–#690; confirm via §7.2)
│ │
▼ │
Step 3: portal smoke tests (§7.3)
│ │
▼ │
Step 4: cascade_4layer = true (§6.2, verify §7.4)
│ │
▼ │
Step 5: new_services_enabled = true (§6.3, verify §7.5)
│ │
▼ │
Post-flip soak §8
dual_write_org_units is not flipped in this runbook. It stays
true and is only flipped false as the final off-switch in Phase 3
(covered by Wave K, separate runbook).
6. Procedure — flag flips
The founder executes these commands personally. Agents may help prepare the commands, but the flip itself is a manual act to preserve operator accountability for the change.
6.1 read_from_org_units → true
-
Pre-flip checklist:
- §3 pre-flight checks all green.
- Current time falls within the agreed change window.
- You (founder) have a shell open against the dev cluster.
-
The command — exact invocation depends on the flag surface landed by the backend track. The flag is tenant-scoped but for dev we flip it platform-wide via the global key:
# Flag store is Redis (precedent: internal/context/versioning/featureflag.go).# Global key pattern: ff:orgv23:read_from_org_units:globalkubectl -n platform exec deploy/redis -- \redis-cli SET ff:orgv23:read_from_org_units:global trueIf the backend track landed a different flag surface (e.g. a GUC or a ConfigMap), replace the command above with whatever the backend PRs documented in their merge description. Confirm the flag surface with
@backend-smebefore flipping — do not guess. -
Verification: see §7.1.
-
If §7.1 fails: flip back with
kubectl -n platform exec deploy/redis -- \redis-cli SET ff:orgv23:read_from_org_units:global false
6.2 cascade_4layer → true
- Gated on §7.1 and §7.3 green.
- Command:
kubectl -n platform exec deploy/redis -- \redis-cli SET ff:orgv23:cascade_4layer:global true
- Verification: §7.4.
- Rollback: set to
falsevia the same key. Rollback criterion from the LLD:GovernanceService.Checkp95 > 7 ms for > 5 min.
6.3 new_services_enabled → true
- Gated on §7.4 green.
- Command:
kubectl -n platform exec deploy/redis -- \redis-cli SET ff:orgv23:new_services_enabled:global true
- Verification: §7.5.
- Rollback: set to
false. Gateway routes for the new services return 404 again (expected).
7. Verification — per-step checks
7.1 After read_from_org_units = true
# Reads should now resolve from org_units / org_unit_memberships.
# Use the tenant-service list endpoint as a smoke:
grpcurl -plaintext \
-H "x-upsquad-org-id: $TEST_ORG_ID" \
dev.tenant-service:50051 \
upsquad.tenant.v1.TenantService/ListTeams
# Expected: same row count as the legacy ListTeams returned yesterday.
# If the count drops to zero, the shim is not resolving org_units — roll back.
- Member count, team count, and org chart must be byte-identical to the pre-flip baseline captured in the change ticket.
- Gateway error rate must stay < 0.5% for 10 minutes.
7.2 After the shim is live
The shim is expected to be live before §7.1 (it was deployed by the backend track). Confirm its health:
kubectl -n platform logs deploy/tenant-service --since=5m \
| grep -E "ListTeams|shim"
# Expected: shim path invoked, no "fallback to legacy" warnings.
7.3 Portal smoke tests
Run the Playwright suite against the dev portal:
cd ../upsquad-client && npm run e2e -- --grep "org-chart|governance"
All four flows must pass: ListMembers, ListTeams, Org Chart, Governance policy list (LLD §10.2 step 3).
7.4 After cascade_4layer = true
# GovernanceService p95 latency — target < 5 ms, hard stop at > 7 ms.
kubectl -n platform port-forward svc/prometheus 9090:9090 &
curl -s 'http://localhost:9090/api/v1/query' \
--data-urlencode 'query=histogram_quantile(0.95, sum(rate(governance_check_duration_seconds_bucket[5m])) by (le))' \
| jq '.data.result[0].value[1]'
# Expected: < 0.005 (5 ms). Rollback criterion: > 0.007 for > 5 min.
policy_conflictsinsertion rate should not spike — an unexpected rise means the 4-layer cascade is detecting conflicts the 3-layer path silently collapsed.
7.5 After new_services_enabled = true
# All three new services must answer health.
for svc in orgunit-service rbac-service conflict-service; do
grpcurl -plaintext dev.$svc:50051 grpc.health.v1.Health/Check
done
# Expected: SERVING on all three.
- Gateway 5xx rate stays < 0.1% for 15 minutes after flipping.
8. Post-flip soak
After step 5 lands, soak for 24 hours before moving on to Wave K (legacy drop):
- Reconciliation reports continue to emit daily; gate stays
PASS. dualwrite_open_driftstays at zero across every tenant.- No open
arch-concernorbugissues filed against MemberService, TenantService, GovernanceService, OrgUnitService, RbacService, or ConflictService. - Portal smoke suite run once more at the end of the soak.
9. Sign-off checklist
Every box must be checked by the named owner before the tracker is closed and Wave K is opened.
- Founder — §3 pre-flight complete (paste command output).
- Founder — Step 1 flipped, §7.1 green.
- Founder — §7.2 shim health confirmed.
- Founder — §7.3 portal smoke tests passed.
- Founder — Step 4 flipped, §7.4 green.
- Founder — Step 5 flipped, §7.5 green.
- Founder — §8 soak complete.
- DevOps — post-soak reconciliation report attached in #570.
- Principal Architect — reviewed the soak evidence, no outstanding concerns.
- Project Manager — closed #692 and the parent §10.2 tracker.
10. Appendix — how this runbook was generated
- Written as part of Wave J.7 (#692).
- Flag-flip authority statements reflect founder decisions recorded 2026-04-19 in the #570 thread.
- Source of truth for the step order is LLD §10.2. Any future change to the flip order MUST update both the LLD and this runbook in the same PR.
11. Appendix — Wave J.4 cascade_4layer auto-rollback breaker
The 4-layer cascade replaces a single-level team-policy walk with a fanned-out
member → org_unit → org → platform tightest-first evaluation. The HLD #556 §6
SLO is p95 Check latency < 5ms at depth 10. Wave J.4 (#689)
adds a p95 breaker that auto-rolls back to the legacy 3-layer path if the new
cascade blows past the SLO for a sustained window. No human in the loop — this
supplements the manual rollback lever in §4 and the criterion in §6.2.
11.1 Breaker defaults
| Config | Default | Source (code) |
|---|---|---|
| p95 latency threshold | 7 ms | governance.DefaultBreakerThreshold |
| Consecutive minutes to trip | 5 | governance.DefaultBreakerConsecutiveMinutes |
| Min samples per minute bucket | 20 | governance.DefaultBreakerMinSamplesPerBucket |
The 7ms threshold sits ~40% above the HLD 5ms SLO so the breaker does not
false-trip on normal load. Buckets with fewer than MinSamplesPerBucket
samples are neutral — they neither count toward the breach chain nor
break it. That protects against a single slow probe call flipping the
breaker during low-traffic windows.
11.2 What happens on trip
- An atomic
trippedflag flips totrue. governance.CascadeAutoRollbackTotalcounter increments (exactly once per trip — CAS-guarded).internal/config.DBResolver.Refresh()is called so any cached "true" reads ofcascade_4layerare immediately invalidated.- A
WARNlog fires under keygovernance.cascade.breaker.trippedwith the active threshold + consecutive-minutes config + the required operator action. - Next
Engine.Checkcall takes the legacy 3-layer path. The trip is process-local — other replicas continue running the cascade until their own breaker fires.
11.3 Dashboards / alerts
| Metric | Type | Alert suggestion |
|---|---|---|
governance_cascade_auto_rollback_total | counter | Any increase → page on-call (indicates a replica rolled back to legacy). |
governance_check_latency_seconds{layers=4} | histogram | P95 > 5ms sustained 2m → warning; > 7ms sustained 5m = breaker territory. |
governance_check_latency_seconds{layers=3} | histogram | Control — legacy path baseline for diffing. |
governance_cascade_duration_seconds | histogram | Inner cascade (LoadCascade + EvaluateCascade) latency; finer-grained. |
governance_cascade_evaluations_total | counter | Verdict distribution — sudden deny spike on flag flip = regression. |
Dashboards should panel both {layers=3} and {layers=4} histograms
side-by-side so a breaker trip is visible as a sudden volume shift
rather than requiring a log grep.
11.4 Re-enabling after a trip (manual only)
Auto-recovery is deliberately not wired. An operator must:
- Diagnose the root cause (query plan regression, missing index,
org_units.ancestor_pathdrift, etc.) using thegovernance_cascade_duration_secondshistogram + pg_stat_statements. - Resolve the underlying issue and verify in a canary replica.
- Clear the trip by either:
- Restarting the affected context-engine pod (simplest; the trip is process-local), OR
- Calling the admin
governance.Reset()surface if wired (Wave K).
- Confirm via
governance_cascade_auto_rollback_totalthat no subsequent trip has fired within the observation window.
Do NOT re-enable by flipping the DB override while replicas are still
reporting tripped — the local override wins even when the DB says
true. You will chase a phantom flag-off state.
11.5 Related files
internal/config/feature_flags.go— flag metadata + resolvers (J.1).internal/governance/breaker.go— Wave J.4 breaker implementation.internal/governance/engine.go— flag gate + histogram wiring.cmd/context-engine/main.go— production wiring.deploy/feature-flags/{dev,staging,prod}.yaml— environment-specific flag posture source of truth.
12. Appendix — Wave J.5 gateway exposure + smoke tests
Wave J.5 (#690)
is the gateway-side counterpart to §6.3. It wires the three new Connect
services behind new_services_enabled and ships a Go-side smoke suite
that operators run after each flip to prove the portal flows are live.
12.1 Gated service paths
All three services are mounted at the public gateway on the same host
as the other Connect services (dev: https://dev.upsquad.ai). The gate
is per-request — flipping new_services_enabled applies without a pod
restart (see §6.3 rollback lever).
| Service | Path prefix | Flag-off response |
|---|---|---|
OrgUnitService | /upsquad.orgunit.v1.OrgUnitService/ | HTTP 404 + new_services_enabled=false body |
RbacService | /upsquad.rbac.v1.RbacService/ | HTTP 404 + new_services_enabled=false body |
ConflictService | /upsquad.conflict.v1.ConflictService/ | HTTP 404 + new_services_enabled=false body |
When flipped off, every path returns HTTP 404 with a one-line
text/plain body naming the flag, so a curl probe during rollback
shows operators why the route is dark rather than a silent empty
response. Flag state is also echoed at boot under the
orgv23 feature flag posture structured-log event.
12.2 Post-flip smoke (make smoke-wave-j)
The smoke suite at test/smoke/wave_j_test.go exercises the four
portal-facing flows end-to-end via the generated Connect clients:
- Org chart render —
OrgUnitService.GetTree - Governance policy list —
GovernanceService.ListPolicies - Role management —
RbacService.ListRoles - Conflict inbox —
ConflictService.ListConflicts
The suite has two modes:
-
Mount check (no token) — asserts every path is live by rejecting a raw
HTTP 404from the gate. Unauthenticated / permission-denied Connect errors are treated as pass because the handler was reached. This is the recommended post-flip sanity:SMOKE_GATEWAY_URL=https://dev.upsquad.ai make smoke-wave-j -
End-to-end (token) — uses a bearer token to make real RPCs and asserts success (any row count, including zero):
SMOKE_GATEWAY_URL=https://dev.upsquad.ai \SMOKE_GATEWAY_TOKEN="$(cat ~/.upsquad/dev-token)" \SMOKE_GATEWAY_ORG_ID="$TEST_ORG_ID" \make smoke-wave-j
Without SMOKE_GATEWAY_URL the target skips — the suite is safe under
go test ./... on workstations that do not have a dev gateway in
reach. Running from CI, the suite is wired into the
make smoke-wave-j workflow step gated on successful deploy.
12.3 Failure shapes and operator actions
| Smoke failure signal | Likely cause | Operator action |
|---|---|---|
gate returned 404 — new_services_enabled is false | Flag not flipped, or flipped back by §4 rollback | Re-check §6.3; confirm DB override or env var is set. |
raw 404 from gateway — service may be dark or unmounted | Pod running a pre-J.5 image | Check Context-Engine deploy version matches Wave J.5 merge SHA. |
unknown connect error — gateway unhealthy? | Gateway 5xx or TLS error | Inspect context-engine pod logs; run grpc.health.v1 probe. |
authenticated call failed — code=… | Token / org-id mismatch or RBAC denial | Regenerate the test token; confirm the org-id tagged on the shim. |
12.4 Related files
cmd/context-engine/new_services_gate.go— per-request gate.cmd/context-engine/main.go— registration under the gate.test/smoke/wave_j_test.go— Connect-client smoke suite.Makefile—smoke-wave-jtarget.internal/config/feature_flags.go— flag resolver.
13. Appendix — Wave J.2 follow-up — admin flag-cache refresh endpoint
cfgpkg.DBResolver (internal/config/feature_flags.go) caches each
DB-override read for 30 seconds. After flipping a row in
platform_feature_flags (§6.1, §6.2, §6.3) the flip is invisible for
up to 30s on every replica until the next re-read. That fits within
the pre-prod < 5s rollback contract via a pod restart, but
#704 closes
the gap so an operator can force the cache drop without rolling pods.
13.1 Endpoint
| Field | Value |
|---|---|
| Method + path | POST /v1/admin/feature-flags/refresh |
| Required clearance | 100 (platform-admin — same as EmergencyRotateHMAC) |
| Auth | Bearer token via the standard Clerk JWT middleware |
| Body (optional) | {"reason": "<free text — recorded in audit>"} |
| 200 response | {"refreshed": true, "at": "<RFC3339>", "broadcast": <bool>} |
| 401 / 403 | {"error": {"code": "UNAUTHORIZED" | "INSUFFICIENT_CLEARANCE", "message": …}} |
broadcast reflects whether the cross-pod fan-out succeeded. A false
value means the local Refresh ran but the Redis publish failed —
re-issue the call to retry the broadcast, or restart the affected
peer pods.
13.2 Cross-pod fan-out
The handler publishes a tiny envelope on Redis pub/sub channel
v23_flags_refresh after running the local Refresh. Every replica
runs a cfgrefresh.RedisSubscriber from boot (wired in
cmd/context-engine/main.go and cmd/agent-orchestrator/main.go)
that calls the local v23Flags.Refresh() on each event. Mirrors the
rbac_role_changed pattern in internal/rbac/pubsub.go.
If Redis is down the local refresh still runs on the pod that received the call. Operators flipping a flag in a Redis-degraded environment should plan to restart pods to propagate, or wait for the 30s TTL.
13.3 Operator workflow (post-flag-flip)
TOKEN=$(cat ~/.upsquad/dev-admin-token) # platform-admin JWT
curl -sS -X POST https://dev.upsquad.ai/v1/admin/feature-flags/refresh \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"reason":"§6.3 new_services_enabled flip"}'
# {"refreshed":true,"at":"2026-04-26T12:34:56Z","broadcast":true}
Verify the flip took effect using the §7 verification probes for the
flag you just flipped (or the §6.3 mount-check curl for
new_services_enabled).
13.4 Audit shape
Each successful refresh emits exactly one agent_audit_log row via
the shared async batch writer (migration 078, #907) so portal audit
reads (#246, #333) and SIEM export (LLD-21) see the same shape as
Phase B config events:
{
"audit_kind": "config.refresh",
"surface": "v23-feature-flags",
"actor_user_id": "user_…",
"actor_clearance": 100,
"reason": "§6.3 new_services_enabled flip"
}
13.5 Related files
internal/config/refresh/handler.go— HTTP handler + auth gate.internal/config/refresh/pubsub.go— Redis publisher + subscriber.internal/config/refresh/audit.go—config.refreshaudit emitter.cmd/context-engine/main.go— endpoint mount + publisher wiring.cmd/agent-orchestrator/main.go— subscriber wiring.internal/config/feature_flags.go§Refresh()— the cache-drop primitive.