Skip to main content

Agent Worktree Isolation — Runbook

Status: active — 2026-04-22. Owner: devops-engineer. Related: core#852, client#107, client#110, client#111, core#827.

Why this exists

Multiple agents (backend-sme, frontend-sme, qa-engineer, principal-architect, etc.) are frequently dispatched in parallel on the same repo. When they all operate on the shared main checkout (/opt/upsquad/<repo>/), their git checkout, git stash, and git clean operations silently clobber each other's uncommitted work.

Direct incidents:

IncidentCostNote
client#107 (Clerk webhook)multiple rebuilds"checkouts got clobbered twice mid-build"
client#110 (dev-bypass removal)~90 minutesuntracked files dropped + tracked edits reverted when parallel branches flipped
client#111 (stub regen)setup tax per agent"concurrent bot sessions so explicit branch checkout + stash-by-id is required"
core#827 (conflict resolution)forced workaroundagent created an ad-hoc worktree at /tmp/upsquad-761-merge

Convention

Every agent task works in a dedicated git worktree at /opt/upsquad-worktrees/<repo>/<agent>-<issue>/, branched from a freshly fetched origin/main.

SlotValue
Root/opt/upsquad-worktrees/<repo>/ — group-owned by upsquad-devs, mode 2775 (setgid)
Worktree dir<agent>-<issue>
Branch<prefix>/<issue>-<agent> (e.g. fix/844-devops-engineer)
Prefixone of fix, feat, chore, docs, refactor, test
Baseorigin/main at dispatch time

One-time bootstrap (per dev box / CI runner)

sudo mkdir -p /opt/upsquad-worktrees/upsquad-core \
/opt/upsquad-worktrees/upsquad-client \
/opt/upsquad-worktrees/upsquad-admin \
/opt/upsquad-worktrees/upsquad-web
sudo chown -R "$USER":upsquad-devs /opt/upsquad-worktrees
sudo chmod 2775 /opt/upsquad-worktrees
sudo chmod 2775 /opt/upsquad-worktrees/*

The setgid bit on each directory makes new files/subdirs inherit the upsquad-devs group so other agents (under any upsquad-devs user) can read/write them.

Per-dispatch

# Agent prompts receive the issue number. The helper script returns the
# worktree path on stdout; all informational logs go to stderr.
WT=$(AGENT_NAME=backend-sme bash /opt/upsquad/upsquad-core/scripts/agent-worktree.sh 1234)
cd "$WT"
# work, commit, push ...

Script behaviour:

  • Fetches origin/main fresh, branches from it.
  • If the worktree already exists (agent retry), it prints the path and exits 0 — agents can idempotently call it.
  • If the branch already exists locally (previous attempt), it reuses the branch and the new worktree checks it out.
  • Prints only the path on stdout; every other line goes to stderr.

Success criteria (from the issue)

  • Two agents dispatched simultaneously on the same repo produce independent PRs with zero clobbering.
  • Setup overhead under 5 seconds per dispatch.
  • Documented convention in CLAUDE.md for every repo.

Measured during core#852 delivery:

$ time (AGENT_NAME=test-b bash scripts/agent-worktree.sh 9999 test-b &
AGENT_NAME=test-c bash scripts/agent-worktree.sh 10000 test-c &
wait)
# ...
real 0m1.800s

1.8 s for two parallel worktrees, well under the 5 s SLA. Separate AGENT_B_FILE / AGENT_C_FILE writes to each worktree confirmed isolation — git status in each showed only that worktree's own untracked file.

Cleanup

Worktrees persist after the agent finishes. This is intentional: if an agent task failed, the next agent picking it up should see the last state rather than a fresh checkout.

Weekly cleanup removes worktrees whose branch no longer exists on origin (i.e. the PR merged or was closed):

# Run from the main checkout so the self-protection guard doesn't
# skip a worktree you're currently inside.
cd /opt/upsquad/upsquad-core
bash scripts/prune-agent-worktrees.sh

Suggested systemd timer (to be wired by a follow-up PR, not in scope for #852):

# /etc/systemd/system/upsquad-worktree-prune.timer
[Unit]
Description=Weekly prune of merged-branch agent worktrees

[Timer]
OnCalendar=Sun *-*-* 03:30:00
Persistent=true

[Install]
WantedBy=timers.target

Until that timer lands, DevOps runs the prune manually on ops days.

Relationship to Claude Code Agent SDK isolation: "worktree"

The SDK's built-in isolation: "worktree" param gives subagent tool calls their own worktree for the duration of the call. It is complementary, not a replacement, for the filesystem convention documented here:

  • SDK isolation covers subagent tool calls that modify the repo. Agents SHOULD pass isolation: "worktree" when spawning a subagent that might git checkout or write files.
  • Filesystem convention covers human-invoked, cron-invoked, and top-level agent dispatches. Without it, any of those contexts can still clobber an SDK-isolated subagent's parent checkout.

When in doubt, prefer the filesystem convention — it works uniformly across every invocation path.

References

  • scripts/agent-worktree.sh — dispatch helper.
  • scripts/prune-agent-worktrees.sh — weekly cleanup.
  • CLAUDE.md → "Parallel agent dispatches (MANDATORY)" section.
  • core#852 — the task that produced this convention.