Skip to main content

Local Dogfood: Agent Runtime Wave 1

This doc describes the local docker-compose dogfood deploy of the Agent Runtime Wave 1 foundation. It exists so contributors can exercise the real gRPC surface, interceptor chain, migrations 018/019, and observability wiring against real Postgres + Redis before anything reaches GKE.

Hard boundary — what this is NOT

  • NOT a GKE deploy. No Pulumi, no ArgoCD, no Kubernetes manifests are touched by this path. The deployments/ tree is untouched.
  • NOT a tenant-facing environment. Hard Gate #155 (PRD #93 v1.5 §15) blocks external tenant onboarding on Wave 1 — production promotion is blocked until Wave 2 lands the agent loop, LLM sourcing, and billing.
  • NOT wired to real LLM providers. No API keys, no outbound calls to Anthropic / OpenAI / Google. The agent-worker service is a Python scaffold that starts a gRPC server but is not yet dialled by the orchestrator — that is Wave 2 work.
  • NOT exposed beyond loopback. Every host port is bound to 127.0.0.1. Do not change this.

What the stack runs

ServicePurposeHost port
postgres (pgvector)Schema at migration v19 (017 audit, 018 usage, 019 RBAC)127.0.0.1:5432
pgbouncerTransaction-mode pooler (unused by orchestrator in dev)127.0.0.1:6432
redis 7.2Streams / cache127.0.0.1:6379
fake-gcsGCS stub for snapshot path127.0.0.1:4443
fake-jwt-issuerAirgapped OIDC issuer127.0.0.1:9090
migrate (one-shot)Applies 001..019
prometheusScrapes engine + orchestrator127.0.0.1:9092
grafanaDashboards127.0.0.1:3001
agent-orchestratorWave 1 Go service — gRPC, health, metricsgRPC 50052, health 8082, metrics 9094
agent-workerWave 2 pending — Python scaffold, no host port(internal only)

Bring-up

# One command: infra → migrate → orchestrator + worker → tail logs
make dev-up-runtime

# Or step-by-step
docker compose -f docker-compose.dev.yml up -d postgres pgbouncer redis
docker compose -f docker-compose.dev.yml run --rm migrate
docker compose -f docker-compose.dev.yml up -d --build agent-orchestrator agent-worker

Smoke tests

# 1. Process health
curl -fsS http://127.0.0.1:8082/healthz

# 2. Dependency readiness (DB + Redis ping)
curl -fsS http://127.0.0.1:8082/readyz

# 3. Prometheus exposition
curl -fsS http://127.0.0.1:9094/metrics | head -20

# 4. gRPC service enumeration (requires grpcurl; reflection is enabled in dev)
grpcurl -plaintext 127.0.0.1:50052 list

# 5. Migration version
docker compose -f docker-compose.dev.yml exec postgres \
psql -U upsquad -d upsquad -c "SELECT version FROM schema_migrations;"
# expected: 19

# 6. Startup log sanity
docker compose -f docker-compose.dev.yml logs agent-orchestrator \
| grep -E "rbac_grants|audit writer|llm_usage_events|started successfully"

Rollback

# Stop and remove runtime services; infra keeps running
make dev-down-runtime

# Nuclear: stop everything and wipe all volumes
docker compose -f docker-compose.dev.yml down -v

There is no rollback risk beyond losing the local Postgres volume. No production state is touched because this stack never reaches production.

Registry-pull path (GHCR :latest)

As an alternative to local builds, the publish-images.yml workflow publishes images for all 6 platform services to GHCR. As of 2026-04-20 (PRD #743 / ticket #745) the workflow runs on a 3-hour cron (0 */3 * * * UTC) rather than on every push to main, with a workflow_dispatch escape hatch for when you need a fresher :latest inside the cron window.

ServiceImage
context-engineghcr.io/upsquad-ai/upsquad-core/context-engine:latest
agent-orchestratorghcr.io/upsquad-ai/upsquad-core/agent-orchestrator:latest
reconcilerghcr.io/upsquad-ai/upsquad-core/reconciler:latest
agent-workerghcr.io/upsquad-ai/upsquad-core/agent-worker:latest

Each image is also tagged with the commit SHA for pinning.

:latest freshness SLA

  • Worst-case lag from main merge to GHCR :latest is ~3 hours plus GitHub Actions cron drift (typically 5-15 min). The dev-box reconciler (scripts/dev-reconcile.sh, #672) polls every 3 min, so effective end-to-end lag from merge to reconciled dev-box is ≤ ~3h 18min.
  • The cron only rebuilds services whose watched paths changed since the most recent successful run of the same workflow, so a cron tick in a quiet window completes in ≤ 60s without pushing any new image.

Need an image sooner? Manual trigger

Open Actions → Publish Images to GHCR → Run workflow in the browser and pick one of:

  • force_all=false (default) — rebuild only services whose paths changed since the last successful run. Good for "I merged one service, push it now" without waiting up to 3h.
  • force_all=true — rebuild all 6 services regardless of path changes. Good for a clean-slate refresh (e.g. right after a go.mod bump lands but you also want to force-pull everything).

The gh CLI equivalent:

gh workflow run publish-images.yml --ref main # force_all defaults to false
gh workflow run publish-images.yml --ref main -f force_all=true

The dev-box reconciler will pick up the new :latest within 3 min of the run finishing — no extra manual step required on the dev box.

Bring the stack up from GHCR — no local build step runs:

# Login once if images are private (token needs read:packages)
echo "$GH_TOKEN" | docker login ghcr.io -u <your-username> --password-stdin

# Pull + up with the override
docker compose -f docker-compose.dev.yml -f docker-compose.dev.registry.yml pull
docker compose -f docker-compose.dev.yml -f docker-compose.dev.registry.yml up -d

# Verify reconciler is healthy (only running on the registry path)
curl -fsS http://127.0.0.1:9119/healthz
curl -fsS http://127.0.0.1:9119/metrics | grep orgunit_dualwrite

The override nulls the build: block on every platform service via build: !reset null and sets pull_policy: always, so up always fetches the newest :latest. The base docker-compose.dev.yml is untouched — drop the -f docker-compose.dev.registry.yml flag to return to local builds.

When to use which:

  • Local builds (base file only): active development, uncommitted changes, or branch work not yet on main.
  • Registry pull (override added): smoke testing latest green main, cold-cache CI runs, teammates who want "what just merged" without waiting for a 3–5 min Go + Python image build.

Why is the orchestrator's Docker healthcheck missing?

The runtime image is gcr.io/distroless/static-debian12:nonroot, which has no shell, wget, or curl, so a CMD-SHELL healthcheck cannot be expressed from inside the container. Health is asserted externally via curl against http://127.0.0.1:8082/healthz and .../readyz. The orchestrator itself still exposes the endpoints — the only thing missing is the in-container probe binary, and we keep the distroless posture intentionally.

Architectural scope

Wave 1 = foundation only. The orchestrator's worker-dial relay currently returns the hardcoded "acknowledged" response. The agent-worker Python container exists in this compose file purely to prove its Docker packaging ahead of Wave 2 — it is not dialled by the orchestrator yet.

When Wave 2 lands:

  • Orchestrator → worker gRPC dial becomes real
  • Worker host port gets published
  • OTLP collector gets added to this stack (or a tracing sidecar)
  • LLM provider stubs get wired (still no real API keys in dev)