Skip to main content

PRD: Context Engine -- Bootstrap Foundation for Token Optimization & Context Management

Status: Draft -- Awaiting Approval Version: 1.1 Author: Product Manager Agent Date: 2026-04-06 Parent PRD: UpSquad Complete PRD v1.6 Priority: [BOOTSTRAP] -- Must be built first; all other agents and platform components depend on it.


Changelog

v1.1 -- 2026-04-06

Triggered by: Internal re-verification against parent PRD (UpSquad Complete PRD v1.6) to close coverage gaps before architect handoff.

  1. Added section 4.10 Embedding & Chunking Pipeline as an explicit first-class sub-component (CE-F48 through CE-F52). Previously implicit in RAG section.
  2. Added CE-F53 LLM prompt-cache exploitation for stable immutable layer prefixes (L1+L2+L3+L4). Major token-reduction multiplier not in parent PRD -- flagged as upstream PRD gap in section 15.
  3. Added CE-F54 Context Engine savings metric -- per-call delta between naive baseline and engine-assembled tokens, integrated with P2.2.8 per-call cost tracking.
  4. Added CE-F55, CE-F56, CE-F57 mapping to P3.4.4, P3.4.5, P3.4.6 (persistent memory versioning with named snapshots, memory rollback, retention policies) -- previously claimed in appendix but not present in the requirements tables.
  5. Added CE-F58 cross-reference to SEC.1.2 (immutable context enforcement validated before every LLM call) -- overlaps with CE-F29, made explicit.
  6. Added section 8.1 Context Engine API Surface -- explicit interface contract between Context Engine and agent runtime, to remove ambiguity for the architect.
  7. Strengthened quality-preservation constraint by embedding it in CE-F5, CE-F11, CE-F12 directly rather than relying only on the global constraint in section 2.
  8. Added section 15 Upstream PRD Gaps -- items discovered during re-verification that should be added to the parent PRD by the Product Manager in a follow-up.
  9. Expanded section 7 Dependencies with explicit provisioning prerequisites (pgvector extension, embedding model selection, compaction LLM routing).
  10. Updated section 14 Parent PRD Cross-Reference appendix to accurately reflect which P3.4.x items are now covered and which were previously claimed-but-missing.

1. Problem Statement

UpsQuad is built by its own AI agent team. Every agent interaction consumes LLM tokens, and the cost and quality of those interactions is directly determined by how context is assembled, compressed, and delivered to the model.

Today, agents operate with naive context strategies: entire PRDs (45,000+ tokens), full conversation histories, and unfiltered tool outputs are sent to the LLM on every call. This creates three compounding problems:

  1. Cost: Token consumption is 3-10x higher than necessary. A single PRD analysis session can consume hundreds of thousands of tokens when the agent only needs a fraction of the document.
  2. Quality degradation: When context windows are packed with irrelevant information, LLMs lose focus. Critical details get buried in noise. The model attends to everything equally rather than focusing on what matters for the current task.
  3. Latency: Larger prompts mean slower responses. Time-to-first-token increases linearly with input size.

The Context Engine is the foundational infrastructure that solves all three problems simultaneously. It ensures that every LLM call receives the most relevant context in the fewest tokens, preserving (and amplifying) quality while dramatically reducing cost and latency.

Why build this first? Every other component in the platform -- workflows, governance, agent runtime, the chat interface -- makes LLM calls. The Context Engine sits beneath all of them. Building it first means every subsequent component benefits from optimized context from day one. The agent team building UpsQuad will use it immediately (dogfooding), validating it under real production conditions before any customer does.


2. Goals & Success Metrics

GoalMetricTarget (MVP)Target (V1.1)
Reduce token consumptionToken reduction ratio (tokens sent vs. naive baseline for same task)>= 40% reduction>= 60% reduction
Preserve output qualityQuality score (blind comparison: Context Engine output vs. full-context output, rated by human reviewer)>= 95% quality parity>= 100% (quality amplification via better relevance)
Reduce latencyTime-to-first-token improvement>= 25% faster>= 40% faster
Context assembly speedTime to assemble context for an LLM call (retrieval + compaction + assembly)< 500ms (p95)< 300ms (p95)
Cost visibilityPer-agent, per-session context efficiency trackingAvailable in metricsDashboarded with alerts
Compaction qualityInformation retention score (key facts preserved after compaction, measured via automated eval)>= 98% fact retention>= 99% fact retention
Semantic retrieval accuracyRelevant chunk recall (% of relevant chunks retrieved in top-K results)>= 85% recall@10>= 92% recall@10

Critical constraint: Using the Context Engine must NEVER compromise the quality of agent work. If the system cannot determine relevance with confidence, it MUST err on the side of including more context, not less. Quality is non-negotiable; token savings are the optimization target only after quality is guaranteed.


3. User Stories

US-CE-1: Agent Receives Relevant Context for Current Task

As an AI agent executing a task, I want to receive only the context relevant to my current step, so that I can focus on the task at hand without being distracted by irrelevant information and without consuming unnecessary tokens.

Acceptance Criteria:

  • Context assembly retrieves chunks semantically relevant to the current task description
  • Irrelevant sections of large documents (e.g., a 45K-token PRD when the agent only needs 2 sections) are excluded
  • The agent's output quality is equal to or better than when given the full document
  • Token usage is measurably lower than the naive baseline

US-CE-2: Long Conversation Context is Compacted Without Loss

As an AI agent in a long-running workflow (50+ messages), I want older conversation context to be intelligently compacted, so that I retain the critical decisions and facts from earlier in the conversation without exceeding my context window.

Acceptance Criteria:

  • Sliding window compaction summarizes older messages while keeping recent messages in full
  • Key decisions, constraints, and facts from compacted messages are preserved as structured entries
  • The agent can still reference earlier decisions accurately after compaction
  • Compaction triggers automatically when context reaches 80% of the model's max tokens (configurable)

US-CE-3: Agent Pushes Session Knowledge to Persistent Memory

As an AI agent completing a task, I want to push the knowledge I have learned (decisions made, facts discovered, state changes) to a persistent memory store, so that future sessions can benefit from this knowledge without re-discovering it.

Acceptance Criteria:

  • /push-context command saves session-derived knowledge to the external memory store
  • Push does NOT modify the agent's role definition, guardrails, or system prompt
  • Push requires write permission to the agent's memory namespace
  • Pushed knowledge is versioned and retrievable by future sessions

US-CE-4: Agent Pulls Fresh Context Without Breaking Session

As an AI agent that needs updated context mid-session, I want to pull the latest context and create a new immutable session snapshot, so that I can operate on up-to-date information without partial state corruption.

Acceptance Criteria:

  • /pull-context fetches latest context from external DB
  • A new immutable session snapshot is created atomically
  • The previous snapshot is fully discarded -- no mixing of old and new state
  • The agent confirms the version transition (e.g., "Context pulled: snapshot v12 -> v13")

US-CE-5: Context is Versioned and Rollbackable

As a team lead (L3+), I want every context mutation to be versioned with content-addressed hashing, so that I can audit changes, compare versions, and roll back to a known-good state if an agent's context becomes corrupted.

Acceptance Criteria:

  • Every context mutation produces a new version with SHA-256 content hash
  • Named snapshots can be created manually or automatically (before critical operations)
  • Rollback to any previous version is possible (L3+ required)
  • Full audit trail: who changed what, when, linked to workflow step

US-CE-6: Platform Operator Sees Context Efficiency Metrics

As a platform operator, I want to see per-agent and per-tenant context efficiency metrics, so that I can identify agents that are consuming tokens inefficiently and tune their context strategies.

Acceptance Criteria:

  • Context efficiency score is calculated: (useful tokens / total tokens sent) per LLM call
  • Metrics are available per agent, per session, per tenant
  • Alerts fire when an agent consistently operates below efficiency threshold
  • Recommendations for better compaction/retrieval settings are surfaced (V1.1)

US-CE-7: Immutable Agent Context Cannot Be Self-Modified

As a platform security engineer, I want to ensure that no agent can modify its own role definition, system prompt, guardrails, or clearance level, so that the chain of trust is preserved and agents cannot escalate their own privileges.

Acceptance Criteria:

  • All self-modification attempts are blocked regardless of vector (DB write, API call, tool invocation, LLM instruction)
  • Self-modification attempts are logged as security events
  • Optional session termination on violation
  • Only parent agents or humans can edit an agent's context (downward edits only)

US-CE-8: Semantic Search Retrieves Relevant Knowledge Before Each LLM Call

As an AI agent about to make an LLM call, I want the system to automatically assemble the most relevant context from my memory, the conversation history, and the team's RAG knowledge bases, so that I have the best possible information for my task without manual curation.

Acceptance Criteria:

  • Semantic search runs against vector store with recency weighting before each LLM call
  • Results are ranked by relevance score and recency
  • Only chunks above a relevance threshold are included
  • Context assembly respects the model's token budget (never exceeds allocated window)
  • Retrieval + assembly completes in < 500ms (p95)

4. Functional Requirements

4.1 Context Ingestion & Storage

IDRequirementReleaseParent PRD Item
CE-F1Context ingestion layer: ingest messages, tool outputs, workflow events into per-agent context store in databaseMVPP3.1.1
CE-F2Per-agent, per-tenant context isolation (separate storage paths, no cross-tenant context leakage)MVPP3.4.3
CE-F3Memory types: conversation context (per-workflow), learned knowledge (persistent), team preferences (persistent), work artifacts (90-day retention)MVPP3.4.2
CE-F4Object storage backend with per-tenant prefix, per-agent subfolder, object versioning enabledMVPP3.4.1

4.2 Context Compaction

IDRequirementReleaseParent PRD Item
CE-F5Sliding Window compaction (default strategy): recent N messages in full, older messages summarized via LLM. Quality guardrail: summarization prompts MUST preserve decisions, commitments, constraints, and open questions verbatim or as structured key-facts. If summarization confidence is low, keep the original message in full. Never silently drop information.MVPP3.1.2
CE-F6Compaction trigger: configurable threshold, default 80% of model max tokensMVPP3.1.6
CE-F7Per-agent/per-team configuration of compaction strategy and thresholds (via config DB)MVPP3.1.7
CE-F8Key-Fact Extraction compaction: extract decisions, constraints, requirements as structured factsV1.1P3.1.3
CE-F9Hierarchical Summary compaction: minute-to-hour-to-day-to-week summariesV1.1P3.1.4
CE-F10Task-Scoped compaction: completed tasks compacted, active tasks kept in fullV1.1P3.1.5

4.3 Smart Retrieval & Assembly

IDRequirementReleaseParent PRD Item
CE-F11Smart retrieval: semantic search via vector store + recency weighting for context assembly before each LLM call. Quality guardrail: if top-K semantic recall drops below the confidence threshold for a query, the engine MUST expand the retrieval set (increase K, lower relevance floor) rather than return a thin result. Erring toward more context is always preferred over missing information.MVPP3.1.9
CE-F12Relevance scoring: chunks scored by semantic similarity to current task + recency + importance weight. Quality guardrail: scoring is advisory, not exclusive -- a chunk tagged as a "pinned" constraint (decisions, commitments, immutable facts) is always included regardless of score.MVPNew (implicit in P3.1.9)
CE-F13Token budget management: context assembly respects model's max token limit, prioritizing highest-relevance chunksMVPNew (operational necessity)
CE-F14Context processing: entity extraction, reference resolution, importance scoringV1.1P3.1.10
CE-F15Context overload detection and alerting (alert when context approaches limit inefficiently)V1.1P3.1.8

4.4 Context Push/Pull & Session Management

IDRequirementReleaseParent PRD Item
CE-F16Context refresh mechanism: explicit refresh creates new immutable session snapshot, atomic operation (old or new, never partial)MVPP3.1.11
CE-F17Context push operation (/push-context): saves session knowledge to external memory store. Does NOT modify role definition or guardrails.MVPP3.1.12
CE-F18Context pull operation (/pull-context): fetches latest context, creates new immutable session snapshot, discards previous snapshotMVPP3.1.13
CE-F19Context push/pull authorization model: push requires write permission to own memory namespace; pull always authorized for own context; neither can modify role/guardrails/system promptMVPP3.1.14

4.5 Context Version Control

IDRequirementReleaseParent PRD Item
CE-F20Auto-versioning on every context mutation (content-addressed SHA-256 hash, stored in context_versions table)MVPP3.2.1
CE-F21Named snapshots (before critical operations, per workflow step, manually triggered by L3+)MVPP3.2.2
CE-F22Context rollback (restore any previous version, L3+ required, recorded in audit log)MVPP3.2.4
CE-F23Full audit trail (who changed what, when, why -- linked to workflow step)MVPP3.2.7
CE-F24Context diffing (compare any two versions, show added/removed/modified entries)V1.1P3.2.3
CE-F25Context branching (parallel exploration with different context)V2P3.2.5
CE-F26Context merging (combine branches with conflict resolution, human review for conflicts)V2P3.2.6

4.6 Immutable Agent Context & Chain of Trust

IDRequirementReleaseParent PRD Item
CE-F274-layer enforcement: Platform Rules (hardcoded) > Tenant Policies (L5) > Team Guidelines (L3+) > Agent Persona (set at creation)MVPP3.3.1
CE-F28Layers injected as system prompt prefix before every LLM interaction (concatenated L1+L2+L3+L4)MVPP3.3.2
CE-F29Validation layer: blocks runtime override attempts via prompt injection detectionMVPP3.3.3
CE-F30Violation logging and agent termination on immutable context violationMVPP3.3.4
CE-F31Self-context immutability: agents cannot modify own role definition, system prompt, guardrails, or clearance levelMVPP3.3.8
CE-F32Hierarchical context editing: only parent agent or human can edit, downward onlyMVPP3.3.9
CE-F33Agent hierarchy definition: tree structure for chain of trust, configurable per tenantMVPP3.3.10
CE-F34Chain of trust enforcement: validates all context edits against hierarchy before applyingMVPP3.3.11
CE-F35Guardrail definition format: structured rules with scope, condition, prohibited action, violation responseMVPP3.3.7
CE-F36Blacklist-based action model: default allow, guardrails define what agents CANNOT doMVPP3.3.6
CE-F37Runtime guardrail enforcement at MCP transport layer [BOOTSTRAP]MVPP3.3.12

4.7 RAG Knowledge Bases (Context Engine Integration Points)

IDRequirementReleaseParent PRD Item
CE-F38Vector store per knowledge domain (per tenant schema)MVPP3.5.1
CE-F39Domain knowledge RAG: tenant uploads documents, auto-chunked, embedded, indexedMVPP3.5.2
CE-F40Clearance-gated access: which agents can access which knowledge bases, per RBACMVPP3.5.4
CE-F41MCP-exposed endpoints: agents access knowledge via MCP tools (search, get_context, find_similar)MVPP3.5.5
CE-F42Cross-tenant contamination prevention: RAG queries always scoped to tenantMVPP3.5.8
CE-F43Update triggers: webhooks, doc updates, configurable scheduleMVPP3.5.6

4.8 Context-Aware Tool Loading

IDRequirementReleaseParent PRD Item
CE-F44Context-mode MCP sandboxing [BOOTSTRAP]: dynamically load/unload MCP tool definitions based on current task step. Only tools relevant to active operation injected into context window. Reduces idle tool definition context consumption.MVPP2.4.16

4.9 Context Efficiency Metrics

IDRequirementReleaseParent PRD Item
CE-F45Context efficiency score: tokens used vs. context window capacity, per agent, per sessionV1.1P7.2.4
CE-F46Flag context overload: alert when agent approaching token limits inefficientlyV1.1P8.1.3
CE-F47Suggest better model/compaction/learning settings based on efficiency analysisV1.1P8.1.2
CE-F54Context Engine savings metric: for every LLM call, record (naive_baseline_tokens, engine_assembled_tokens, savings_ratio). Aggregated per agent, per session, per tenant. Integrated with per-call cost tracking so savings are visible in billing dashboards and Platform Owner Console.MVPNew (integrates P2.2.8, P7.1.1)

4.10 Embedding & Chunking Pipeline

This section makes the embedding pipeline a first-class sub-component of the Context Engine. Previously implicit in the RAG section, it is called out here because it is on the critical path for semantic retrieval quality and because embedding model selection has long-term migration consequences.

IDRequirementReleaseParent PRD Item
CE-F48Chunking service: token-aware chunking with configurable chunk size and overlap. Default: 512 tokens per chunk with 64-token overlap. Respects structural boundaries (headings, paragraphs, code blocks) when possible. Per-document-type strategies (markdown, PDF, code, transcript).MVPImplicit in P3.5.2
CE-F49Embedding service: unified interface for producing embeddings from text. Initial MVP provider: a single approved embedding model (e.g., text-embedding-3-small or equivalent open model) selected by the architect. Per-tenant allowlisting of embedding models in V1.1.MVPImplicit in P3.5.2
CE-F50Embedding cache: deduplicate embedding calls by content hash. If the same chunk content is embedded twice (same tenant or across tenants where content is non-sensitive platform data), reuse the cached embedding.MVPNew (cost optimization)
CE-F51Embedding model versioning and dual-index migration: when the embedding model changes, maintain both old and new indices in parallel, gradually migrate queries to the new index, validate recall@10 parity, drop the old index only after validation. No query outage during migration.V1.1Open Question #4 resolution
CE-F52Reindexing job: background job that reprocesses source documents when (a) chunking strategy changes, (b) embedding model is upgraded, (c) index corruption is detected. Per-tenant scoped, rate-limited, resumable.MVPNew (operational necessity)

4.11 Prompt Cache Exploitation (Token Optimization Multiplier)

IDRequirementReleaseParent PRD Item
CE-F53LLM prompt-cache exploitation: when the LLM router supports prompt caching (Anthropic, OpenAI), the Context Engine MUST structure every assembled prompt so that the stable prefix (L1 Platform Rules + L2 Tenant Policies + L3 Team Guidelines + L4 Agent Persona + pinned knowledge base chunks) is placed at the front of the prompt in a cache-friendly order. Unstable tail (current task, retrieved chunks, recent conversation) goes after the cacheable prefix. This is a major token-cost multiplier because immutable layers are injected on every call (per CE-F28) and are highly cacheable across calls in the same session.MVPNew -- flagged to parent PRD in section 15

4.12 Persistent Memory Versioning & Retention

These items were claimed as "covered" in the v1.0 appendix but were missing from the requirements tables. Added here for completeness.

IDRequirementReleaseParent PRD Item
CE-F55Persistent memory versioning with named snapshots (before major learning cycles, configurable per memory type)MVPP3.4.4
CE-F56Persistent memory rollback to any version (L4+ required, logged to audit trail)V1.1P3.4.5
CE-F57Configurable memory retention policies (30/90/365 days/indefinite, per memory type, per tenant)V1.1P3.4.6

4.13 Immutable Context Enforcement Integration

IDRequirementReleaseParent PRD Item
CE-F58Immutable context enforcement before every LLM call: the assembled prompt is validated against the 4-layer immutable context before dispatch to the LLM router. If validation fails (tampering, injection, layer missing), the call is blocked and logged as a security event. This is the Context Engine's concrete implementation of SEC.1.2, and subsumes CE-F29 (prompt injection detection).MVPSEC.1.2 + P3.3.3

5. Non-Functional Requirements

CategoryRequirementTarget
PerformanceContext assembly (retrieval + compaction + assembly)< 500ms p95
PerformanceSemantic search over 1M context entries< 500ms p95 (V1.1 benchmark)
PerformanceCompaction execution (single agent session)< 2s for sessions up to 100K tokens
ScalabilityConcurrent context assemblies1,000 concurrent (for 1,000 tenants)
ScalabilityVector store entries per tenant10M entries without degradation
ReliabilityContext version integrityZero data loss on versioned context (content-addressed hashing)
ReliabilityPush/pull atomicityOperations are atomic -- never partial state
SecurityTenant isolationZero cross-tenant context leakage. Every query scoped to tenant_id.
SecurityImmutable context enforcementSystem prompt validated before every LLM call
SecuritySelf-modification preventionAll vectors blocked (DB, API, tool, LLM instruction)
ObservabilityPer-call token usage trackingEvery LLM call metered: model, input_tokens, output_tokens, context_engine_savings
QualityCompaction information retention>= 98% key fact retention (measured by automated eval)
QualityRetrieval relevance>= 85% recall@10 for relevant chunks

6. Scope

In Scope (MVP)

  • Context ingestion layer (messages, tool outputs, workflow events)
  • Sliding Window compaction with configurable thresholds
  • Smart retrieval with semantic search and recency weighting
  • Token budget management during context assembly
  • Context push/pull operations with authorization
  • Context versioning with content-addressed hashing
  • Named snapshots and rollback
  • 4-layer immutable agent context enforcement
  • Hierarchical context editing (chain of trust)
  • Self-modification prevention
  • Guardrail enforcement at MCP transport layer
  • RAG vector store with tenant isolation
  • Document ingestion, chunking, embedding pipeline
  • MCP-exposed context access endpoints
  • Context-mode MCP sandboxing (dynamic tool loading)
  • Per-agent context storage isolation

In Scope (V1.1)

  • Key-Fact Extraction compaction
  • Hierarchical Summary compaction
  • Task-Scoped compaction
  • Context processing (entity extraction, reference resolution, importance scoring)
  • Context diffing between versions
  • Context overload detection and alerting
  • Context efficiency score dashboard
  • Optimization recommendations engine
  • Specialized RAG domains (per-tenant configurable)
  • Industry knowledge base templates

Out of Scope (V2+)

  • Context branching (parallel exploration)
  • Context merging with conflict resolution
  • Cross-agent context sharing without explicit push/pull
  • Training embeddings on tenant data (privacy implications -- deferred)
  • S3-compatible backend for on-premise

Explicitly Out of Scope (Not This PRD)

  • Workflow engine (P4) -- consumes context, does not produce it
  • Chat interface (P6) -- presentation layer for context operations
  • Cost engine billing aggregation (P7) -- consumes context metrics
  • Agent executor process (P2.1) -- integrates with context engine but is a separate component
  • User-facing configuration UIs -- separate frontend PRD

7. Dependencies

DependencyDirectionDescription
Agent Runtime (P2.1)BidirectionalContext Engine provides context to agents; agents produce context events. Agent session initialization (P2.1.16) fetches context from this engine.
Config DB (P4.6)InboundCompaction strategies, thresholds, guardrail definitions stored in config DB
LLM Router (P2.2)OutboundCompaction uses LLM calls (summarization). Context Engine must route these via the LLM router.
MCP Framework (P2.4)BidirectionalContext Engine exposes MCP endpoints; MCP transport layer enforces guardrails
RBAC / Clearance Engine (P4.5)InboundAuthorization for push/pull/rollback operations checked against RBAC
Metering (P2.1.19)OutboundContext Engine emits token usage and efficiency metrics to metering pipeline
Vector Store Infrastructure (P1.2.9)InboundProvisioning prerequisite: pgvector extension enabled on the primary relational DB with per-tenant schema isolation. Must exist before Context Engine deployment.
Object Storage (P1.2.8)InboundPersistent agent memory stored in object storage (per-tenant prefix, per-agent subfolder, versioning enabled).
Cache (P1.2.7)InboundHot context cached for fast retrieval. Cache never source-of-truth for pull operations.
Embedding Model SelectionInboundArchitect decision required before MVP build: which embedding model is the MVP default (text-embedding-3-small, bge-large, or equivalent). Model choice affects recall targets and migration cost.
Compaction LLM Routing (P2.2)OutboundCompaction uses LLM calls for summarization. Architect must decide: dedicated lightweight model vs. agent's configured model. See Open Question #3.
Security (SEC.1.1, SEC.1.2)Inbound/OutboundInput sanitization (SEC.1.1) runs before context assembly; immutable context enforcement (SEC.1.2) runs after assembly, before LLM dispatch. Context Engine owns SEC.1.2 implementation.
Per-call Cost Tracking (P2.2.8, P7.1.1)OutboundEvery LLM call emits tokens + engine savings delta to metering pipeline.

8. Architecture Context

The Context Engine is a core platform service in UpsQuad's architecture:

+------------------------------------------------------------------+
| Agent Runtime: Executor | LLM Router | Tool Registry (MCP) |
| | | | |
| v v v |
+------------------------------------------------------------------+
| >>>>>> CONTEXT ENGINE <<<<<< |
| Context Assembly | Compaction | Smart Retrieval | Versioning |
| Immutable Context Enforcement | Guardrail Engine |
| Push/Pull Manager | RAG Integration | Token Budget Manager |
+------------------------------------------------------------------+
| | | | |
| v v v |
+------------------------------------------------------------------+
| Data Layer: Relational DB + Vector Store | Cache | Object Storage |
+------------------------------------------------------------------+

Every LLM call flows through the Context Engine. The engine sits between the agent runtime (which decides what to do) and the data layer (which stores everything). It is the "lens" that focuses the right information into each LLM call.

8.1 Context Engine API Surface (Interface Contract)

The Context Engine exposes the following logical operations to the agent runtime. This is a product-level contract; the architect owns the concrete gRPC/internal API design.

OperationCallerPurpose
AssembleContext(session_id, task_description, token_budget)Agent runtime (before every LLM call)Returns the assembled prompt (immutable layers + compacted conversation + retrieved chunks + current task), guaranteed to fit within token_budget and to satisfy quality guardrails.
IngestEvent(session_id, event)Agent runtime (after every tool call or message)Appends a context event to the per-agent store with auto-versioning.
Compact(session_id, strategy?)Agent runtime (on threshold breach)Applies the configured compaction strategy. Returns new version hash.
PushContext(agent_id, knowledge)Agent runtime (on /push-context)Persists session-derived knowledge to the agent's memory namespace. Authorization-checked.
PullContext(agent_id)Agent runtime (on /pull-context or critical-update signal)Returns the latest context snapshot (role definition, guardrails, config, memory). Creates a new immutable session snapshot atomically.
RollbackContext(context_id, target_version)User (L3+) or platform operatorRestores a previous version. Audit-logged.
ValidatePrompt(assembled_prompt)Context Engine (internal, before dispatch)Validates the assembled prompt against the 4-layer immutable context and SEC.1.2. Blocks on violation.
EmitMetrics(session_id, call_metrics)Context Engine (after every LLM call)Emits naive_baseline_tokens, engine_assembled_tokens, savings_ratio, latency, quality-eval score to metering.
Search(tenant_id, query, domain?, top_k?)Agent runtime (via MCP tool)Semantic search against RAG knowledge bases. Tenant-scoped.
Embed(text)Internal (chunking pipeline, search)Produces an embedding using the configured model. Cached by content hash.

All operations are tenant-scoped. All operations return structured errors with stable error codes. All operations are metered.


9. Edge Cases & Failure Modes

ScenarioExpected Behavior
Compaction LLM call fails (provider down)Fall back to truncation strategy (keep recent N messages, drop oldest). Log degraded-mode event. Never block the agent.
Semantic search returns zero resultsFall back to recency-based retrieval (most recent context entries). Log low-relevance event.
Context push fails mid-writeAtomic write -- either fully committed or fully rolled back. Agent retries on next push.
Context pull returns stale data (cache lag)Pull always reads from primary DB, never cache. Cache is only for assembly-time reads.
Agent attempts to modify own system promptBlocked immediately. Logged as security event. Optional session termination per configuration.
Parent agent edits child context while child is mid-sessionChild continues on its frozen snapshot. New context takes effect on next session or explicit /pull-context.
Two agents push to same memory namespace simultaneouslyOptimistic concurrency with version check. Second push fails with conflict, must retry with latest version.
Compaction produces a summary that loses a critical factQuality eval catches this in V1.1. MVP mitigation: compaction always preserves structured key-facts alongside summary. Human review available for critical workflows.
Vector store index corruptionAutomatic reindex from source documents. Alert platform operator. Serve from raw document search (degraded mode) during reindex.
Context exceeds model max tokens even after compactionHard truncation with priority ordering: (1) immutable layers, (2) current task context, (3) relevant retrieved chunks, (4) recent conversation, (5) older summaries. Never exceed model limit.
Cross-tenant query attemptedRejected at query layer. Every context query requires tenant_id. Queries without tenant_id are invalid. Automated isolation tests verify this.

10. Open Questions

#QuestionImpactProposed Default
1Should compaction summaries be stored alongside originals, or replace them?Storage cost vs. audit trailStore alongside (originals are append-only, summaries are derived views)
2What is the right default relevance threshold for semantic search inclusion?Quality vs. token savings0.7 cosine similarity, tunable per agent
3Should the Context Engine have its own dedicated LLM for compaction, or share the agent's configured model?Cost and qualityDedicated lightweight model for compaction (cheaper, faster); agent's model for task work
4How should the system handle embedding model upgrades (existing vectors become incompatible)?Migration complexityDual-index strategy: old + new index, gradual migration, drop old after validation
5Should context efficiency metrics count immutable layer tokens as "overhead" or "useful"?Metric accuracyCount as "required overhead" -- separate from "task context efficiency"

11. Pricing Tier Mapping

CapabilityFreeProEnterprise
Sliding Window compactionYesYesYes
Smart retrieval (semantic search)Basic (top-5)Full (top-20, tunable)Full + custom models
Context versioningLast 10 versionsLast 100 versionsUnlimited
Named snapshots3 per agent50 per agentUnlimited
Context rollbackNoYes (L3+)Yes (L3+)
RAG knowledge bases1 domain, 1GB10 domains, 50GBUnlimited
Context efficiency dashboardBasicFullFull + alerts + recommendations
Custom compaction strategiesNoNoYes
Guardrail customizationPlatform defaults onlyTeam-levelFull 4-layer

12. MVP vs V1.1 Summary

MVP (Build First -- Agents Use Immediately)

38 functional requirements. These provide the core context optimization loop:

  1. Ingest context events (messages, tool outputs, workflow events)
  2. Compact when approaching token limits (sliding window, with quality guardrails)
  3. Retrieve semantically relevant context for each LLM call (with quality guardrails)
  4. Manage token budget -- never exceed model limits, prioritize relevance
  5. Version every mutation with content-addressed hashing
  6. Enforce immutability -- 4-layer context, chain of trust, self-modification prevention, SEC.1.2 validation before every LLM call
  7. Push/Pull for controlled context updates across sessions
  8. Integrate with RAG for knowledge retrieval
  9. Sandbox MCP tools -- only load relevant tools per task step
  10. Embed and chunk source documents via the dedicated embedding pipeline with content-hash caching
  11. Exploit LLM prompt caches -- structure prompts so immutable prefixes are cache-friendly, unstable tails come last
  12. Measure savings -- emit naive-baseline vs. engine-assembled token deltas per call into metering
  13. Version persistent memory with named snapshots (P3.4.4)

V1.1 (Enhanced Optimization)

14 functional requirements adding advanced compaction strategies, context processing, efficiency metrics, optimization recommendations, embedding model migration, memory rollback, and retention policies.


13. Dogfooding Validation Plan

The Context Engine will be validated by UpsQuad's own agent team before any customer deployment:

Validation StepHow
Token reduction measurementCompare token usage for identical tasks (e.g., PRD analysis) with and without Context Engine
Quality parity checkHuman blind review: does the agent produce equal or better output with Context Engine?
Compaction accuracyRun 50 compaction operations, verify key-fact retention via automated eval
Semantic retrieval relevanceQuery test suite: 100 queries against known-relevant documents, measure recall@10
Push/pull correctnessAgent team performs real push/pull operations during development workflow
Immutability enforcementRed-team test: attempt self-modification via all vectors, verify all are blocked
Cross-tenant isolationAutomated test: create two tenants, verify zero data leakage across all operations
Failure mode resilienceInject failures (LLM down, DB slow, cache miss) and verify graceful degradation

14. Appendix: Parent PRD Item Cross-Reference

Every functional requirement in this PRD traces back to the parent PRD (UpSquad Complete PRD v1.6) or is a net-new item flagged for upstream inclusion (see section 15).

Fully covered by this subset PRD:

  • P3.1.1 through P3.1.14 (Context Engine section) -- all items, MVP and V1.1
  • P3.2.1 through P3.2.7 (Context Version Control) -- all items, MVP through V2
  • P3.3.1 through P3.3.12 (Immutable Agent Context) -- all items
  • P3.4.1, P3.4.2, P3.4.3 (Persistent Agent Memory -- storage, types, isolation) -- MVP
  • P3.4.4 (Persistent memory versioning with named snapshots) -- MVP, added in v1.1 of this PRD (CE-F55)
  • P3.4.5 (Persistent memory rollback L4+) -- V1.1, added in v1.1 of this PRD (CE-F56)
  • P3.4.6 (Memory retention policies) -- V1.1, added in v1.1 of this PRD (CE-F57)
  • P3.5.1, P3.5.2, P3.5.4, P3.5.5, P3.5.6, P3.5.8, P3.5.9 (RAG Knowledge Bases, MVP items)
  • P2.4.16 (Context-mode MCP sandboxing)
  • P7.2.4 (Context efficiency score)
  • P8.1.2, P8.1.3 (Optimization suggestions, context overload flagging)
  • SEC.1.2 (Immutable context enforcement before every LLM call) -- owned here as CE-F58

Deferred to V2 (out of scope for this PRD's MVP):

  • P3.4.7 (S3-compatible backend for on-premise)
  • P3.2.5, P3.2.6 (Context branching and merging)
  • P3.5.3, P3.5.7, P3.5.10 (Specialized RAG domains, dedup, industry templates)

Items from the parent PRD that interact with but are not owned by this PRD:

  • P2.1.16 (Agent session initialization) -- consumes context engine, owned by Agent Runtime
  • P2.2.x (LLM Router) -- used by compaction, owned by Agent Runtime
  • P2.2.8 (Per-call cost tracking) -- the engine emits savings deltas into this pipeline
  • P4.5.x (RBAC) -- authorization provider, owned by Governance
  • P6.1.10 (Slash commands /push-context, /pull-context) -- UI layer, owned by Chat Interface
  • P7.1.x (Cost tracking) -- consumes context metrics, owned by Cost Engine
  • SEC.1.1 (Input sanitization) -- runs before context assembly, owned by Security Hardening

15. Upstream PRD Gaps (Flagged for Parent PRD v1.7)

During re-verification of this subset PRD against the parent PRD, the following gaps were discovered in the parent PRD itself. The Product Manager will address these in a follow-up update to UpSquad Complete PRD.

#GapProposed Parent PRD ItemPriority
1LLM prompt-cache exploitation is a major token-reduction multiplier not explicitly called out in P3.1 or P2.2. Immutable layers L1+L2+L3+L4 (per P3.3.2) are injected on every LLM call -- they should be structured cache-friendly so that Anthropic/OpenAI/Gemini prompt caches can elide them. This can reduce effective input tokens by 70-90% for subsequent calls in a session.New P3.1.15 -- "LLM prompt-cache exploitation: assembled prompts must place immutable prefix (L1-L4 + pinned chunks) at the front in cache-friendly order; unstable tail (task, retrieved chunks, recent conversation) follows. Router coordinates with providers that support prompt caching."MVP [BOOTSTRAP]
2Context Engine savings metric -- no explicit requirement in P7.2 for tracking the delta between naive-baseline tokens and engine-assembled tokens. Without this metric, we cannot prove the engine is working or justify its complexity.New P7.2.5 -- "Context Engine savings metric: per LLM call, record (naive_baseline_tokens, engine_assembled_tokens, savings_ratio, quality_score). Aggregated and dashboarded per agent, session, tenant."MVP
3Embedding pipeline as first-class sub-component -- parent PRD mentions "auto-chunked, embedded, and indexed" in P3.5.2 but does not enumerate chunking strategy, embedding service, embedding cache, or embedding model migration as distinct items.New P3.5.11 through P3.5.14 covering chunking service, embedding service, embedding cache, embedding model migrationMVP / V1.1
4Compaction quality retention target -- no non-functional requirement specifying minimum key-fact retention for compaction. Without this target, compaction quality is unverifiable.New P3.1.16 -- "Compaction quality: >= 98% key-fact retention measured by automated eval suite. Compaction strategies must not silently drop decisions, commitments, constraints, or open questions."MVP
5Pinned context -- no explicit concept of "pinned" chunks that are always included in assembly regardless of semantic score. Without this, a low-similarity but critical constraint can be excluded.New P3.1.17 -- "Pinned context: any chunk tagged as a hard constraint (decisions, commitments, immutable facts) is always included in assembly regardless of relevance score."MVP

These will be added to the parent PRD in a follow-up update that bumps UpSquad Complete PRD to v1.7.