# Maestro

**A continuation-oriented agent runtime for continuous research work.**

> Maestro is the product. BarnOS / Barnyard is the substrate it runs on. This document defines both — the runtime semantics, the data models, the decision functions, and a 90-day build plan targeting Azure AI Foundry and AWS (EKS + Redshift + Bedrock).
>
> First customers: **Cargill researchers and dairy nutrition consultants** working with ANH Digital Platform. Their work is long-running, multi-source, partially-supervised, and genuinely continuous — the exact shape of work the current MCP-plus-graphs stack handles poorly.

---

## 0. Why This Document Exists

The industry converged in 2026 on a stack that looks like: graph orchestrator (LangGraph / Semantic Kernel) + MCP tools + layered memory (vector / episodic / structured / object) + observability. That stack is good enough for chat-shaped work. It is *not* good enough for research-shaped work, where:

- jobs run for weeks, not seconds
- data arrives continuously and asynchronously
- humans intervene irregularly
- cost discipline matters more than latency
- the "answer" is a body of artifacts, not a chat reply

Maestro re-bases the architecture on three primitives the current stack lacks: **continuations** (resumable, migratable reasoning processes), **context fields** (continuously-projected relevance surfaces), and a **policy kernel** (a deterministic control plane for model/budget decisions). MCP and graphs remain — at the edges, as wire protocols — but they are no longer the center.

---

## 1. The Continuation Store

### 1.1 What a Continuation Is

A **Continuation** is a serializable, resumable, migratable unit of reasoning work. It is the atomic primitive of Maestro. Where current frameworks make `model.invoke()` the unit of work, Maestro makes the continuation the unit, and a model call is just one of several things a continuation can do during a tick.

A continuation has:

- **identity** — a stable ID and lineage (parent, root, generation)
- **goal frame** — a structured representation of what it is trying to accomplish
- **state** — the current reasoning state, including a working draft, open questions, and accumulated findings
- **budget vector** — token, dollar, wall-clock, tool-quota, and human-attention allowances
- **wake conditions** — predicates that determine when it should next be scheduled
- **context binding** — a pointer to its context field projection (see §2)
- **lineage edges** — parent, children, siblings, and merge targets
- **status** — one of `running`, `sleeping`, `waiting`, `blocked`, `merged`, `done`, `killed`, `failed`

### 1.2 Data Model

The continuation store is **bi-modal**: a hot store for active state (Redis on Azure / ElastiCache on AWS, or DynamoDB / Cosmos DB for durable hot state) and a warm store for analytical and audit work (Redshift on AWS, Synapse on Azure). The hot store is the source of truth at runtime; the warm store receives a change-data-capture stream and is the source of truth for analytics, replay, and billing.

#### Hot store schema (logical)

```
continuation
  continuation_id          UUID, PK
  root_id                  UUID                # root of the lineage tree
  parent_id                UUID NULL
  generation               INT                 # depth from root
  tenant_id                UUID                # Cargill, ANH, consultant org
  researcher_id            UUID                # human owner
  status                   ENUM
  created_at               TIMESTAMP
  updated_at               TIMESTAMP
  next_wake_at             TIMESTAMP NULL      # earliest scheduled wake
  goal_frame               JSONB               # see §1.3
  state                    JSONB               # working memory snapshot
  budget                   JSONB               # see §1.4
  wake_conditions          JSONB               # see §1.5
  context_field_id         UUID                # FK to context field
  policy_class             TEXT                # routing hint for policy kernel
  tags                     TEXT[]              # for grouping/queries

continuation_event           # append-only event log per continuation
  event_id                 UUID, PK
  continuation_id          UUID, FK
  ts                       TIMESTAMP
  kind                     ENUM (spawn, wake, model_call, tool_call,
                                 publish, fork, merge, sleep, kill,
                                 budget_charge, human_signal, error)
  payload                  JSONB
  cost                     JSONB               # tokens, $, wall_ms

continuation_edge            # lineage and dependency graph
  from_id                  UUID
  to_id                    UUID
  kind                     ENUM (parent, fork, merge, depends_on,
                                 publishes_to, subscribes_to)
  created_at               TIMESTAMP

continuation_lease           # cooperative locking for migration
  continuation_id          UUID, PK
  worker_id                TEXT
  lease_until              TIMESTAMP
  generation               INT                 # fencing token
```

The event log is the most important table: every state change of a continuation is an immutable event, which gives us **time-travel debugging, replay, and audit for free**, and is the canonical input to the warm store.

#### Warm store (Redshift) schema

Redshift receives a CDC stream from the hot store and materializes:

- `fact_continuation_step` — one row per event, with cost, latency, and model attribution
- `fact_continuation_lifecycle` — one row per continuation with start/end, total cost, outcome
- `dim_researcher`, `dim_tenant`, `dim_model`, `dim_tool`, `dim_goal_class`
- `agg_daily_cost_by_tenant` — for chargeback and budget alerting
- `agg_policy_decisions` — for offline policy training (§3.4)

This is what powers the dashboards, the chargeback, and — critically — the offline training data for the policy kernel.

### 1.3 The Goal Frame

The goal frame is the most important field on a continuation, because it is what the context field and policy kernel both key off of. It is **not** a prompt. It is structured:

```json
{
  "intent": "evaluate_feed_additive_effect",
  "subject": {
    "species": "dairy_cow",
    "lactation_stage": "early",
    "cohort_size_range": [60, 200]
  },
  "question": "Does {additive} at {dose_range} improve {outcome} vs control?",
  "bindings": {
    "additive": "monensin",
    "dose_range_mg_per_day": [200, 400],
    "outcome": "milk_yield_kg_per_day"
  },
  "evidence_requirements": {
    "study_types": ["rct", "field_trial"],
    "min_n_per_arm": 30,
    "min_recency_years": 10,
    "geographic_scope": ["NA", "EU"]
  },
  "deliverable": "structured_evidence_brief",
  "success_predicate": "evidence_brief.confidence >= 0.7 AND human_signoff.received"
}
```

The goal frame is **the contract** between the researcher and Maestro. It is also what fingerprints the continuation for the context field — semantically similar goal frames share context.

### 1.4 The Budget Vector

```json
{
  "tokens": {"opus_class": 250000, "sonnet_class": 2000000, "haiku_class": 20000000},
  "dollars": {"hard_cap": 50.00, "soft_cap": 35.00, "spent": 12.40},
  "wall_clock": {"deadline": "2026-06-01T00:00:00Z", "active_seconds_cap": 7200},
  "tool_quotas": {"web_search": 200, "literature_api": 500, "internal_db": 5000},
  "human_attention": {"interrupts_allowed": 3, "interrupts_used": 0}
}
```

Budgets are **enforced by the policy kernel** at every tick (§3). When `soft_cap` is hit, the kernel down-shifts to cheaper models; when `hard_cap` is hit, the continuation is forced to publish whatever it has and terminate.

### 1.5 Wake Conditions

A sleeping continuation specifies *why* it should be woken:

```json
{
  "any_of": [
    {"kind": "timer", "at": "2026-05-20T09:00:00Z"},
    {"kind": "event", "stream": "arxiv.q-bio", "predicate": "matches_goal_frame"},
    {"kind": "human_signal", "from": "researcher_id", "topic": "approval"},
    {"kind": "sibling_publish", "root_id": "...", "tag": "finding"},
    {"kind": "data_arrival", "source": "cargill.feed_trial_2026_q2"}
  ]
}
```

This is what makes Maestro *continuous*: a continuation can sleep for three weeks waiting on an arXiv RSS event, wake for five seconds of work, and go back to sleep — without anyone "running" it.

### 1.6 Migration and Failure Semantics

Continuations are designed to be migratable. The `continuation_lease` table implements cooperative locking with fencing tokens (the `generation` field). The contract:

1. A worker acquires a lease on a continuation by writing `(worker_id, lease_until, generation+1)`.
2. The worker may execute the continuation only while its lease is valid.
3. Any state mutation must include the fencing `generation` — stale workers' writes are rejected.
4. On worker death, the lease expires and another worker can pick it up.
5. **Idempotency is enforced at the event-log level**: every event carries a `(continuation_id, generation, sequence)` triple, and duplicate writes are no-ops.

This is the Erlang/Temporal pattern, adapted for LLM-driven work. We do not attempt to migrate *mid-model-call* — model calls are checkpoint boundaries.

---

## 2. The Context Field

### 2.1 What a Context Field Is

A **context field** is a continuously-maintained, ranked projection of all available memory sources, scored against a continuation's goal frame. The continuation does not "query" memory — it *reads the top of its field*, which is always fresh.

Think of it as a **streaming materialized view** over your entire memory hierarchy, with the goal frame as the join key and a small ranking model as the projection function.

### 2.2 Sources Flowing Into a Field

A field has multiple **source channels**, each producing scored candidates:

| Channel | Underlying store | Latency target | Candidate type |
|---|---|---|---|
| `semantic` | Qdrant / OpenSearch / Azure AI Search | < 50 ms | document chunks |
| `structured` | Postgres / Cosmos DB | < 20 ms | facts, rows, entities |
| `episodic` | Continuation event log | < 30 ms | prior decisions, findings |
| `graph` | Neo4j / Neptune | < 100 ms | entity-relation paths |
| `stream` | NATS / Kafka / Event Grid | push | newly-arrived items |
| `human` | Notebook annotations | push | researcher notes, corrections |

Each channel emits candidates with a **provenance tag**, an **embedding**, and **structured features** (recency, source authority, prior use, cost-to-include).

### 2.3 The Ranking Algorithm

The field is recomputed when (a) the goal frame changes, (b) a new candidate arrives on a push channel, or (c) a TTL expires. Ranking is a **two-stage cascade**:

#### Stage 1 — Cheap recall (top-K from each channel)

For each channel, fetch top-K candidates by channel-native similarity:

- semantic: cosine over goal-frame embedding
- structured: predicate match against goal-frame bindings
- episodic: time-decayed similarity to current state
- graph: k-hop neighborhood of goal-frame entities
- stream: recent arrivals matching subscribed predicates
- human: all unconsumed annotations for this continuation

This yields a candidate pool of roughly 200–500 items.

#### Stage 2 — Cross-encoder reranking with a small model

A small, fast model (Haiku-class, or a fine-tuned cross-encoder) scores each candidate against the goal frame. The score function is:

```
score(c, g) =
      w_rel  * relevance(c, g)              # cross-encoder output, [0,1]
    + w_rec  * recency_decay(c.ts, now)     # exp(-Δt/τ), τ per source
    + w_auth * source_authority(c.source)   # learned per-source prior
    + w_use  * prior_utility(c, g.tenant)   # was this useful before?
    + w_div  * diversity_penalty(c, picked) # MMR-style anti-redundancy
    - w_cost * inclusion_cost(c)            # tokens to include
```

Weights `w_*` are per-tenant learnable parameters, initialized to sensible defaults and refined from the warm store (§3.4). The continuation's field is the top-N by this score, subject to a token budget.

#### Stage 3 — Field materialization

The top-N items are assembled into a **field manifest**:

```json
{
  "field_id": "...",
  "continuation_id": "...",
  "computed_at": "...",
  "ttl_seconds": 300,
  "token_budget": 12000,
  "items": [
    {"rank": 1, "source": "semantic:pubmed", "tokens": 380,
     "score": 0.91, "provenance": "...", "content_ref": "..."},
    ...
  ],
  "evicted": [...]   # what was dropped, and why — for explainability
}
```

The continuation reads `items` in rank order until its prompt budget is met. **The big model never sees the memory hierarchy directly — it sees the field.**

### 2.4 Eviction and Decay

The field is not a database — it forgets. Each item has a **half-life** that depends on its source and the continuation's pace. Items below a score threshold are evicted. When a continuation publishes a finding that *uses* an item, the item's `prior_utility` is reinforced, raising its score in future recomputations for similar goal frames in the same tenant.

This gives us **organizational learning** as an emergent property: items that prove useful for one consultant's dairy nutrition work get surfaced faster for the next.

### 2.5 Implementation Note

The naive implementation recomputes the field on every tick. The production implementation uses a streaming materialized view (Materialize / RisingWave on open source, or a custom Flink job on managed platforms). Channels publish change events; the view incrementally updates the field; the continuation reads the latest snapshot.

Cost discipline: the Stage-2 reranker is the dominant cost. We cap it at one rerank per continuation per N seconds, and we cache reranker outputs keyed on `(goal_frame_hash, candidate_id)`.

---

## 3. The Policy Kernel

### 3.1 What the Policy Kernel Decides

The policy kernel is a **small, fast, deterministic** service that runs before every model call and answers six questions:

1. **Route** — which model tier (Haiku / Sonnet / Opus, or GPT equivalents) should handle this tick?
2. **Mode** — synchronous (block and wait), async (fire and yield), or batched (defer)?
3. **Branching** — should this continuation spawn children, or do the work inline?
4. **Tool gating** — which tools is this tick allowed to call, given remaining quota?
5. **Termination** — is the marginal value of continuing positive given the remaining budget?
6. **Human escalation** — should the next tick interrupt the researcher?

It is **not an LLM**. It is a rules-plus-learned-priors function. We do not put LLMs on the critical control path of LLMs.

### 3.2 The Decision Function

For each continuation tick, the kernel receives a **decision input**:

```
DecisionInput = {
  continuation:   { goal_frame, state, status, generation },
  budget:         { remaining_tokens_by_tier, remaining_dollars,
                    deadline, tool_quotas, human_interrupts_left },
  context_field:  { size_tokens, top_score, score_spread,
                    novelty_vs_prior_tick },
  history:        { last_N_ticks_costs, last_N_ticks_progress,
                    consecutive_no_progress_ticks },
  global:         { current_load, model_health, price_signals }
}
```

And emits a **decision**:

```
Decision = {
  route:        "haiku" | "sonnet" | "opus",
  mode:         "sync" | "async" | "batch",
  branching:    { spawn: int, with_goal_frames: [...] },
  tools_allowed: [...],
  terminate:    bool,
  escalate:     bool,
  rationale:    "..."   # for the audit log
}
```

### 3.3 The Decision Logic (concrete)

The kernel is implemented as a layered cascade. Each layer is fast, explainable, and overridable.

**Layer A — Hard gates.** Non-negotiable rules:

```
if budget.remaining_dollars <= 0:        decision.terminate = true
if budget.deadline < now:                decision.terminate = true
if context_field.top_score < 0.2:        decision.terminate = true  # no signal
if consecutive_no_progress_ticks >= 3:   decision.escalate = true
if tenant.compliance_freeze:             decision.terminate = true
```

**Layer B — Tier routing.** Pick the cheapest model that is likely to make progress:

```
required_capability = capability_estimator(
    goal_frame.intent,
    context_field.top_score,
    state.open_questions
)
# capability_estimator is a small learned classifier, cached aggressively

if required_capability == "extract" or "classify":
    route = "haiku"
elif required_capability == "synthesize" or "draft":
    route = "sonnet"
elif required_capability == "plan" or "reason":
    route = "opus"

# Budget-aware downshift:
if budget.remaining_dollars < soft_cap_threshold:
    route = downshift(route, by=1)
```

**Layer C — Mode and branching.** When to fork, when to defer:

```
parallelizable_subgoals = parallelism_estimator(goal_frame, state)
if len(parallelizable_subgoals) > 1 and budget.allows_fanout(N):
    branching.spawn = min(len(parallelizable_subgoals), max_fanout)
    branching.with_goal_frames = parallelizable_subgoals

if route == "opus" and not is_user_waiting:
    mode = "async"     # don't block on expensive calls
elif route == "haiku" and batchable(state):
    mode = "batch"     # defer to next batch window
else:
    mode = "sync"
```

**Layer D — Tool gating.** Cheapest tool first, quota-aware:

```
tools_allowed = [
    t for t in goal_frame.eligible_tools
    if budget.tool_quotas[t.name] > 0
       and t.cost <= remaining_per_tick_budget
]
# Sort by expected_value_per_dollar, learned per tenant
```

**Layer E — Escalation.** When to interrupt the human:

```
if state.has_blocking_question and budget.human_interrupts_left > 0:
    escalate = true
if confidence_in_current_path < 0.3 and cost_so_far > 0.5 * hard_cap:
    escalate = true
if external_signal.contradicts_state:
    escalate = true
```

### 3.4 How the Kernel Learns

The kernel is initialized with hand-tuned rules. It improves via **offline supervised learning** on the warm store:

- input: `DecisionInput` features for past ticks
- label: did the tick produce useful progress, measured by downstream outcomes (publish events, researcher approvals, evidence-brief quality scores)
- model: gradient-boosted trees per decision dimension (route, branching, tools)
- deployment: shadow-mode first, then gated rollout with a kill-switch

We do not use RL on production traffic. The kernel is too critical to be exploratory.

### 3.5 Why Not An LLM Here

Three reasons. **Latency**: the kernel runs before every model call, often dozens of times per minute per researcher. **Determinism**: we need to be able to explain to Cargill compliance why a given decision was made. **Cost**: putting an LLM in front of every LLM call is a regress. A 5ms GBT inference beats a 500ms model call for control-plane decisions every time.

### 3.6 Models, Tiers, and How the Kernel Routes Across Clouds

The kernel emits a `route` tier (`haiku` | `sonnet` | `opus`), not a model id. The dispatch layer (§4) resolves the tier to a concrete model on a concrete provider, per tenant. Maestro treats **AWS Bedrock** and **Azure AI Foundry Models** as peer inference surfaces — same Claude family available on both, plus each cloud's own catalog.

#### Tier → provider → model resolution

| Tier | Bedrock model id | Foundry Models deployment | Failover order |
|---|---|---|---|
| `haiku` | `anthropic.claude-haiku-4-5-20251001-v1:0` | `claude-haiku-4-5` | Bedrock → Foundry → Anthropic direct |
| `sonnet` | `anthropic.claude-sonnet-4-6` (via inference profile) | `claude-sonnet-4-6` | Bedrock → Foundry → Anthropic direct |
| `opus`   | `anthropic.claude-opus-4-7` (via inference profile) | `claude-opus-4-6` | Bedrock → Foundry → Anthropic direct |
| `gpt-fast` | `openai.gpt-oss-120b-1:0` | `gpt-4.1-mini`, `gpt-oss-120b` | Foundry → OpenAI direct |
| `gpt-reason` | n/a | `o4`, `gpt-5` family | Foundry → OpenAI direct |

The mapping is per-tenant configuration, not code. Cargill's tenant prefers Bedrock for Claude (existing data lake is on AWS); a hypothetical Azure-resident tenant flips the order. The kernel does not know or care.

#### Cost levers the kernel can pull

- **Bedrock Cross-Region Inference (CRIS).** Geographic (`us.`, `eu.`, `apac.`) and **Global** inference profiles. Global is ≈10% cheaper than Geographic, with no routing surcharge. The kernel sets the inference profile per call when latency budget allows.
- **Bedrock service tiers.** `default | priority (+75%) | flex (−50%) | reserved`. The kernel routes deferred / batchable ticks to `flex`, user-waiting ticks to `priority`.
- **Bedrock latency-optimized inference.** Available for Haiku 3.5, Llama 70B/405B, Nova Pro. Kernel uses it for `mode=sync` ticks where p50 latency matters.
- **Bedrock prompt caching** via `cachePoint` blocks. Field manifests that don't change tick-over-tick are obvious caching candidates.
- **Foundry PTU** (Global / Data Zone / Regional Provisioned). For tenants running steady, predictable load — Cargill at scale — PTU at the right zone amortizes well below per-token MaaS.
- **Foundry Intelligent Prompt Routing** (within the OpenAI family). Used when the tier is `gpt-fast` and the kernel hasn't committed to a specific deployment.

#### Tool-schema normalization

Bedrock's `Converse` API normalizes `toolConfig.tools[].toolSpec` across providers. Foundry's Responses API has a different shape. The dispatch layer (`core/maestro/dispatch/`) exposes a single internal tool-call ABI; `to_provider_tool_schema()` and `from_provider_tool_call()` adapt at the provider boundary. Every provider is conformance-tested against the same suite — same inputs, same expected `toolUse` envelope out.

#### Governance and guardrails

- **Bedrock Guardrails** are applied inline on the `Converse` call (`guardrailConfig`) for Bedrock-routed traffic. PII detection, denied topics, contextual grounding, Automated Reasoning checks.
- **Azure AI Content Safety** filters are applied at the Foundry deployment level for Foundry-routed traffic. Same six categories (hate / self-harm / sexual / violence / jailbreak / protected-material) with togglable severity.
- Provider-specific safety verdicts are normalized into a single `safety_outcome` enum on the `model_call` event so the audit trail reads the same regardless of route.

The point of this section is to be explicit: the kernel sits above the providers. Adding a third inference surface (or swapping Claude on Bedrock for Claude on Foundry for a given tenant) is a config change, not a code change.

---

## 4. The 90-Day Build — Dual-Cloud Architecture

We build Maestro on **both** Azure AI Foundry and AWS (EKS + Redshift + Bedrock) from day one. ANH Digital Platform's customers split across clouds, and dual-cloud forces us to keep the abstractions clean. The cloud-specific parts live behind a thin **substrate interface** (§4.3).

### 4.1 Azure AI Foundry Build

```
                         ┌────────────────────────────┐
   Researcher Notebook ─▶│   Azure Front Door + APIM  │
   (Astro + React)       └─────────────┬──────────────┘
                                       ▼
                          ┌─────────────────────────┐
                          │  Maestro API (Container │
                          │   Apps, Foundry-aware)  │
                          └────────┬────────────────┘
                                   ▼
                        ┌─────────────────────────┐
                        │  Continuation Store     │
                        │  Cosmos DB (hot)        │
                        │  + Redis (lease/cache)  │
                        └────────┬────────────────┘
                                   ▼
                ┌──────────────────┴──────────────────┐
                ▼                                     ▼
  ┌──────────────────────────┐         ┌────────────────────────┐
  │ Scheduler (Durable       │         │ Context Field Service  │
  │ Functions): wake, lease, │         │ (Container App):       │
  │ dispatch                 │         │ RisingWave-style view  │
  └──────────┬───────────────┘         │ over Azure AI Search,  │
             │                         │ Cosmos DB, Event Grid  │
             ▼                         └──────────┬─────────────┘
  ┌──────────────────────┐                        │
  │ Worker Pool          │◀───────────────────────┘
  │ (Container Apps      │
  │  + Foundry Agents)   │
  └──────────┬───────────┘
             ▼
  ┌──────────────────────┐    ┌────────────────────┐
  │ Policy Kernel        │    │ Model Dispatch     │
  │ (Container App,      │───▶│ Foundry: GPT-4.x,  │
  │  ONNX GBT inference) │    │ Claude on Bedrock, │
  └──────────────────────┘    │ Azure OpenAI       │
                              └────────────────────┘

  Streams: Event Grid + Service Bus
  Audit/Analytics: Event Hubs → Synapse (warm store)
  Observability: Application Insights + OpenTelemetry
  Identity: Entra ID (researcher), Managed Identity (services)
```

Key Azure-specific choices:

- **Cosmos DB** for the hot continuation store. It gives us global distribution, change feed (for CDC into the warm store), and predictable latency.
- **Durable Functions** for the scheduler — they handle the timer-and-wake mechanics natively.
- **Azure AI Search** as the semantic channel, with hybrid vector+keyword.
- **Foundry Models** as the Azure-side inference catalog — peer of Bedrock. One Foundry resource exposes OpenAI (GPT-5, GPT-4.1, o-series), Anthropic Claude (Sonnet 4.6, Opus 4.6, Haiku 4.5 — billed via Azure Marketplace, MACC-eligible), Llama 3.3/4, DeepSeek V3/R1, Mistral, Cohere, Phi, Grok, and gpt-oss through a single OpenAI-compatible `/openai/v1/chat/completions` endpoint. The dispatch layer treats it as a first-class provider, not a side option. See §3.6.
- **Foundry Agent Service** (GA 2026-03-16, built on the OpenAI Responses API) as a *managed agent runtime* for bounded sub-tasks Maestro chooses to delegate — Hosted-agent sessions for browser automation, file/code work, or single-session research crawls. We use Foundry as an integration partner, not as the orchestrator. See §4.4 for the explicit reasoning.
- **Synapse** as the warm store equivalent for analytics; Redshift remains the AWS path.

### 4.2 AWS Build (EKS + Redshift + Bedrock)

```
                         ┌────────────────────────────┐
   Researcher Notebook ─▶│   CloudFront + API Gateway │
   (Astro + React)       └─────────────┬──────────────┘
                                       ▼
                          ┌─────────────────────────┐
                          │  Maestro API            │
                          │  (EKS, Fargate-backed)  │
                          └────────┬────────────────┘
                                   ▼
                       ┌──────────────────────────┐
                       │  Continuation Store      │
                       │  DynamoDB (hot)          │
                       │  + ElastiCache Redis     │
                       └────────┬─────────────────┘
                                ▼
                ┌──────────────────┴──────────────────┐
                ▼                                     ▼
  ┌──────────────────────────┐         ┌────────────────────────┐
  │ Scheduler (Step          │         │ Context Field Service  │
  │ Functions Express) for   │         │ (EKS): Flink job over  │
  │ wake & dispatch          │         │ OpenSearch, DynamoDB,  │
  └──────────┬───────────────┘         │ MSK streams            │
             │                         └──────────┬─────────────┘
             ▼                                    │
  ┌──────────────────────┐                        │
  │ Worker Pool          │◀───────────────────────┘
  │ (EKS, Karpenter-     │
  │  scaled, spot+OD)    │
  └──────────┬───────────┘
             ▼
  ┌──────────────────────┐    ┌────────────────────┐
  │ Policy Kernel        │    │ Model Dispatch     │
  │ (EKS, ONNX GBT,      │───▶│ Bedrock (Claude,   │
  │  sub-10ms p99)       │    │ Nova), OpenAI API  │
  └──────────────────────┘    └────────────────────┘

  Streams: MSK (Kafka) + EventBridge
  CDC: DynamoDB Streams → Firehose → Redshift (warm store)
  Analytics: Redshift Serverless + QuickSight
  Observability: CloudWatch + OpenTelemetry → Managed Grafana
  Identity: Cognito (researcher), IAM IRSA (services)
```

Key AWS-specific choices:

- **DynamoDB** for hot continuation state, with **DynamoDB Streams** as the CDC source for Redshift.
- **Redshift Serverless** as the warm store. Two reasons: (a) the policy kernel's offline training set lives here, (b) Cargill's existing data lake is on Redshift, so co-locating analytics minimizes data-egress and matches their team's skills.
- **EKS + Karpenter** for the worker pool. Spot-heavy with on-demand fallback for the policy kernel. Continuations are migratable, so spot reclaims are non-events.
- **MSK** (Managed Kafka) as the reactive tool mesh. EventBridge handles the cross-account / cross-tenant routing.
- **Bedrock** as the AWS-side inference catalog — Claude (Haiku 4.5, Sonnet 4.6, Opus 4.7), Nova/Titan, Llama 4, Mistral, Cohere, AI21, DeepSeek, gpt-oss-120b. The `Converse` API normalizes tool schemas across providers. Bedrock's **Cross-Region Inference profiles** (Global ≈ −10% vs Geographic), **service tiers** (`flex`, `priority`, `reserved`), **latency-optimized inference**, and **prompt caching** are the cost levers the policy kernel uses (§3.6).
- **Bedrock AgentCore** (GA 2025-10-13) as the AWS-side managed agent runtime — peer of Foundry Agent Service. Runtime (microVM sessions up to 8h), Memory (events + extracted records), Gateway (MCP-native tool egress with semantic search), Identity (workload identity + token vault), Browser, Code Interpreter, Observability. Maestro delegates bounded sub-tasks to AgentCore the same way it delegates to Foundry on Azure. See §4.4.

### 4.3 The Substrate Interface

The cloud-specific bits hide behind a single interface that every Maestro service depends on:

```
Substrate {
  ContinuationStore       # CRUD + lease semantics
  EventLog                # append-only, partitioned by continuation
  Scheduler               # register/cancel wake conditions
  StreamBus               # publish/subscribe for reactive tools
  ModelDispatch           # call model by tier, returns metered usage
  AgentRuntime            # delegate a bounded sub-task to a vendor agent runtime
  Identity                # who is this caller, what can they do
  Observability           # spans, metrics, audit
}
```

Each cloud ships a concrete `Substrate` implementation. Maestro core code never imports Azure or AWS SDKs directly. This is the discipline that keeps dual-cloud honest.

`AgentRuntime` is the seventh method group, added so that Maestro can hand a bounded sub-task to **Bedrock AgentCore** on AWS or **Foundry Agent Service** on Azure when it makes sense to do so — see §4.4. The interface is small on purpose:

```
AgentRuntime {
  spawn_session(spec) -> session_id           # bounded sub-task
  stream_events(session_id) -> Iterator       # journal into the continuation event log
  signal(session_id, payload)                 # send a human or sibling signal in
  terminate(session_id)                       # forced stop
  collect_terminal_state(session_id) -> dict  # final state on completion
}
```

The concrete impls are `AgentCoreRuntime` (Runtime + Gateway + Memory) on AWS and `FoundryAgentRuntime` (Hosted agents + Connected agents + Toolboxes) on Azure. Both stream events back through the substrate's `EventLog` so the audit trail in Redshift/Synapse remains the single source of truth.

### 4.4 Managed Agent Runtimes Are Integration Partners, Not Maestro

Two managed agent runtimes matter in 2026: **AWS Bedrock AgentCore** (GA 2025-10-13) and **Azure AI Foundry Agent Service** (GA 2026-03-16). They overlap heavily with what Maestro's worker pool + scheduler do. It is worth being explicit about why Maestro is not built *on* them, and where Maestro nonetheless delegates *to* them.

#### Why neither is Maestro's orchestrator

| Concern | AgentCore | Foundry Agent Service |
|---|---|---|
| Hard session ceiling | 8 hours | 30 days for Hosted agents; runs are request/response |
| Idle timeout | 15 minutes | 15 minutes for Hosted agents |
| Pause / branch / resume | No first-class primitive | No first-class primitive on runs; Hosted-agent sessions resume but not branch |
| Wake on external signal | Out-of-band; requires external orchestrator | Same — needs Logic Apps / Durable Functions / external scheduler |
| Fork / merge with lineage | Not modeled | Not modeled |
| Multi-cloud | Single-cloud by design | Single-cloud by design |

Maestro's continuations sleep for weeks, wake on arXiv RSS or sibling-publish events, fork and merge, and span both clouds. The vendor runtimes are not built for that shape of work. Putting Maestro continuations *on* AgentCore or Foundry would forfeit the central architectural bet.

#### Where Maestro delegates to them

The vendor runtimes are excellent for the specific shape they handle: a **bounded sub-task** within a session, with managed memory, sandboxed tool execution, and an OTEL trace. The policy kernel may choose to delegate a tick when:

- the sub-task is a managed browser crawl (AgentCore Browser / Foundry Browser Automation),
- a code-interpreter session (both have one),
- a tool-heavy session reachable through the vendor's MCP gateway (AgentCore Gateway / Foundry Toolboxes),
- or any sub-task that fits cleanly inside the vendor's session limits.

When the kernel delegates, the worker calls `Substrate.AgentRuntime.spawn_session(...)`, streams session events back into the continuation's event log (preserving the audit trail), and on completion collects the terminal state and resumes the continuation on the Maestro worker pool. The vendor session is one event in the continuation's history, not a new orchestrator.

#### The delegation rule

A continuation tick is eligible for vendor delegation when **all** of:

1. The kernel's `route` is a tier the vendor can serve, and
2. The expected wall-clock for the sub-task is `< vendor.session_ceiling - safety_margin`, and
3. The sub-task does not need to fork, merge, or sleep on a multi-day external signal, and
4. The tenant's policy class allows external-runtime execution (some compliance regimes don't).

Otherwise the tick stays on the Maestro worker pool. The default is "stay on Maestro" — delegation is opt-in per tick, recorded in the `model_call` event as `runtime: maestro | agentcore | foundry`.

#### Why this is the right shape

This keeps Maestro's center honest: continuations as the unit of work, the policy kernel as the control plane, the substrate as the cloud boundary. The vendor runtimes become a *capability* the substrate exposes, in the same row as `ModelDispatch` and `StreamBus`. We benefit from their managed sandboxing and OTEL hooks without inheriting their session-shaped worldview.

### 4.5 90-Day Phasing

**Days 1–30 — Spine.** Continuation store + event log + scheduler on both clouds. A bare worker that can execute a no-op continuation, sleep, wake, and complete. Substrate interface defined and implemented on both. No context field, no policy kernel yet — just the resumable-process backbone.

**Days 31–60 — Field and policy.** Context field service (Stage 1 + Stage 2 reranker) wired to one semantic source per cloud (Azure AI Search / OpenSearch). Policy kernel v0 — rules only, no learned components. End-to-end demo: a researcher submits a goal frame, a continuation runs to completion, calls Claude through Bedrock and GPT through Azure OpenAI, reads from the field, respects budgets.

**Days 61–90 — Continuity and learning.** Wake conditions beyond timers (stream subscriptions, sibling-publish, human signal). Forking and merging. Redshift / Synapse warm store live with CDC from both clouds. Policy kernel v1 — GBT models trained on initial usage data, deployed in shadow mode. Researcher notebook UI shipped. First Cargill pilot with two dairy nutrition consultants.

Detailed phasing is in `maestro-build-plan.yaml`.

---

## 5. Why MCP and Tools Live at the Edges, Not the Center

MCP is a good wire protocol. It is not a good architecture. In Maestro:

- **Tools speak MCP at the edge.** The reactive tool mesh exposes MCP endpoints for any tool that needs to be called imperatively. We do not reinvent the tool protocol.
- **But tools are also event sources.** Every MCP tool worth integrating gets a parallel `stream` adapter that publishes events to the mesh — new papers, new trial results, new market data. The continuation subscribes to the stream rather than polling the tool.
- **The orchestrator is not a graph.** It is the continuation store plus the scheduler. Graphs are an emergent property of lineage edges, not a hand-authored DAG.

This is the central architectural bet: the future of agentic work is **continuations over streams**, not **graphs over RPCs**.

---

## 6. What Maestro Lets a Cargill Researcher Do

A nutrition consultant logs into Maestro on Monday morning. They submit one goal frame: *"Evaluate the evidence base for {additive X} in early-lactation Holsteins, accounting for {regional regulation Y}."*

Maestro spawns a root continuation. Within two minutes, the root has spawned eleven children:

- three pulling structured trial data from Cargill's internal feed-trial database
- four monitoring literature streams (PubMed, ScienceDirect, J. Dairy Sci.) with subscriptions that will fire for weeks as new papers land
- two synthesizing the current state of evidence using Opus-class reasoning
- one drafting the regulatory-context section using Sonnet
- one watching for a competing additive's announcement (a wake condition on a specific RSS feed)

Eight of them go to sleep within ten minutes, waiting on streams or timers. Three keep working. By Tuesday, the consultant has a draft evidence brief in the notebook — generated entirely by the synthesis continuations reading from a field that was assembled from forty-three context items spread across seven sources. By Friday, two of the sleeping continuations have woken because new trials were published, and the brief has been updated with annotations highlighting what changed. Total spend: $14. Total researcher time: 35 minutes — almost all of it review, not authoring.

That is the work Maestro is for. That is the work the current MCP-plus-graphs stack cannot do without bespoke orchestration written for every customer.

---

## 7. Open Risks

A few things to keep honest about as we build:

**Context field cost.** Stage-2 reranking against a moving goal frame, across hundreds of candidates, for hundreds of continuations, can dominate the bill. The cache-and-cap strategy in §2.5 is necessary, not optional. Worth prototyping in isolation in week 2.

**Continuation migration in practice.** Erlang got this right; almost nobody else has. We start with non-migratable continuations pinned to durable workflows (Durable Functions / Step Functions) and earn migration only after the lease/fencing semantics are battle-tested.

**Policy kernel cold start.** With no training data, the kernel is just hand-tuned rules. We need to instrument heavily from day one so we have a corpus to train on by day 60.

**Dual-cloud overhead.** Maintaining two substrate implementations is real work. The substrate interface has to stay narrow. Any time we are tempted to add a method to it, that is a signal to push the complexity into the calling service instead.

**Researcher trust.** Cargill consultants will not hand work to a system whose decisions they cannot explain. Every continuation must produce a human-readable audit trail of its key decisions, sourced from the policy kernel's `rationale` field and the field manifest's `evicted` log.

---

## 8. What This Document Is, and What It Isn't

This is an architecture sketch with enough detail to start building. It is not a spec. The data models in §1 will change once we start writing migrations. The ranking weights in §2 will be tuned. The decision layers in §3 will be reordered. The 90-day plan will slip on at least one phase.

What should *not* change is the shape of the bet: continuations as the unit of work, context fields as the projection of memory, a deterministic policy kernel as the control plane, MCP at the edges. Everything else is implementation.

Maestro is ANH Digital Platform's product. BarnOS / Barnyard is the substrate. Cargill researchers and dairy consultants are the first customers because their work is exactly the wrong shape for the chatbot-era stack and exactly the right shape for this one.

Let's build it.