XIOPro Production Blueprint v5.0¶
Part 10 — Swarm Architecture v5.0: Multi-Project Agent Orchestration¶
1. Core Principle¶
The Bus is the brain, agents are the hands.
All state flows through Bus + PostgreSQL. Agents are ephemeral; the Bus is permanent.
No agent owns truth. The Bus owns truth. Agents read state, execute, report results, and terminate. If an agent dies, the Bus state survives and a replacement is spawned from that state.
This is the single most important architectural constraint in the system.
2. Layer Architecture¶
XIOPro operates as a four-layer orchestration stack. Each layer has a clear boundary of responsibility, a clear upward escalation path, and a clear downward delegation path.
Layer 0: Infrastructure (always on, no agents)¶
Infrastructure is not orchestrated — it is permanent. These services run whether or not any agent exists.
| Service | Role | Lifecycle |
|---|---|---|
| Bus (Fastify) | 74+ endpoints, SSE, REST events, message relay | Always on, Governor-managed |
| PostgreSQL | bus + devxio DBs, source of truth | Always on, Governor-managed |
| Governor cron | 5-min health checks, auto-restart of crashed services | Always on, systemd |
| Heartbeat cron | Agent liveness detection | Always on, systemd |
| Caddy | TLS termination, reverse proxy | Always on, systemd |
| Dashboard | Stateless web app, reads Bus + DB | Always on, Caddy-served |
Layer 0 has zero agent dependencies. If every agent in the system dies, Layer 0 continues running and preserves all state.
Layer 1: Global Orchestration¶
| Agent | Count | Role |
|---|---|---|
| GO (Global Orchestrator) | Exactly 1 | Primary consumer of Bus, spawns all other orchestrators, resolves cross-project conflicts |
GO is the singleton root of the agent hierarchy. There is one GO in the entire system, always. GO does not execute work directly — it delegates downward.
GO responsibilities:
- Spawn and monitor HOs, POs, IOs
- Resolve cross-project resource conflicts
- Maintain global capacity table
- Route L3+ alerts
- Enforce budget constraints across all projects
Layer 2: Host Orchestration¶
| Agent | Host | Role |
|---|---|---|
| HO@Server1 | Hetzner CPX62 | Docker, Git repos, containers, agent spawning |
| HO@Mac1 | Mac Studio M1 | Browser testing, SOPS keys, GPU, local ops |
| HO@Server2 (future) | TBD | Horizontal scaling |
| HO@Cloud1 (future) | TBD | Cloud burst capacity |
HO (Host Orchestrator) — one per host. Each HO:
- Reports capacity metrics to GO via Bus:
host_id,cpu_pct,mem_pct,active_agents,max_agents - Enforces local memory limits (target: 50% per host, configurable)
- Spawns/terminates agents on its host as directed by GO or PO
- Refuses new agents if over capacity target, reports constraint to GO
- Manages host-specific infrastructure (Docker containers, filesystem, cron)
Layer 3: Project Orchestration¶
| Agent | Project | Role |
|---|---|---|
| PO-MVP1 | Paperclip MVP1 | Sprint plan, ticket queue, velocity, specialist agents |
| PO-XIOPro | XIOPro platform | Platform architecture, self-building pipeline |
| PO-{future} | Any new project | Spawned by GO when project is created |
PO (Project Orchestrator) — one per active project. Each PO owns the Project Template lifecycle pipeline:
PO responsibilities:
- Own sprint plan, ticket queue, and velocity tracking
- Spawn and manage Contextual Agents (long-lived, project-scoped: branding, domain expert, architecture)
- Spawn and manage Ephemeral Agents (task-scoped: coder, researcher, ops, designer)
- Resolve L1-L2 alerts within project scope
- Escalate to GO for cross-project issues
Master PO for composite projects: For composite projects (projects with sub-projects linked via parent_project_id), the top-level PO acts as a Master PO. The Master PO coordinates sub-project POs, resolves cross-sub-project dependencies, and aggregates cost and velocity reporting. Sub-project POs operate autonomously within their scope but escalate to the Master PO for cross-boundary issues. See Part 9, Section 5B for composite project structure.
Layer 4: Interaction Orchestration¶
| Agent | User | Role |
|---|---|---|
| IO@Shai | Shai (founder) | Alert triage, human decisions, idea capture, RC session bridge |
| IO@Client1 (future) | External stakeholder | Scoped project visibility |
IO (Interaction Orchestrator) — one per human user. Each IO:
- Triages alerts by level and routes to the correct human
- Bridges RC (Remote Control) sessions between human and Bus
- Captures ideas from human observation and records as Bus events
- Presents summaries, cost reports, and decision requests
- Never executes work — only mediates between humans and the agent hierarchy
3. Agent Identity¶
Agent identity follows the pattern: Role @ Host.
The host is a deployment detail, not identity. The role defines behavior.
| Agent ID | Role | Notes |
|---|---|---|
| GO | Global Orchestrator | Lives on primary host, singleton |
| HO@Server1 | Host Orchestrator on Hetzner | Was part of GO's infra duties |
| HO@Mac1 | Host Orchestrator on Mac | Was MO |
| PO-MVP1 | Project Orchestrator for MVP1 | New role |
| PO-XIOPro | Project Orchestrator for XIOPro | New role |
| IO@Shai | Interaction Orchestrator for Shai | New role |
Migration from v4.3 naming:
BM→ split intoGO+HO@Server1MO→ renamed toHO@Mac1C0→ absorbed intoIO@Shai- Specialist agents → spawned by PO, not directly by GO
4. Alert Taxonomy¶
All alerts flow through the Bus. Each alert has a level that determines routing.
| Level | Category | Examples | Handler | Escalation |
|---|---|---|---|---|
| L0 | Infrastructure | Container down, disk full, OOM | Governor cron auto-resolves | → HO if unresolved |
| L1 | Execution | Test fail, build break, lint error | Specialist retries (3-strike) | → PO |
| L2 | Project | Blocked ticket, design decision needed | PO resolves | → GO |
| L3 | Cross-project | Resource conflict, priority clash | GO resolves | → IO |
| L4 | Human-required | Budget approval, DNS, API keys, branding | IO routes to human via RC | Blocks until resolved |
| L5 | Strategic | New project, pivot, scope change | IO captures, GO plans | Human decision required |
The 3-strike circuit breaker applies at L1: if a specialist fails 3 times on the same task, it halts and escalates to PO. PO may reassign, change approach, or escalate to GO.
5. Capacity Management¶
Capacity is managed collaboratively between GO and HOs.
5.1 Host Capacity Table (Bus DB)¶
CREATE TABLE host_capacity (
host_id TEXT PRIMARY KEY,
cpu_pct REAL NOT NULL DEFAULT 0,
mem_pct REAL NOT NULL DEFAULT 0,
active_agents INTEGER NOT NULL DEFAULT 0,
max_agents INTEGER NOT NULL DEFAULT 10,
target_mem REAL NOT NULL DEFAULT 50.0,
last_report TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
5.2 Reporting Cycle¶
- Each HO reports metrics every 60 seconds via Bus event
- GO reads the capacity table before any spawn decision
- If no host has capacity, GO queues the spawn request and alerts IO
5.3 Spawn Decision Flow¶
GO receives spawn request
→ Query host_capacity WHERE mem_pct < target_mem
→ Select host with lowest mem_pct
→ Send spawn command to that host's HO via Bus
→ HO spawns agent, confirms via Bus
→ If no capacity: queue + alert IO
6. Context Rotation v2 (Bus-Native)¶
Context rotation moves from file-based handoff to Bus-native events.
6.1 Rotation Event¶
{
"type": "rotation",
"agent_id": "PO-MVP1",
"state_summary": { "current_ticket": "T1P-042", "progress": "tests passing, PR pending" },
"next_task": "merge PR and move to T1P-043",
"bus_cursor": 847291
}
6.2 Rotation Flow¶
- Agent detects context limit approaching
- Agent writes rotation event to Bus:
POST /events - Agent terminates
- Orchestrator (GO or PO) sees rotation event via
bus_poll - Orchestrator spawns replacement agent with state reference
- Replacement reads Bus state from cursor, resumes
6.3 Crash Recovery¶
- Heartbeat cron detects missing heartbeat (agent not responding)
- Governor alerts GO via Bus
- GO reads last Bus state for that agent
- GO spawns replacement, passes last known state
- Replacement resumes from last checkpoint
No file-based handoff. No conversation memory dependency. Bus is the only state authority.
7. Dashboard¶
The Dashboard is Layer 0 infrastructure, not an agent.
- Technology: Stateless React app
- Data source: Bus REST API + PostgreSQL direct reads
- Auth: OAuth/API keys per user (future; currently internal-only)
- Multi-user: Views scoped by project membership
- No agent needed: Dashboard reads data; it does not orchestrate
Dashboard views:
- Agent map (who is running, where, what task)
- Alert feed (L0-L5, filterable)
- Capacity heatmap (per host)
- Sprint board (per project, from PO state)
- Cost tracker (per project, per agent, per model)
- Idea pipeline (from IO captures)
8. Idea Lifecycle¶
Ideas are the first stage of the Project Template pipeline. They are not tickets — they are raw observations that may or may not become work.
8.1 Lifecycle Stages¶
Trigger (human observation)
→ Capture (IO records as Bus event)
→ Research (agent explores feasibility, market, prior art)
→ Brainstorm (architectural discussion, trade-off analysis)
→ Decision (human approves or rejects)
→ Blueprint (if approved, becomes Part N of project blueprint)
→ Tickets (work plan decomposition)
→ Execute (PO manages via sprint)
8.2 Idea Event Schema¶
{
"type": "idea",
"source": "IO@Shai",
"title": "Add multi-tenant support",
"context": "Observed during client demo — they asked about team access",
"priority_hint": "medium",
"project_hint": "MVP1"
}
IO captures ideas without interrupting execution flow. GO reviews idea backlog during planning cycles.
9. Migration from v4.3¶
This architecture replaces the flat BM/MO/C0 model with a layered hierarchy. The migration is defined in transition_v5.yaml and proceeds in 7 phases:
- T1: Split GO from BM (separate global orchestration from host management)
- T2: Rename MO to HO@Mac1 (standardize host orchestrator naming)
- T3: Create PO role (project-scoped orchestration)
- T4: Create IO role (human interaction mediation)
- T5: Alert taxonomy in Bus DB (structured routing)
- T6: Capacity management (host metrics, spawn decisions)
- T7: Context rotation v2 (Bus-native state transfer)
Each phase is independently deployable. No phase requires all previous phases to be complete, though T1 should precede T3 (PO needs GO to exist).
10. Bus Degraded Mode¶
The Bus is permanent infrastructure, but transient unavailability (restart, network blip, deploy) must not halt agent execution. This section defines the expected behavior during Bus downtime.
10.1 Agent Behavior During Bus Unavailability¶
When an agent cannot reach the Bus:
- Continue executing the current task. The agent does not pause, halt, or block on Bus availability. Work in progress is not abandoned.
- Queue Bus writes locally (in-memory). Any event that would normally be written to the Bus (heartbeats, state updates, results, rotation events) is held in a local write queue for the duration of the outage.
- Do not queue indefinitely. If the local queue exceeds 500 entries or the agent reaches context rotation while the Bus is still down, the agent writes its state to the handoff directory (
~/STRUXIO_Workspace/STRUXIO_OS/struxio-control/state/handoff/) as a fallback checkpoint.
10.2 Reconnection and Flush¶
On Bus reconnection:
- Flush the local write queue to the Bus in order (oldest event first).
- Write a
bus_reconnectevent with the flush count and gap duration. - Resume normal Bus-native operation.
No deduplication is required on flush — the Bus event log is append-only. Downstream consumers must handle idempotency for events that may have been delayed.
10.3 SSE Reconnection¶
SSE (Server-Sent Events) connections use the retry:5000 directive already implemented in the Bus. On connection drop:
- The browser/client automatically retries after 5 seconds.
- The
Last-Event-IDheader is sent on reconnect, allowing the Bus to resume the stream from the last acknowledged event. - No manual reconnection logic is required in the client.
10.4 REST Retry Policy¶
All REST calls from agents to the Bus use exponential backoff:
| Attempt | Delay |
|---|---|
| 1 (initial) | immediate |
| 2 (first retry) | 1 second |
| 3 (second retry) | 2 seconds |
| 4 (third retry) | 4 seconds |
After 3 retries (4 total attempts) without success, the write is placed in the local queue (see 10.1) and the agent continues executing. The failure is logged locally but does not raise an alert until the Bus is confirmed down for > 60 seconds (Governor detection window).
10.5 Governor Role During Bus Outage¶
The Governor cron (5-minute cycle) detects Bus unavailability via the /healthz endpoint. On detection:
- Governor logs the outage event to
/opt/struxio/logs/governor.log. - Governor attempts
docker restart deploy-bus-1after the first failed cycle. - If Bus does not recover within 2 cycles (10 minutes), Governor raises an L0 alert via the filesystem alert queue (
/opt/struxio/bus/alerts/). - On Bus recovery, Governor logs the recovery event and the outage duration.
Agents are not expected to know about the Governor cycle. Their only obligation is the retry policy (10.4) and local queue (10.1).
11. Design Constraints¶
- No agent owns state. The Bus + PostgreSQL own state. Agents read and write via Bus API.
- No agent is irreplaceable. Any agent can be terminated and respawned from Bus state.
- No cross-layer shortcuts. L3 agents do not talk to L0 infrastructure directly. They go through their HO.
- No implicit communication. All agent-to-agent communication goes through the Bus. No direct function calls between agents.
- Cost visibility is mandatory. Every agent reports token usage. Every PO reports project cost. GO reports total cost. IO presents cost to human.
11.1 Maximum Delegation Depth¶
The spawn hierarchy has a hard depth limit of 4 levels:
- No agent at L4 (Worker) may spawn additional agents.
- No delegation path may exceed 4 hops from GO.
- The Governor enforces this constraint via a spawn depth counter attached to every spawn request. Each spawn increments the counter; any spawn request with depth ≥ 4 is rejected with a
spawn_depth_exceedederror event on the Bus. - If a Specialist requires delegation, it must escalate to its PO to spawn the Worker on its behalf — the Specialist does not spawn directly.
Rationale¶
Unbounded delegation creates supervision gaps, makes cost attribution unreliable, and produces runaway agent trees that the Governor cannot control. The 4-level limit reflects the actual operational hierarchy and prevents architectural drift.
Rule¶
Violations of the depth limit are non-recoverable at spawn time — the spawn is rejected, not queued. The requesting agent must escalate to its parent to re-route the work.
11A. Optimizer Architecture Note¶
The Optimizer (Dream Engine, Idle Maintenance, Stewards) operates as a set of background cron-like tasks, not a unified service. No shared interface exists — each component reads from and writes to the Bus independently. This is intentional: loose coupling over tight integration.
Implications: - Dream Engine, Idle Maintenance tasks, and Stewards are not coordinated by a shared Optimizer supervisor - Each component has its own trigger (idle window, schedule, threshold event) - Each component writes its proposals and outputs as Bus events - No single endpoint or API represents "the Optimizer" — queries go to individual components - Adding a new optimizer component requires only that it reads/writes Bus events; no shared interface to extend
12. Design Decisions¶
These are captured architectural decisions for Swarm v5.0. Each decision is final unless explicitly reopened via the decision log.
12.1 Authentication: User Auth vs Agent Auth¶
Decision: User authentication (STRUXIO.ai org OAuth) is separate from agent authentication (API keys / Bus tokens).
Rationale: Users are humans who log in to the Dashboard via browser. Agents are processes that authenticate headlessly. Conflating the two creates security complexity and operational fragility.
Rules: - Users log in to the Dashboard and Control Center via OAuth (STRUXIO.ai org) - Agents authenticate to the Bus via Bus tokens (issued per agent, stored in SOPS) - No agent ever holds a user OAuth token - No user-facing flow depends on agent tokens
12.2 Dashboard: Per-User Settings and Device Layout¶
Decision: Per-user settings are stored in the database. Dashboard layout includes a device attribute (desktop, tablet, phone). The last saved layout per device is restored on login.
Rationale: Users access the dashboard from multiple devices with different screen geometries. Persisting layout per device provides continuity without requiring manual reconfiguration.
Rules:
- Layout state schema includes: user_id, device (enum: desktop | tablet | phone), layout_json, updated_at
- On login, the Dashboard queries the user's saved layout for the detected device
- If no saved layout exists for the device, a default preset is applied
- Layout changes are debounced and auto-saved to the DB (no explicit save button required)
- Device detection is based on viewport width breakpoints, not user-agent sniffing
12.3 Bootstrap: Web Login as Primary Entry Point¶
Decision: Web login (Dashboard) is the primary entry point for all users. Terminal CLI is the fallback and bootstrap backdoor for scenarios where the system is down or not yet initialized.
Rationale: For normal operation, the web UI provides the governed, auditable, and user-friendly control surface. The CLI must remain functional as a backdoor for infrastructure-level recovery and initial bootstrapping.
Rules: - All production operational tasks are expected to flow through the Dashboard - CLI is not deprecated — it remains available and maintained as the bootstrap/recovery path - If the Bus or Dashboard is unreachable, CLI is the recovery method - CLI-based operations that change system state must still write to the Bus (or queue for sync) when the Bus comes back online - The bootstrap sequence (new server, fresh install) uses CLI exclusively until the Bus is up
12.4 HO Auto-Discovery¶
Decision: On activation, each HO automatically reads its host's memory and CPU metrics and reports them to the Bus without requiring manual configuration.
Rationale: Manual host configuration is error-prone and creates onboarding friction. Auto-discovery ensures the capacity table is always accurate from the moment an HO starts.
Rules:
- On startup, HO reads: total RAM, available RAM, CPU count, CPU model, disk usage, running Docker containers
- HO writes an initial host_capacity record to the Bus immediately after startup
- HO continues reporting metrics every 60 seconds via Bus heartbeat event
- If host specs change (e.g., RAM upgrade), the next HO restart auto-updates the record
- HO does not require a max_agents config file — it calculates max_agents from available RAM and a per-agent memory estimate (default: 512 MB per agent slot)
12.5 GO Warm Standby (I6)¶
Decision: GO uses a lease-based warm standby model. True leader election is not possible because Claude sessions cannot coordinate directly. Instead, a single-row go_lease table in Bus PostgreSQL acts as the authority.
Rationale: GO is the only singleton in the system with no supervisor above it. Without standby, a crashed GO session leaves the system without a spawn root until manual intervention. The lease model closes this gap with a 30-minute implementation that provides automatic recovery without distributed coordination complexity.
Model:
- The active GO renews the lease every minute via
POST /agents/000/leasewithttl=300. - If the lease has not been renewed for 5 minutes (
expires_at < NOW()), it is considered expired. - The next GO session to start checks the lease and claims it if expired.
- Only the current
holderor an expired-lease claimer can update the row (enforced by the UPDATE WHERE clause).
Bus endpoints:
| Method | Path | Purpose |
|---|---|---|
GET |
/agents/000/lease |
Return current holder, expires_at, seconds_remaining, expired flag |
POST |
/agents/000/lease |
Renew (current holder) or claim (if expired). Returns 409 if lease is active and held by another. |
Heartbeat integration: /opt/struxio/scripts/agent_heartbeat.sh runs every 60 seconds via cron and calls POST /agents/000/lease after the agent register call. This ensures the lease is renewed as long as any GO is alive on the host.
Recovery sequence:
1. GO session A crashes. Lease expires after 5 minutes.
2. GO session B starts (new Claude Code session or manual restart).
3. Session B calls POST /agents/000/lease — succeeds because expires_at < NOW().
4. Session B is now the active GO. All Bus state is intact. Normal operation resumes.
5. No manual intervention required.
Limitations: This model does not prevent two simultaneous active GOs during the TTL window. If a GO session goes zombie (still alive but not actively working), the lease renewal keeps it "official" until TTL expires. This is acceptable for the current scale — at most 1 active GO session is expected at any time.
13. Project Roadmap¶
The following is the priority-ordered roadmap for projects that XIOPro will manage. Each project builds on the previous one, creating a virtuous cycle where the platform improves itself.
| Phase | Project | Description | Dependencies |
|---|---|---|---|
| 1 | XIOPro Core | Stabilize the platform: complete v5.0 transition (T1-T7), deploy PO/IO/HO roles, implement RBAC, deploy governance breakers, establish observability. This is the foundation everything else depends on. | None |
| 2 | Template Builder | Meta-tool for building project templates. A researcher agent that analyzes a target domain and produces a complete template (stages, steps, gates, agent roles, resource defaults) calibrated to T1P standards. See Part 9, Section 5A. | XIOPro Core stable |
| 3 | AI Project Template | Use the Template Builder to create a model template for AI/software projects, using XIOPro's own architecture as the reference implementation. This template becomes the standard for all future IT projects. | Template Builder operational |
| 4 | XIOPro v6 Self-Build | Regenerate XIOPro using its own AI Project Template. The platform rebuilds itself through its own orchestration layer, validating the template and identifying gaps. This is the ultimate dogfooding exercise. | AI Project Template validated |
| 5 | MVP1 Composite | Launch MVP1 (Paperclip) as a composite project with 4 sub-projects: Platform (IT Project template), Marketing (Marketing template), Knowledge (Knowledge Expert template), Content (Content Creation template). See Part 9, Section 5B for composite project structure. | XIOPro Core stable, templates defined |
Roadmap Principles¶
- Phase 1 is non-negotiable: Nothing else starts until XIOPro Core is stable. Attempting to run projects on an unstable platform wastes more time than it saves.
- Phases 2-4 are the self-improvement loop: The platform gets better at managing projects by managing the project of improving itself.
- Phase 5 is the first real external-facing project: By the time MVP1 launches, the platform has been tested on itself.
- Phases can overlap: Phase 5 can begin in parallel with Phase 4 once the core templates are defined and PO orchestration is proven.
14. Operator-to-Project Ratios¶
14.1 Current Validated Ratio¶
Current validated ratio: 1 operator managing 2 projects with 12 agents.
This reflects the operational baseline as of v5.0: GO managing XIOPro Core and MVP1 simultaneously, with a combined agent pool of ~12 active agents across both POs.
14.2 Target Ratio¶
Target: 1 operator managing 5+ projects.
This requires:
- Stable PO autonomy (PO handles L1/L2 alerts without GO involvement)
- Reliable capacity management (HO auto-scaling without manual intervention)
- IO filtering alerts to near-zero operator noise for stable projects
- Dream Engine handling routine maintenance autonomously
14.3 Stress Testing Requirement¶
Stress testing is required before the 5-project target is declared achievable.
Minimum stress tests:
- Simulate 5 concurrent POs with active sprint execution
- Inject L2 alert storm (10+ simultaneous blocked tickets across projects)
- Simulate HO capacity saturation and GO spawn-queuing behavior
- Verify alert routing does not cross project boundaries (see Part 7, Section 10.4)
- Verify cost attribution remains correct across all 5 projects under load
Results must be recorded before GO declares the 5-project target validated.
Changelog¶
| Version | Date | Author | Change |
|---|---|---|---|
| 5.0.4 | 2026-03-30 | GO | N16: Added Section 11A — Optimizer Architecture Note. Clarifies that Dream Engine, Idle Maintenance, and Stewards are independent cron-like tasks with no shared interface. Loose coupling over tight integration is intentional. |
| 5.0.5 | 2026-03-30 | GO | N11: Added Section 11.1 — Maximum Delegation Depth. Hard limit of 4 levels (GO → PO → Specialist → Worker). Governor enforces via spawn depth counter. Deeper spawning rejected at spawn time. |
| 5.0.6 | 2026-03-30 | GO | Round 2 review fixes: Layer 1 — changed "Owns Bus" to "Primary consumer of Bus" (Bus is Layer 0 infrastructure, not owned by GO). Layer 3 — added Master PO note for composite projects with sub-project PO coordination, referencing Part 9 Section 5B. |
End of Part 10