Skip to content

XIOPro Production Blueprint v5.0

Part 10 — Swarm Architecture v5.0: Multi-Project Agent Orchestration


1. Core Principle

The Bus is the brain, agents are the hands.

All state flows through Bus + PostgreSQL. Agents are ephemeral; the Bus is permanent.

No agent owns truth. The Bus owns truth. Agents read state, execute, report results, and terminate. If an agent dies, the Bus state survives and a replacement is spawned from that state.

This is the single most important architectural constraint in the system.


2. Layer Architecture

XIOPro operates as a four-layer orchestration stack. Each layer has a clear boundary of responsibility, a clear upward escalation path, and a clear downward delegation path.

Layer 0: Infrastructure (always on, no agents)

Infrastructure is not orchestrated — it is permanent. These services run whether or not any agent exists.

Service Role Lifecycle
Bus (Fastify) 74+ endpoints, SSE, REST events, message relay Always on, Governor-managed
PostgreSQL bus + devxio DBs, source of truth Always on, Governor-managed
Governor cron 5-min health checks, auto-restart of crashed services Always on, systemd
Heartbeat cron Agent liveness detection Always on, systemd
Caddy TLS termination, reverse proxy Always on, systemd
Dashboard Stateless web app, reads Bus + DB Always on, Caddy-served

Layer 0 has zero agent dependencies. If every agent in the system dies, Layer 0 continues running and preserves all state.


Layer 1: Global Orchestration

Agent Count Role
GO (Global Orchestrator) Exactly 1 Primary consumer of Bus, spawns all other orchestrators, resolves cross-project conflicts

GO is the singleton root of the agent hierarchy. There is one GO in the entire system, always. GO does not execute work directly — it delegates downward.

GO responsibilities:

  • Spawn and monitor HOs, POs, IOs
  • Resolve cross-project resource conflicts
  • Maintain global capacity table
  • Route L3+ alerts
  • Enforce budget constraints across all projects

Layer 2: Host Orchestration

Agent Host Role
HO@Server1 Hetzner CPX62 Docker, Git repos, containers, agent spawning
HO@Mac1 Mac Studio M1 Browser testing, SOPS keys, GPU, local ops
HO@Server2 (future) TBD Horizontal scaling
HO@Cloud1 (future) TBD Cloud burst capacity

HO (Host Orchestrator) — one per host. Each HO:

  • Reports capacity metrics to GO via Bus: host_id, cpu_pct, mem_pct, active_agents, max_agents
  • Enforces local memory limits (target: 50% per host, configurable)
  • Spawns/terminates agents on its host as directed by GO or PO
  • Refuses new agents if over capacity target, reports constraint to GO
  • Manages host-specific infrastructure (Docker containers, filesystem, cron)

Layer 3: Project Orchestration

Agent Project Role
PO-MVP1 Paperclip MVP1 Sprint plan, ticket queue, velocity, specialist agents
PO-XIOPro XIOPro platform Platform architecture, self-building pipeline
PO-{future} Any new project Spawned by GO when project is created

PO (Project Orchestrator) — one per active project. Each PO owns the Project Template lifecycle pipeline:

Idea Research → Brainstorm → Manifest → Blueprint → Work Plan → Test Plan → Review → Execute

PO responsibilities:

  • Own sprint plan, ticket queue, and velocity tracking
  • Spawn and manage Contextual Agents (long-lived, project-scoped: branding, domain expert, architecture)
  • Spawn and manage Ephemeral Agents (task-scoped: coder, researcher, ops, designer)
  • Resolve L1-L2 alerts within project scope
  • Escalate to GO for cross-project issues

Master PO for composite projects: For composite projects (projects with sub-projects linked via parent_project_id), the top-level PO acts as a Master PO. The Master PO coordinates sub-project POs, resolves cross-sub-project dependencies, and aggregates cost and velocity reporting. Sub-project POs operate autonomously within their scope but escalate to the Master PO for cross-boundary issues. See Part 9, Section 5B for composite project structure.


Layer 4: Interaction Orchestration

Agent User Role
IO@Shai Shai (founder) Alert triage, human decisions, idea capture, RC session bridge
IO@Client1 (future) External stakeholder Scoped project visibility

IO (Interaction Orchestrator) — one per human user. Each IO:

  • Triages alerts by level and routes to the correct human
  • Bridges RC (Remote Control) sessions between human and Bus
  • Captures ideas from human observation and records as Bus events
  • Presents summaries, cost reports, and decision requests
  • Never executes work — only mediates between humans and the agent hierarchy

3. Agent Identity

Agent identity follows the pattern: Role @ Host.

The host is a deployment detail, not identity. The role defines behavior.

Agent ID Role Notes
GO Global Orchestrator Lives on primary host, singleton
HO@Server1 Host Orchestrator on Hetzner Was part of GO's infra duties
HO@Mac1 Host Orchestrator on Mac Was MO
PO-MVP1 Project Orchestrator for MVP1 New role
PO-XIOPro Project Orchestrator for XIOPro New role
IO@Shai Interaction Orchestrator for Shai New role

Migration from v4.3 naming:

  • BM → split into GO + HO@Server1
  • MO → renamed to HO@Mac1
  • C0 → absorbed into IO@Shai
  • Specialist agents → spawned by PO, not directly by GO

4. Alert Taxonomy

All alerts flow through the Bus. Each alert has a level that determines routing.

Level Category Examples Handler Escalation
L0 Infrastructure Container down, disk full, OOM Governor cron auto-resolves → HO if unresolved
L1 Execution Test fail, build break, lint error Specialist retries (3-strike) → PO
L2 Project Blocked ticket, design decision needed PO resolves → GO
L3 Cross-project Resource conflict, priority clash GO resolves → IO
L4 Human-required Budget approval, DNS, API keys, branding IO routes to human via RC Blocks until resolved
L5 Strategic New project, pivot, scope change IO captures, GO plans Human decision required

The 3-strike circuit breaker applies at L1: if a specialist fails 3 times on the same task, it halts and escalates to PO. PO may reassign, change approach, or escalate to GO.


5. Capacity Management

Capacity is managed collaboratively between GO and HOs.

5.1 Host Capacity Table (Bus DB)

CREATE TABLE host_capacity (
  host_id       TEXT PRIMARY KEY,
  cpu_pct       REAL NOT NULL DEFAULT 0,
  mem_pct       REAL NOT NULL DEFAULT 0,
  active_agents INTEGER NOT NULL DEFAULT 0,
  max_agents    INTEGER NOT NULL DEFAULT 10,
  target_mem    REAL NOT NULL DEFAULT 50.0,
  last_report   TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

5.2 Reporting Cycle

  • Each HO reports metrics every 60 seconds via Bus event
  • GO reads the capacity table before any spawn decision
  • If no host has capacity, GO queues the spawn request and alerts IO

5.3 Spawn Decision Flow

GO receives spawn request
  → Query host_capacity WHERE mem_pct < target_mem
  → Select host with lowest mem_pct
  → Send spawn command to that host's HO via Bus
  → HO spawns agent, confirms via Bus
  → If no capacity: queue + alert IO

6. Context Rotation v2 (Bus-Native)

Context rotation moves from file-based handoff to Bus-native events.

6.1 Rotation Event

{
  "type": "rotation",
  "agent_id": "PO-MVP1",
  "state_summary": { "current_ticket": "T1P-042", "progress": "tests passing, PR pending" },
  "next_task": "merge PR and move to T1P-043",
  "bus_cursor": 847291
}

6.2 Rotation Flow

  1. Agent detects context limit approaching
  2. Agent writes rotation event to Bus: POST /events
  3. Agent terminates
  4. Orchestrator (GO or PO) sees rotation event via bus_poll
  5. Orchestrator spawns replacement agent with state reference
  6. Replacement reads Bus state from cursor, resumes

6.3 Crash Recovery

  1. Heartbeat cron detects missing heartbeat (agent not responding)
  2. Governor alerts GO via Bus
  3. GO reads last Bus state for that agent
  4. GO spawns replacement, passes last known state
  5. Replacement resumes from last checkpoint

No file-based handoff. No conversation memory dependency. Bus is the only state authority.


7. Dashboard

The Dashboard is Layer 0 infrastructure, not an agent.

  • Technology: Stateless React app
  • Data source: Bus REST API + PostgreSQL direct reads
  • Auth: OAuth/API keys per user (future; currently internal-only)
  • Multi-user: Views scoped by project membership
  • No agent needed: Dashboard reads data; it does not orchestrate

Dashboard views:

  • Agent map (who is running, where, what task)
  • Alert feed (L0-L5, filterable)
  • Capacity heatmap (per host)
  • Sprint board (per project, from PO state)
  • Cost tracker (per project, per agent, per model)
  • Idea pipeline (from IO captures)

8. Idea Lifecycle

Ideas are the first stage of the Project Template pipeline. They are not tickets — they are raw observations that may or may not become work.

8.1 Lifecycle Stages

Trigger (human observation)
  → Capture (IO records as Bus event)
  → Research (agent explores feasibility, market, prior art)
  → Brainstorm (architectural discussion, trade-off analysis)
  → Decision (human approves or rejects)
  → Blueprint (if approved, becomes Part N of project blueprint)
  → Tickets (work plan decomposition)
  → Execute (PO manages via sprint)

8.2 Idea Event Schema

{
  "type": "idea",
  "source": "IO@Shai",
  "title": "Add multi-tenant support",
  "context": "Observed during client demo — they asked about team access",
  "priority_hint": "medium",
  "project_hint": "MVP1"
}

IO captures ideas without interrupting execution flow. GO reviews idea backlog during planning cycles.


9. Migration from v4.3

This architecture replaces the flat BM/MO/C0 model with a layered hierarchy. The migration is defined in transition_v5.yaml and proceeds in 7 phases:

  1. T1: Split GO from BM (separate global orchestration from host management)
  2. T2: Rename MO to HO@Mac1 (standardize host orchestrator naming)
  3. T3: Create PO role (project-scoped orchestration)
  4. T4: Create IO role (human interaction mediation)
  5. T5: Alert taxonomy in Bus DB (structured routing)
  6. T6: Capacity management (host metrics, spawn decisions)
  7. T7: Context rotation v2 (Bus-native state transfer)

Each phase is independently deployable. No phase requires all previous phases to be complete, though T1 should precede T3 (PO needs GO to exist).


10. Bus Degraded Mode

The Bus is permanent infrastructure, but transient unavailability (restart, network blip, deploy) must not halt agent execution. This section defines the expected behavior during Bus downtime.

10.1 Agent Behavior During Bus Unavailability

When an agent cannot reach the Bus:

  • Continue executing the current task. The agent does not pause, halt, or block on Bus availability. Work in progress is not abandoned.
  • Queue Bus writes locally (in-memory). Any event that would normally be written to the Bus (heartbeats, state updates, results, rotation events) is held in a local write queue for the duration of the outage.
  • Do not queue indefinitely. If the local queue exceeds 500 entries or the agent reaches context rotation while the Bus is still down, the agent writes its state to the handoff directory (~/STRUXIO_Workspace/STRUXIO_OS/struxio-control/state/handoff/) as a fallback checkpoint.

10.2 Reconnection and Flush

On Bus reconnection:

  1. Flush the local write queue to the Bus in order (oldest event first).
  2. Write a bus_reconnect event with the flush count and gap duration.
  3. Resume normal Bus-native operation.

No deduplication is required on flush — the Bus event log is append-only. Downstream consumers must handle idempotency for events that may have been delayed.

10.3 SSE Reconnection

SSE (Server-Sent Events) connections use the retry:5000 directive already implemented in the Bus. On connection drop:

  • The browser/client automatically retries after 5 seconds.
  • The Last-Event-ID header is sent on reconnect, allowing the Bus to resume the stream from the last acknowledged event.
  • No manual reconnection logic is required in the client.

10.4 REST Retry Policy

All REST calls from agents to the Bus use exponential backoff:

Attempt Delay
1 (initial) immediate
2 (first retry) 1 second
3 (second retry) 2 seconds
4 (third retry) 4 seconds

After 3 retries (4 total attempts) without success, the write is placed in the local queue (see 10.1) and the agent continues executing. The failure is logged locally but does not raise an alert until the Bus is confirmed down for > 60 seconds (Governor detection window).

10.5 Governor Role During Bus Outage

The Governor cron (5-minute cycle) detects Bus unavailability via the /healthz endpoint. On detection:

  1. Governor logs the outage event to /opt/struxio/logs/governor.log.
  2. Governor attempts docker restart deploy-bus-1 after the first failed cycle.
  3. If Bus does not recover within 2 cycles (10 minutes), Governor raises an L0 alert via the filesystem alert queue (/opt/struxio/bus/alerts/).
  4. On Bus recovery, Governor logs the recovery event and the outage duration.

Agents are not expected to know about the Governor cycle. Their only obligation is the retry policy (10.4) and local queue (10.1).


11. Design Constraints

  • No agent owns state. The Bus + PostgreSQL own state. Agents read and write via Bus API.
  • No agent is irreplaceable. Any agent can be terminated and respawned from Bus state.
  • No cross-layer shortcuts. L3 agents do not talk to L0 infrastructure directly. They go through their HO.
  • No implicit communication. All agent-to-agent communication goes through the Bus. No direct function calls between agents.
  • Cost visibility is mandatory. Every agent reports token usage. Every PO reports project cost. GO reports total cost. IO presents cost to human.

11.1 Maximum Delegation Depth

The spawn hierarchy has a hard depth limit of 4 levels:

GO (L1) → PO (L2) → Specialist (L3) → Worker (L4)
  • No agent at L4 (Worker) may spawn additional agents.
  • No delegation path may exceed 4 hops from GO.
  • The Governor enforces this constraint via a spawn depth counter attached to every spawn request. Each spawn increments the counter; any spawn request with depth ≥ 4 is rejected with a spawn_depth_exceeded error event on the Bus.
  • If a Specialist requires delegation, it must escalate to its PO to spawn the Worker on its behalf — the Specialist does not spawn directly.

Rationale

Unbounded delegation creates supervision gaps, makes cost attribution unreliable, and produces runaway agent trees that the Governor cannot control. The 4-level limit reflects the actual operational hierarchy and prevents architectural drift.

Rule

Violations of the depth limit are non-recoverable at spawn time — the spawn is rejected, not queued. The requesting agent must escalate to its parent to re-route the work.


11A. Optimizer Architecture Note

The Optimizer (Dream Engine, Idle Maintenance, Stewards) operates as a set of background cron-like tasks, not a unified service. No shared interface exists — each component reads from and writes to the Bus independently. This is intentional: loose coupling over tight integration.

Implications: - Dream Engine, Idle Maintenance tasks, and Stewards are not coordinated by a shared Optimizer supervisor - Each component has its own trigger (idle window, schedule, threshold event) - Each component writes its proposals and outputs as Bus events - No single endpoint or API represents "the Optimizer" — queries go to individual components - Adding a new optimizer component requires only that it reads/writes Bus events; no shared interface to extend


12. Design Decisions

These are captured architectural decisions for Swarm v5.0. Each decision is final unless explicitly reopened via the decision log.

12.1 Authentication: User Auth vs Agent Auth

Decision: User authentication (STRUXIO.ai org OAuth) is separate from agent authentication (API keys / Bus tokens).

Rationale: Users are humans who log in to the Dashboard via browser. Agents are processes that authenticate headlessly. Conflating the two creates security complexity and operational fragility.

Rules: - Users log in to the Dashboard and Control Center via OAuth (STRUXIO.ai org) - Agents authenticate to the Bus via Bus tokens (issued per agent, stored in SOPS) - No agent ever holds a user OAuth token - No user-facing flow depends on agent tokens

12.2 Dashboard: Per-User Settings and Device Layout

Decision: Per-user settings are stored in the database. Dashboard layout includes a device attribute (desktop, tablet, phone). The last saved layout per device is restored on login.

Rationale: Users access the dashboard from multiple devices with different screen geometries. Persisting layout per device provides continuity without requiring manual reconfiguration.

Rules: - Layout state schema includes: user_id, device (enum: desktop | tablet | phone), layout_json, updated_at - On login, the Dashboard queries the user's saved layout for the detected device - If no saved layout exists for the device, a default preset is applied - Layout changes are debounced and auto-saved to the DB (no explicit save button required) - Device detection is based on viewport width breakpoints, not user-agent sniffing

12.3 Bootstrap: Web Login as Primary Entry Point

Decision: Web login (Dashboard) is the primary entry point for all users. Terminal CLI is the fallback and bootstrap backdoor for scenarios where the system is down or not yet initialized.

Rationale: For normal operation, the web UI provides the governed, auditable, and user-friendly control surface. The CLI must remain functional as a backdoor for infrastructure-level recovery and initial bootstrapping.

Rules: - All production operational tasks are expected to flow through the Dashboard - CLI is not deprecated — it remains available and maintained as the bootstrap/recovery path - If the Bus or Dashboard is unreachable, CLI is the recovery method - CLI-based operations that change system state must still write to the Bus (or queue for sync) when the Bus comes back online - The bootstrap sequence (new server, fresh install) uses CLI exclusively until the Bus is up

12.4 HO Auto-Discovery

Decision: On activation, each HO automatically reads its host's memory and CPU metrics and reports them to the Bus without requiring manual configuration.

Rationale: Manual host configuration is error-prone and creates onboarding friction. Auto-discovery ensures the capacity table is always accurate from the moment an HO starts.

Rules: - On startup, HO reads: total RAM, available RAM, CPU count, CPU model, disk usage, running Docker containers - HO writes an initial host_capacity record to the Bus immediately after startup - HO continues reporting metrics every 60 seconds via Bus heartbeat event - If host specs change (e.g., RAM upgrade), the next HO restart auto-updates the record - HO does not require a max_agents config file — it calculates max_agents from available RAM and a per-agent memory estimate (default: 512 MB per agent slot)

12.5 GO Warm Standby (I6)

Decision: GO uses a lease-based warm standby model. True leader election is not possible because Claude sessions cannot coordinate directly. Instead, a single-row go_lease table in Bus PostgreSQL acts as the authority.

Rationale: GO is the only singleton in the system with no supervisor above it. Without standby, a crashed GO session leaves the system without a spawn root until manual intervention. The lease model closes this gap with a 30-minute implementation that provides automatic recovery without distributed coordination complexity.

Model:

go_lease (id=1, holder, acquired_at, expires_at, session_id)
  • The active GO renews the lease every minute via POST /agents/000/lease with ttl=300.
  • If the lease has not been renewed for 5 minutes (expires_at < NOW()), it is considered expired.
  • The next GO session to start checks the lease and claims it if expired.
  • Only the current holder or an expired-lease claimer can update the row (enforced by the UPDATE WHERE clause).

Bus endpoints:

Method Path Purpose
GET /agents/000/lease Return current holder, expires_at, seconds_remaining, expired flag
POST /agents/000/lease Renew (current holder) or claim (if expired). Returns 409 if lease is active and held by another.

Heartbeat integration: /opt/struxio/scripts/agent_heartbeat.sh runs every 60 seconds via cron and calls POST /agents/000/lease after the agent register call. This ensures the lease is renewed as long as any GO is alive on the host.

Recovery sequence: 1. GO session A crashes. Lease expires after 5 minutes. 2. GO session B starts (new Claude Code session or manual restart). 3. Session B calls POST /agents/000/lease — succeeds because expires_at < NOW(). 4. Session B is now the active GO. All Bus state is intact. Normal operation resumes. 5. No manual intervention required.

Limitations: This model does not prevent two simultaneous active GOs during the TTL window. If a GO session goes zombie (still alive but not actively working), the lease renewal keeps it "official" until TTL expires. This is acceptable for the current scale — at most 1 active GO session is expected at any time.


13. Project Roadmap

The following is the priority-ordered roadmap for projects that XIOPro will manage. Each project builds on the previous one, creating a virtuous cycle where the platform improves itself.

Phase Project Description Dependencies
1 XIOPro Core Stabilize the platform: complete v5.0 transition (T1-T7), deploy PO/IO/HO roles, implement RBAC, deploy governance breakers, establish observability. This is the foundation everything else depends on. None
2 Template Builder Meta-tool for building project templates. A researcher agent that analyzes a target domain and produces a complete template (stages, steps, gates, agent roles, resource defaults) calibrated to T1P standards. See Part 9, Section 5A. XIOPro Core stable
3 AI Project Template Use the Template Builder to create a model template for AI/software projects, using XIOPro's own architecture as the reference implementation. This template becomes the standard for all future IT projects. Template Builder operational
4 XIOPro v6 Self-Build Regenerate XIOPro using its own AI Project Template. The platform rebuilds itself through its own orchestration layer, validating the template and identifying gaps. This is the ultimate dogfooding exercise. AI Project Template validated
5 MVP1 Composite Launch MVP1 (Paperclip) as a composite project with 4 sub-projects: Platform (IT Project template), Marketing (Marketing template), Knowledge (Knowledge Expert template), Content (Content Creation template). See Part 9, Section 5B for composite project structure. XIOPro Core stable, templates defined

Roadmap Principles

  • Phase 1 is non-negotiable: Nothing else starts until XIOPro Core is stable. Attempting to run projects on an unstable platform wastes more time than it saves.
  • Phases 2-4 are the self-improvement loop: The platform gets better at managing projects by managing the project of improving itself.
  • Phase 5 is the first real external-facing project: By the time MVP1 launches, the platform has been tested on itself.
  • Phases can overlap: Phase 5 can begin in parallel with Phase 4 once the core templates are defined and PO orchestration is proven.

14. Operator-to-Project Ratios

14.1 Current Validated Ratio

Current validated ratio: 1 operator managing 2 projects with 12 agents.

This reflects the operational baseline as of v5.0: GO managing XIOPro Core and MVP1 simultaneously, with a combined agent pool of ~12 active agents across both POs.

14.2 Target Ratio

Target: 1 operator managing 5+ projects.

This requires:

  • Stable PO autonomy (PO handles L1/L2 alerts without GO involvement)
  • Reliable capacity management (HO auto-scaling without manual intervention)
  • IO filtering alerts to near-zero operator noise for stable projects
  • Dream Engine handling routine maintenance autonomously

14.3 Stress Testing Requirement

Stress testing is required before the 5-project target is declared achievable.

Minimum stress tests:

  • Simulate 5 concurrent POs with active sprint execution
  • Inject L2 alert storm (10+ simultaneous blocked tickets across projects)
  • Simulate HO capacity saturation and GO spawn-queuing behavior
  • Verify alert routing does not cross project boundaries (see Part 7, Section 10.4)
  • Verify cost attribution remains correct across all 5 projects under load

Results must be recorded before GO declares the 5-project target validated.


Changelog

Version Date Author Change
5.0.4 2026-03-30 GO N16: Added Section 11A — Optimizer Architecture Note. Clarifies that Dream Engine, Idle Maintenance, and Stewards are independent cron-like tasks with no shared interface. Loose coupling over tight integration is intentional.
5.0.5 2026-03-30 GO N11: Added Section 11.1 — Maximum Delegation Depth. Hard limit of 4 levels (GO → PO → Specialist → Worker). Governor enforces via spawn depth counter. Deeper spawning rejected at spawn time.
5.0.6 2026-03-30 GO Round 2 review fixes: Layer 1 — changed "Owns Bus" to "Primary consumer of Bus" (Bus is Layer 0 infrastructure, not owned by GO). Layer 3 — added Master PO note for composite projects with sub-project PO coordination, referencing Part 9 Section 5B.

End of Part 10