XIOPro Production Blueprint v5.0¶

Part 10 — Swarm Architecture v5.0: Multi-Project Agent Orchestration¶

1. Core Principle¶

The Bus is the brain, agents are the hands.

All state flows through Bus + PostgreSQL. Agents are ephemeral; the Bus is permanent.

No agent owns truth. The Bus owns truth. Agents read state, execute, report results, and terminate. If an agent dies, the Bus state survives and a replacement is spawned from that state.

This is the single most important architectural constraint in the system.

2. Layer Architecture¶

XIOPro operates as a four-layer orchestration stack. Each layer has a clear boundary of responsibility, a clear upward escalation path, and a clear downward delegation path.

Layer 0: Infrastructure (always on, no agents)¶

Infrastructure is not orchestrated — it is permanent. These services run whether or not any agent exists.

Service	Role	Lifecycle
Bus (Fastify)	74+ endpoints, SSE, REST events, message relay	Always on, Governor-managed
PostgreSQL	bus + devxio DBs, source of truth	Always on, Governor-managed
Governor cron	5-min health checks, auto-restart of crashed services	Always on, systemd
Heartbeat cron	Agent liveness detection	Always on, systemd
Caddy	TLS termination, reverse proxy	Always on, systemd
Dashboard	Stateless web app, reads Bus + DB	Always on, Caddy-served

Layer 0 has zero agent dependencies. If every agent in the system dies, Layer 0 continues running and preserves all state.

Layer 1: Global Orchestration¶

Agent	Count	Role
GO (Global Orchestrator)	Exactly 1	Primary consumer of Bus, spawns all other orchestrators, resolves cross-project conflicts

GO is the singleton root of the agent hierarchy. There is one GO in the entire system, always. GO does not execute work directly — it delegates downward.

GO responsibilities:

Spawn and monitor HOs, POs, IOs
Resolve cross-project resource conflicts
Maintain global capacity table
Route L3+ alerts
Enforce budget constraints across all projects

Layer 2: Host Orchestration¶

Agent	Host	Role
HO@Server1	Hetzner CPX62	Docker, Git repos, containers, agent spawning
HO@Mac1	Mac Studio M1	Browser testing, SOPS keys, GPU, local ops
HO@Server2 (future)	TBD	Horizontal scaling
HO@Cloud1 (future)	TBD	Cloud burst capacity

HO (Host Orchestrator) — one per host. Each HO:

Reports capacity metrics to GO via Bus: host_id, cpu_pct, mem_pct, active_agents, max_agents
Enforces local memory limits (target: 50% per host, configurable)
Spawns/terminates agents on its host as directed by GO or PO
Refuses new agents if over capacity target, reports constraint to GO
Manages host-specific infrastructure (Docker containers, filesystem, cron)

Layer 3: Project Orchestration¶

Agent	Project	Role
PO-MVP1	Paperclip MVP1	Sprint plan, ticket queue, velocity, specialist agents
PO-XIOPro	XIOPro platform	Platform architecture, self-building pipeline
PO-{future}	Any new project	Spawned by GO when project is created

PO (Project Orchestrator) — one per active project. Each PO owns the Project Template lifecycle pipeline:

Idea Research → Brainstorm → Manifest → Blueprint → Work Plan → Test Plan → Review → Execute

PO responsibilities:

Own sprint plan, ticket queue, and velocity tracking
Spawn and manage Contextual Agents (long-lived, project-scoped: branding, domain expert, architecture)
Spawn and manage Ephemeral Agents (task-scoped: coder, researcher, ops, designer)
Resolve L1-L2 alerts within project scope
Escalate to GO for cross-project issues

Master PO for composite projects: For composite projects (projects with sub-projects linked via parent_project_id), the top-level PO acts as a Master PO. The Master PO coordinates sub-project POs, resolves cross-sub-project dependencies, and aggregates cost and velocity reporting. Sub-project POs operate autonomously within their scope but escalate to the Master PO for cross-boundary issues. See Part 9, Section 5B for composite project structure.

Layer 4: Interaction Orchestration¶

Agent	User	Role
IO@Shai	Shai (founder)	Alert triage, human decisions, idea capture, RC session bridge
IO@Client1 (future)	External stakeholder	Scoped project visibility

IO (Interaction Orchestrator) — one per human user. Each IO:

Triages alerts by level and routes to the correct human
Bridges RC (Remote Control) sessions between human and Bus
Captures ideas from human observation and records as Bus events
Presents summaries, cost reports, and decision requests
Never executes work — only mediates between humans and the agent hierarchy

3. Agent Identity¶

Agent identity follows the pattern: Role @ Host.

The host is a deployment detail, not identity. The role defines behavior.

Agent ID	Role	Notes
GO	Global Orchestrator	Lives on primary host, singleton
HO@Server1	Host Orchestrator on Hetzner	Was part of GO's infra duties
HO@Mac1	Host Orchestrator on Mac	Was MO
PO-MVP1	Project Orchestrator for MVP1	New role
PO-XIOPro	Project Orchestrator for XIOPro	New role
IO@Shai	Interaction Orchestrator for Shai	New role

Migration from v4.3 naming:

BM → split into GO + HO@Server1
MO → renamed to HO@Mac1
C0 → absorbed into IO@Shai
Specialist agents → spawned by PO, not directly by GO

4. Alert Taxonomy¶

All alerts flow through the Bus. Each alert has a level that determines routing.

Level	Category	Examples	Handler	Escalation
L0	Infrastructure	Container down, disk full, OOM	Governor cron auto-resolves	→ HO if unresolved
L1	Execution	Test fail, build break, lint error	Specialist retries (3-strike)	→ PO
L2	Project	Blocked ticket, design decision needed	PO resolves	→ GO
L3	Cross-project	Resource conflict, priority clash	GO resolves	→ IO
L4	Human-required	Budget approval, DNS, API keys, branding	IO routes to human via RC	Blocks until resolved
L5	Strategic	New project, pivot, scope change	IO captures, GO plans	Human decision required

The 3-strike circuit breaker applies at L1: if a specialist fails 3 times on the same task, it halts and escalates to PO. PO may reassign, change approach, or escalate to GO.

5. Capacity Management¶

Capacity is managed collaboratively between GO and HOs.

5.1 Host Capacity Table (Bus DB)¶

CREATE TABLE host_capacity (
  host_id       TEXT PRIMARY KEY,
  cpu_pct       REAL NOT NULL DEFAULT 0,
  mem_pct       REAL NOT NULL DEFAULT 0,
  active_agents INTEGER NOT NULL DEFAULT 0,
  max_agents    INTEGER NOT NULL DEFAULT 10,
  target_mem    REAL NOT NULL DEFAULT 50.0,
  last_report   TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

5.2 Reporting Cycle¶

Each HO reports metrics every 60 seconds via Bus event
GO reads the capacity table before any spawn decision
If no host has capacity, GO queues the spawn request and alerts IO

5.3 Spawn Decision Flow¶

GO receives spawn request
  → Query host_capacity WHERE mem_pct < target_mem
  → Select host with lowest mem_pct
  → Send spawn command to that host's HO via Bus
  → HO spawns agent, confirms via Bus
  → If no capacity: queue + alert IO

6. Context Rotation v2 (Bus-Native)¶

Context rotation moves from file-based handoff to Bus-native events.

6.1 Rotation Event¶

{
  "type": "rotation",
  "agent_id": "PO-MVP1",
  "state_summary": { "current_ticket": "T1P-042", "progress": "tests passing, PR pending" },
  "next_task": "merge PR and move to T1P-043",
  "bus_cursor": 847291
}

6.2 Rotation Flow¶

Agent detects context limit approaching
Agent writes rotation event to Bus: POST /events
Agent terminates
Orchestrator (GO or PO) sees rotation event via bus_poll
Orchestrator spawns replacement agent with state reference
Replacement reads Bus state from cursor, resumes

6.3 Crash Recovery¶

Heartbeat cron detects missing heartbeat (agent not responding)
Governor alerts GO via Bus
GO reads last Bus state for that agent
GO spawns replacement, passes last known state
Replacement resumes from last checkpoint

No file-based handoff. No conversation memory dependency. Bus is the only state authority.

7. Dashboard¶

The Dashboard is Layer 0 infrastructure, not an agent.

Technology: Stateless React app
Data source: Bus REST API + PostgreSQL direct reads
Auth: OAuth/API keys per user (future; currently internal-only)
Multi-user: Views scoped by project membership
No agent needed: Dashboard reads data; it does not orchestrate

Dashboard views:

Agent map (who is running, where, what task)
Alert feed (L0-L5, filterable)
Capacity heatmap (per host)
Sprint board (per project, from PO state)
Cost tracker (per project, per agent, per model)
Idea pipeline (from IO captures)

8. Idea Lifecycle¶

Ideas are the first stage of the Project Template pipeline. They are not tickets — they are raw observations that may or may not become work.

8.1 Lifecycle Stages¶

Trigger (human observation)
  → Capture (IO records as Bus event)
  → Research (agent explores feasibility, market, prior art)
  → Brainstorm (architectural discussion, trade-off analysis)
  → Decision (human approves or rejects)
  → Blueprint (if approved, becomes Part N of project blueprint)
  → Tickets (work plan decomposition)
  → Execute (PO manages via sprint)

8.2 Idea Event Schema¶

{
  "type": "idea",
  "source": "IO@Shai",
  "title": "Add multi-tenant support",
  "context": "Observed during client demo — they asked about team access",
  "priority_hint": "medium",
  "project_hint": "MVP1"
}

IO captures ideas without interrupting execution flow. GO reviews idea backlog during planning cycles.

9. Migration from v4.3¶

This architecture replaces the flat BM/MO/C0 model with a layered hierarchy. The migration is defined in transition_v5.yaml and proceeds in 7 phases:

T1: Split GO from BM (separate global orchestration from host management)
T2: Rename MO to HO@Mac1 (standardize host orchestrator naming)
T3: Create PO role (project-scoped orchestration)
T4: Create IO role (human interaction mediation)
T5: Alert taxonomy in Bus DB (structured routing)
T6: Capacity management (host metrics, spawn decisions)
T7: Context rotation v2 (Bus-native state transfer)

Each phase is independently deployable. No phase requires all previous phases to be complete, though T1 should precede T3 (PO needs GO to exist).

10. Bus Degraded Mode¶

The Bus is permanent infrastructure, but transient unavailability (restart, network blip, deploy) must not halt agent execution. This section defines the expected behavior during Bus downtime.

10.1 Agent Behavior During Bus Unavailability¶

When an agent cannot reach the Bus:

Continue executing the current task. The agent does not pause, halt, or block on Bus availability. Work in progress is not abandoned.
Queue Bus writes locally (in-memory). Any event that would normally be written to the Bus (heartbeats, state updates, results, rotation events) is held in a local write queue for the duration of the outage.
Do not queue indefinitely. If the local queue exceeds 500 entries or the agent reaches context rotation while the Bus is still down, the agent writes its state to the handoff directory (~/STRUXIO_Workspace/STRUXIO_OS/struxio-control/state/handoff/) as a fallback checkpoint.

10.2 Reconnection and Flush¶

On Bus reconnection:

Flush the local write queue to the Bus in order (oldest event first).
Write a bus_reconnect event with the flush count and gap duration.
Resume normal Bus-native operation.

No deduplication is required on flush — the Bus event log is append-only. Downstream consumers must handle idempotency for events that may have been delayed.

10.3 SSE Reconnection¶

SSE (Server-Sent Events) connections use the retry:5000 directive already implemented in the Bus. On connection drop:

The browser/client automatically retries after 5 seconds.
The Last-Event-ID header is sent on reconnect, allowing the Bus to resume the stream from the last acknowledged event.
No manual reconnection logic is required in the client.

10.4 REST Retry Policy¶

All REST calls from agents to the Bus use exponential backoff:

Attempt	Delay
1 (initial)	immediate
2 (first retry)	1 second
3 (second retry)	2 seconds
4 (third retry)	4 seconds

After 3 retries (4 total attempts) without success, the write is placed in the local queue (see 10.1) and the agent continues executing. The failure is logged locally but does not raise an alert until the Bus is confirmed down for > 60 seconds (Governor detection window).

10.5 Governor Role During Bus Outage¶

The Governor cron (5-minute cycle) detects Bus unavailability via the /healthz endpoint. On detection:

Governor logs the outage event to /opt/struxio/logs/governor.log.
Governor attempts docker restart deploy-bus-1 after the first failed cycle.
If Bus does not recover within 2 cycles (10 minutes), Governor raises an L0 alert via the filesystem alert queue (/opt/struxio/bus/alerts/).
On Bus recovery, Governor logs the recovery event and the outage duration.

Agents are not expected to know about the Governor cycle. Their only obligation is the retry policy (10.4) and local queue (10.1).

11. Design Constraints¶

No agent owns state. The Bus + PostgreSQL own state. Agents read and write via Bus API.
No agent is irreplaceable. Any agent can be terminated and respawned from Bus state.
No cross-layer shortcuts. L3 agents do not talk to L0 infrastructure directly. They go through their HO.
No implicit communication. All agent-to-agent communication goes through the Bus. No direct function calls between agents.
Cost visibility is mandatory. Every agent reports token usage. Every PO reports project cost. GO reports total cost. IO presents cost to human.

11.1 Maximum Delegation Depth¶

The spawn hierarchy has a hard depth limit of 4 levels:

GO (L1) → PO (L2) → Specialist (L3) → Worker (L4)

No agent at L4 (Worker) may spawn additional agents.
No delegation path may exceed 4 hops from GO.
The Governor enforces this constraint via a spawn depth counter attached to every spawn request. Each spawn increments the counter; any spawn request with depth ≥ 4 is rejected with a spawn_depth_exceeded error event on the Bus.
If a Specialist requires delegation, it must escalate to its PO to spawn the Worker on its behalf — the Specialist does not spawn directly.

Rationale¶

Unbounded delegation creates supervision gaps, makes cost attribution unreliable, and produces runaway agent trees that the Governor cannot control. The 4-level limit reflects the actual operational hierarchy and prevents architectural drift.

Rule¶

Violations of the depth limit are non-recoverable at spawn time — the spawn is rejected, not queued. The requesting agent must escalate to its parent to re-route the work.

11A. Optimizer Architecture Note¶

The Optimizer (Dream Engine, Idle Maintenance, Stewards) operates as a set of background cron-like tasks, not a unified service. No shared interface exists — each component reads from and writes to the Bus independently. This is intentional: loose coupling over tight integration.

Implications: - Dream Engine, Idle Maintenance tasks, and Stewards are not coordinated by a shared Optimizer supervisor - Each component has its own trigger (idle window, schedule, threshold event) - Each component writes its proposals and outputs as Bus events - No single endpoint or API represents "the Optimizer" — queries go to individual components - Adding a new optimizer component requires only that it reads/writes Bus events; no shared interface to extend

12. Design Decisions¶

These are captured architectural decisions for Swarm v5.0. Each decision is final unless explicitly reopened via the decision log.

12.1 Authentication: User Auth vs Agent Auth¶

Decision: User authentication (STRUXIO.ai org OAuth) is separate from agent authentication (API keys / Bus tokens).

Rationale: Users are humans who log in to the Dashboard via browser. Agents are processes that authenticate headlessly. Conflating the two creates security complexity and operational fragility.

Rules: - Users log in to the Dashboard and Control Center via OAuth (STRUXIO.ai org) - Agents authenticate to the Bus via Bus tokens (issued per agent, stored in SOPS) - No agent ever holds a user OAuth token - No user-facing flow depends on agent tokens

12.2 Dashboard: Per-User Settings and Device Layout¶

Decision: Per-user settings are stored in the database. Dashboard layout includes a device attribute (desktop, tablet, phone). The last saved layout per device is restored on login.

Rationale: Users access the dashboard from multiple devices with different screen geometries. Persisting layout per device provides continuity without requiring manual reconfiguration.

Rules: - Layout state schema includes: user_id, device (enum: desktop | tablet | phone), layout_json, updated_at - On login, the Dashboard queries the user's saved layout for the detected device - If no saved layout exists for the device, a default preset is applied - Layout changes are debounced and auto-saved to the DB (no explicit save button required) - Device detection is based on viewport width breakpoints, not user-agent sniffing

Decision: Web login (Dashboard) is the primary entry point for all users. Terminal CLI is the fallback and bootstrap backdoor for scenarios where the system is down or not yet initialized.

Rationale: For normal operation, the web UI provides the governed, auditable, and user-friendly control surface. The CLI must remain functional as a backdoor for infrastructure-level recovery and initial bootstrapping.

Rules: - All production operational tasks are expected to flow through the Dashboard - CLI is not deprecated — it remains available and maintained as the bootstrap/recovery path - If the Bus or Dashboard is unreachable, CLI is the recovery method - CLI-based operations that change system state must still write to the Bus (or queue for sync) when the Bus comes back online - The bootstrap sequence (new server, fresh install) uses CLI exclusively until the Bus is up

12.4 HO Auto-Discovery¶

Decision: On activation, each HO automatically reads its host's memory and CPU metrics and reports them to the Bus without requiring manual configuration.

Rationale: Manual host configuration is error-prone and creates onboarding friction. Auto-discovery ensures the capacity table is always accurate from the moment an HO starts.

Rules: - On startup, HO reads: total RAM, available RAM, CPU count, CPU model, disk usage, running Docker containers - HO writes an initial host_capacity record to the Bus immediately after startup - HO continues reporting metrics every 60 seconds via Bus heartbeat event - If host specs change (e.g., RAM upgrade), the next HO restart auto-updates the record - HO does not require a max_agents config file — it calculates max_agents from available RAM and a per-agent memory estimate (default: 512 MB per agent slot)

12.5 GO Warm Standby (I6)¶

Decision: GO uses a lease-based warm standby model. True leader election is not possible because Claude sessions cannot coordinate directly. Instead, a single-row go_lease table in Bus PostgreSQL acts as the authority.

Rationale: GO is the only singleton in the system with no supervisor above it. Without standby, a crashed GO session leaves the system without a spawn root until manual intervention. The lease model closes this gap with a 30-minute implementation that provides automatic recovery without distributed coordination complexity.

Model:

go_lease (id=1, holder, acquired_at, expires_at, session_id)

The active GO renews the lease every minute via POST /agents/000/lease with ttl=300.
If the lease has not been renewed for 5 minutes (expires_at < NOW()), it is considered expired.
The next GO session to start checks the lease and claims it if expired.
Only the current holder or an expired-lease claimer can update the row (enforced by the UPDATE WHERE clause).

Bus endpoints:

Method	Path	Purpose
`GET`	`/agents/000/lease`	Return current holder, `expires_at`, `seconds_remaining`, `expired` flag
`POST`	`/agents/000/lease`	Renew (current holder) or claim (if expired). Returns 409 if lease is active and held by another.

Heartbeat integration: /opt/struxio/scripts/agent_heartbeat.sh runs every 60 seconds via cron and calls POST /agents/000/lease after the agent register call. This ensures the lease is renewed as long as any GO is alive on the host.

Recovery sequence: 1. GO session A crashes. Lease expires after 5 minutes. 2. GO session B starts (new Claude Code session or manual restart). 3. Session B calls POST /agents/000/lease — succeeds because expires_at < NOW(). 4. Session B is now the active GO. All Bus state is intact. Normal operation resumes. 5. No manual intervention required.

Limitations: This model does not prevent two simultaneous active GOs during the TTL window. If a GO session goes zombie (still alive but not actively working), the lease renewal keeps it "official" until TTL expires. This is acceptable for the current scale — at most 1 active GO session is expected at any time.

13. Project Roadmap¶

The following is the priority-ordered roadmap for projects that XIOPro will manage. Each project builds on the previous one, creating a virtuous cycle where the platform improves itself.

Phase	Project	Description	Dependencies
1	XIOPro Core	Stabilize the platform: complete v5.0 transition (T1-T7), deploy PO/IO/HO roles, implement RBAC, deploy governance breakers, establish observability. This is the foundation everything else depends on.	None
2	Template Builder	Meta-tool for building project templates. A researcher agent that analyzes a target domain and produces a complete template (stages, steps, gates, agent roles, resource defaults) calibrated to T1P standards. See Part 9, Section 5A.	XIOPro Core stable
3	AI Project Template	Use the Template Builder to create a model template for AI/software projects, using XIOPro's own architecture as the reference implementation. This template becomes the standard for all future IT projects.	Template Builder operational
4	XIOPro v6 Self-Build	Regenerate XIOPro using its own AI Project Template. The platform rebuilds itself through its own orchestration layer, validating the template and identifying gaps. This is the ultimate dogfooding exercise.	AI Project Template validated
5	MVP1 Composite	Launch MVP1 (Paperclip) as a composite project with 4 sub-projects: Platform (IT Project template), Marketing (Marketing template), Knowledge (Knowledge Expert template), Content (Content Creation template). See Part 9, Section 5B for composite project structure.	XIOPro Core stable, templates defined

Roadmap Principles¶

Phase 1 is non-negotiable: Nothing else starts until XIOPro Core is stable. Attempting to run projects on an unstable platform wastes more time than it saves.
Phases 2-4 are the self-improvement loop: The platform gets better at managing projects by managing the project of improving itself.
Phase 5 is the first real external-facing project: By the time MVP1 launches, the platform has been tested on itself.
Phases can overlap: Phase 5 can begin in parallel with Phase 4 once the core templates are defined and PO orchestration is proven.

14. Operator-to-Project Ratios¶

14.1 Current Validated Ratio¶

Current validated ratio: 1 operator managing 2 projects with 12 agents.

This reflects the operational baseline as of v5.0: GO managing XIOPro Core and MVP1 simultaneously, with a combined agent pool of ~12 active agents across both POs.

14.2 Target Ratio¶

Target: 1 operator managing 5+ projects.

This requires:

Stable PO autonomy (PO handles L1/L2 alerts without GO involvement)
Reliable capacity management (HO auto-scaling without manual intervention)
IO filtering alerts to near-zero operator noise for stable projects
Dream Engine handling routine maintenance autonomously

14.3 Stress Testing Requirement¶

Stress testing is required before the 5-project target is declared achievable.

Minimum stress tests:

Simulate 5 concurrent POs with active sprint execution
Inject L2 alert storm (10+ simultaneous blocked tickets across projects)
Simulate HO capacity saturation and GO spawn-queuing behavior
Verify alert routing does not cross project boundaries (see Part 7, Section 10.4)
Verify cost attribution remains correct across all 5 projects under load

Results must be recorded before GO declares the 5-project target validated.

Changelog¶

Version	Date	Author	Change
5.0.4	2026-03-30	GO	N16: Added Section 11A — Optimizer Architecture Note. Clarifies that Dream Engine, Idle Maintenance, and Stewards are independent cron-like tasks with no shared interface. Loose coupling over tight integration is intentional.
5.0.5	2026-03-30	GO	N11: Added Section 11.1 — Maximum Delegation Depth. Hard limit of 4 levels (GO → PO → Specialist → Worker). Governor enforces via spawn depth counter. Deeper spawning rejected at spawn time.
5.0.6	2026-03-30	GO	Round 2 review fixes: Layer 1 — changed "Owns Bus" to "Primary consumer of Bus" (Bus is Layer 0 infrastructure, not owned by GO). Layer 3 — added Master PO note for composite projects with sub-project PO coordination, referencing Part 9 Section 5B.

End of Part 10