XIOPro Production Blueprint v5.0¶

Part 2 — Architecture¶

1. Purpose of This Part¶

This document defines the structural architecture of XIOPro:

major layers
major roles and components
runtime topology
boundaries between concerns
environment roles
separation between XIOPro and future STRUXIO product runtime
the T1P implementation stack that makes the blueprint actually buildable

Part 1 defines why XIOPro exists. Part 2 defines what the machine is and what technology it is built with.

2. Architectural Thesis¶

XIOPro is a multi-layer agentic operating system.

It is not one app, one server, one chat, or one model router.

It is composed of:

a human interaction surface
a control/UI layer
an orchestration layer
a governed execution fabric
a knowledge and research substrate
a governance and optimization layer
a durable work graph/state layer
an infrastructure platform

The architecture must support:

continuous headless operation
recoverable execution
user collaboration
provider independence
governed evolution
future scale without redesign of core logic

3. High-Level Layer Stack¶

flowchart TD
    Human[User / Human Operator] --> UI[Web Control Center / Mobile Surface]
    UI --> Interaction[Interaction & ContextPrompting Layer]
    Interaction --> Orchestration[Orchestration Layer]
    Orchestration --> Domain[Domain Brain Layer]
    Domain --> Workers[Worker Layer]
    Orchestration --> Governance[Governance & Optimization Layer]
    Governance --> WorkGraph[Work Graph / ODM / State]
    Domain --> Knowledge[Knowledge & Research Layer]
    Workers --> Execution[Execution Targets / External Systems]
    Knowledge --> WorkGraph
    WorkGraph --> UI
    Knowledge --> UI

4. Architectural Layers¶

4.1 Human Interaction Layer¶

This is where the user interacts with XIOPro.

Inputs include:

exploratory conversation
execution-bound discussion
approvals
rejections
clarifications
file and image attachments
voice input
research requests
recovery decisions
module and routing choices where allowed

Outputs include:

tickets
decisions
constraints
clarified intent
approvals
durable human decision records

This layer must remain:

high-bandwidth
low-friction
mobile-capable
durable when it affects execution

4.2 UI Control Layer¶

The visual control surface of XIOPro.

Responsibilities:

display system state
host widget-based operator workspaces
support brain interaction
expose approvals, alerts, and traceability
show cost, module, and governance posture
expose research and knowledge surfaces
support intervention and recovery

The UI is web-based and widget-first. It must never become the only runtime path.

4.3 Orchestration Layer¶

The central coordinating intelligence that turns structured work into assigned execution.

Responsibilities:

create or refine work objects
read work graph state
assign tickets and tasks
coordinate brains and workers
preserve continuity across sessions
manage execution order
react to human gates
consume prompt packages from the prompt steward role
operate within governance and module constraints

The BrainMaster uses Ruflo (claude-flow) as the agent execution runtime. The orchestrator decides WHAT to execute. Ruflo decides HOW to spawn and manage agents.

This is the control spine of XIOPro.

4.4 Domain Brain Layer¶

This layer contains specialized long-lived or semi-long-lived brains.

Canonical examples:

Compliance (e.g., industry standards)
Engineering
Brand / Content
Finance / Business
DevOps / Research

Responsibilities:

domain reasoning
domain decomposition
review of worker outputs
knowledge contribution in domain
bounded execution through workers or direct action

This layer provides specialization without fragmentation.

4.5 Worker Layer¶

This layer contains short-lived, bounded, task-specific execution actors.

Responsibilities:

execute narrow work
run isolated subtasks
offload mechanical or lower-cost work
operate under parent supervision
remain replaceable and bounded

Workers should remain:

ephemeral
cheap when possible
explicitly constrained
easy to retire or replace

4.6 Work Graph / State Layer¶

This layer stores and relates operational objects such as:

topics
projects
sprints
tickets
tasks
activities
runtimes
sessions
escalations
human decisions
costs
alerts
evaluations
reflections
improvements

This is the operational memory and structure of XIOPro.

It is what turns AI behavior into a governed system.

4.7 Knowledge & Research Layer¶

This layer contains:

Librarian
rules
skills
activations
patterns
protocols
indexed documents
historical decisions
Research Center
NotebookLM-related workflows
Obsidian-facing structures
Hindsight and Dream-derived proposals

Responsibilities:

preserve intelligence
classify and retrieve documents
support research workflows
reduce repeated thinking
generate reusable knowledge and proposals
enable compounding system knowledge

4.8 Governance & Optimization Layer¶

This layer includes:

governor runtime governance
rule steward role — rule/skill stewardship
prompt steward role — ContextPrompting governance and inquiry discipline
module steward role — module portfolio governance and optimization
policy objects
breakers
approval logic
audit/event trails

Responsibilities:

protect runtime
enforce policy
govern prompting and assumptions
govern rules and activations
govern modules and subscriptions
surface anomalies and drift
preserve explainability and approval discipline

This layer does not replace execution. It protects and improves execution.

4.9 Execution Targets¶

This is where actions land in the material world:

repositories
documentation
infrastructure
APIs
rendered outputs
tickets and external systems
websites
research outputs
future product runtime services

This is the world XIOPro changes.

5. T1P Implementation Technology Decisions¶

5.1 Decision Principle¶

T1P must optimize for:

buildability
recoverability
low moving-part count
strong Python integration with the current agent/runtime environment
explicit web/mobile support
clear separation between control-plane state and UI presentation

The stack below is therefore a deliberate simplification, not a maximal architecture.

5.2 Frontend Stack¶

T1P frontend stack:

TypeScript
React 19
Next.js App Router
shadcn/ui
TanStack Query
React-Grid-Layout

Rule¶

The UI is a web application with widget-first composition. It is not the source of truth.

Critical Control Rule¶

Critical control-plane mutations should flow through the backend API layer, not through opaque frontend-only mutation paths.

5.3 Backend Stack¶

T1P backend stack:

Python 3.12+
FastAPI
Pydantic v2
SQLAlchemy 2
Alembic

This stack is the canonical implementation path for:

control APIs
ODM-backed services
scheduler/worker coordination
governance services
Research Center APIs
module telemetry and optimization services

5.4 Python Tooling & Environment Management¶

Canonical Python tooling:

uv for Python version management, environment creation, dependency locking, and tool/script execution

Expected project standards where applicable:

pyproject.toml
uv.lock
.python-version

Rule¶

uv is the default Python tooling layer for T1P.

It improves:

bootstrap speed
dependency sync
local/server consistency
reproducible environments
CI and deployment ergonomics

It does not replace the backend framework. It standardizes the Python workflow around it.

5.5 Primary Data Store¶

Canonical primary data store:

PostgreSQL 17.x

Rule¶

PostgreSQL is the authoritative state store for T1P.

It holds:

work graph state
sessions
escalations
human decisions
governance records
normalized cost telemetry
research task metadata
scheduler/job state where practical

Conservative Versioning Rule¶

Even if newer PostgreSQL majors are available, T1P should pin one explicit major version and avoid drifting during early implementation.

For T1P, the pinned target is:

PostgreSQL 17.x

5.6 Realtime and UI Update Transport¶

Default transport decisions:

REST/JSON over HTTPS for standard request/response APIs
Server-Sent Events (SSE) for one-way live updates to the UI
WebSocket only where true bidirectional interactive streaming is required

Use SSE For¶

alerts
activity/event feeds
cost pulse
approval updates
trace/status updates
research task progress
widget refresh streams

Use WebSocket Only For¶

live bidirectional conversation streaming when needed
terminal-like interactive traces
future cases that truly require two-way socket behavior

Rule¶

SSE is the default live-update mechanism for T1P because the UI mostly needs server-to-client streaming, not a general-purpose socket layer for every widget.

5.7 Background Execution & Async Backbone¶

T1P background execution model:

authoritative job and execution state in PostgreSQL
dedicated Python worker processes
scheduler-driven and API-triggered task dispatch
explicit polling / claim / update flow for jobs and runtime state

Rule¶

T1P uses PostgreSQL-backed job dispatch as its async backbone. No separate message broker is required.

No required NATS / Redis-stream / Kafka-style backbone in T1P.

The purpose is to keep the system buildable while the canonical work graph and execution flow become real.

Future Expansion Rule¶

A dedicated event backbone (NATS, Redis-stream, or similar) may be introduced later only if:

Postgres-backed dispatch becomes the bottleneck
event volume or fan-out justifies it
operational value clearly exceeds additional complexity

This is an explicit architectural decision, not an oversight.

5.8 XIOPro Control Bus¶

The XIOPro Control Bus is the unified communication, coordination, and intervention backbone.

It merges the persistence and cross-host reach of the existing Bus MCP with the orchestration concepts of Ruflo into a single always-on service that every agent and surface can reach.

Principle¶

Every agent talks to one service for everything: messaging, tasks, state, intervention, spawning.

The Control Bus is not a message broker. It is a stateful coordination service backed by PostgreSQL.

Architecture¶

graph TB
    subgraph ControlBus["XIOPro Control Bus"]
        REST["REST API :8088"]
        SSE["SSE Push :8089"]
        Worker["Background Worker"]
        REST --- SSE
        REST --- Worker
    end

    PG[("PostgreSQL")] --- REST
    PG --- Worker

    Agents["Domain Brains & Workers"] --> REST
    REST --> Agents
    SSE --> UI["Control Center UI"]
    SSE --> Agents
    Orchestrator["Orchestrator"] --> REST
    Founder["Founder via RC/UI"] --> REST

Capabilities¶

Capability	Endpoint Pattern	Description
Messaging	`POST /messages`, `GET /messages/poll`	Persistent async messaging between agents. Existing capability.
Push Delivery	SSE `/events/{agent_id}`	Real-time push to agents and UI via Server-Sent Events. Eliminates polling delay.
Agent Registration	`POST /agents/register`, `GET /agents`	Full agent registry with capabilities, host binding, resource requirements. Extends existing heartbeat.
Agent Heartbeat	`POST /agents/{id}/heartbeat`	Liveness signal with current task, status, resource usage. Existing capability.
Task Orchestration	`POST /tasks`, `PATCH /tasks/{id}`, `GET /tasks`	Create, assign, update, query tasks. Backed by ODM schema.
Intervention	`POST /agents/{id}/pause`, `/resume`, `/terminate`, `/redirect`	Founder or the governor can pause, resume, terminate, or redirect any agent. State persisted. Agent checks intervention state each cycle.
Host Capacity	`GET /hosts/{id}/capacity`, `POST /hosts/register`	Query available agent slots per host. Pre-spawn gate.
Agent Spawning	`POST /agents/spawn`	Request agent spawn on a target host. Bus checks capacity, then triggers Claude Code subprocess.
Cost Tracking	`POST /costs`, `GET /costs/summary`	Record and query cost ledger entries per agent, task, ticket.
Governance Events	`POST /alerts`, `POST /breakers/{id}/trigger`	The governor emits alerts and breaker events through the Bus.
Agent Auto-Pickup	`POST /agents/{id}/pickup`, `GET /agents/{id}/tasks`	Agent signals readiness and retrieves its next assigned task without orchestrator polling.
Agent Health	`GET /agents/health`	Returns health state of all registered agents: status, last heartbeat, current task.
Agent Metrics	`GET /agents/{id}/metrics`, `GET /agents/metrics/summary`	Per-agent metrics (tokens, tasks completed, cost) and system-wide summary.
CLI Services	`GET /services`, `POST /services/{name}/run`	12 config-driven operational CLI services accessible via Bus API or `devxio` CLI.
Template Registry	`GET /templates`, `POST /templates`, `GET /templates/{id}`	4 agent activation templates; create, list, and retrieve templates by ID.
Dashboard	`GET /dashboard`	Single-call endpoint returning unified agent status, task summary, recent alerts, and cost pulse.
Message Search	`GET /messages/search`	Full-text search across message history with filters for agent, topic, and date range.
Project Lifecycle	`GET /projects`, `PATCH /projects/{id}`	Project list and lifecycle_phase updates (discovery / active / paused / complete).

Intervention Model¶

Intervention is a first-class capability, not a side effect.

intervention:
  id: string
  target_agent_id: string
  action: enum
    # pause | resume | terminate | redirect | constrain
  reason: string
  issued_by: string           # founder | 000_governor | 000_orchestrator
  issued_at: datetime
  acknowledged_at: datetime|null
  state: enum
    # pending | acknowledged | applied | rejected | expired
  expires_at: datetime|null

Agents must check for pending interventions: - On each activity cycle start - When polling for messages - Via SSE push if connected

Push Delivery Model¶

SSE channels per agent and per surface:

GET /events/{agent_id}  → agent receives tasks, messages, interventions in real-time
GET /events/ui          → Control Center receives all state changes for live dashboard
GET /events/founder     → Founder receives alerts, approvals, escalations

Push eliminates the poll-only limitation. Agents no longer need to actively check — the Bus pushes to them.

Relationship to Ruflo¶

Ruflo remains the in-session execution runtime:

Spawns sub-agents within a Claude Code session
Manages agent lifecycle within session scope
Provides memory and coordination tools within session

The Control Bus is the cross-session coordination layer:

Persists all state in PostgreSQL
Reaches agents across hosts (Hetzner, Mac, future nodes)
Survives session restarts
Provides intervention and governance

Ruflo reports state UP to the Bus. The Bus does not depend on Ruflo.

Founder/UI → Control Bus → Orchestrator → Ruflo → Agents
                 ↑                          |
                 +---- state reports --------+

Current State and Migration¶

The existing Bus MCP (bus.struxio.ai) already provides: - REST API (port 8088) - SSE streaming (port 8089) - PostgreSQL persistence - OAuth 2.1 authentication - Message send/poll/ack - Presence heartbeats - Paperclip proxy

Migration path: 1. Add SSE push channels (per-agent streams) — extends existing SSE 2. Add intervention endpoints — new CRUD routes 3. Add task orchestration endpoints — builds on ODM schema 4. Add agent registration — extends existing heartbeat 5. Add host capacity endpoints — reads Host Registry 6. Add spawn endpoint — triggers processes on target hosts

Estimated effort: ~2 weeks. No rewrite — iterative extension of existing service.

See resources/DESIGN_rc_architecture.md for the Remote Control architecture design (Open WebUI evaluation, multi-provider chat routing, Prompt Composer integration with the Control Bus).

Rules¶

The Control Bus is always on. It must survive agent crashes, session restarts, and host reboots.
All cross-session state flows through the Bus. No agent-to-agent communication bypasses it.
Intervention commands take priority over normal message delivery.
The Bus does not execute work. It coordinates. Agents execute.
SSE push is the default delivery method. Polling remains as fallback for agents that cannot hold SSE connections.

SSE Reconnection Behavior¶

SSE connections between agents and the Bus are long-lived but not permanent. Agents must handle disconnections gracefully.

On SSE disconnect: - Agent immediately falls back to HTTP polling (bus_poll) for message delivery - Agent continues executing its current task without interruption

On reconnect: - Agent re-registers its SSE channel with the Bus - Agent resumes the event stream from its last known cursor position (Last-Event-ID header) - Any events received via polling during the disconnect are deduplicated by the agent using event IDs

Keepalive: - The Bus sends an SSE heartbeat comment (:keepalive) every 30 seconds on each active channel - If an agent receives no data (including keepalives) for 60 seconds, it considers the connection dropped and initiates reconnection

Reconnection timing: - First reconnect attempt: immediate - Subsequent attempts: exponential backoff (1s, 2s, 4s, max 30s) - After 5 failed reconnection attempts: agent stays on HTTP polling and logs a warning

Bus Crash Recovery & Restart Sequence¶

The Control Bus is a stateful coordination service. Its restart behavior must be explicit and predictable.

In-Flight Message Handling¶

All messages are persisted to PostgreSQL before acknowledgement is returned to the sender. On Bus crash, no acknowledged messages are lost. Messages in transit (sent but not yet persisted) will fail at the sender and must be retried by the sending agent.

Restart Sequence¶

PostgreSQL connection pool re-established
Bus validates schema and migration state
REST API endpoints become available (health check returns 200)
SSE push channels are re-opened (no automatic client reconnection — clients must reconnect)
Background worker resumes processing pending jobs from the jobs table
Bus emits bus.restarted event on all SSE channels

Pending Intervention Recovery¶

On restart, the Bus scans the interventions table for interventions in pending or acknowledged state. These are re-queued for delivery. Interventions with expires_at in the past are moved to expired state. No intervention is silently dropped.

SSE Subscription Reconnection¶

SSE connections are stateless server-push streams. On Bus restart or network interruption:

Clients must detect connection loss (EventSource onerror or heartbeat timeout)
Clients reconnect using the same GET /events/{agent_id} endpoint
Clients resume from their last acknowledged cursor position (Last-Event-ID header or ?cursor= parameter)
The Bus replays any unacknowledged events from the persistent event log

Message Delivery Guarantees¶

The Bus provides at-least-once delivery:

Every message is persisted before the sender receives 201 Created
Consumers poll or receive via SSE and must acknowledge (bus_ack) after processing
Unacknowledged messages are re-delivered on the next poll or SSE reconnection
Consumers must be idempotent — processing the same message twice must produce the same result
The idempotency_key field on messages enables consumer-side deduplication

The Bus does NOT provide exactly-once delivery. Idempotent consumers are required.

Input Validation & Rate Limiting¶

All Bus API endpoints enforce input validation and rate limiting to prevent abuse and injection.

Input Validation¶

All request bodies are validated against JSON Schema before processing
Schema definitions are co-located with endpoint handlers and versioned with the API
Requests failing validation receive 400 Bad Request with a structured error body listing violations
Request size limit: 20 KB body maximum. Requests exceeding this receive 413 Payload Too Large
Attachment limit: 8 attachments per message
All text fields are HTML-escaped before storage to prevent stored XSS
SQL injection is prevented by parameterized queries (SQLAlchemy / node-postgres parameterized statements) — no string concatenation in queries

Rate Limiting¶

Per-agent rate limit: 100 requests/second
Global rate limit: 1000 requests/second across all agents
Rate limit responses: 429 Too Many Requests with Retry-After header
SSE connections: 1 connection per agent per channel (enforced server-side)
Rate limit state is held in-memory (per-process) with optional Redis backing for multi-process deployments

Enforcement Rule¶

Input validation and rate limiting are not optional middleware. They are required on every public and agent-facing endpoint from T1P onwards.

Data Access Rule: Bus API vs Direct Database¶

All agents access PostgreSQL through the Control Bus API by default. Direct database access is reserved for bulk/heavy operations that run local to the database host.

data_access_policy:
  default: bus_api
  # All agents, all hosts — use REST endpoints
  # Auth: OAuth via Bus
  # Latency: ~50-100ms per request
  # Suitable for: task CRUD, state reads, cost logging, queries < 100 rows

  bulk_local: direct_postgresql
  # Only agents running on the SAME HOST as PostgreSQL
  # Use case: imports > 100 rows, report generation, cleanup jobs,
  #           data migration, analytics, batch cost aggregation
  # Auth: local connection (Unix socket or localhost)
  # Latency: ~1-5ms per query

Orchestrator Spawn Rule for Bulk Operations¶

When the orchestrator receives a task that requires bulk database operations:

Classify the task — is it normal CRUD (<100 rows) or bulk (>100 rows)?
If bulk: spawn the agent on the database host (Hetzner), never on a remote host
Agent connects locally — localhost:5432 or Unix socket, no network overhead
Results flow back through the Bus API as normal

spawn_decision:
  task_type: bulk_import | bulk_report | data_cleanup | migration
  required_host: database_host        # must run on same machine as PostgreSQL
  access_method: direct_postgresql    # bypass Bus for data operations
  result_delivery: bus_api            # results still reported through Bus

Examples¶

Task	Access Method	Host	Why
Agent reads its next task	Bus API	Any host	Normal operation, ~60ms is fine
Agent updates task status	Bus API	Any host	Normal operation
Morning brief queries 50 activities	Bus API	Any host	Small result set
Import 5000 research records	Direct PostgreSQL	Database host only	Bulk — 500ms local vs 50s via API
Sprint cost report across all agents	Direct PostgreSQL	Database host only	Aggregation query over thousands of rows
Nightly cleanup of orphaned sessions	Direct PostgreSQL	Database host only	Scan + delete pattern
Knowledge Ledger batch write	Direct PostgreSQL	Database host only	High-volume append

Rule¶

Remote agents (Mac, future cloud nodes) NEVER get direct PostgreSQL access. If a remote agent needs bulk operations, the orchestrator spawns a local agent on the database host to do the work, then the local agent reports results back through the Bus.

5.9 Reverse Proxy and Edge¶

Canonical reverse proxy:

Caddy

Caddy should handle:

HTTPS termination
ingress routing
host/domain routing
simple edge policy
low-friction certificate management

5.10 Observability Stack¶

Canonical T1P observability stack:

OpenTelemetry for instrumentation
Prometheus for metrics and alert-compatible scraping
Grafana for dashboards, alert visualization, and operator inspection

Rule¶

Observability is not an optional enhancement. Every core service must emit enough signals to support:

recovery
alerting
cost/usage inspection
user-facing diagnosis

5.11 Testing Toolchain¶

Canonical T1P testing tools:

pytest for backend unit, integration, and workflow tests
Playwright for UI, browser, and mobile-surface end-to-end tests

Optional, not required on day one:

Vitest for frontend component/unit tests if UI complexity justifies it

Rule¶

T1P does not need a sprawling test-tool matrix.

It needs one strong backend runner and one strong browser/E2E runner first.

5.12 CLI Toolchain¶

Canonical CLI tools for T1P agent and operator workflows:

Tool	Purpose
gh	GitHub CLI — repo, PR, issue, release management
jq	JSON processor — structured data extraction and transformation
yq	YAML processor — config and state file manipulation
uv	Python environment — version management, dependency locking, script execution
rg (ripgrep)	Fast recursive search across codebases and knowledge files
fd	Fast file finder — replacement for find with sane defaults
Ruflo (claude-flow)	Agent execution runtime — spawning, coordination, memory
sops	Secrets encryption — encrypt/decrypt secrets in config files
age	Encryption backend for SOPS — key management
restic	Backup — automated snapshots to Backblaze B2

Rule¶

CLI tools are the primary execution interface for agents. MCP wrappers may exist for discovery or integration, but CLI is the default for production pipelines.

See resources/CLI_TOOLS_ASSESSMENT.md for the full assessment of available CLI tools and their roles.

See resources/DESIGN_cli_services.md for the config-driven CLI services framework design (operational commands executable via Bus API or devxio CLI).

5.13 Implementation Form of Key Roles¶

For T1P, roles should be implemented pragmatically.

Orchestrator¶

Implementation form:

backend service / orchestration module
not a separate mystical agent surface

Governor¶

Implementation form:

backend governance service / policy engine
not only a prompt persona

Rule steward / prompt steward / module steward roles¶

T1P implementation form:

explicit services/modules with durable inputs and outputs
but allowed to begin as thin application services rather than fully independent distributed systems

Rule¶

The blueprint keeps the roles. T1P is allowed to implement them with fewer deployables than named roles.

5.14 Security & Session Handling Decision¶

T1P security stance:

operator-first
single-tenant or very low-user-count
strong infrastructure boundary
simple application auth over a strong network boundary

Recommended pattern:

Tailscale/private network for admin paths
application login/session for the web UI
no dependence on the UI for core execution safety
no frontend-only secret handling

Secrets at Rest¶

Secrets are encrypted at rest using SOPS + age. Key stored at ~/age-key.txt.

All configuration files containing secrets must use SOPS encryption. Plaintext secrets must never be committed to any repository.

5.15 Backup & Recovery¶

Automated backup via Restic to Backblaze B2, daily at 03:00 UTC.

Backup scope includes:

PostgreSQL database dumps
Configuration files
Knowledge repository content
State files

Retention policy is managed by Restic pruning rules.

See reference_backblaze.md for current B2 configuration details.

5.16 Final Technology Rule¶

The purpose of these decisions is to make the blueprint executable.

Any future stack change must be justified by:

clear operational gain
reduced risk
or proven scale pressure

Not by novelty.

6. Runtime Topology¶

6.1 Node A — Cloud Control Node¶

Primary always-on environment for:

orchestrator
governor
API/control services
PostgreSQL
scheduler/workers
LiteLLM/router where needed
runtime execution fabric
knowledge services
telemetry and backup jobs

6.2 Node B — Local Operator Node¶

Primary environment for:

user interaction
local CLI execution
fallback sessions
local knowledge access
manual validation
controlled local experiments

6.3 Node C — Future GPU / Model Node¶

Reserved for:

self-hosted model serving
embedding/indexing jobs
compute-intensive background work
isolated experimental inference

6.4 Node D — Future Product Runtime Node¶

Reserved for:

customer-facing STRUXIO runtime
product APIs
workloads isolated from XIOPro control-plane services

7. Current State and Evolution¶

7.1 Current State¶

As of 2026-03-28, the system operates with:

Node A: Hetzner CPX62 (16 vCPU AMD EPYC-Genoa, 30 GB RAM, 150 GB SSD) running 10 Docker containers (post-retirement of devxio-frontend, devxio-bridge, devxio-librarian, Neo4j)
Node B: Mac Studio (Mac Worker, agent 010) connected via Tailscale VPN
Orchestration: BrainMaster (agent 000) operating as proto-orchestrator
Messaging: Bus-based inter-agent messaging (PostgreSQL-backed)
Ticket tracking: Paperclip (to be superseded by ODM work graph)
Dashboard: dashboard.struxio.ai (React) — current operator UI
Knowledge: Hindsight running (localhost:8888/9999, Vectorize.io Docker)
Backup: Restic to Backblaze B2, daily 03:00 UTC
Secrets: SOPS + age encryption

7.2 Service Migration¶

Current services will transition to the XIOPro target architecture through a managed migration.

See resources/SERVICE_FATE_MAP_v4_2.md for the explicit mapping of:

services to keep as-is
services to evolve
services to retire
services to replace

No big-bang cutover. Old services run in parallel alongside new services until the new services are proven and functional parity is reached.

7.3 Target Direction¶

Move toward:

cleaner orchestrator identity
clearer governor identity
explicit stewardship roles (rule steward, prompt steward, module steward)
a DB-backed work graph
a Research Center built on the Librarian
a widget-based control center UI
a rationalized module portfolio and infrastructure model

8. XIOPro vs Product Runtime¶

XIOPro must remain conceptually separate from STRUXIO product runtime.

XIOPro¶

internal AI operating system
execution and governance substrate
research and knowledge system
user control center
internal optimization machine

STRUXIO Product Runtime¶

customer-facing APIs and services
product workloads
external runtime isolation
product-specific scaling and SLAs

XIOPro may build and operate product runtime, but it must not collapse into it.

9. Architectural Success Criteria¶

The architecture is successful when:

execution continues without UI
recovery is practical
layers stay separable
governance remains explicit
knowledge compounds instead of fragments
module optimization is real, not ad hoc
human collaboration does not break state integrity
future scale can happen without rethinking the whole machine

10. Final Statement¶

XIOPro is not one service.

It is a layered machine for governed execution, collaboration, research, and optimization.

If the layers stay clean, XIOPro remains buildable, recoverable, and evolvable. If they blur, the system regresses into expensive chat-shaped chaos.

Changelog¶

v5.0.0 (2026-03-28)¶

Changes from v4.1.0:

C2.1: Added Section 5.11 CLI Toolchain with canonical CLI tools table and rule
C2.2: Added SOPS + age secrets management to Section 5.13 (Security)
C2.3: Added Section 5.14 Backup & Recovery (Restic to Backblaze B2)
C2.4: Added service migration reference to Section 7.2, pointing to SERVICE_FATE_MAP_v4_2.md
C2.5: Made async backbone decision explicit in Section 5.7 — PostgreSQL-backed dispatch is a deliberate choice, not an omission
CX.1: Global naming fix — "Rufio" replaced with "Ruflo" throughout
CX.2: Version header updated to 4.2.0, last_updated to 2026-03-28
CX.3: Added this changelog section
CX.4: Added Current State subsection (Section 7.1) documenting existing infrastructure
Clarified 000/Ruflo relationship in Section 4.3
Renumbered sections 5.11-5.13 to 5.12-5.15 to accommodate CLI Toolchain insertion
Restructured Section 7 from "Current -> Target Evolution" to "Current State and Evolution" with explicit subsections

v5.0.2 (2026-03-28)¶

Agent naming migration to 3-digit unified IDs:

Replaced O00/O01/R01/P01/M01 with 000 role-based naming throughout
Replaced B1-B5 with 001-005 (domain brains)
Replaced M0 with 010 (Mac Worker)
Replaced BM with 000 (BrainMaster)
Updated Mermaid diagrams to use 3-digit IDs
Preserved Backblaze B2 references unchanged

v5.0.3 (2026-03-28)¶

Roles over numbers: Removed agent IDs from architectural descriptions, layer headers, diagrams, and role implementation sections. Current State section (7.1) uses "agent NNN" format for operational references. Blueprint describes WHAT roles do, not WHICH agent holds them.

v5.0.12 (2026-03-29)¶

Cross-references: Added pointer to resources/DESIGN_rc_architecture.md (Remote Control architecture design — Open WebUI evaluation, multi-provider chat routing, Prompt Composer integration) in Section 5.8 context. Added pointer to resources/DESIGN_cli_services.md (CLI services framework — config-driven operational commands via Bus API) in Section 5.12 context.

v5.0.13 (2026-03-29)¶

Batch BP update from recent tickets: Added 9 new Control Bus endpoint capabilities to Section 5.8 table — agent auto-pickup (/agents/{id}/pickup, /agents/{id}/tasks), agent health (/agents/health), agent metrics (/agents/{id}/metrics, /agents/metrics/summary), CLI services (/services), template registry (/templates), dashboard (/dashboard), message search (/messages/search), and project lifecycle (/projects).

v5.0.14 (2026-03-30)¶

Review fixes: C3: Added Bus Crash Recovery & Restart Sequence subsection to Section 5.8 — in-flight message handling, restart sequence (5 steps), pending intervention recovery, SSE reconnection behavior, at-least-once delivery guarantees with idempotent consumers. C8: Added Input Validation & Rate Limiting subsection to Section 5.8 — JSON Schema validation on all endpoints, 100 req/s per-agent and 1000 req/s global rate limits, 20KB body limit, 8 attachment limit, HTML escaping, parameterized SQL enforcement.

v5.0.15 (2026-03-30)¶

Round 2 review fix (SSE reconnection):

Section 5.8: Added SSE Reconnection Behavior subsection — documents fallback to HTTP polling on disconnect, reconnect from last cursor via Last-Event-ID, 30-second keepalive heartbeat, 60-second connection drop detection, exponential backoff reconnection timing.