Skip to content

XIOPro Production Blueprint v5.0

Part 2 — Architecture


1. Purpose of This Part

This document defines the structural architecture of XIOPro:

  • major layers
  • major roles and components
  • runtime topology
  • boundaries between concerns
  • environment roles
  • separation between XIOPro and future STRUXIO product runtime
  • the T1P implementation stack that makes the blueprint actually buildable

Part 1 defines why XIOPro exists. Part 2 defines what the machine is and what technology it is built with.


2. Architectural Thesis

XIOPro is a multi-layer agentic operating system.

It is not one app, one server, one chat, or one model router.

It is composed of:

  • a human interaction surface
  • a control/UI layer
  • an orchestration layer
  • a governed execution fabric
  • a knowledge and research substrate
  • a governance and optimization layer
  • a durable work graph/state layer
  • an infrastructure platform

The architecture must support:

  • continuous headless operation
  • recoverable execution
  • user collaboration
  • provider independence
  • governed evolution
  • future scale without redesign of core logic

3. High-Level Layer Stack

flowchart TD
    Human[User / Human Operator] --> UI[Web Control Center / Mobile Surface]
    UI --> Interaction[Interaction & ContextPrompting Layer]
    Interaction --> Orchestration[Orchestration Layer]
    Orchestration --> Domain[Domain Brain Layer]
    Domain --> Workers[Worker Layer]
    Orchestration --> Governance[Governance & Optimization Layer]
    Governance --> WorkGraph[Work Graph / ODM / State]
    Domain --> Knowledge[Knowledge & Research Layer]
    Workers --> Execution[Execution Targets / External Systems]
    Knowledge --> WorkGraph
    WorkGraph --> UI
    Knowledge --> UI

4. Architectural Layers

4.1 Human Interaction Layer

This is where the user interacts with XIOPro.

Inputs include:

  • exploratory conversation
  • execution-bound discussion
  • approvals
  • rejections
  • clarifications
  • file and image attachments
  • voice input
  • research requests
  • recovery decisions
  • module and routing choices where allowed

Outputs include:

  • tickets
  • decisions
  • constraints
  • clarified intent
  • approvals
  • durable human decision records

This layer must remain:

  • high-bandwidth
  • low-friction
  • mobile-capable
  • durable when it affects execution

4.2 UI Control Layer

The visual control surface of XIOPro.

Responsibilities:

  • display system state
  • host widget-based operator workspaces
  • support brain interaction
  • expose approvals, alerts, and traceability
  • show cost, module, and governance posture
  • expose research and knowledge surfaces
  • support intervention and recovery

The UI is web-based and widget-first. It must never become the only runtime path.


4.3 Orchestration Layer

The central coordinating intelligence that turns structured work into assigned execution.

Responsibilities:

  • create or refine work objects
  • read work graph state
  • assign tickets and tasks
  • coordinate brains and workers
  • preserve continuity across sessions
  • manage execution order
  • react to human gates
  • consume prompt packages from the prompt steward role
  • operate within governance and module constraints

The BrainMaster uses Ruflo (claude-flow) as the agent execution runtime. The orchestrator decides WHAT to execute. Ruflo decides HOW to spawn and manage agents.

This is the control spine of XIOPro.


4.4 Domain Brain Layer

This layer contains specialized long-lived or semi-long-lived brains.

Canonical examples:

  • Compliance (e.g., industry standards)
  • Engineering
  • Brand / Content
  • Finance / Business
  • DevOps / Research

Responsibilities:

  • domain reasoning
  • domain decomposition
  • review of worker outputs
  • knowledge contribution in domain
  • bounded execution through workers or direct action

This layer provides specialization without fragmentation.


4.5 Worker Layer

This layer contains short-lived, bounded, task-specific execution actors.

Responsibilities:

  • execute narrow work
  • run isolated subtasks
  • offload mechanical or lower-cost work
  • operate under parent supervision
  • remain replaceable and bounded

Workers should remain:

  • ephemeral
  • cheap when possible
  • explicitly constrained
  • easy to retire or replace

4.6 Work Graph / State Layer

This layer stores and relates operational objects such as:

  • topics
  • projects
  • sprints
  • tickets
  • tasks
  • activities
  • runtimes
  • sessions
  • escalations
  • human decisions
  • costs
  • alerts
  • evaluations
  • reflections
  • improvements

This is the operational memory and structure of XIOPro.

It is what turns AI behavior into a governed system.


4.7 Knowledge & Research Layer

This layer contains:

  • Librarian
  • rules
  • skills
  • activations
  • patterns
  • protocols
  • indexed documents
  • historical decisions
  • Research Center
  • NotebookLM-related workflows
  • Obsidian-facing structures
  • Hindsight and Dream-derived proposals

Responsibilities:

  • preserve intelligence
  • classify and retrieve documents
  • support research workflows
  • reduce repeated thinking
  • generate reusable knowledge and proposals
  • enable compounding system knowledge

4.8 Governance & Optimization Layer

This layer includes:

  • governor runtime governance
  • rule steward role — rule/skill stewardship
  • prompt steward role — ContextPrompting governance and inquiry discipline
  • module steward role — module portfolio governance and optimization
  • policy objects
  • breakers
  • approval logic
  • audit/event trails

Responsibilities:

  • protect runtime
  • enforce policy
  • govern prompting and assumptions
  • govern rules and activations
  • govern modules and subscriptions
  • surface anomalies and drift
  • preserve explainability and approval discipline

This layer does not replace execution. It protects and improves execution.


4.9 Execution Targets

This is where actions land in the material world:

  • repositories
  • documentation
  • infrastructure
  • APIs
  • rendered outputs
  • tickets and external systems
  • websites
  • research outputs
  • future product runtime services

This is the world XIOPro changes.


5. T1P Implementation Technology Decisions

5.1 Decision Principle

T1P must optimize for:

  • buildability
  • recoverability
  • low moving-part count
  • strong Python integration with the current agent/runtime environment
  • explicit web/mobile support
  • clear separation between control-plane state and UI presentation

The stack below is therefore a deliberate simplification, not a maximal architecture.


5.2 Frontend Stack

T1P frontend stack:

  • TypeScript
  • React 19
  • Next.js App Router
  • shadcn/ui
  • TanStack Query
  • React-Grid-Layout

Rule

The UI is a web application with widget-first composition. It is not the source of truth.

Critical Control Rule

Critical control-plane mutations should flow through the backend API layer, not through opaque frontend-only mutation paths.


5.3 Backend Stack

T1P backend stack:

  • Python 3.12+
  • FastAPI
  • Pydantic v2
  • SQLAlchemy 2
  • Alembic

This stack is the canonical implementation path for:

  • control APIs
  • ODM-backed services
  • scheduler/worker coordination
  • governance services
  • Research Center APIs
  • module telemetry and optimization services

5.4 Python Tooling & Environment Management

Canonical Python tooling:

  • uv for Python version management, environment creation, dependency locking, and tool/script execution

Expected project standards where applicable:

  • pyproject.toml
  • uv.lock
  • .python-version

Rule

uv is the default Python tooling layer for T1P.

It improves:

  • bootstrap speed
  • dependency sync
  • local/server consistency
  • reproducible environments
  • CI and deployment ergonomics

It does not replace the backend framework. It standardizes the Python workflow around it.


5.5 Primary Data Store

Canonical primary data store:

  • PostgreSQL 17.x

Rule

PostgreSQL is the authoritative state store for T1P.

It holds:

  • work graph state
  • sessions
  • escalations
  • human decisions
  • governance records
  • normalized cost telemetry
  • research task metadata
  • scheduler/job state where practical

Conservative Versioning Rule

Even if newer PostgreSQL majors are available, T1P should pin one explicit major version and avoid drifting during early implementation.

For T1P, the pinned target is:

  • PostgreSQL 17.x

5.6 Realtime and UI Update Transport

Default transport decisions:

  • REST/JSON over HTTPS for standard request/response APIs
  • Server-Sent Events (SSE) for one-way live updates to the UI
  • WebSocket only where true bidirectional interactive streaming is required

Use SSE For

  • alerts
  • activity/event feeds
  • cost pulse
  • approval updates
  • trace/status updates
  • research task progress
  • widget refresh streams

Use WebSocket Only For

  • live bidirectional conversation streaming when needed
  • terminal-like interactive traces
  • future cases that truly require two-way socket behavior

Rule

SSE is the default live-update mechanism for T1P because the UI mostly needs server-to-client streaming, not a general-purpose socket layer for every widget.


5.7 Background Execution & Async Backbone

T1P background execution model:

  • authoritative job and execution state in PostgreSQL
  • dedicated Python worker processes
  • scheduler-driven and API-triggered task dispatch
  • explicit polling / claim / update flow for jobs and runtime state

Rule

T1P uses PostgreSQL-backed job dispatch as its async backbone. No separate message broker is required.

No required NATS / Redis-stream / Kafka-style backbone in T1P.

The purpose is to keep the system buildable while the canonical work graph and execution flow become real.

Future Expansion Rule

A dedicated event backbone (NATS, Redis-stream, or similar) may be introduced later only if:

  • Postgres-backed dispatch becomes the bottleneck
  • event volume or fan-out justifies it
  • operational value clearly exceeds additional complexity

This is an explicit architectural decision, not an oversight.


5.8 XIOPro Control Bus

The XIOPro Control Bus is the unified communication, coordination, and intervention backbone.

It merges the persistence and cross-host reach of the existing Bus MCP with the orchestration concepts of Ruflo into a single always-on service that every agent and surface can reach.

Principle

Every agent talks to one service for everything: messaging, tasks, state, intervention, spawning.

The Control Bus is not a message broker. It is a stateful coordination service backed by PostgreSQL.

Architecture

graph TB
    subgraph ControlBus["XIOPro Control Bus"]
        REST["REST API :8088"]
        SSE["SSE Push :8089"]
        Worker["Background Worker"]
        REST --- SSE
        REST --- Worker
    end

    PG[("PostgreSQL")] --- REST
    PG --- Worker

    Agents["Domain Brains & Workers"] --> REST
    REST --> Agents
    SSE --> UI["Control Center UI"]
    SSE --> Agents
    Orchestrator["Orchestrator"] --> REST
    Founder["Founder via RC/UI"] --> REST

Capabilities

Capability Endpoint Pattern Description
Messaging POST /messages, GET /messages/poll Persistent async messaging between agents. Existing capability.
Push Delivery SSE /events/{agent_id} Real-time push to agents and UI via Server-Sent Events. Eliminates polling delay.
Agent Registration POST /agents/register, GET /agents Full agent registry with capabilities, host binding, resource requirements. Extends existing heartbeat.
Agent Heartbeat POST /agents/{id}/heartbeat Liveness signal with current task, status, resource usage. Existing capability.
Task Orchestration POST /tasks, PATCH /tasks/{id}, GET /tasks Create, assign, update, query tasks. Backed by ODM schema.
Intervention POST /agents/{id}/pause, /resume, /terminate, /redirect Founder or the governor can pause, resume, terminate, or redirect any agent. State persisted. Agent checks intervention state each cycle.
Host Capacity GET /hosts/{id}/capacity, POST /hosts/register Query available agent slots per host. Pre-spawn gate.
Agent Spawning POST /agents/spawn Request agent spawn on a target host. Bus checks capacity, then triggers Claude Code subprocess.
Cost Tracking POST /costs, GET /costs/summary Record and query cost ledger entries per agent, task, ticket.
Governance Events POST /alerts, POST /breakers/{id}/trigger The governor emits alerts and breaker events through the Bus.
Agent Auto-Pickup POST /agents/{id}/pickup, GET /agents/{id}/tasks Agent signals readiness and retrieves its next assigned task without orchestrator polling.
Agent Health GET /agents/health Returns health state of all registered agents: status, last heartbeat, current task.
Agent Metrics GET /agents/{id}/metrics, GET /agents/metrics/summary Per-agent metrics (tokens, tasks completed, cost) and system-wide summary.
CLI Services GET /services, POST /services/{name}/run 12 config-driven operational CLI services accessible via Bus API or devxio CLI.
Template Registry GET /templates, POST /templates, GET /templates/{id} 4 agent activation templates; create, list, and retrieve templates by ID.
Dashboard GET /dashboard Single-call endpoint returning unified agent status, task summary, recent alerts, and cost pulse.
Message Search GET /messages/search Full-text search across message history with filters for agent, topic, and date range.
Project Lifecycle GET /projects, PATCH /projects/{id} Project list and lifecycle_phase updates (discovery / active / paused / complete).

Intervention Model

Intervention is a first-class capability, not a side effect.

intervention:
  id: string
  target_agent_id: string
  action: enum
    # pause | resume | terminate | redirect | constrain
  reason: string
  issued_by: string           # founder | 000_governor | 000_orchestrator
  issued_at: datetime
  acknowledged_at: datetime|null
  state: enum
    # pending | acknowledged | applied | rejected | expired
  expires_at: datetime|null

Agents must check for pending interventions: - On each activity cycle start - When polling for messages - Via SSE push if connected

Push Delivery Model

SSE channels per agent and per surface:

GET /events/{agent_id}  → agent receives tasks, messages, interventions in real-time
GET /events/ui          → Control Center receives all state changes for live dashboard
GET /events/founder     → Founder receives alerts, approvals, escalations

Push eliminates the poll-only limitation. Agents no longer need to actively check — the Bus pushes to them.

Relationship to Ruflo

Ruflo remains the in-session execution runtime:

  • Spawns sub-agents within a Claude Code session
  • Manages agent lifecycle within session scope
  • Provides memory and coordination tools within session

The Control Bus is the cross-session coordination layer:

  • Persists all state in PostgreSQL
  • Reaches agents across hosts (Hetzner, Mac, future nodes)
  • Survives session restarts
  • Provides intervention and governance

Ruflo reports state UP to the Bus. The Bus does not depend on Ruflo.

Founder/UI → Control Bus → Orchestrator → Ruflo → Agents
                 ↑                          |
                 +---- state reports --------+

Current State and Migration

The existing Bus MCP (bus.struxio.ai) already provides: - REST API (port 8088) - SSE streaming (port 8089) - PostgreSQL persistence - OAuth 2.1 authentication - Message send/poll/ack - Presence heartbeats - Paperclip proxy

Migration path: 1. Add SSE push channels (per-agent streams) — extends existing SSE 2. Add intervention endpoints — new CRUD routes 3. Add task orchestration endpoints — builds on ODM schema 4. Add agent registration — extends existing heartbeat 5. Add host capacity endpoints — reads Host Registry 6. Add spawn endpoint — triggers processes on target hosts

Estimated effort: ~2 weeks. No rewrite — iterative extension of existing service.

See resources/DESIGN_rc_architecture.md for the Remote Control architecture design (Open WebUI evaluation, multi-provider chat routing, Prompt Composer integration with the Control Bus).

Rules

  • The Control Bus is always on. It must survive agent crashes, session restarts, and host reboots.
  • All cross-session state flows through the Bus. No agent-to-agent communication bypasses it.
  • Intervention commands take priority over normal message delivery.
  • The Bus does not execute work. It coordinates. Agents execute.
  • SSE push is the default delivery method. Polling remains as fallback for agents that cannot hold SSE connections.

SSE Reconnection Behavior

SSE connections between agents and the Bus are long-lived but not permanent. Agents must handle disconnections gracefully.

On SSE disconnect: - Agent immediately falls back to HTTP polling (bus_poll) for message delivery - Agent continues executing its current task without interruption

On reconnect: - Agent re-registers its SSE channel with the Bus - Agent resumes the event stream from its last known cursor position (Last-Event-ID header) - Any events received via polling during the disconnect are deduplicated by the agent using event IDs

Keepalive: - The Bus sends an SSE heartbeat comment (:keepalive) every 30 seconds on each active channel - If an agent receives no data (including keepalives) for 60 seconds, it considers the connection dropped and initiates reconnection

Reconnection timing: - First reconnect attempt: immediate - Subsequent attempts: exponential backoff (1s, 2s, 4s, max 30s) - After 5 failed reconnection attempts: agent stays on HTTP polling and logs a warning

Bus Crash Recovery & Restart Sequence

The Control Bus is a stateful coordination service. Its restart behavior must be explicit and predictable.

In-Flight Message Handling

All messages are persisted to PostgreSQL before acknowledgement is returned to the sender. On Bus crash, no acknowledged messages are lost. Messages in transit (sent but not yet persisted) will fail at the sender and must be retried by the sending agent.

Restart Sequence
  1. PostgreSQL connection pool re-established
  2. Bus validates schema and migration state
  3. REST API endpoints become available (health check returns 200)
  4. SSE push channels are re-opened (no automatic client reconnection — clients must reconnect)
  5. Background worker resumes processing pending jobs from the jobs table
  6. Bus emits bus.restarted event on all SSE channels
Pending Intervention Recovery

On restart, the Bus scans the interventions table for interventions in pending or acknowledged state. These are re-queued for delivery. Interventions with expires_at in the past are moved to expired state. No intervention is silently dropped.

SSE Subscription Reconnection

SSE connections are stateless server-push streams. On Bus restart or network interruption:

  • Clients must detect connection loss (EventSource onerror or heartbeat timeout)
  • Clients reconnect using the same GET /events/{agent_id} endpoint
  • Clients resume from their last acknowledged cursor position (Last-Event-ID header or ?cursor= parameter)
  • The Bus replays any unacknowledged events from the persistent event log
Message Delivery Guarantees

The Bus provides at-least-once delivery:

  • Every message is persisted before the sender receives 201 Created
  • Consumers poll or receive via SSE and must acknowledge (bus_ack) after processing
  • Unacknowledged messages are re-delivered on the next poll or SSE reconnection
  • Consumers must be idempotent — processing the same message twice must produce the same result
  • The idempotency_key field on messages enables consumer-side deduplication

The Bus does NOT provide exactly-once delivery. Idempotent consumers are required.

Input Validation & Rate Limiting

All Bus API endpoints enforce input validation and rate limiting to prevent abuse and injection.

Input Validation
  • All request bodies are validated against JSON Schema before processing
  • Schema definitions are co-located with endpoint handlers and versioned with the API
  • Requests failing validation receive 400 Bad Request with a structured error body listing violations
  • Request size limit: 20 KB body maximum. Requests exceeding this receive 413 Payload Too Large
  • Attachment limit: 8 attachments per message
  • All text fields are HTML-escaped before storage to prevent stored XSS
  • SQL injection is prevented by parameterized queries (SQLAlchemy / node-postgres parameterized statements) — no string concatenation in queries
Rate Limiting
  • Per-agent rate limit: 100 requests/second
  • Global rate limit: 1000 requests/second across all agents
  • Rate limit responses: 429 Too Many Requests with Retry-After header
  • SSE connections: 1 connection per agent per channel (enforced server-side)
  • Rate limit state is held in-memory (per-process) with optional Redis backing for multi-process deployments
Enforcement Rule

Input validation and rate limiting are not optional middleware. They are required on every public and agent-facing endpoint from T1P onwards.

Data Access Rule: Bus API vs Direct Database

All agents access PostgreSQL through the Control Bus API by default. Direct database access is reserved for bulk/heavy operations that run local to the database host.

data_access_policy:
  default: bus_api
  # All agents, all hosts — use REST endpoints
  # Auth: OAuth via Bus
  # Latency: ~50-100ms per request
  # Suitable for: task CRUD, state reads, cost logging, queries < 100 rows

  bulk_local: direct_postgresql
  # Only agents running on the SAME HOST as PostgreSQL
  # Use case: imports > 100 rows, report generation, cleanup jobs,
  #           data migration, analytics, batch cost aggregation
  # Auth: local connection (Unix socket or localhost)
  # Latency: ~1-5ms per query
Orchestrator Spawn Rule for Bulk Operations

When the orchestrator receives a task that requires bulk database operations:

  1. Classify the task — is it normal CRUD (<100 rows) or bulk (>100 rows)?
  2. If bulk: spawn the agent on the database host (Hetzner), never on a remote host
  3. Agent connects locallylocalhost:5432 or Unix socket, no network overhead
  4. Results flow back through the Bus API as normal
spawn_decision:
  task_type: bulk_import | bulk_report | data_cleanup | migration
  required_host: database_host        # must run on same machine as PostgreSQL
  access_method: direct_postgresql    # bypass Bus for data operations
  result_delivery: bus_api            # results still reported through Bus
Examples
Task Access Method Host Why
Agent reads its next task Bus API Any host Normal operation, ~60ms is fine
Agent updates task status Bus API Any host Normal operation
Morning brief queries 50 activities Bus API Any host Small result set
Import 5000 research records Direct PostgreSQL Database host only Bulk — 500ms local vs 50s via API
Sprint cost report across all agents Direct PostgreSQL Database host only Aggregation query over thousands of rows
Nightly cleanup of orphaned sessions Direct PostgreSQL Database host only Scan + delete pattern
Knowledge Ledger batch write Direct PostgreSQL Database host only High-volume append
Rule

Remote agents (Mac, future cloud nodes) NEVER get direct PostgreSQL access. If a remote agent needs bulk operations, the orchestrator spawns a local agent on the database host to do the work, then the local agent reports results back through the Bus.


5.9 Reverse Proxy and Edge

Canonical reverse proxy:

  • Caddy

Caddy should handle:

  • HTTPS termination
  • ingress routing
  • host/domain routing
  • simple edge policy
  • low-friction certificate management

5.10 Observability Stack

Canonical T1P observability stack:

  • OpenTelemetry for instrumentation
  • Prometheus for metrics and alert-compatible scraping
  • Grafana for dashboards, alert visualization, and operator inspection

Rule

Observability is not an optional enhancement. Every core service must emit enough signals to support:

  • recovery
  • alerting
  • cost/usage inspection
  • user-facing diagnosis

5.11 Testing Toolchain

Canonical T1P testing tools:

  • pytest for backend unit, integration, and workflow tests
  • Playwright for UI, browser, and mobile-surface end-to-end tests

Optional, not required on day one:

  • Vitest for frontend component/unit tests if UI complexity justifies it

Rule

T1P does not need a sprawling test-tool matrix.

It needs one strong backend runner and one strong browser/E2E runner first.


5.12 CLI Toolchain

Canonical CLI tools for T1P agent and operator workflows:

Tool Purpose
gh GitHub CLI — repo, PR, issue, release management
jq JSON processor — structured data extraction and transformation
yq YAML processor — config and state file manipulation
uv Python environment — version management, dependency locking, script execution
rg (ripgrep) Fast recursive search across codebases and knowledge files
fd Fast file finder — replacement for find with sane defaults
Ruflo (claude-flow) Agent execution runtime — spawning, coordination, memory
sops Secrets encryption — encrypt/decrypt secrets in config files
age Encryption backend for SOPS — key management
restic Backup — automated snapshots to Backblaze B2

Rule

CLI tools are the primary execution interface for agents. MCP wrappers may exist for discovery or integration, but CLI is the default for production pipelines.

See resources/CLI_TOOLS_ASSESSMENT.md for the full assessment of available CLI tools and their roles.

See resources/DESIGN_cli_services.md for the config-driven CLI services framework design (operational commands executable via Bus API or devxio CLI).


5.13 Implementation Form of Key Roles

For T1P, roles should be implemented pragmatically.

Orchestrator

Implementation form:

  • backend service / orchestration module
  • not a separate mystical agent surface

Governor

Implementation form:

  • backend governance service / policy engine
  • not only a prompt persona

Rule steward / prompt steward / module steward roles

T1P implementation form:

  • explicit services/modules with durable inputs and outputs
  • but allowed to begin as thin application services rather than fully independent distributed systems

Rule

The blueprint keeps the roles. T1P is allowed to implement them with fewer deployables than named roles.


5.14 Security & Session Handling Decision

T1P security stance:

  • operator-first
  • single-tenant or very low-user-count
  • strong infrastructure boundary
  • simple application auth over a strong network boundary

Recommended pattern:

  • Tailscale/private network for admin paths
  • application login/session for the web UI
  • no dependence on the UI for core execution safety
  • no frontend-only secret handling

Secrets at Rest

Secrets are encrypted at rest using SOPS + age. Key stored at ~/age-key.txt.

All configuration files containing secrets must use SOPS encryption. Plaintext secrets must never be committed to any repository.


5.15 Backup & Recovery

Automated backup via Restic to Backblaze B2, daily at 03:00 UTC.

Backup scope includes:

  • PostgreSQL database dumps
  • Configuration files
  • Knowledge repository content
  • State files

Retention policy is managed by Restic pruning rules.

See reference_backblaze.md for current B2 configuration details.


5.16 Final Technology Rule

The purpose of these decisions is to make the blueprint executable.

Any future stack change must be justified by:

  • clear operational gain
  • reduced risk
  • or proven scale pressure

Not by novelty.


6. Runtime Topology

6.1 Node A — Cloud Control Node

Primary always-on environment for:

  • orchestrator
  • governor
  • API/control services
  • PostgreSQL
  • scheduler/workers
  • LiteLLM/router where needed
  • runtime execution fabric
  • knowledge services
  • telemetry and backup jobs

6.2 Node B — Local Operator Node

Primary environment for:

  • user interaction
  • local CLI execution
  • fallback sessions
  • local knowledge access
  • manual validation
  • controlled local experiments

6.3 Node C — Future GPU / Model Node

Reserved for:

  • self-hosted model serving
  • embedding/indexing jobs
  • compute-intensive background work
  • isolated experimental inference

6.4 Node D — Future Product Runtime Node

Reserved for:

  • customer-facing STRUXIO runtime
  • product APIs
  • workloads isolated from XIOPro control-plane services

7. Current State and Evolution

7.1 Current State

As of 2026-03-28, the system operates with:

  • Node A: Hetzner CPX62 (16 vCPU AMD EPYC-Genoa, 30 GB RAM, 150 GB SSD) running 10 Docker containers (post-retirement of devxio-frontend, devxio-bridge, devxio-librarian, Neo4j)
  • Node B: Mac Studio (Mac Worker, agent 010) connected via Tailscale VPN
  • Orchestration: BrainMaster (agent 000) operating as proto-orchestrator
  • Messaging: Bus-based inter-agent messaging (PostgreSQL-backed)
  • Ticket tracking: Paperclip (to be superseded by ODM work graph)
  • Dashboard: dashboard.struxio.ai (React) — current operator UI
  • Knowledge: Hindsight running (localhost:8888/9999, Vectorize.io Docker)
  • Backup: Restic to Backblaze B2, daily 03:00 UTC
  • Secrets: SOPS + age encryption

7.2 Service Migration

Current services will transition to the XIOPro target architecture through a managed migration.

See resources/SERVICE_FATE_MAP_v4_2.md for the explicit mapping of:

  • services to keep as-is
  • services to evolve
  • services to retire
  • services to replace

No big-bang cutover. Old services run in parallel alongside new services until the new services are proven and functional parity is reached.

7.3 Target Direction

Move toward:

  • cleaner orchestrator identity
  • clearer governor identity
  • explicit stewardship roles (rule steward, prompt steward, module steward)
  • a DB-backed work graph
  • a Research Center built on the Librarian
  • a widget-based control center UI
  • a rationalized module portfolio and infrastructure model

8. XIOPro vs Product Runtime

XIOPro must remain conceptually separate from STRUXIO product runtime.

XIOPro

  • internal AI operating system
  • execution and governance substrate
  • research and knowledge system
  • user control center
  • internal optimization machine

STRUXIO Product Runtime

  • customer-facing APIs and services
  • product workloads
  • external runtime isolation
  • product-specific scaling and SLAs

XIOPro may build and operate product runtime, but it must not collapse into it.


9. Architectural Success Criteria

The architecture is successful when:

  • execution continues without UI
  • recovery is practical
  • layers stay separable
  • governance remains explicit
  • knowledge compounds instead of fragments
  • module optimization is real, not ad hoc
  • human collaboration does not break state integrity
  • future scale can happen without rethinking the whole machine

10. Final Statement

XIOPro is not one service.

It is a layered machine for governed execution, collaboration, research, and optimization.

If the layers stay clean, XIOPro remains buildable, recoverable, and evolvable. If they blur, the system regresses into expensive chat-shaped chaos.


Changelog

v5.0.0 (2026-03-28)

Changes from v4.1.0:

  • C2.1: Added Section 5.11 CLI Toolchain with canonical CLI tools table and rule
  • C2.2: Added SOPS + age secrets management to Section 5.13 (Security)
  • C2.3: Added Section 5.14 Backup & Recovery (Restic to Backblaze B2)
  • C2.4: Added service migration reference to Section 7.2, pointing to SERVICE_FATE_MAP_v4_2.md
  • C2.5: Made async backbone decision explicit in Section 5.7 — PostgreSQL-backed dispatch is a deliberate choice, not an omission
  • CX.1: Global naming fix — "Rufio" replaced with "Ruflo" throughout
  • CX.2: Version header updated to 4.2.0, last_updated to 2026-03-28
  • CX.3: Added this changelog section
  • CX.4: Added Current State subsection (Section 7.1) documenting existing infrastructure
  • Clarified 000/Ruflo relationship in Section 4.3
  • Renumbered sections 5.11-5.13 to 5.12-5.15 to accommodate CLI Toolchain insertion
  • Restructured Section 7 from "Current -> Target Evolution" to "Current State and Evolution" with explicit subsections

v5.0.2 (2026-03-28)

Agent naming migration to 3-digit unified IDs:

  • Replaced O00/O01/R01/P01/M01 with 000 role-based naming throughout
  • Replaced B1-B5 with 001-005 (domain brains)
  • Replaced M0 with 010 (Mac Worker)
  • Replaced BM with 000 (BrainMaster)
  • Updated Mermaid diagrams to use 3-digit IDs
  • Preserved Backblaze B2 references unchanged

v5.0.3 (2026-03-28)

Roles over numbers: Removed agent IDs from architectural descriptions, layer headers, diagrams, and role implementation sections. Current State section (7.1) uses "agent NNN" format for operational references. Blueprint describes WHAT roles do, not WHICH agent holds them.

v5.0.12 (2026-03-29)

Cross-references: Added pointer to resources/DESIGN_rc_architecture.md (Remote Control architecture design — Open WebUI evaluation, multi-provider chat routing, Prompt Composer integration) in Section 5.8 context. Added pointer to resources/DESIGN_cli_services.md (CLI services framework — config-driven operational commands via Bus API) in Section 5.12 context.

v5.0.13 (2026-03-29)

Batch BP update from recent tickets: Added 9 new Control Bus endpoint capabilities to Section 5.8 table — agent auto-pickup (/agents/{id}/pickup, /agents/{id}/tasks), agent health (/agents/health), agent metrics (/agents/{id}/metrics, /agents/metrics/summary), CLI services (/services), template registry (/templates), dashboard (/dashboard), message search (/messages/search), and project lifecycle (/projects).

v5.0.14 (2026-03-30)

Review fixes: C3: Added Bus Crash Recovery & Restart Sequence subsection to Section 5.8 — in-flight message handling, restart sequence (5 steps), pending intervention recovery, SSE reconnection behavior, at-least-once delivery guarantees with idempotent consumers. C8: Added Input Validation & Rate Limiting subsection to Section 5.8 — JSON Schema validation on all endpoints, 100 req/s per-agent and 1000 req/s global rate limits, 20KB body limit, 8 attachment limit, HTML escaping, parameterized SQL enforcement.

v5.0.15 (2026-03-30)

Round 2 review fix (SSE reconnection):

  • Section 5.8: Added SSE Reconnection Behavior subsection — documents fallback to HTTP polling on disconnect, reconnect from last cursor via Last-Event-ID, 30-second keepalive heartbeat, 60-second connection drop detection, exponential backoff reconnection timing.