XIOPro Production Blueprint v5.0¶
Part 2 — Architecture¶
1. Purpose of This Part¶
This document defines the structural architecture of XIOPro:
- major layers
- major roles and components
- runtime topology
- boundaries between concerns
- environment roles
- separation between XIOPro and future STRUXIO product runtime
- the T1P implementation stack that makes the blueprint actually buildable
Part 1 defines why XIOPro exists. Part 2 defines what the machine is and what technology it is built with.
2. Architectural Thesis¶
XIOPro is a multi-layer agentic operating system.
It is not one app, one server, one chat, or one model router.
It is composed of:
- a human interaction surface
- a control/UI layer
- an orchestration layer
- a governed execution fabric
- a knowledge and research substrate
- a governance and optimization layer
- a durable work graph/state layer
- an infrastructure platform
The architecture must support:
- continuous headless operation
- recoverable execution
- user collaboration
- provider independence
- governed evolution
- future scale without redesign of core logic
3. High-Level Layer Stack¶
flowchart TD
Human[User / Human Operator] --> UI[Web Control Center / Mobile Surface]
UI --> Interaction[Interaction & ContextPrompting Layer]
Interaction --> Orchestration[Orchestration Layer]
Orchestration --> Domain[Domain Brain Layer]
Domain --> Workers[Worker Layer]
Orchestration --> Governance[Governance & Optimization Layer]
Governance --> WorkGraph[Work Graph / ODM / State]
Domain --> Knowledge[Knowledge & Research Layer]
Workers --> Execution[Execution Targets / External Systems]
Knowledge --> WorkGraph
WorkGraph --> UI
Knowledge --> UI
4. Architectural Layers¶
4.1 Human Interaction Layer¶
This is where the user interacts with XIOPro.
Inputs include:
- exploratory conversation
- execution-bound discussion
- approvals
- rejections
- clarifications
- file and image attachments
- voice input
- research requests
- recovery decisions
- module and routing choices where allowed
Outputs include:
- tickets
- decisions
- constraints
- clarified intent
- approvals
- durable human decision records
This layer must remain:
- high-bandwidth
- low-friction
- mobile-capable
- durable when it affects execution
4.2 UI Control Layer¶
The visual control surface of XIOPro.
Responsibilities:
- display system state
- host widget-based operator workspaces
- support brain interaction
- expose approvals, alerts, and traceability
- show cost, module, and governance posture
- expose research and knowledge surfaces
- support intervention and recovery
The UI is web-based and widget-first. It must never become the only runtime path.
4.3 Orchestration Layer¶
The central coordinating intelligence that turns structured work into assigned execution.
Responsibilities:
- create or refine work objects
- read work graph state
- assign tickets and tasks
- coordinate brains and workers
- preserve continuity across sessions
- manage execution order
- react to human gates
- consume prompt packages from the prompt steward role
- operate within governance and module constraints
The BrainMaster uses Ruflo (claude-flow) as the agent execution runtime. The orchestrator decides WHAT to execute. Ruflo decides HOW to spawn and manage agents.
This is the control spine of XIOPro.
4.4 Domain Brain Layer¶
This layer contains specialized long-lived or semi-long-lived brains.
Canonical examples:
- Compliance (e.g., industry standards)
- Engineering
- Brand / Content
- Finance / Business
- DevOps / Research
Responsibilities:
- domain reasoning
- domain decomposition
- review of worker outputs
- knowledge contribution in domain
- bounded execution through workers or direct action
This layer provides specialization without fragmentation.
4.5 Worker Layer¶
This layer contains short-lived, bounded, task-specific execution actors.
Responsibilities:
- execute narrow work
- run isolated subtasks
- offload mechanical or lower-cost work
- operate under parent supervision
- remain replaceable and bounded
Workers should remain:
- ephemeral
- cheap when possible
- explicitly constrained
- easy to retire or replace
4.6 Work Graph / State Layer¶
This layer stores and relates operational objects such as:
- topics
- projects
- sprints
- tickets
- tasks
- activities
- runtimes
- sessions
- escalations
- human decisions
- costs
- alerts
- evaluations
- reflections
- improvements
This is the operational memory and structure of XIOPro.
It is what turns AI behavior into a governed system.
4.7 Knowledge & Research Layer¶
This layer contains:
- Librarian
- rules
- skills
- activations
- patterns
- protocols
- indexed documents
- historical decisions
- Research Center
- NotebookLM-related workflows
- Obsidian-facing structures
- Hindsight and Dream-derived proposals
Responsibilities:
- preserve intelligence
- classify and retrieve documents
- support research workflows
- reduce repeated thinking
- generate reusable knowledge and proposals
- enable compounding system knowledge
4.8 Governance & Optimization Layer¶
This layer includes:
- governor runtime governance
- rule steward role — rule/skill stewardship
- prompt steward role — ContextPrompting governance and inquiry discipline
- module steward role — module portfolio governance and optimization
- policy objects
- breakers
- approval logic
- audit/event trails
Responsibilities:
- protect runtime
- enforce policy
- govern prompting and assumptions
- govern rules and activations
- govern modules and subscriptions
- surface anomalies and drift
- preserve explainability and approval discipline
This layer does not replace execution. It protects and improves execution.
4.9 Execution Targets¶
This is where actions land in the material world:
- repositories
- documentation
- infrastructure
- APIs
- rendered outputs
- tickets and external systems
- websites
- research outputs
- future product runtime services
This is the world XIOPro changes.
5. T1P Implementation Technology Decisions¶
5.1 Decision Principle¶
T1P must optimize for:
- buildability
- recoverability
- low moving-part count
- strong Python integration with the current agent/runtime environment
- explicit web/mobile support
- clear separation between control-plane state and UI presentation
The stack below is therefore a deliberate simplification, not a maximal architecture.
5.2 Frontend Stack¶
T1P frontend stack:
- TypeScript
- React 19
- Next.js App Router
- shadcn/ui
- TanStack Query
- React-Grid-Layout
Rule¶
The UI is a web application with widget-first composition. It is not the source of truth.
Critical Control Rule¶
Critical control-plane mutations should flow through the backend API layer, not through opaque frontend-only mutation paths.
5.3 Backend Stack¶
T1P backend stack:
- Python 3.12+
- FastAPI
- Pydantic v2
- SQLAlchemy 2
- Alembic
This stack is the canonical implementation path for:
- control APIs
- ODM-backed services
- scheduler/worker coordination
- governance services
- Research Center APIs
- module telemetry and optimization services
5.4 Python Tooling & Environment Management¶
Canonical Python tooling:
- uv for Python version management, environment creation, dependency locking, and tool/script execution
Expected project standards where applicable:
pyproject.tomluv.lock.python-version
Rule¶
uv is the default Python tooling layer for T1P.
It improves:
- bootstrap speed
- dependency sync
- local/server consistency
- reproducible environments
- CI and deployment ergonomics
It does not replace the backend framework. It standardizes the Python workflow around it.
5.5 Primary Data Store¶
Canonical primary data store:
- PostgreSQL 17.x
Rule¶
PostgreSQL is the authoritative state store for T1P.
It holds:
- work graph state
- sessions
- escalations
- human decisions
- governance records
- normalized cost telemetry
- research task metadata
- scheduler/job state where practical
Conservative Versioning Rule¶
Even if newer PostgreSQL majors are available, T1P should pin one explicit major version and avoid drifting during early implementation.
For T1P, the pinned target is:
- PostgreSQL 17.x
5.6 Realtime and UI Update Transport¶
Default transport decisions:
- REST/JSON over HTTPS for standard request/response APIs
- Server-Sent Events (SSE) for one-way live updates to the UI
- WebSocket only where true bidirectional interactive streaming is required
Use SSE For¶
- alerts
- activity/event feeds
- cost pulse
- approval updates
- trace/status updates
- research task progress
- widget refresh streams
Use WebSocket Only For¶
- live bidirectional conversation streaming when needed
- terminal-like interactive traces
- future cases that truly require two-way socket behavior
Rule¶
SSE is the default live-update mechanism for T1P because the UI mostly needs server-to-client streaming, not a general-purpose socket layer for every widget.
5.7 Background Execution & Async Backbone¶
T1P background execution model:
- authoritative job and execution state in PostgreSQL
- dedicated Python worker processes
- scheduler-driven and API-triggered task dispatch
- explicit polling / claim / update flow for jobs and runtime state
Rule¶
T1P uses PostgreSQL-backed job dispatch as its async backbone. No separate message broker is required.
No required NATS / Redis-stream / Kafka-style backbone in T1P.
The purpose is to keep the system buildable while the canonical work graph and execution flow become real.
Future Expansion Rule¶
A dedicated event backbone (NATS, Redis-stream, or similar) may be introduced later only if:
- Postgres-backed dispatch becomes the bottleneck
- event volume or fan-out justifies it
- operational value clearly exceeds additional complexity
This is an explicit architectural decision, not an oversight.
5.8 XIOPro Control Bus¶
The XIOPro Control Bus is the unified communication, coordination, and intervention backbone.
It merges the persistence and cross-host reach of the existing Bus MCP with the orchestration concepts of Ruflo into a single always-on service that every agent and surface can reach.
Principle¶
Every agent talks to one service for everything: messaging, tasks, state, intervention, spawning.
The Control Bus is not a message broker. It is a stateful coordination service backed by PostgreSQL.
Architecture¶
graph TB
subgraph ControlBus["XIOPro Control Bus"]
REST["REST API :8088"]
SSE["SSE Push :8089"]
Worker["Background Worker"]
REST --- SSE
REST --- Worker
end
PG[("PostgreSQL")] --- REST
PG --- Worker
Agents["Domain Brains & Workers"] --> REST
REST --> Agents
SSE --> UI["Control Center UI"]
SSE --> Agents
Orchestrator["Orchestrator"] --> REST
Founder["Founder via RC/UI"] --> REST
Capabilities¶
| Capability | Endpoint Pattern | Description |
|---|---|---|
| Messaging | POST /messages, GET /messages/poll |
Persistent async messaging between agents. Existing capability. |
| Push Delivery | SSE /events/{agent_id} |
Real-time push to agents and UI via Server-Sent Events. Eliminates polling delay. |
| Agent Registration | POST /agents/register, GET /agents |
Full agent registry with capabilities, host binding, resource requirements. Extends existing heartbeat. |
| Agent Heartbeat | POST /agents/{id}/heartbeat |
Liveness signal with current task, status, resource usage. Existing capability. |
| Task Orchestration | POST /tasks, PATCH /tasks/{id}, GET /tasks |
Create, assign, update, query tasks. Backed by ODM schema. |
| Intervention | POST /agents/{id}/pause, /resume, /terminate, /redirect |
Founder or the governor can pause, resume, terminate, or redirect any agent. State persisted. Agent checks intervention state each cycle. |
| Host Capacity | GET /hosts/{id}/capacity, POST /hosts/register |
Query available agent slots per host. Pre-spawn gate. |
| Agent Spawning | POST /agents/spawn |
Request agent spawn on a target host. Bus checks capacity, then triggers Claude Code subprocess. |
| Cost Tracking | POST /costs, GET /costs/summary |
Record and query cost ledger entries per agent, task, ticket. |
| Governance Events | POST /alerts, POST /breakers/{id}/trigger |
The governor emits alerts and breaker events through the Bus. |
| Agent Auto-Pickup | POST /agents/{id}/pickup, GET /agents/{id}/tasks |
Agent signals readiness and retrieves its next assigned task without orchestrator polling. |
| Agent Health | GET /agents/health |
Returns health state of all registered agents: status, last heartbeat, current task. |
| Agent Metrics | GET /agents/{id}/metrics, GET /agents/metrics/summary |
Per-agent metrics (tokens, tasks completed, cost) and system-wide summary. |
| CLI Services | GET /services, POST /services/{name}/run |
12 config-driven operational CLI services accessible via Bus API or devxio CLI. |
| Template Registry | GET /templates, POST /templates, GET /templates/{id} |
4 agent activation templates; create, list, and retrieve templates by ID. |
| Dashboard | GET /dashboard |
Single-call endpoint returning unified agent status, task summary, recent alerts, and cost pulse. |
| Message Search | GET /messages/search |
Full-text search across message history with filters for agent, topic, and date range. |
| Project Lifecycle | GET /projects, PATCH /projects/{id} |
Project list and lifecycle_phase updates (discovery / active / paused / complete). |
Intervention Model¶
Intervention is a first-class capability, not a side effect.
intervention:
id: string
target_agent_id: string
action: enum
# pause | resume | terminate | redirect | constrain
reason: string
issued_by: string # founder | 000_governor | 000_orchestrator
issued_at: datetime
acknowledged_at: datetime|null
state: enum
# pending | acknowledged | applied | rejected | expired
expires_at: datetime|null
Agents must check for pending interventions: - On each activity cycle start - When polling for messages - Via SSE push if connected
Push Delivery Model¶
SSE channels per agent and per surface:
GET /events/{agent_id} → agent receives tasks, messages, interventions in real-time
GET /events/ui → Control Center receives all state changes for live dashboard
GET /events/founder → Founder receives alerts, approvals, escalations
Push eliminates the poll-only limitation. Agents no longer need to actively check — the Bus pushes to them.
Relationship to Ruflo¶
Ruflo remains the in-session execution runtime:
- Spawns sub-agents within a Claude Code session
- Manages agent lifecycle within session scope
- Provides memory and coordination tools within session
The Control Bus is the cross-session coordination layer:
- Persists all state in PostgreSQL
- Reaches agents across hosts (Hetzner, Mac, future nodes)
- Survives session restarts
- Provides intervention and governance
Ruflo reports state UP to the Bus. The Bus does not depend on Ruflo.
Current State and Migration¶
The existing Bus MCP (bus.struxio.ai) already provides: - REST API (port 8088) - SSE streaming (port 8089) - PostgreSQL persistence - OAuth 2.1 authentication - Message send/poll/ack - Presence heartbeats - Paperclip proxy
Migration path: 1. Add SSE push channels (per-agent streams) — extends existing SSE 2. Add intervention endpoints — new CRUD routes 3. Add task orchestration endpoints — builds on ODM schema 4. Add agent registration — extends existing heartbeat 5. Add host capacity endpoints — reads Host Registry 6. Add spawn endpoint — triggers processes on target hosts
Estimated effort: ~2 weeks. No rewrite — iterative extension of existing service.
See resources/DESIGN_rc_architecture.md for the Remote Control architecture design (Open WebUI evaluation, multi-provider chat routing, Prompt Composer integration with the Control Bus).
Rules¶
- The Control Bus is always on. It must survive agent crashes, session restarts, and host reboots.
- All cross-session state flows through the Bus. No agent-to-agent communication bypasses it.
- Intervention commands take priority over normal message delivery.
- The Bus does not execute work. It coordinates. Agents execute.
- SSE push is the default delivery method. Polling remains as fallback for agents that cannot hold SSE connections.
SSE Reconnection Behavior¶
SSE connections between agents and the Bus are long-lived but not permanent. Agents must handle disconnections gracefully.
On SSE disconnect:
- Agent immediately falls back to HTTP polling (bus_poll) for message delivery
- Agent continues executing its current task without interruption
On reconnect:
- Agent re-registers its SSE channel with the Bus
- Agent resumes the event stream from its last known cursor position (Last-Event-ID header)
- Any events received via polling during the disconnect are deduplicated by the agent using event IDs
Keepalive:
- The Bus sends an SSE heartbeat comment (:keepalive) every 30 seconds on each active channel
- If an agent receives no data (including keepalives) for 60 seconds, it considers the connection dropped and initiates reconnection
Reconnection timing: - First reconnect attempt: immediate - Subsequent attempts: exponential backoff (1s, 2s, 4s, max 30s) - After 5 failed reconnection attempts: agent stays on HTTP polling and logs a warning
Bus Crash Recovery & Restart Sequence¶
The Control Bus is a stateful coordination service. Its restart behavior must be explicit and predictable.
In-Flight Message Handling¶
All messages are persisted to PostgreSQL before acknowledgement is returned to the sender. On Bus crash, no acknowledged messages are lost. Messages in transit (sent but not yet persisted) will fail at the sender and must be retried by the sending agent.
Restart Sequence¶
- PostgreSQL connection pool re-established
- Bus validates schema and migration state
- REST API endpoints become available (health check returns 200)
- SSE push channels are re-opened (no automatic client reconnection — clients must reconnect)
- Background worker resumes processing pending jobs from the
jobstable - Bus emits
bus.restartedevent on all SSE channels
Pending Intervention Recovery¶
On restart, the Bus scans the interventions table for interventions in pending or acknowledged state. These are re-queued for delivery. Interventions with expires_at in the past are moved to expired state. No intervention is silently dropped.
SSE Subscription Reconnection¶
SSE connections are stateless server-push streams. On Bus restart or network interruption:
- Clients must detect connection loss (EventSource
onerroror heartbeat timeout) - Clients reconnect using the same
GET /events/{agent_id}endpoint - Clients resume from their last acknowledged cursor position (
Last-Event-IDheader or?cursor=parameter) - The Bus replays any unacknowledged events from the persistent event log
Message Delivery Guarantees¶
The Bus provides at-least-once delivery:
- Every message is persisted before the sender receives
201 Created - Consumers poll or receive via SSE and must acknowledge (
bus_ack) after processing - Unacknowledged messages are re-delivered on the next poll or SSE reconnection
- Consumers must be idempotent — processing the same message twice must produce the same result
- The
idempotency_keyfield on messages enables consumer-side deduplication
The Bus does NOT provide exactly-once delivery. Idempotent consumers are required.
Input Validation & Rate Limiting¶
All Bus API endpoints enforce input validation and rate limiting to prevent abuse and injection.
Input Validation¶
- All request bodies are validated against JSON Schema before processing
- Schema definitions are co-located with endpoint handlers and versioned with the API
- Requests failing validation receive
400 Bad Requestwith a structured error body listing violations - Request size limit: 20 KB body maximum. Requests exceeding this receive
413 Payload Too Large - Attachment limit: 8 attachments per message
- All text fields are HTML-escaped before storage to prevent stored XSS
- SQL injection is prevented by parameterized queries (SQLAlchemy / node-postgres parameterized statements) — no string concatenation in queries
Rate Limiting¶
- Per-agent rate limit: 100 requests/second
- Global rate limit: 1000 requests/second across all agents
- Rate limit responses:
429 Too Many RequestswithRetry-Afterheader - SSE connections: 1 connection per agent per channel (enforced server-side)
- Rate limit state is held in-memory (per-process) with optional Redis backing for multi-process deployments
Enforcement Rule¶
Input validation and rate limiting are not optional middleware. They are required on every public and agent-facing endpoint from T1P onwards.
Data Access Rule: Bus API vs Direct Database¶
All agents access PostgreSQL through the Control Bus API by default. Direct database access is reserved for bulk/heavy operations that run local to the database host.
data_access_policy:
default: bus_api
# All agents, all hosts — use REST endpoints
# Auth: OAuth via Bus
# Latency: ~50-100ms per request
# Suitable for: task CRUD, state reads, cost logging, queries < 100 rows
bulk_local: direct_postgresql
# Only agents running on the SAME HOST as PostgreSQL
# Use case: imports > 100 rows, report generation, cleanup jobs,
# data migration, analytics, batch cost aggregation
# Auth: local connection (Unix socket or localhost)
# Latency: ~1-5ms per query
Orchestrator Spawn Rule for Bulk Operations¶
When the orchestrator receives a task that requires bulk database operations:
- Classify the task — is it normal CRUD (<100 rows) or bulk (>100 rows)?
- If bulk: spawn the agent on the database host (Hetzner), never on a remote host
- Agent connects locally —
localhost:5432or Unix socket, no network overhead - Results flow back through the Bus API as normal
spawn_decision:
task_type: bulk_import | bulk_report | data_cleanup | migration
required_host: database_host # must run on same machine as PostgreSQL
access_method: direct_postgresql # bypass Bus for data operations
result_delivery: bus_api # results still reported through Bus
Examples¶
| Task | Access Method | Host | Why |
|---|---|---|---|
| Agent reads its next task | Bus API | Any host | Normal operation, ~60ms is fine |
| Agent updates task status | Bus API | Any host | Normal operation |
| Morning brief queries 50 activities | Bus API | Any host | Small result set |
| Import 5000 research records | Direct PostgreSQL | Database host only | Bulk — 500ms local vs 50s via API |
| Sprint cost report across all agents | Direct PostgreSQL | Database host only | Aggregation query over thousands of rows |
| Nightly cleanup of orphaned sessions | Direct PostgreSQL | Database host only | Scan + delete pattern |
| Knowledge Ledger batch write | Direct PostgreSQL | Database host only | High-volume append |
Rule¶
Remote agents (Mac, future cloud nodes) NEVER get direct PostgreSQL access. If a remote agent needs bulk operations, the orchestrator spawns a local agent on the database host to do the work, then the local agent reports results back through the Bus.
5.9 Reverse Proxy and Edge¶
Canonical reverse proxy:
- Caddy
Caddy should handle:
- HTTPS termination
- ingress routing
- host/domain routing
- simple edge policy
- low-friction certificate management
5.10 Observability Stack¶
Canonical T1P observability stack:
- OpenTelemetry for instrumentation
- Prometheus for metrics and alert-compatible scraping
- Grafana for dashboards, alert visualization, and operator inspection
Rule¶
Observability is not an optional enhancement. Every core service must emit enough signals to support:
- recovery
- alerting
- cost/usage inspection
- user-facing diagnosis
5.11 Testing Toolchain¶
Canonical T1P testing tools:
- pytest for backend unit, integration, and workflow tests
- Playwright for UI, browser, and mobile-surface end-to-end tests
Optional, not required on day one:
- Vitest for frontend component/unit tests if UI complexity justifies it
Rule¶
T1P does not need a sprawling test-tool matrix.
It needs one strong backend runner and one strong browser/E2E runner first.
5.12 CLI Toolchain¶
Canonical CLI tools for T1P agent and operator workflows:
| Tool | Purpose |
|---|---|
| gh | GitHub CLI — repo, PR, issue, release management |
| jq | JSON processor — structured data extraction and transformation |
| yq | YAML processor — config and state file manipulation |
| uv | Python environment — version management, dependency locking, script execution |
| rg (ripgrep) | Fast recursive search across codebases and knowledge files |
| fd | Fast file finder — replacement for find with sane defaults |
| Ruflo (claude-flow) | Agent execution runtime — spawning, coordination, memory |
| sops | Secrets encryption — encrypt/decrypt secrets in config files |
| age | Encryption backend for SOPS — key management |
| restic | Backup — automated snapshots to Backblaze B2 |
Rule¶
CLI tools are the primary execution interface for agents. MCP wrappers may exist for discovery or integration, but CLI is the default for production pipelines.
See resources/CLI_TOOLS_ASSESSMENT.md for the full assessment of available CLI tools and their roles.
See resources/DESIGN_cli_services.md for the config-driven CLI services framework design (operational commands executable via Bus API or devxio CLI).
5.13 Implementation Form of Key Roles¶
For T1P, roles should be implemented pragmatically.
Orchestrator¶
Implementation form:
- backend service / orchestration module
- not a separate mystical agent surface
Governor¶
Implementation form:
- backend governance service / policy engine
- not only a prompt persona
Rule steward / prompt steward / module steward roles¶
T1P implementation form:
- explicit services/modules with durable inputs and outputs
- but allowed to begin as thin application services rather than fully independent distributed systems
Rule¶
The blueprint keeps the roles. T1P is allowed to implement them with fewer deployables than named roles.
5.14 Security & Session Handling Decision¶
T1P security stance:
- operator-first
- single-tenant or very low-user-count
- strong infrastructure boundary
- simple application auth over a strong network boundary
Recommended pattern:
- Tailscale/private network for admin paths
- application login/session for the web UI
- no dependence on the UI for core execution safety
- no frontend-only secret handling
Secrets at Rest¶
Secrets are encrypted at rest using SOPS + age.
Key stored at ~/age-key.txt.
All configuration files containing secrets must use SOPS encryption. Plaintext secrets must never be committed to any repository.
5.15 Backup & Recovery¶
Automated backup via Restic to Backblaze B2, daily at 03:00 UTC.
Backup scope includes:
- PostgreSQL database dumps
- Configuration files
- Knowledge repository content
- State files
Retention policy is managed by Restic pruning rules.
See reference_backblaze.md for current B2 configuration details.
5.16 Final Technology Rule¶
The purpose of these decisions is to make the blueprint executable.
Any future stack change must be justified by:
- clear operational gain
- reduced risk
- or proven scale pressure
Not by novelty.
6. Runtime Topology¶
6.1 Node A — Cloud Control Node¶
Primary always-on environment for:
- orchestrator
- governor
- API/control services
- PostgreSQL
- scheduler/workers
- LiteLLM/router where needed
- runtime execution fabric
- knowledge services
- telemetry and backup jobs
6.2 Node B — Local Operator Node¶
Primary environment for:
- user interaction
- local CLI execution
- fallback sessions
- local knowledge access
- manual validation
- controlled local experiments
6.3 Node C — Future GPU / Model Node¶
Reserved for:
- self-hosted model serving
- embedding/indexing jobs
- compute-intensive background work
- isolated experimental inference
6.4 Node D — Future Product Runtime Node¶
Reserved for:
- customer-facing STRUXIO runtime
- product APIs
- workloads isolated from XIOPro control-plane services
7. Current State and Evolution¶
7.1 Current State¶
As of 2026-03-28, the system operates with:
- Node A: Hetzner CPX62 (16 vCPU AMD EPYC-Genoa, 30 GB RAM, 150 GB SSD) running 10 Docker containers (post-retirement of devxio-frontend, devxio-bridge, devxio-librarian, Neo4j)
- Node B: Mac Studio (Mac Worker, agent 010) connected via Tailscale VPN
- Orchestration: BrainMaster (agent 000) operating as proto-orchestrator
- Messaging: Bus-based inter-agent messaging (PostgreSQL-backed)
- Ticket tracking: Paperclip (to be superseded by ODM work graph)
- Dashboard: dashboard.struxio.ai (React) — current operator UI
- Knowledge: Hindsight running (localhost:8888/9999, Vectorize.io Docker)
- Backup: Restic to Backblaze B2, daily 03:00 UTC
- Secrets: SOPS + age encryption
7.2 Service Migration¶
Current services will transition to the XIOPro target architecture through a managed migration.
See resources/SERVICE_FATE_MAP_v4_2.md for the explicit mapping of:
- services to keep as-is
- services to evolve
- services to retire
- services to replace
No big-bang cutover. Old services run in parallel alongside new services until the new services are proven and functional parity is reached.
7.3 Target Direction¶
Move toward:
- cleaner orchestrator identity
- clearer governor identity
- explicit stewardship roles (rule steward, prompt steward, module steward)
- a DB-backed work graph
- a Research Center built on the Librarian
- a widget-based control center UI
- a rationalized module portfolio and infrastructure model
8. XIOPro vs Product Runtime¶
XIOPro must remain conceptually separate from STRUXIO product runtime.
XIOPro¶
- internal AI operating system
- execution and governance substrate
- research and knowledge system
- user control center
- internal optimization machine
STRUXIO Product Runtime¶
- customer-facing APIs and services
- product workloads
- external runtime isolation
- product-specific scaling and SLAs
XIOPro may build and operate product runtime, but it must not collapse into it.
9. Architectural Success Criteria¶
The architecture is successful when:
- execution continues without UI
- recovery is practical
- layers stay separable
- governance remains explicit
- knowledge compounds instead of fragments
- module optimization is real, not ad hoc
- human collaboration does not break state integrity
- future scale can happen without rethinking the whole machine
10. Final Statement¶
XIOPro is not one service.
It is a layered machine for governed execution, collaboration, research, and optimization.
If the layers stay clean, XIOPro remains buildable, recoverable, and evolvable. If they blur, the system regresses into expensive chat-shaped chaos.
Changelog¶
v5.0.0 (2026-03-28)¶
Changes from v4.1.0:
- C2.1: Added Section 5.11 CLI Toolchain with canonical CLI tools table and rule
- C2.2: Added SOPS + age secrets management to Section 5.13 (Security)
- C2.3: Added Section 5.14 Backup & Recovery (Restic to Backblaze B2)
- C2.4: Added service migration reference to Section 7.2, pointing to SERVICE_FATE_MAP_v4_2.md
- C2.5: Made async backbone decision explicit in Section 5.7 — PostgreSQL-backed dispatch is a deliberate choice, not an omission
- CX.1: Global naming fix — "Rufio" replaced with "Ruflo" throughout
- CX.2: Version header updated to 4.2.0, last_updated to 2026-03-28
- CX.3: Added this changelog section
- CX.4: Added Current State subsection (Section 7.1) documenting existing infrastructure
- Clarified 000/Ruflo relationship in Section 4.3
- Renumbered sections 5.11-5.13 to 5.12-5.15 to accommodate CLI Toolchain insertion
- Restructured Section 7 from "Current -> Target Evolution" to "Current State and Evolution" with explicit subsections
v5.0.2 (2026-03-28)¶
Agent naming migration to 3-digit unified IDs:
- Replaced O00/O01/R01/P01/M01 with 000 role-based naming throughout
- Replaced B1-B5 with 001-005 (domain brains)
- Replaced M0 with 010 (Mac Worker)
- Replaced BM with 000 (BrainMaster)
- Updated Mermaid diagrams to use 3-digit IDs
- Preserved Backblaze B2 references unchanged
v5.0.3 (2026-03-28)¶
Roles over numbers: Removed agent IDs from architectural descriptions, layer headers, diagrams, and role implementation sections. Current State section (7.1) uses "agent NNN" format for operational references. Blueprint describes WHAT roles do, not WHICH agent holds them.
v5.0.12 (2026-03-29)¶
Cross-references: Added pointer to resources/DESIGN_rc_architecture.md (Remote Control architecture design — Open WebUI evaluation, multi-provider chat routing, Prompt Composer integration) in Section 5.8 context. Added pointer to resources/DESIGN_cli_services.md (CLI services framework — config-driven operational commands via Bus API) in Section 5.12 context.
v5.0.13 (2026-03-29)¶
Batch BP update from recent tickets: Added 9 new Control Bus endpoint capabilities to Section 5.8 table — agent auto-pickup (/agents/{id}/pickup, /agents/{id}/tasks), agent health (/agents/health), agent metrics (/agents/{id}/metrics, /agents/metrics/summary), CLI services (/services), template registry (/templates), dashboard (/dashboard), message search (/messages/search), and project lifecycle (/projects).
v5.0.14 (2026-03-30)¶
Review fixes: C3: Added Bus Crash Recovery & Restart Sequence subsection to Section 5.8 — in-flight message handling, restart sequence (5 steps), pending intervention recovery, SSE reconnection behavior, at-least-once delivery guarantees with idempotent consumers. C8: Added Input Validation & Rate Limiting subsection to Section 5.8 — JSON Schema validation on all endpoints, 100 req/s per-agent and 1000 req/s global rate limits, 20KB body limit, 8 attachment limit, HTML escaping, parameterized SQL enforcement.
v5.0.15 (2026-03-30)¶
Round 2 review fix (SSE reconnection):
- Section 5.8: Added SSE Reconnection Behavior subsection — documents fallback to HTTP polling on disconnect, reconnect from last cursor via Last-Event-ID, 30-second keepalive heartbeat, 60-second connection drop detection, exponential backoff reconnection timing.