Skip to content

XIOPro Production Blueprint v5.0

Part 8 — Infrastructure & Deployment Architecture


1. Purpose

Defines the concrete infrastructure baseline required to run XIOPro as a headless-first, recoverable, secure, and provider-independent execution system.

This part specifies:

  • runtime environments
  • node roles
  • service boundaries
  • deployment topology
  • network shape
  • storage surfaces
  • installation inventory
  • scaling direction
  • operational constraints

This document is not a cloud wishlist. It is the execution platform contract for XIOPro.


2. Infrastructure Thesis

Infrastructure must support all of the following simultaneously:

  • continuous headless operation
  • recoverable multi-agent execution
  • explicit control-plane separation
  • durable state persistence
  • low-friction founder intervention
  • provider-swappable model access
  • future expansion without redesign

Infrastructure exists to make the architectural rules real.


3. Infrastructure Principles

3.1 Headless First

All critical execution must continue without UI.

The UI may observe and control, but must never become the only runtime path.


3.2 Durable State First

No important execution state may live only inside:

  • a terminal tab
  • a provider chat window
  • a single container memory space
  • an agent-local temp file

Durable state must land in authoritative storage surfaces.


3.3 Replaceability

The infrastructure must allow replacement of:

  • model providers
  • agent runtimes
  • API gateway/router
  • UI
  • storage backends
  • observability stack

without invalidating the XIOPro operating model.


3.4 Logical Separation Before Physical Separation

Even when colocated on one server initially, the following concerns must remain logically separated:

  • control plane
  • execution fabric
  • governance
  • data/state
  • knowledge services
  • ingress/API
  • observability
  • backup/recovery

3.5 Recovery Is Native

Infrastructure must assume:

  • session crash
  • provider disconnect
  • container restart
  • host reboot
  • partial service outage
  • network interruption
  • founder disconnect

Recovery is not a future enhancement. It is a base requirement.


3.6 Security by Reduction

Prefer:

  • private network paths
  • minimum exposed ports
  • minimum standing privileges
  • minimum long-lived secrets
  • explicit auditability

4. Canonical Environment Model

4.1 PRD -- Production Runtime

Primary live environment for XIOPro system operation.

Contains:

  • orchestrator runtime
  • governor runtime
  • API/control services
  • PostgreSQL
  • scheduler/worker services
  • LiteLLM router
  • Ruflo swarm runtime
  • knowledge service backends
  • observability services
  • backup jobs

4.2 TST -- Integration Validation

Used to validate:

  • schema changes
  • orchestration behavior
  • recovery behavior
  • deployment updates
  • service compatibility

TST must be structurally similar to PRD, but can run with reduced scale.


4.3 DEV -- Builder / Experiment Zone

Used for:

  • agent experiments
  • rule iteration
  • local service development
  • migration rehearsal
  • safe breakage

4.4 LOC -- Local Operator Node

Primary founder workstation environment.

Contains or may contain:

  • RC-capable local execution surfaces
  • local knowledge access
  • local file operations
  • CLI diagnostics
  • fallback execution
  • future local models
  • operator utilities

LOC is not the production control plane, but it is an important resilience and intervention node.


5. Runtime Node Topology

5.1 Node A -- Cloud Control Node (Hetzner CPX62)

Primary always-on control and execution node.

Actual Hardware Specs (as of 2026-03-28)

Spec Value
Provider Hetzner Cloud
Instance type CPX62 (shared vCPU, AMD)
CPU 16 vCPU AMD EPYC-Genoa
RAM 30 GB
Storage 150 GB SSD (NVMe)
OS Ubuntu 24.04 LTS
Location Hetzner EU

Responsibilities:

  • orchestrator control
  • governance control
  • API ingress
  • work graph persistence
  • scheduling
  • runtime coordination
  • background execution
  • telemetry collection

5.2 Node B -- Local Operator Node (Mac Studio)

Connected via Tailscale VPN (encrypted mesh).

Responsibilities:

  • founder interaction
  • local CLI execution
  • fallback RC-capable sessions
  • local knowledge access
  • manual validation
  • future local inference experiments

5.3 Node C -- Future GPU / Model Node

Reserved for:

  • self-hosted model serving
  • heavier local inference
  • embedding jobs
  • batch processing
  • specialized isolated workloads

5.4 Node D -- Future Product Runtime Node

Reserved for:

  • STRUXIO product APIs
  • customer-facing runtime isolation
  • product workloads separated from XIOPro control plane

6. High-Level Infrastructure Overview

flowchart TD
    User[User / Local Operator Node] --> Ingress[Ingress / API Gateway]
    Ingress --> Control[Control Services]
    Control --> Orchestrator["Orchestrator"]
    Control --> Governor["Governor"]
    Orchestrator --> Ruflo[Ruflo Execution Fabric]
    Ruflo --> Surfaces[Execution Surfaces]
    Surfaces --> Providers[Model Providers / Local Models]
    Orchestrator --> DB[(PostgreSQL)]
    Governor --> DB
    Control --> DB
    Control --> Knowledge[Knowledge / Librarian Services]
    Control --> Telemetry[Logs / Metrics / Alerts]
    DB --> Backup[Backup & Recovery]
    Knowledge --> Backup

7. Service Architecture

7.1 Control Plane Services

Core services that maintain system state and coordination:

  • API service
  • orchestrator service
  • governor service
  • scheduler service
  • worker/queue consumers
  • RC/escalation broker

7.2 Execution Fabric Services

Services responsible for agent execution and provider interaction:

  • Ruflo agent swarm engine
  • LiteLLM router
  • execution adapters
  • CLI/runtime bridges
  • provider connectors

7.3 Data and Knowledge Services

Authoritative storage and retrieval services:

  • PostgreSQL
  • knowledge/librarian service
  • index refresh jobs
  • document/asset storage references

7.4 Operational Services

Cross-cutting operations services:

  • reverse proxy / ingress
  • secrets delivery
  • backup jobs
  • log pipeline
  • metrics exporter
  • alert delivery

8. Canonical Service Inventory

8.1 Ingress / Reverse Proxy

Role:

  • terminate TLS
  • route inbound traffic
  • expose minimal public surfaces
  • forward requests to internal services

Examples:

  • Caddy
  • Traefik
  • Nginx

8.2 API Service

Role:

  • main entry point for UI and CLI
  • authentication and authorization
  • session/control endpoints
  • work graph access
  • human escalation endpoints

8.3 Orchestrator Service

Role:

  • reads tickets/tasks/state
  • assigns work
  • selects execution path
  • manages continuity
  • coordinates domain/worker agents

8.4 Governor Service

Role:

  • monitors cost, health, anomalies, and risk
  • enforces policy actions
  • raises alerts and intervention requests
  • proposes optimization actions

8.5 Ruflo Runtime Service

Role:

  • agent spawning
  • sub-agent lifecycle management
  • bounded multi-agent execution
  • runtime coordination hooks

8.6 LiteLLM Router Service

Role:

  • provider abstraction
  • model routing
  • fallback routing
  • usage metering integration
  • future local-model routing

8.7 Scheduler / Background Worker Service

Role:

  • recurring jobs
  • dream windows
  • maintenance jobs
  • index refresh
  • backup execution
  • telemetry rollups

8.8 PostgreSQL Service

Authoritative store for:

  • ODM entities
  • runtime state
  • session state
  • escalation state
  • governance events
  • cost records
  • audit events
  • control metadata

8.8.1 Connection Pooling

Connection pooling via PgBouncer or built-in pool_size is recommended when agent count exceeds 15. Current Fastify pool: { max: 20 }. Monitor with GET /metrics using the struxio_db_pool_* gauge family.

Rules:

  • Below 15 agents: Fastify built-in pool (max: 20) is sufficient
  • At 15+ agents: evaluate PgBouncer in transaction-pooling mode as a sidecar to the PostgreSQL container
  • Pool exhaustion events must be captured as governance alerts (warning level)
  • struxio_db_pool_active, struxio_db_pool_idle, and struxio_db_pool_waiting gauges must be emitted to the observability stack
  • PgBouncer configuration (if adopted) must be SOPS-encrypted and managed via the same secrets path as PostgreSQL credentials

8.9 Knowledge / Librarian Service

Role:

  • ingest knowledge sources
  • classify/index content
  • maintain retrieval structures
  • support render/export/query workflows

8.10 Object Storage / Backup Surface

Primary uses:

  • database dumps
  • snapshots
  • compressed transcripts
  • recovery packages
  • exported artifacts

8.11 Observability Stack

Core outputs:

  • logs
  • metrics
  • health state
  • error events
  • alert signals
  • future traces

8.12 Module Portfolio Infrastructure Linkage

Purpose

Infrastructure must provide the real-world constraints and capabilities that make module portfolio governance credible.

The module steward can recommend and optimize modules only within an actual hosting envelope.

That means infrastructure must expose enough information for the portfolio layer to reason about:

  • subscription-backed module access
  • API-backed module access
  • self-hosted module feasibility
  • local vs cloud placement
  • resource ceilings
  • operational complexity
  • fallback paths

8.12.1 Infrastructure Inputs Required by the Module Steward

Part 8 should provide the module steward with at least:

  • available execution nodes
  • node class and role
  • approximate compute profile
  • memory profile
  • storage considerations
  • network posture
  • public vs private connectivity assumptions
  • allowed runtime surfaces
  • operational risk notes
  • recovery and observability readiness

This is necessary so "recommended module" can mean: recommended and actually runnable.


8.12.2 Hosting Feasibility Principle

A module should not be marked portfolio-approved for self-hosted or local use unless there is a credible hosting profile for it.

A credible hosting profile must include at least:

  • target environment
  • resource assumptions
  • deployment complexity notes
  • security notes
  • recovery notes
  • observability notes
  • fallback path if the hosting path fails

8.12.3 Local / Cloud / Hybrid Evaluation

The module steward should be able to evaluate candidate module options against at least these hosting classes:

  • local Mac execution
  • Hetzner primary control node
  • future dedicated GPU/model node
  • future isolated product runtime node
  • hybrid cloud/provider access

Each class carries different tradeoffs in:

  • quality
  • stability
  • trust
  • latency
  • bandwidth
  • compute pressure
  • operational complexity

8.12.4 Subscription and Surface Awareness

Infrastructure and module governance must stay aligned on where module access actually exists.

This includes awareness of:

  • provider API access paths
  • provider subscription-backed surfaces
  • local CLI/runtime adapters
  • routing-layer reachability
  • fallback availability during provider failure

This prevents recommending modules that cannot actually be reached from the required runtime surface.


8.12.5 Optimization Telemetry Requirement

Infrastructure should preserve enough telemetry for portfolio optimization over time.

Useful telemetry includes:

  • latency by module and task class
  • error/failure rate by module and access path
  • cost / usage by module
  • retry rate by module
  • fallback frequency
  • node pressure when self-hosted or local
  • bandwidth pressure where relevant

This allows the module steward to optimize with evidence, not intuition alone.


8.12.6 Adoption Rule

Infrastructure may support evaluation and comparison of new modules, subscriptions, and self-hosted options.

But infrastructure must not auto-adopt them.

Adoption still requires governed approval and a deliberate rollout decision.


8.12A Bus API Rate Limits

The Control Bus enforces rate limits to protect stability and ensure fair access across all actors. These limits are active in the current Bus implementation.

Default Limits

Limit Value Notes
Default request rate 100 req/min per actor Warning logged at threshold; already implemented
Burst allowance 200 req/min per actor Allowed for short bursts; throttled (429) after sustained burst
SSE connections 1 connection per actor per channel Reconnect replaces the prior connection; no parallel SSE streams
Event emission rate 50 events/min per actor Applies to POST /events; excess events are queued or dropped with warning

Rules

  • Rate limits are applied per actor_id, not per IP or session.
  • Burst capacity (200 req/min) is available for up to 30 seconds before throttle kicks in.
  • Throttled requests receive HTTP 429 with a Retry-After header.
  • Rate limit violations are logged as Bus warning events and visible in the Dashboard alert feed.
  • SSE reconnect on rate-limited channels retries after the backoff window (see Section 10.4 retry policy).
  • These limits protect Bus and PostgreSQL from agent runaway — they are not negotiable per-actor.

Tuning Principle

Rate limits may be raised globally only if sustained Bus latency remains below 200ms after the increase. Individual actors may not self-negotiate higher limits — only the Governor may authorize a limit adjustment via a Bus configuration change.


8.13 Repository, Filesystem & Storage Layout

Purpose

XIOPro needs an explicit filesystem and repository model.

Without it, the system may have strong logic but weak operational discipline.

This section defines where source-of-truth assets live, how they are separated, and which storage surfaces are authoritative for which classes of data.

Principle

Not all data belongs in the same place.

XIOPro should separate:

  • versioned source assets
  • runtime state
  • large artifacts
  • backups
  • local operator files
  • experimental or temporary material

This prevents confusion between:

  • what is canonical
  • what is generated
  • what is recoverable
  • what is disposable

8.13.1 Canonical Storage Classes

Git Repositories

Use Git repositories for:

  • source code
  • blueprints
  • rules
  • skills
  • activations
  • prompt templates
  • runbooks
  • deployment definitions
  • scripts
  • configuration templates

Git is the human-readable and auditable source of truth for versioned text-based assets.

PostgreSQL

Use PostgreSQL for:

  • ODM entities
  • tickets
  • tasks
  • activities
  • runtimes
  • sessions
  • escalations
  • human decisions
  • policy objects
  • governance events
  • cost/usage rollups
  • scheduler state
  • indexing metadata

PostgreSQL is the authoritative operational state store.

Object / Blob Storage

Use object storage for:

  • transcript snapshots
  • checkpoints
  • recovery bundles
  • exported artifacts
  • large generated files
  • retained log bundles
  • research exports where size or format justifies it

Object storage is for durable large artifacts, not for the primary source of truth of structured runtime state.

Local Operator Filesystem

The local founder/operator node may hold:

  • local clones of approved repos
  • local working notes
  • sandbox experiments
  • review/export materials
  • temporary staging files
  • local tool caches

Local operator storage is useful, but it is not authoritative unless content is committed or ingested properly.


T1P should align to the actual active STRUXIO repository family rather than a generic placeholder structure.

Canonical active repos:

  • struxio-os
  • struxio-logic
  • struxio-design
  • struxio-app
  • struxio-business
  • struxio-knowledge

A transitional repo may still exist for a limited period:

  • struxio-aibus

Reference repos may also exist for research or inspiration, but they are not part of the canonical operating core.

struxio-os

Primary control-plane and operations repo.

Holds:

  • infra
  • state
  • tickets
  • deployment
  • runbooks
  • control-layer operational files
  • bootstrap/update scripts
  • ops-facing automation
struxio-logic

Primary cognition / behavior repo.

Holds:

  • agents
  • rules
  • skills
  • prompts
  • logic-layer governance assets
  • activation and protocol assets where appropriate
struxio-design

Primary architecture / blueprint / research repo.

Holds:

  • blueprint parts
  • architecture records
  • system maps
  • evolution notes
  • product design
  • PRDs
  • research artifacts and synthesis outputs where text-first is appropriate
struxio-app

Primary product/application implementation repo.

Holds:

  • app/runtime code
  • APIs
  • product-facing implementation
  • product integration surfaces
  • E2E test surfaces
struxio-business

Primary business / legal / finance / strategy repo.

Holds:

  • business assets
  • legal materials
  • finance materials
  • strategy
  • brand and fundraising assets
struxio-knowledge

Primary knowledge / research / reference repo.

Holds:

  • research artifacts
  • curated reference material
  • knowledge ledger assets
  • synthesis outputs
  • topic-indexed knowledge files
struxio-aibus (Transitional / Legacy)

Not a permanent first-class pillar.

Plan:

  • identify still-valuable code or documents
  • migrate what remains useful into canonical repos
  • archive the repo once no longer operationally required
Rule

Part 8 repository topology must stay aligned with the canonical active repo family used by the work plan and migration model.


8.13.3 Filesystem Class Rules

Within any repo or managed storage surface, files should conceptually fall into these classes:

  • source
  • generated
  • runtime
  • archive
  • temp
Source

Human-maintained canonical inputs.

Examples:

  • code
  • rules
  • skills
  • blueprints
  • configs
  • runbooks
Generated

System-produced durable outputs.

Examples:

  • exports
  • compiled artifacts
  • evaluation reports
  • generated documentation
  • synthesized summaries

Generated assets should not silently replace source assets.

Runtime

Operationally live mutable state.

Examples:

  • DB data
  • active checkpoints
  • session snapshots
  • job state

Runtime state belongs in state stores, not committed source repos.

Archive

Longer-lived retained material not needed for active work.

Examples:

  • retired reports
  • older exports
  • superseded bundles
  • long-term retained incident artifacts
Temp

Disposable staging content.

Examples:

  • scratch files
  • transient downloads
  • in-progress experiment outputs
  • tool caches

Temp must never be treated as authoritative.


8.13.4 Authoritative Repo / State Rules

The system must be explicit about which surface is authoritative.

Rules:

  • text assets -> authoritative in Git
  • runtime operational state -> authoritative in PostgreSQL
  • large artifacts / checkpoints / exports -> authoritative in object storage where applicable
  • local machine files -> non-authoritative until committed or ingested

No agent should assume a local filesystem copy is canonical merely because it exists.


8.13.5 Research & Knowledge Storage Note

Research-related material may live across:

  • Git-managed knowledge assets
  • PostgreSQL metadata/indexing
  • object storage exports
  • local review workspaces
  • Obsidian/NotebookLM connected surfaces

But the system must still preserve clear distinction between:

  • raw source material
  • curated knowledge
  • generated derivative outputs
  • scheduled research artifacts

8.14 Cost Telemetry & Attribution Pipeline

Purpose

Infrastructure must collect cost and usage signals from the moment an agent/runtime uses a module, and preserve them in a form that is:

  • attributable
  • queryable
  • enforceable
  • optimizable

This supports Part 3 cost propagation and Part 4/Part 7 runtime governance.

Principle

Cost must be captured both:

  • during execution
  • after execution

This requires a pipeline, not only a dashboard.


8.14.1 Collection Stages

Stage 1 -- Raw Usage Emission

Execution surfaces, routers, and adapters should emit raw usage events when work happens.

Typical sources:

  • LiteLLM/router usage records
  • provider API responses
  • local runtime counters
  • subscription-surface usage approximations where exact billing is delayed
  • worker/task metadata
Stage 2 -- Activity Attribution

Raw usage must be attributed to the correct operational scope.

Minimum attribution targets:

  • activity
  • session
  • agent runtime
  • task
  • ticket
  • execution surface
  • module/provider
  • environment
Stage 3 -- Normalization

Usage must be normalized into comparable records.

Useful normalized fields include:

cost_event:
  event_id: string
  timestamp: datetime

  activity_id: string|null
  session_id: string|null
  agent_runtime_id: string|null
  task_id: string|null
  ticket_id: string|null

  module_id: string|null
  provider: string|null
  access_path: string|null
  # api | subscription | self_hosted | hybrid

  usage_units_in: float|null
  usage_units_out: float|null
  estimated_cost: float|null
  billed_cost: float|null
  currency: string|null

  latency_ms: int|null
  retries: int|null
  node_id: string|null
  notes: string|null
Stage 4 -- Rollup

Rollups should aggregate by at least:

  • activity
  • task
  • ticket
  • project
  • module/provider
  • access path
  • runtime surface
  • day / week / month
Stage 5 -- Governance Consumption

Rollups and anomaly signals should feed:

  • the governor
  • breaker policies
  • budget policies
  • module steward optimization analysis
  • reporting/UI layers later

8.14.2 Collection Requirements by Access Type

API-Based Module Use

Preferred collection source:

  • router/provider response metadata
  • request/response usage counters
  • billing approximation tables
  • later reconciliation with actual billed usage where available
Subscription-Based Module Use

Exact billing detail may be weaker or delayed.

Minimum requirement:

  • record which runtime used which subscription-backed surface
  • approximate scope and intensity of use
  • preserve task/runtime attribution
  • support strategic optimization even when exact per-call pricing is unavailable
Self-Hosted Module Use

Collect at least:

  • runtime used
  • node used
  • time consumed
  • compute/memory pressure
  • queue/wait cost proxy
  • power/capacity proxy where useful later

Self-hosted cost is not zero just because no API bill exists.


8.14.3 Storage Rule

Cost telemetry should be stored in PostgreSQL as normalized operational records and rollups.

Large raw logs may additionally land in log/object storage, but authoritative attribution must remain queryable from the operational store.


8.14.4 Validation Rule

A task is not considered fully cost-observable unless XIOPro can answer at least:

  • which module(s) were used
  • by which runtime/surface
  • for which task/ticket
  • with what estimated or billed cost signal
  • with what latency/retry profile

If this cannot be answered, cost governance is incomplete.


8.14.5 Final Rule

Cost is not "a later finance report".

It is a live infrastructure signal that must be captured at execution time and preserved for both governance and optimization.


9. Deployment Model

9.1 Initial T1P Deployment

Initial production baseline:

  • single Hetzner CPX62 primary node
  • Docker Compose or equivalent simple orchestrator
  • all core XIOPro services colocated
  • strict logical separation between services
  • reverse proxy in front
  • PostgreSQL persistent volume
  • scheduled backup jobs
  • private admin access only

This is acceptable because the current need is:

  • founder-scale operation
  • rapid iteration
  • recoverability
  • low complexity

It is not acceptable to let "single-node MVP" become "undefined production."


9.2 Initial Container Groups

Recommended initial groups:

  • ingress
  • api
  • orchestrator
  • governor
  • ruflo
  • litellm
  • scheduler
  • workers
  • postgres
  • knowledge
  • telemetry
  • backup

9.3 Scale-Out Direction

When required, scale along these lines:

  1. split ingress/API from control services
  2. split PostgreSQL onto stronger isolated storage node
  3. split worker/runtime services from control node
  4. add dedicated GPU/model node
  5. isolate product runtime from XIOPro runtime

9.4 Non-Goals for Initial Phase

Do not introduce yet unless proven necessary:

  • Kubernetes
  • distributed queue complexity beyond real need
  • service mesh
  • heavy graph infrastructure
  • multi-region architecture
  • premature HA theater

These may become valid later, but are not required for T1P execution readiness.


9.5 Initial Hardware Baseline

9.5.1 Node A -- Hetzner CPX62 (Actual Specs)

The current production server is a Hetzner CPX62:

Spec Value
CPU 16 vCPU AMD EPYC-Genoa (shared)
RAM 30 GB
Storage 150 GB SSD (NVMe)
OS Ubuntu 24.04 LTS
Docker Docker Engine 29.2.1, Docker Compose
Network Public IPv4, Tailscale VPN overlay
Python 3.12.3
Node.js 20.20.1
Practical Sizing Principle

The initial node must be sized for control-plane reliability first, not for speculative future self-hosted model serving.

That means it must comfortably support:

  • orchestrator service
  • governor service
  • PostgreSQL
  • API / ingress
  • Ruflo
  • LiteLLM
  • scheduler / workers
  • observability
  • backup jobs

without sustained resource contention.

Initial Recommendation Logic

Choose a Hetzner class that prioritizes:

  • CPU consistency
  • RAM headroom
  • fast NVMe/SSD
  • stable Linux support
  • easy vertical upgrade path

Do not size Node A around local-model aspirations. If self-hosted inference becomes real, it belongs on Node C.


9.5.2 Node B -- Local Operator Node (Mac Studio)

Current role:

  • founder interaction
  • RC-capable local sessions
  • local CLI operations
  • local validation
  • local knowledge work
  • fallback execution

Connected via Tailscale VPN (encrypted mesh, Hetzner <-> Mac).

Recommended baseline:

  • stable workstation environment
  • local CLI toolchain
  • secure admin access to Node A
  • local backup for critical operator-side configs
  • optional local container tooling for test/fallback

9.5.3 Node C -- Future GPU / Self-Hosted Model Node

This node is optional and deferred.

It becomes justified only when one or more conditions are true:

  • self-hosted models materially improve privacy
  • unit economics justify dedicated inference
  • batch embedding/index workloads become heavy
  • provider dependence becomes strategically limiting
  • offline or degraded-network resilience becomes important

Until then, Node C remains a reserved architectural slot, not an implementation obligation.


9.5A Container Memory Budget (CPX62 -- 30 GB)

With the CPX62 at 30 GB RAM, the memory budget after retirement of stale services is:

Category Estimated RAM Notes
Docker containers (current, post-retirement) ~2.25 GB 10 containers after retiring devxio-frontend, devxio-bridge, devxio-librarian, graph_stack_neo4j (Neo4j deprecated -- both instances removed)
Agent processes (orchestrator + 2 brains typical) ~2-3 GB Claude Code sessions via Max20
System / OS ~2 GB Ubuntu 24.04, systemd, journald, etc.
Available headroom ~22-24 GB
New XIOPro backend + UI (budget) 4-6 GB FastAPI backend, Next.js UI, workers
Remaining free ~16-20 GB Comfortable margin for spikes

This gives substantial headroom for the new XIOPro services. The CPX62 is not a constraint for T1P.

Realistic Concurrent Agent Estimate

Each Claude Code agent process consumes approximately 300-500 MB of RAM. With the CPX62's 30 GB:

Component Estimated RAM
Services baseline (13 containers) ~10 GB
System / OS ~2 GB
Available for agents ~18-20 GB
Agent process (each) ~300-500 MB
Realistic concurrent agents 8-10 (at ~500 MB each, with ~3-5 GB buffer for spikes)

The realistic maximum is 8-10 concurrent agents on the current CPX62. This accounts for:

  • Worst-case agent memory (~500 MB each)
  • A 3-5 GB safety buffer for memory spikes, background jobs, and transient allocations
  • The 85% RAM utilization hard limit from Part 1, Section 4.10 (no agent spawning above 85%)

Previous estimates of higher agent counts assumed smaller agent footprints. This revised estimate reflects observed Claude Code process sizes in production.

Budget Rule

If total container memory exceeds 15 GB sustained, investigate:

  • which containers can be retired or consolidated
  • whether any service is leaking memory
  • whether workload should move to a separate node

See resources/SERVICE_FATE_MAP_v4_2.md for the full current-to-target service transition plan.


9.6 Installation Bill of Materials (T1P)

9.6.1 Host-Level Baseline

Node A should install and configure:

  • Ubuntu LTS base OS
  • Docker Engine
  • Docker Compose or equivalent simple orchestrator
  • UFW or nftables firewall
  • Tailscale or equivalent secure overlay
  • SSH server with key-only auth
  • fail2ban if SSH remains publicly reachable
  • log rotation baseline
  • backup scripting/runtime support
  • system time sync
  • unattended or managed security update strategy

9.6.2 Core XIOPro Service Set

Initial service set:

  • ingress / reverse proxy
  • API service
  • orchestrator service
  • governor service
  • Ruflo runtime service
  • LiteLLM router service
  • scheduler service
  • worker service(s)
  • PostgreSQL service
  • knowledge / librarian service
  • telemetry / monitoring service(s)
  • backup service / scheduled jobs

9.6.3 Supporting Operational Components

Recommended supporting components:

  • TLS certificate automation
  • environment/secrets injection mechanism
  • deployment scripts / make targets / runbooks
  • uv-based Python version/dependency/tool management for Python services and scripts
  • backup restore scripts
  • database migration runner
  • health-check endpoints
  • metrics exporter(s)
  • alert delivery integration

9.6.4 Deferred / Optional Components

Do not install for T1P unless clearly justified:

  • Kubernetes
  • service mesh
  • heavy queue infrastructure
  • dedicated tracing stack if basic telemetry is enough
  • vector/graph infrastructure without proven usage
  • GPU inference stack on Node A

9.6A CLI Toolchain

XIOPro follows a CLI-first principle: prefer CLI tools over MCP wrappers where both exist. CLI pipelines are faster, more composable, and more debuggable.

See resources/CLI_TOOLS_ASSESSMENT.md for the full assessment with install instructions.

See resources/DESIGN_cli_services.md for the config-driven CLI services framework design (operational commands executable via Bus API or devxio CLI, including DNS management via Porkbun API and infrastructure management via Hetzner hcloud CLI).

Already Installed

Tool Version Purpose
tmux 3.4 Terminal multiplexer
ripgrep (rg) 14.1.1 Fast code/text search

Must-Have (install in Phase 0)

Tool Purpose Install
gh GitHub CLI -- PR, issue, Actions automation Official apt repo
jq JSON processor -- API response parsing, config manipulation apt install jq
uv Python package manager -- 10-100x faster than pip, replaces pip+venv+pyenv curl installer
fzf Fuzzy finder -- history search, file navigation, pipeline glue apt install fzf
fd Fast find -- file discovery, respects .gitignore apt install fd-find
yq YAML processor -- state file manipulation, Docker Compose queries wget binary
direnv Per-directory env vars -- project isolation, agent env scoping apt install direnv
hcloud Hetzner Cloud CLI -- server, network, firewall management Official apt repo

Nice-to-Have (install when convenient)

Tool Purpose
bat Syntax-highlighted file viewing
delta Better git diffs
lazygit Visual git TUI
xh Friendlier HTTP client
dust Visual disk usage
btm (bottom) Visual system monitor
llm (Simon Willison) Ad-hoc LLM queries from terminal

Skip

Tool Reason
aider Overlaps with Claude Code
aichat Overlaps with Claude Code
jj (jujutsu) Evaluate later; needs Rust toolchain

Install Script

A bootstrap script is provided at resources/CLI_TOOLS_ASSESSMENT.md Section "Recommended Install Script". Cost: zero (all tools are free and open-source). Disk: under 200 MB total.


9.7 Network Exposure Matrix

9.7.1 Principle

Every port and entry point must have an owner and justification.

No service should be reachable from the public internet unless:

  • it is operationally required
  • it is protected
  • it is documented

9.7.2 Publicly Exposed Surfaces

Allowed public exposure should normally be limited to:

  • HTTPS ingress endpoint
  • optional HTTP -> HTTPS redirect endpoint

Public exposure should not directly include:

  • PostgreSQL
  • internal runtime adapters
  • scheduler
  • worker services
  • observability admin surfaces
  • raw agent runtimes

9.7.3 Private / Overlay-Only Surfaces

Prefer private-only access for:

  • SSH administration
  • database administration
  • internal dashboards
  • recovery tooling
  • deployment control
  • backup administration
  • founder/operator maintenance access

This is where Tailscale or equivalent is strongly preferred.


9.7.4 Internal Service Communication

Internal services should communicate over:

  • private Docker network(s)
  • host-local interfaces where practical
  • explicit service credentials
  • service-to-service allow rules

The infrastructure should avoid a "flat trust" model.


9.8 Domain / DNS / Surface Allocation

9.8.1 Principle

Surface naming should reflect service boundaries, not historical accidents.

Recommended pattern:

  • main XIOPro control surface
  • optional API subdomain
  • optional RC/escalation subdomain
  • optional knowledge subdomain
  • optional product/runtime subdomains later

9.8.2 T1P Surface Recommendation

For T1P, it is acceptable to expose only one or two public surfaces:

  • primary XIOPro control endpoint
  • optional API endpoint if separation is useful

Everything else may remain internal/private until needed.

This keeps complexity, certificate handling, and attack surface lower.


9.8.3 DNS Records (Active as of 2026-03-29)

Domain registrar: Porkbun. DNS managed via Porkbun.

Record Type Value Purpose
bus.struxio.ai A 89.167.96.154 Control Bus REST + MCP API
dashboard.struxio.ai A 89.167.96.154 Control Center UI
paperclip.struxio.ai A 89.167.96.154 Paperclip issue tracker
tickets.struxio.ai A 89.167.96.154 Ticket management surface
chat.struxio.ai A 89.167.96.154 Open WebUI chat interface
*.struxio.ai CNAME pixie.porkbun.com Wildcard — covers all subdomains not listed above

Note: The wildcard CNAME means devxio.struxio.ai (and any other unlisted subdomain) resolves automatically via *.struxio.ai. Caddy just needs a site block to serve it.

Explicit A records take precedence over the wildcard CNAME for the four listed subdomains.

All public-facing subdomains are reverse-proxied through Caddy with automatic TLS (Let's Encrypt).


9.9 Access Path Matrix

9.9.1 Founder Admin Path

Used for:

  • infrastructure administration
  • recovery
  • deployment
  • secrets handling
  • emergency intervention

Preferred path:

  • private overlay network
  • key-based auth
  • auditable commands

9.9.2 System Service Path

Used for:

  • service-to-service calls
  • scheduled jobs
  • DB access by approved services
  • runtime adapter communication

Requirements:

  • scoped credentials
  • least privilege
  • revocable access
  • auditable configuration

9.9.3 Agent Runtime Path

Used for:

  • execution requests
  • provider/model calls
  • artifact production
  • bounded interaction with control/data services

Restrictions:

  • no broad infrastructure admin rights
  • no unrestricted DB access
  • no unrestricted secrets access
  • only approved tools/endpoints

9.9.4 Service Placement Matrix

Principle

Every service must have a default execution home.

This avoids accidental sprawl, unclear ownership, and unnecessary cross-node complexity.

Node A -- Cloud Control Node (Hetzner CPX62)

Node A should host the initial authoritative platform baseline:

  • ingress / reverse proxy
  • API service
  • orchestrator service
  • governor service
  • Ruflo runtime service
  • LiteLLM router service
  • scheduler service
  • core worker service(s)
  • PostgreSQL service
  • librarian / knowledge service
  • telemetry / monitoring baseline
  • backup job runner
  • deployment / migration runner
Node B -- Local Operator Node (Mac Studio)

Node B is the founder-operated local execution and intervention node.

It may host:

  • local CLI surfaces
  • RC-capable local sessions
  • local validation tooling
  • emergency operator tools
  • local knowledge access
  • safe sandbox experiments
  • optional local container tooling for test/fallback

Node B must not be treated as the authoritative production control plane.

Node C -- Future GPU / Model Node

Node C is optional and deferred.

If added later, it should host only specialized higher-weight workloads such as:

  • self-hosted model runtimes
  • embedding or indexing jobs
  • heavier background processing
  • isolated experimental inference services
  • other compute-intensive workloads that should not burden Node A

Node C should not be required for initial correctness.

Node D -- Future Product Runtime Node

Node D is optional and deferred.

If introduced later, it should host:

  • STRUXIO product APIs
  • customer-facing runtime services
  • product-specific workloads isolated from XIOPro control-plane services

Node D exists to preserve separation between XIOPro internal operations and future product runtime responsibilities.

Rule

If a service has no explicit placement decision, it defaults to Node A for T1P.


9.9.5 Interface / Port Exposure Classes

Principle

T1P does not require a full port catalog yet, but it does require deterministic exposure classes.

Every interface must belong to one of the following classes.

Class A -- Public Internet Facing

Allowed only when operationally justified.

Typical examples:

  • HTTPS ingress endpoint
  • optional HTTP redirect endpoint

Requirements:

  • protected by reverse proxy
  • TLS enabled
  • documented owner
  • monitored
  • minimal surface only

Class B -- Private Overlay Only

Accessible only through Tailscale or equivalent secure overlay.

Typical examples:

  • SSH administration
  • internal dashboard access
  • deployment controls
  • recovery tooling
  • admin-only APIs

Requirements:

  • key-based or equivalent strong auth
  • operator-only access
  • auditable usage

Class C -- Internal Service Network Only

Never publicly exposed.

Typical examples:

  • PostgreSQL
  • scheduler
  • worker coordination
  • Librarian internal interfaces
  • telemetry collectors
  • service-to-service APIs

Requirements:

  • private Docker/network namespace or host-local isolation
  • explicit service identity
  • least-privilege credentials

Class D -- Localhost / Node-Local Only

Only reachable on the owning node.

Typical examples:

  • migration runners
  • emergency maintenance helpers
  • temporary admin endpoints
  • local-only debug utilities

Requirements:

  • disabled by default unless needed
  • never exposed externally by accident

Final Rule

No interface may exist without:

  • exposure class
  • owning service
  • access method
  • justification

9.9.6 Secrets Ownership and Injection Rules

Principle

Secrets must be scoped by role, not shared broadly across the platform.

Secret Classes

Examples of secret classes include:

  • provider API credentials
  • router/provider integration secrets
  • database credentials
  • session signing/application secrets
  • backup/storage credentials
  • deployment credentials
  • notification/integration secrets

SOPS + age Secrets Encryption

Secrets are encrypted at rest using SOPS + age.

Component Details
Encryption tool SOPS (Secrets OPerationS)
Key backend age (modern file encryption)
Key location ~/age-key.txt on Node A
Encrypted files .sops.yaml configs, encrypted env files

SOPS + age provides:

  • encryption at rest for all secret files in Git and on disk
  • per-file or per-key encryption granularity
  • Git-friendly encrypted diffs (only values are encrypted, keys are visible)
  • no external key management service required (age key is file-based)
  • simple rotation: re-encrypt with new age key

Ownership Rules

Founder / Operator Only

The founder or emergency operator path may control:

  • root infrastructure credentials
  • overlay administration
  • DNS/domain credentials
  • emergency recovery credentials
  • secret issuance / rotation authority
  • age key management
Platform Services

Approved control-plane services may receive only the secrets they require.

Examples:

  • API service -> app/session secrets, scoped DB access
  • orchestrator / governor -> scoped platform secrets only where operationally necessary
  • LiteLLM/router -> provider credentials required for routing
  • backup service -> backup target credentials
Agent Runtimes

Agent runtimes must not receive broad secret visibility.

They should only receive:

  • task-scoped credentials
  • provider access via approved broker/router path
  • temporary credentials where justified

They must not receive:

  • unrestricted production DB credentials
  • infrastructure root credentials
  • blanket secret bundles

Injection Rules

Approved methods for T1P:

  • environment injection at container/service start
  • mounted secret files with restricted permissions
  • managed secret loading wrapper
  • SOPS-decrypted values injected at deploy time

Not allowed:

  • plaintext secrets in Git
  • plaintext secrets in blueprint docs
  • secrets embedded in tickets
  • secrets stored in general application tables unless explicitly encrypted and justified

Rotation Rule

Any secret class that can affect:

  • provider spend
  • production data
  • recovery access
  • external exposure

must be rotatable without redesigning the platform.


9.9.7 Environment Separation Rules

Principle

T1P must distinguish clearly between:

  • local/dev
  • production cloud
  • recovery/emergency operation

Local / Dev Environment

Local/dev may be less durable, but must not silently share production authority.

Rules:

  • no default reuse of production secrets
  • no default connection to production database
  • no hidden dependency on founder machine availability
  • safe to destroy and recreate

Production Cloud Environment

Production cloud is the authoritative execution environment.

Rules:

  • persistent state lives here
  • scheduled automation lives here
  • recovery baseline is validated here
  • headless execution must function without local GUI dependency

Recovery / Emergency Path

Recovery path must exist even if the main control surface is unavailable.

Minimum expectation:

  • private overlay access works
  • key administrative commands are documented
  • restore path is tested
  • one founder/operator path remains usable during failure scenarios

Final Rule

No environment may depend on undocumented manual steps for core recovery, restart, or access.


9.10 T1P Deployment Acceptance Checklist

A T1P infrastructure deployment is not accepted unless all are true:

  • Node A can reboot and recover services predictably
  • PostgreSQL persistence is verified
  • backup job runs successfully
  • restore procedure is documented
  • HTTPS ingress works
  • non-essential public ports are closed
  • Tailscale/private admin path works
  • orchestrator, governor, API, PostgreSQL, Ruflo, LiteLLM, scheduler, and backup jobs are observable
  • one task can run end-to-end headlessly
  • one interruption/restart scenario has been tested

9.10.1 Final Rule

If an infrastructure component is installed, it must satisfy one of these:

  • needed now for headless execution
  • needed now for recovery/security/observability
  • required to prevent near-term rework

Otherwise, defer it.


9.11 Bootstrap, Startup & Controlled Update Lifecycle

9.11.1 Purpose

XIOPro must be able to start, restart, update, and recover deliberately.

A serious headless system cannot depend on "manual remembering" to become operational after:

  • host reboot
  • deployment change
  • schema update
  • service crash
  • secret rotation
  • version rollout

This section defines the minimum controlled lifecycle.


9.11.2 Bootstrap Principle

Bootstrap must be:

  • scripted
  • repeatable
  • environment-aware
  • observable
  • rollback-conscious

If startup requires undocumented manual steps, bootstrap is incomplete.


9.11.2A Python Environment Standard

Python-based XIOPro services and scripts should use uv as the default tooling layer for:

  • Python version management
  • environment creation
  • dependency sync
  • lockfile-driven reproducibility
  • tool and script execution during bootstrap/update

Expected standards where applicable:

  • pyproject.toml
  • uv.lock
  • .python-version
Rule

Bootstrap and update automation for Python services should prefer uv-based workflows over ad hoc pip/venv handling.

The goal is:

  • faster environment setup
  • reproducible sync across Mac and Hetzner
  • cleaner CI/deploy behavior
  • fewer environment drift problems

9.11.3 Cold Start Sequence

A first-time or rebuilt environment should follow this order:

  1. host baseline ready
  2. network/security baseline ready
  3. secrets delivery path ready
  4. storage surfaces reachable
  5. PostgreSQL initialized or restored
  6. schema migrations applied
  7. core control services started
  8. scheduler/background jobs started
  9. knowledge/index refresh checks run
  10. observability/health checks confirmed
  11. workload admission opened

Rule

The system should not accept normal execution until foundational dependencies pass health gates.


9.11.4 Warm Restart Sequence

For ordinary reboot or redeploy:

  1. preserve or verify durable state
  2. restart PostgreSQL and storage dependencies
  3. restart control services
  4. rebind runtime/scheduler state
  5. verify pending sessions / checkpoints
  6. verify alerting and telemetry
  7. reopen execution intake

Warm restart should prefer continuity over full rebuild.


9.11.5 Controlled Update Flow

Every significant update should support:

  • planned target version
  • preflight validation
  • backup / snapshot before change
  • migration step if needed
  • health verification after rollout
  • rollback path if checks fail

Minimum stages:

controlled_update_flow:
  - preflight
  - snapshot
  - deploy
  - migrate
  - verify
  - reopen
  - rollback_if_needed

9.11.6 Preflight Checks

Before deployment or upgrade, the system should check at least:

  • target environment identity
  • available disk/RAM headroom
  • secrets availability
  • database reachability
  • migration compatibility
  • backup readiness
  • current health baseline
  • operator approval where required

9.11.7 Health Gates

Startup/update should define health gates for at least:

  • PostgreSQL
  • API service
  • orchestrator
  • governor
  • scheduler
  • LiteLLM/router path
  • Ruflo/runtime path
  • backup jobs
  • telemetry/alerts

If health gates fail, the system should remain in degraded or closed admission mode until reviewed.


9.11.8 Runtime Admission Control

After bootstrap or update, XIOPro should reopen work in controlled order.

Suggested order:

  1. read-only status visibility
  2. manual/operator access
  3. scheduler and maintenance jobs
  4. controlled task execution
  5. full execution intake

This prevents unstable startup from immediately turning into unstable work.


9.11.9 Version & Migration Discipline

The platform must keep clear record of:

  • deployed service versions
  • DB migration level
  • blueprint/runtime compatibility notes
  • last successful deployment
  • last successful restore drill
  • pending upgrade blockers

This is necessary for recovery and auditability.


9.11.10 Self-Restart vs Self-Mutation Rule

XIOPro should be able to:

  • restart services
  • rebind sessions
  • resume controlled execution
  • propose updates
  • assist in rollout preparation

But it must not silently self-mutate production behavior without governed approval.

Self-recovery is allowed. Unapproved self-redefinition is not.


9.11.11 Success Criteria

Bootstrap and update discipline is successful when:

  • a host can reboot without operational chaos
  • a fresh node can be built from runbooks/scripts
  • deployments are repeatable
  • migrations are not guesswork
  • rollback is realistic
  • post-update health is explicit before execution resumes

9.11.12 Orchestrator Launch Commands

XIOPro orchestrator surfaces are launched via the devxio CLI command:

Command Surface Host Effect
devxio go or GO Global Orchestrator Hetzner Starts the primary 24x7 orchestrator session. Reads CLAUDE.md, memory files, plan.yaml, and resumes execution.
devxio mo or MO Mac Orchestrator Mac Studio Starts the Mac-local orchestrator. Handles Mac tasks, browser testing, local experiments. Reports to GO via Control Bus.

Both surfaces can run simultaneously. GO is always the primary. See Part 4, Section 4.1A for the full naming convention and rules.


10. Backup & Recovery

10.1 Principle

Recovery is not a future enhancement. It is a required runtime property.

XIOPro must be able to recover from:

  • node failure
  • process crash
  • session loss
  • database corruption
  • bad deployment
  • accidental deletion
  • provider-side disruption
  • operator error

10.2 Backup Scope

All critical persistence surfaces must be covered.

10.2.1 PostgreSQL

Must back up:

  • ODM entities
  • tickets
  • tasks
  • activities
  • runtimes
  • sessions
  • escalation requests
  • human decisions
  • governance state
  • cost and telemetry aggregates
  • scheduler state

10.2.2 Git Repositories

Must preserve:

  • source code
  • rules
  • skills
  • blueprints
  • prompts
  • configuration templates
  • scripts

Git is already versioned, but mirror/backup copies are still required.

10.2.3 Object / Blob Storage

Must preserve:

  • transcript snapshots
  • exported artifacts
  • checkpoints
  • large outputs
  • recovery bundles
  • retained logs

10.2.4 Configuration & Infrastructure State

Must preserve:

  • environment templates
  • Docker compose files
  • reverse proxy config
  • firewall config
  • job schedules
  • deployment scripts
  • secret references
  • runbooks

Secrets themselves should not be dumped into general backups unless explicitly encrypted and controlled.


10.2A Restic Backup to Backblaze B2

Automated backup runs daily via Restic to Backblaze B2. Implemented and operational as of 2026-03-28.

Parameter Value
Tool Restic
Target Backblaze B2 bucket (STRUXIO-ai)
Schedule Daily at 03:00 UTC (cron)
Script /opt/struxio/backup/backup.sh
Scope Workspace, configs, scripts, PostgreSQL dumps
Encryption Restic built-in (AES-256)
Credentials SOPS-encrypted (backup_secrets.enc.env), loaded at runtime via age key

Backup Process (3 steps)

  1. Decrypt credentials — SOPS decrypts B2 account ID, key, and restic password from backup_secrets.enc.env using age key at ~/age-key.txt
  2. Dump PostgreSQLpg_dump for Bus DB and Paperclip DB to /opt/struxio/backup/pg_dumps/. Files named with date suffix. 7-day local retention.
  3. Restic backup — backs up workspace, bus config, scripts, and pg_dumps to B2. Tags: daily, hetzner.

What Is Backed Up

Path Content
/home/struxio/STRUXIO_Workspace All 7 Git repos
/opt/struxio/bus Bus MCP source and config
/opt/struxio/config System configuration
/opt/struxio/scripts Operational scripts
/opt/struxio/backup/pg_dumps Daily PostgreSQL dumps (Bus + Paperclip)

Excluded: node_modules, .git, *.log, __pycache__, .venv

Security

  • No plaintext credentials — B2 account key and restic password are SOPS-encrypted at rest
  • Decryption requires the age private key (~/age-key.txt) which is not in any Git repo
  • Backup data is encrypted by Restic (AES-256) before upload to B2

Retention Policy

Restic prunes automatically after each backup: - Keep 7 daily snapshots - Keep 4 weekly snapshots - Keep 6 monthly snapshots


10.3 Backup Cadence

Database

  • logical dump: at least daily
  • WAL archiving: continuous (see Section 10.3A)
  • pre-deploy snapshot: required before high-risk migrations

Git / Markdown

  • mirrored continuously through Git remote
  • daily off-platform mirror recommended

Object Storage

  • continuous durable write pattern preferred
  • lifecycle retention policy required

Config / Infra State

  • export on every significant infrastructure change
  • nightly snapshot of deployment definitions recommended

10.3A PostgreSQL WAL Archiving for Point-in-Time Recovery

Daily logical dumps (Section 10.2A) provide a 24-hour RPO. WAL (Write-Ahead Log) archiving reduces the RPO to 5 minutes by continuously shipping transaction logs to Backblaze B2.

Configuration

wal_archiving:
  archive_mode: "on"
  archive_command: "restic backup --tag wal --stdin-filename %f < %p"
  # Alternative direct B2 shipping:
  # archive_command: "b2 upload-file STRUXIO-ai wal/%f %p"
  wal_level: "replica"
  max_wal_senders: 3
  wal_keep_size: "1GB"

RPO Target

  • Target RPO: 5 minutes (down from 24 hours with daily dumps alone)
  • WAL segments are archived continuously as they complete (typically every few minutes under normal load)
  • Combined with the daily base backup, any point in time within retention can be restored

Archive Destination

WAL segments are shipped to Backblaze B2 alongside the daily Restic backups:

Component Destination Retention
Daily base backup (pg_dump) B2 via Restic (existing) 7 daily, 4 weekly, 6 monthly
WAL segments B2 via Restic or direct B2 upload 7 days minimum

Point-in-Time Restore Procedure

  1. Identify target time — determine the recovery point (e.g., "2026-03-30 14:30:00 UTC")
  2. Restore base backup — restore the most recent daily pg_dump that precedes the target time
  3. Download WAL segments — retrieve all WAL files from B2 between the base backup and the target time
  4. Configure recovery — set recovery_target_time in postgresql.conf (or recovery.conf for older versions)
  5. Start PostgreSQL in recovery mode — PostgreSQL replays WAL segments up to the target time
  6. Validate — verify table counts, recent data, and ODM entity integrity
  7. Promote — remove recovery configuration and restart as primary

Monitoring

  • Alert if WAL archiving falls behind by more than 5 minutes (archive lag)
  • Alert if archive_command fails 3 consecutive times
  • Include WAL archive status in the daily backup health check

Rule

WAL archiving is required for T1P production. Daily logical dumps alone are insufficient for a system managing active tickets, tasks, and governance state.


10.4 Retention Policy

Minimum target policy:

  • daily backups: 14 days
  • weekly backups: 8 weeks
  • monthly backups: 12 months
  • critical milestone backups: retained until manually reviewed

Session checkpoints and recovery bundles may use shorter retention if cost requires it, but production recovery points must remain sufficient for incident handling.


10.5 Recovery Priorities

Recovery order must follow business value.

Priority 1

  • database integrity
  • orchestrator state
  • governor state
  • active ticket/task continuity

Priority 2

  • session checkpoint restoration
  • transcript recovery
  • scheduler recovery
  • API availability

Priority 3

  • observability dashboards
  • historical exports
  • non-critical mirrors

10.6 Recovery Targets

Initial T1P targets:

  • infrastructure RPO target: <= 5 minutes (with WAL archiving; <= 24 hours without)
  • operational DB restore target: same day
  • critical session recovery target: best effort via checkpoint + transcript snapshot
  • redeploy target after node loss: scripted and repeatable

These are initial targets, not final enterprise targets. The key requirement is that recovery must be rehearsable and explicit.


10.7 Session & Runtime Recovery

Recovery must align with Part 3 and Part 4 runtime semantics.

Infrastructure must support:

  • runtime restart without losing ticket linkage
  • session rebind when possible
  • replacement session creation when rebind fails
  • recovery escalation to human when continuity is uncertain
  • durable storage of context snapshots and transcript references

Infrastructure recovery is not complete unless runtime continuity is addressed.


10.8 Restore Drill Requirements

A restore drill must be executable from runbook.

Minimum Drill Scenarios

  1. PostgreSQL restore to clean environment
  2. full service restart from deployment definitions
  3. object storage recovery validation
  4. recovery of one interrupted active task
  5. rollback to prior known-good deployment

If recovery is not tested, it is not real.

Monthly Restore Drill Procedure

A restore drill must run at least once per calendar month. The drill validates that B2 backups are actually recoverable, not just present.

Drill Steps
restore_drill:
  cadence: "monthly (first week of month)"
  executor: "GO or designated ops agent"

  steps:
    1_download:
      action: "Download latest Restic snapshot from B2"
      command: "restic restore latest --target /tmp/restore_drill/"
      verify: "Files exist at /tmp/restore_drill/"

    2_restore_db:
      action: "Restore PostgreSQL dump to temporary database"
      command: "createdb restore_drill_db && pg_restore -d restore_drill_db /tmp/restore_drill/pg_dumps/latest.dump"
      verify: "Database created without errors"

    3_verify_tables:
      action: "Verify table counts match production"
      checks:
        - "SELECT count(*) FROM tickets  within 5% of production count"
        - "SELECT count(*) FROM tasks  within 5% of production count"
        - "SELECT count(*) FROM messages  within 5% of production count"
        - "SELECT count(*) FROM agent_runtimes  non-zero"

    4_verify_recent_data:
      action: "Verify data freshness"
      checks:
        - "SELECT max(created_at) FROM messages  within 24 hours of drill time"
        - "SELECT max(created_at) FROM tasks  within 24 hours of drill time"
        - "WAL recovery test: if WAL archiving active, verify PITR to specific timestamp"

    5_cleanup:
      action: "Remove temporary resources"
      commands:
        - "dropdb restore_drill_db"
        - "rm -rf /tmp/restore_drill/"

    6_record:
      action: "Record drill results"
      output:
        file: "state/restore_drills.yaml"
        fields:
          - drill_date
          - snapshot_id
          - snapshot_age_hours
          - tables_verified
          - table_count_drift_pct
          - data_freshness_hours
          - wal_pitr_tested (boolean)
          - pass_fail
          - notes
          - executor
Drill Success Criteria
  • All tables restored without errors
  • Table counts within 5% of production
  • Most recent data within 24 hours of drill time (within RPO)
  • WAL PITR test successful (if WAL archiving is active)
  • Drill completes in under 30 minutes
Drill Failure Response
  • If drill fails: create a critical governance alert (backup.restore_drill.failed)
  • Root cause must be identified and fixed before the next scheduled drill
  • Two consecutive drill failures trigger a human escalation to the founder

11. Security

11.1 Principle

XIOPro security must protect:

  • proprietary strategy
  • source code
  • execution control
  • credentials
  • knowledge assets
  • product plans
  • customer-sensitive material

Security must be practical, layered, and compatible with headless execution.


11.2 Security Posture

Initial production posture:

  • minimal public exposure
  • private overlay access first
  • least privilege by default
  • founder-controlled admin path
  • explicit service boundaries
  • auditable changes

Public internet exposure should be minimized to only what is operationally required.


11.3 Access Model

Primary roles:

  • founder_admin
  • system_service
  • agent_runtime
  • emergency_operator
  • read_only_observer

Rules:

  • agents do not receive broad admin privileges
  • infrastructure administration remains human-controlled
  • service-to-service access uses explicit credentials
  • emergency paths must be documented and separate from normal automation

11.4 Network Security Baseline

Recommended baseline:

  • Tailscale or equivalent private overlay for administrative access
  • SSH restricted to approved identities only (currently restricted to Tailscale)
  • firewall deny-by-default posture (UFW active)
  • only required inbound ports opened
  • internal services bound privately where possible
  • reverse proxy terminates TLS for exposed services

The preferred posture is:

  • private access first
  • public exposure second

11.5 Secrets Management

Secrets must never live as unmanaged plaintext in:

  • code repositories
  • markdown blueprints
  • shared chat messages
  • container images

Minimum standard:

  • use environment injection or secret manager pattern
  • separate secrets by environment
  • rotate high-value credentials
  • maintain inventory of critical secrets
  • use scoped provider keys where supported
  • encrypt secrets at rest using SOPS + age (see Section 9.9.6)

Recommended categories:

  • provider API credentials
  • GitHub tokens
  • database credentials
  • object storage credentials
  • Tailscale / network auth material
  • domain / DNS / TLS credentials

11.6 Service Isolation

Services must be logically isolated even if colocated.

Isolation baseline:

  • separate containers for major services
  • separate service credentials
  • no unnecessary shared writable volumes
  • DB access limited by service role
  • execution runtimes separated from core control services where practical

Agent runtimes should not have unrestricted access to all system internals.


11.7 Endpoint Protection & Host Hardening

Baseline host controls:

  • timely OS security updates
  • non-root routine operation
  • SSH key auth only
  • fail2ban or equivalent if internet-facing SSH remains enabled
  • UFW / nftables firewall policy
  • audit of installed packages and open ports
  • disk encryption where supported and operationally practical

11.8 Security Logging & Audit

Must record:

  • admin logins
  • deploy events
  • secret changes
  • permission changes
  • breaker-triggered shutdowns
  • emergency access usage
  • unusual agent privilege attempts

Security-relevant events must be reviewable from an audit trail.


11.9 Incident Response Baseline

Every critical environment must have a basic incident path:

  • detect
  • contain
  • preserve evidence
  • rotate credentials if needed
  • restore service safely
  • document root cause
  • update controls

A simple runbook is sufficient initially, but undocumented response is not acceptable.


11.10 Emergency Access, Out-of-Band Recovery & Memory Pressure Survival

Purpose

XIOPro must remain recoverable even when normal access paths fail.

This includes cases such as:

  • host memory exhaustion
  • service thrash or restart loops
  • accidental firewall lockout
  • Tailscale failure
  • SSH unavailability
  • broken deploy causing loss of normal admin path

This section defines the minimum emergency-access discipline.


11.10.1 Principle

Private overlay access and normal SSH are the preferred control paths.

But they are not sufficient as the only recovery plan.

Every critical environment must also have a documented out-of-band recovery path.


11.10.2 Required Access Layers

The environment should support these layers in order:

  1. normal private admin path
  2. Tailscale or equivalent
  3. SSH with key-only auth
  4. normal deployment and maintenance workflow

  5. degraded emergency operator path

  6. limited but documented recovery path
  7. safe rollback of firewall/network changes
  8. ability to stop unstable services

  9. out-of-band host access

  10. provider console / rescue mode / equivalent
  11. keyboard/layout-aware emergency instructions
  12. ability to restore basic reachability without guessing

Rule

A host is not operationally safe if only one access path exists.


11.10.3 Memory Pressure Survival Rule

The system must assume that memory exhaustion can impair:

  • SSH responsiveness
  • Tailscale responsiveness
  • service health
  • logging
  • the ability to run normal recovery commands

Therefore Node A must reserve enough operational headroom to allow emergency access and controlled recovery.

Minimum policy:

  • avoid sizing Node A so tightly that ordinary bursts can fully consume memory
  • prefer explicit RAM headroom over theoretical maximum utilization
  • treat repeated OOM behavior as a production-severity signal
  • preserve the ability to stop or pause non-critical services under pressure

11.10.4 Emergency Recovery Controls

At minimum, the environment should support these emergency actions:

  • stop or pause non-essential containers/services
  • restore firewall/network path to a safe known baseline
  • restart only core control-plane services first
  • verify DB health before reopening broader execution
  • keep an emergency runbook for Hetzner console / rescue operations
  • keep known-good command snippets accessible outside the affected host

Examples of Core-First Recovery Order

  1. regain admin access
  2. verify disk and memory state
  3. stop unstable/non-essential services
  4. verify PostgreSQL
  5. restore API/orchestrator/governor path
  6. restore scheduler and workers
  7. reopen task admission gradually

11.10.5 Firewall Safety Rule

Firewall changes must be governed like risky production changes.

Minimum practice:

  • keep a known-good baseline policy
  • document rollback steps
  • avoid permanent lockout risk from one bad rule push
  • test private admin path after material firewall changes
  • keep console-level rollback instructions documented

The goal is not perfect automation. The goal is avoiding avoidable lockout.


11.10.6 Emergency Operator Role

The emergency_operator role exists for major incident recovery.

This role is separate from normal automation and should be able to:

  • use out-of-band access when needed
  • execute documented recovery commands
  • restore reachability
  • preserve evidence before destructive actions
  • log all meaningful emergency interventions

11.10.7 Runbook Requirement

At least one explicit emergency runbook must exist for Node A covering:

  • Tailscale unavailable
  • SSH unavailable
  • firewall rollback
  • memory exhaustion / OOM stabilization
  • service stop order
  • provider console usage
  • post-incident verification checklist

An undocumented emergency procedure is not a real emergency procedure.


11.10.8 Acceptance Rule

Infrastructure is not accepted as production-capable unless the team can answer:

  • how do we access the host if Tailscale fails?
  • how do we recover if SSH is unresponsive?
  • how do we recover if firewall changes block normal access?
  • how do we stabilize the host if memory is exhausted?
  • what is the exact first-command sequence in provider console mode?

If these answers are not documented, the security model is incomplete.


12. Observability

12.1 Principle

If XIOPro cannot observe itself, it cannot govern itself.

Observability must support:

  • runtime visibility
  • recovery
  • cost control
  • debugging
  • safety decisions
  • future optimization

12.2 Required Signals

Minimum required signal families:

  • logs
  • metrics
  • health checks
  • heartbeats
  • alerts
  • audit events

Tracing is recommended but may be phased in later.


12.3 Logging Requirements

Logs must exist for:

  • API layer
  • orchestrator
  • governor
  • scheduler
  • runtime adapters
  • database-related failures
  • deployment actions
  • security events

Log requirements:

  • structured where possible
  • timestamped
  • correlated by request/session/task IDs where possible
  • retained according to environment policy
  • searchable during incidents

12.4 Metrics Requirements

Minimum operational metrics:

Platform

  • CPU
  • memory
  • disk
  • network
  • container restarts
  • process uptime

Runtime

  • active runtimes
  • active sessions
  • waiting human escalations
  • failed runs
  • retries
  • queue depth

Business/Execution

  • tickets in progress
  • tasks completed
  • task latency
  • session recovery count
  • human intervention count

Cost

  • provider cost estimate
  • per-runtime estimated spend
  • per-task estimated spend
  • infra cost trend

12.5 Health Model

Each core service must expose a health view:

  • healthy
  • degraded
  • blocked
  • failed

Minimum monitored services:

  • API
  • orchestrator
  • governor
  • database
  • scheduler
  • runtime adapter layer
  • reverse proxy
  • object storage connectivity

12.6 Alerting Baseline

Alerts must be routed by severity.

Critical

  • database unavailable
  • orchestrator down
  • repeated session recovery failure
  • secret/security incident
  • runaway cost spike

Warning

  • elevated retry rate
  • queue backlog
  • degraded disk space
  • failed backup job
  • runtime adapter instability

Info

  • deploy complete
  • scheduled maintenance
  • non-critical optimization suggestions

12.7 Dashboard Requirements

At minimum, the operator must be able to see:

  • system health
  • active runtimes
  • active sessions
  • waiting escalations
  • error count
  • recovery events
  • cost trend
  • backup status

This may begin with simple dashboards, but the signals themselves are mandatory.


12.8 Observability Storage & Retention

Need explicit retention rules for:

  • operational logs
  • audit logs
  • metrics history
  • incident snapshots

Retention length may vary by cost, but critical incident analysis must remain possible.


13. Cost Strategy

13.1 Principle

Infrastructure cost must be:

  • visible
  • attributable
  • governable
  • optimized without harming reliability

Cost strategy is not only about lowering spend. It is about choosing the right cost for the right leverage.


13.2 Cost Categories

Track at least these categories:

  • hosting / compute
  • storage
  • network / bandwidth
  • backup retention
  • observability tooling
  • provider runtime/API spend
  • local hardware / future self-hosted capacity

13.3 Attribution Model

Infrastructure should support attribution by:

  • environment
  • node
  • service
  • runtime surface
  • ticket or project where practical

This enables the governor and the operator to answer:

  • what is expensive
  • why it is expensive
  • whether it is justified

13.4 Cost Control Rules

Initial rules:

  • avoid idle heavyweight services without clear value
  • scale up only when signal justifies it
  • prefer simple colocated deployment before fragmentation
  • separate services only when risk, cost, or operational pressure justifies it
  • prune unused storage and log retention intentionally

13.5 Scale-Up Triggers

Infrastructure upgrade may be justified when one or more apply:

  • repeated CPU or memory saturation
  • queue growth impacting execution goals
  • session recovery degradation due to node pressure
  • observability overhead becoming material
  • self-hosted model experimentation requiring isolated compute
  • product workloads contaminating XIOPro control-plane stability

13.5A Scaling Triggers

The following specific conditions trigger a scaling evaluation. Meeting one trigger does not mandate action — it requires a deliberate review and decision. GO is responsible for raising the evaluation; the decision requires operator approval.

Signal Threshold Evaluation Required
PostgreSQL write latency > 50ms sustained at 10+ concurrent agents Evaluate read replicas
Host memory > 75% sustained (any host) Add new host
Bus request latency > 200ms p95 Evaluate caching layer
Agent spawn queue depth > 5 pending spawns Distribute spawn load to additional hosts
Concurrent agent count > 8 active simultaneously on a single host Evaluate second host or reduce parallelism
Disk usage > 80% on any data volume Archive old activity partitions to B2; evaluate volume expansion

Rules

  • Triggers are measured over a sustained window (minimum 5 minutes), not transient spikes.
  • A trigger that clears before review requires no action but should be logged.
  • Scaling adds operational complexity — it must be justified by signal, not by precaution.
  • GO reports trigger events via Bus alert (L3 or higher) so IO can route to the founder for decision.

13.6 Hetzner Upgrade Policy

Initial assumption:

  • one primary Hetzner CPX62 node is acceptable for T1P

Upgrade path should remain open for:

  • larger CPU / RAM node
  • split DB to dedicated node
  • split runtime workers from control plane
  • add dedicated GPU / model experimentation node later

No upgrade should be performed only because it feels more "serious". Upgrade must follow observed bottlenecks.


13.7 Self-Hosted Model Decision Rule

Future self-hosted model infrastructure should be evaluated only if it improves one or more of:

  • privacy posture
  • unit economics
  • latency
  • offline resilience
  • provider independence
  • special workload suitability

It should not be adopted merely because self-hosting sounds strategic.


14. Service Fate Map Reference

The transition from current services to v5.0 target architecture is documented in:

resources/SERVICE_FATE_MAP_v4_2.md

This resource maps every currently running service/container to its v5.0 fate:

  • KEEP: Caddy, PostgreSQL (upgrade), Hindsight, ISO 19650 engine (product code -- see MVP1_PRODUCT_SPEC.md), Tailscale, UFW, Restic, SOPS+age, Ruflo, Claude Code, AutoDream
  • KEEP + EVOLVE: Bus (-> API gateway/relay), LiteLLM (activate routing)
  • KEEP for now: Paperclip (until ODM parity), Tickets renderer, RC keepalive
  • REPLACE: Dashboard (-> Control Center)
  • RETIRE: devxio-frontend, devxio-bridge (stale pre-v3.1 code)
  • RETIRED (deprecated): devxio-librarian (631 MB Neo4j), graph_stack_neo4j (1.2 GB) -- both Neo4j instances stopped and removed

Retirement RAM Impact

Retiring stale services frees approximately 1.95 GB, leaving approximately 26 GB available for new XIOPro backend, UI, and worker services on the CPX62.

Parallel Operation Rule

During migration, old services (Bus, Paperclip, dashboard) run alongside new services. No big-bang cutover. Parallel-run until new services are proven and feature parity is reached.


15. Current State

As of 2026-03-28, the infrastructure layer is operational:

What exists today:

  • Hetzner CPX62 running Ubuntu 24.04 with 14 Docker containers (~4.2 GB RAM)
  • Caddy reverse proxy with TLS and basic auth
  • PostgreSQL (bus database, 44 MB)
  • XIOPro Control Bus (evolving from Bus MCP): REST API :8088, SSE Push :8089, OAuth 2.1, PostgreSQL-backed. Currently 107 MB. Being extended with push delivery, intervention, task orchestration, agent registration, host capacity, and spawn coordination (see Part 2, Section 5.8)
  • Paperclip issue tracker + DB (339 MB combined)
  • Hindsight memory system (1.06 GB, Vectorize.io Docker)
  • LiteLLM router (576 MB, not actively routing under Max20)
  • ISO 19650 engine (57 MB, product code -- see MVP1_PRODUCT_SPEC.md)
  • ~~Two Neo4j instances~~ (deprecated -- both stopped and removed, 1.83 GB freed)
  • Phase 1 React dashboard (11 MB)
  • Pre-v3.1 stale frontend + bridge (123 MB, candidates for immediate retirement)
  • Tailscale VPN mesh (Hetzner <-> Mac)
  • UFW firewall ACTIVE (SSH restricted to Tailscale 100.64.0.0/10, HTTP/HTTPS public, default deny incoming). Enabled 2026-03-28.
  • Root password set for emergency Hetzner console access
  • struxio user has sudo access
  • Restic backup to Backblaze B2 (daily 03:00 UTC)
  • SOPS + age for secret encryption
  • Git history cleaned: plaintext secrets purged from STRUXIO_OS repo history via git-filter-repo (2026-03-28). Only SOPS-encrypted versions remain.
  • Supply chain security: Socket.dev + GuardDog recommended for behavioral malicious package detection. Trivy for container scanning. pip-audit/npm-audit for CVE baseline.
  • RC keepalive cron (every 10 min)
  • Ruflo (claude-flow) for agent teams
  • Claude Code v2.1.86 with Max20 OAuth
  • AutoDream enabled (memory consolidation)
  • tmux 3.4, ripgrep 14.1.1 installed

What must be built/changed:

  • Install must-have CLI tools (gh, jq, uv, fzf, fd, yq, direnv)
  • Retire stale containers (devxio-frontend, devxio-bridge)
  • ~~Evaluate Neo4j instances for retirement~~ (done -- both retired, see Part 5 Section 12.1)
  • Add pg_dump to restic backup scope
  • Build new FastAPI backend + Next.js UI services
  • Upgrade PostgreSQL to become primary ODM state store
  • Evolve Bus into API gateway or keep as messaging relay

16. Infrastructure Success Criteria

Infrastructure is successful only if the following are true:

16.1 Reliability

  • core services start reproducibly
  • system can run continuously
  • failures are detectable
  • restart procedures are documented

16.2 Recoverability

  • backups exist and are valid
  • restore drill is executable
  • runtime/session recovery path is defined
  • bad deployments can be rolled back

16.3 Security

  • secrets are controlled
  • access is role-scoped
  • public exposure is minimized
  • audit trail exists for critical actions

16.4 Observability

  • core services emit useful telemetry
  • critical alerts reach the operator
  • cost and health are visible
  • incident diagnosis is possible without guesswork

16.5 Scalability

  • architecture can separate services without redesign
  • local node remains viable as fallback or augmentation
  • future GPU or product nodes can be added cleanly

16.6 Cost Discipline

  • infrastructure spend is explainable
  • upgrade decisions are signal-based
  • expensive idle complexity is avoided

Infrastructure that merely "runs" is not enough. It must be operable, recoverable, and governable.


17. Naming Conventions

All STRUXIO repositories, folders, and files follow a four-rule naming standard. These rules ensure consistency across GitHub, local disk, and internal structure.

17.0 General Principles

  1. Case-insensitive uniqueness: Never create two files or folders with the same name differing only by case. Uppercase in Mac root folders is for human readability only — the system must treat names as case-insensitive for search and deduplication.
  2. XIOPro and STRUXIO are proper names: Always written in uppercase. They are brand names with no abbreviation or meaning to decode — keep as-is everywhere.
  3. Mac vs Hetzner convention: Mac uses STRUXIO_ prefix on top-level folders for Finder readability. Hetzner uses the GitHub lowercase name (the git clone default). Both are valid — they map to the same repo (see Section 17.5).
  4. External tool names kept as-is: Third-party tool names (Neo4j, PostgreSQL, Caddy, Backblaze, Tailscale) retain their original casing in all documents.
  5. High-level folders are descriptive: Use full words — STRUXIO_Design (not STRUXIO_D), STRUXIO_Knowledge (not abbreviated). The folder name should explain what it contains.

17.1 Rule 1 — GitHub Repository Names

  • All lowercase.
  • Words separated by hyphens (-).
  • Must start with struxio-.

Examples: struxio-design, struxio-app, struxio-knowledge

17.2 Rule 2 — Local Top-Level Folders (Repos on Disk)

  • Mac: Start with STRUXIO_. Use underscores (_). CamelCase or logical uppercase for readability.
  • Hetzner: Use GitHub lowercase name as cloned (e.g., struxio-design). No renaming needed.
  • These represent the repos and are the exception to the lowercase rule on Mac.

Examples (Mac): STRUXIO_Design, STRUXIO_OS, STRUXIO_Knowledge, STRUXIO_DEVXIO_UI Examples (Hetzner): struxio-design, struxio-os, struxio-knowledge

17.3 Rule 3 — Structure Folders (Inside Repos)

  • All lowercase.
  • Words separated by underscores (_).
  • No spaces, no hyphens.

Examples: 02_devxio_architecture, blueprint_devxio_bl_v4_2_set, resources

The daily folder cleanup cron at 04:00 UTC enforces Rule 3 (lowercase structure folders).

17.4 Rule 4 — File Names

  • Start with a function/type prefix in UPPERCASE.
  • Rest uses appropriate casing for readability.

Examples: BLUEPRINT_XIOPro_v4_2_Part1_Foundations.md, SKILL_REGISTRY.yaml, REVIEW_final_freeze_v4_2.md, PLAN_iso19650_integration.md

17.5 Repository Mapping

GitHub Repo Mac Folder Hetzner Folder Purpose
struxio-design STRUXIO_Design struxio-design Architecture, blueprints, design docs
struxio-logic STRUXIO_Logic struxio-logic Agent activations, rules, skills
struxio-os STRUXIO_OS STRUXIO_OS State, tickets, engineering, infra
struxio-app STRUXIO_App struxio-app Product code (see MVP1_PRODUCT_SPEC.md)
struxio-business STRUXIO_Business struxio-business Business docs
struxio-knowledge STRUXIO_Knowledge struxio-knowledge Knowledge vault, Obsidian sync
struxio-devxio-ai STRUXIO_DEVXIO_UI devxio-control-center Control Center UI (Next.js)
struxio-aibus STRUXIO_AIBUS struxio-aibus Bus MCP Server source
struxio-dashboard STRUXIO_Dashboard struxio-dashboard Dashboard UI
struxio-tickets STRUXIO_Tickets struxio-tickets Ticket tracking

17.6 Operational Tools

Tool Command Schedule Purpose
Folder Naming Cleanup /opt/struxio/scripts/folder_naming_cleanup.sh Daily 04:00 UTC Enforces Rule 3 (lowercase structure folders)
Workspace Graph /opt/struxio/scripts/workspace_graph.sh Daily 04:01 UTC Generates STATE_workspace_graph.yaml — full folder/file map for agent navigation

18. Final Statement

Infrastructure is the execution ground of XIOPro.

If this layer is weak:

  • runtime becomes fragile
  • recovery becomes guesswork
  • security becomes accidental
  • costs become opaque

If this layer is strong:

  • the system can run headless with confidence
  • failures can be absorbed and repaired
  • the founder can scale with less fear
  • future growth does not require rethinking everything

Changelog

Version Date Author Changes
4.1.0 2026-03-27 BM Initial infrastructure blueprint
4.2.0 2026-03-28 BM C8.1: Added actual Hetzner CPX62 specs (16 vCPU AMD EPYC-Genoa, 30GB RAM, 150GB SSD) to Section 5.1 and 9.5.1. C8.2: Added SOPS+age secrets encryption to Section 9.9.6. C8.3: Added Restic backup to Backblaze B2 section (10.2A). C8.4: Added service fate map reference (Section 14). C8.5: Added container memory budget (Section 9.5A). C8.6: Added CLI toolchain section (9.6A) referencing CLI_TOOLS_ASSESSMENT.md. CX.1: Global "Rufio" to "Ruflo" rename. CX.2: Updated version header to 4.2.0. CX.3: Added changelog. CX.4: Added current state section (Section 15). Renumbered success criteria to Section 16, final statement to Section 17.
4.2.2 2026-03-28 000 Agent naming migration: O00/O01 replaced with 000 (orchestrator role) / 000 (governor role). M01 replaced with module steward role. BM replaced with 000. Container group names updated from o00/o01 to orchestrator/governor. Backblaze B2 references preserved unchanged. Changelog author entries preserved as historical.
4.2.3 2026-03-28 000 Roles over numbers: Removed agent IDs from all architectural descriptions, section headers, diagrams, and service lists. Role names used throughout instead of agent numbers.
4.2.7 2026-03-28 BM Neo4j deprecated: Both instances (devxio-librarian, graph_stack_neo4j) marked as retired/removed across Sections 9.5A, 14, 15. PostgreSQL + pgvector replaces all Neo4j use cases for T1P.
4.2.11 2026-03-29 BM Added Section 9.11.12 (Orchestrator Launch Commands) — devxio go and devxio mo launch commands for GO and MO surfaces with cross-reference to Part 4, Section 4.1A.
4.2.12 2026-03-29 BM Added Section 17 (Naming Conventions) — four-rule naming standard for repos, folders, and files with repository mapping table. Renumbered Final Statement to Section 18.
4.2.13 2026-03-29 BM Updated Section 17 naming conventions: added Section 17.0 (General Principles — case-insensitive uniqueness, proper names, Mac vs Hetzner, tool names, descriptive folders). Updated 17.2 to distinguish Mac/Hetzner. Updated 17.5 mapping table with Hetzner column. Added 17.6 (Operational Tools — folder cleanup + workspace graph).
4.2.14 2026-03-29 BM Cross-references: Added pointer to resources/DESIGN_cli_services.md in Section 9.6A (CLI services framework including Porkbun DNS and Hetzner hcloud). Added hcloud to Must-Have CLI tools table.
5.0.1 2026-03-30 GO N22: Added Section 8.8.1 (Connection Pooling) -- PgBouncer or built-in pool_size recommended at 15+ agents, current Fastify pool max: 20, struxio_db_pool_* gauge monitoring via GET /metrics, pool exhaustion = warning alert.
5.0.2 2026-03-30 GO N8: Added Section 13.5A (Scaling Triggers) — four specific thresholds: PostgreSQL write latency > 50ms at 10+ agents → read replicas; host memory > 75% sustained → new host; Bus latency > 200ms → caching layer; spawn queue depth > 5 → distribute to additional hosts. N20: Added Section 8.12A (Bus API Rate Limits) — default 100 req/min per actor, burst 200 req/min throttled, 1 SSE connection per actor per channel, 50 events/min per actor.
5.0.3 2026-03-30 GO C4: Added Section 10.3A (PostgreSQL WAL Archiving) — continuous WAL shipping to B2, RPO reduced from 24h to 5 minutes, archive_mode/archive_command config, point-in-time restore procedure (7 steps), monitoring rules. Updated Section 10.6 RPO target to reflect WAL archiving. C5: Expanded Section 10.8 (Restore Drill Requirements) — monthly restore drill procedure with 6-step checklist (download, restore, verify tables, verify freshness, cleanup, record), success criteria, failure response, results recorded in state/restore_drills.yaml.
5.0.4 2026-03-30 GO I13: Revised agent count estimate in Section 9.5A — realistic max 8-10 concurrent agents on CPX62 (30 GB RAM). Each Claude Code process ~300-500 MB, services baseline ~10 GB, 3-5 GB safety buffer. Previous higher estimates reflected smaller assumed agent footprints.
5.0.5 2026-03-30 GO N8 addendum: Added two scaling triggers to Section 13.5A — concurrent agent count > 8 per host → evaluate second host; disk usage > 80% → archive old partitions.