XIOPro Production Blueprint v5.0¶
Part 8 — Infrastructure & Deployment Architecture¶
1. Purpose¶
Defines the concrete infrastructure baseline required to run XIOPro as a headless-first, recoverable, secure, and provider-independent execution system.
This part specifies:
- runtime environments
- node roles
- service boundaries
- deployment topology
- network shape
- storage surfaces
- installation inventory
- scaling direction
- operational constraints
This document is not a cloud wishlist. It is the execution platform contract for XIOPro.
2. Infrastructure Thesis¶
Infrastructure must support all of the following simultaneously:
- continuous headless operation
- recoverable multi-agent execution
- explicit control-plane separation
- durable state persistence
- low-friction founder intervention
- provider-swappable model access
- future expansion without redesign
Infrastructure exists to make the architectural rules real.
3. Infrastructure Principles¶
3.1 Headless First¶
All critical execution must continue without UI.
The UI may observe and control, but must never become the only runtime path.
3.2 Durable State First¶
No important execution state may live only inside:
- a terminal tab
- a provider chat window
- a single container memory space
- an agent-local temp file
Durable state must land in authoritative storage surfaces.
3.3 Replaceability¶
The infrastructure must allow replacement of:
- model providers
- agent runtimes
- API gateway/router
- UI
- storage backends
- observability stack
without invalidating the XIOPro operating model.
3.4 Logical Separation Before Physical Separation¶
Even when colocated on one server initially, the following concerns must remain logically separated:
- control plane
- execution fabric
- governance
- data/state
- knowledge services
- ingress/API
- observability
- backup/recovery
3.5 Recovery Is Native¶
Infrastructure must assume:
- session crash
- provider disconnect
- container restart
- host reboot
- partial service outage
- network interruption
- founder disconnect
Recovery is not a future enhancement. It is a base requirement.
3.6 Security by Reduction¶
Prefer:
- private network paths
- minimum exposed ports
- minimum standing privileges
- minimum long-lived secrets
- explicit auditability
4. Canonical Environment Model¶
4.1 PRD -- Production Runtime¶
Primary live environment for XIOPro system operation.
Contains:
- orchestrator runtime
- governor runtime
- API/control services
- PostgreSQL
- scheduler/worker services
- LiteLLM router
- Ruflo swarm runtime
- knowledge service backends
- observability services
- backup jobs
4.2 TST -- Integration Validation¶
Used to validate:
- schema changes
- orchestration behavior
- recovery behavior
- deployment updates
- service compatibility
TST must be structurally similar to PRD, but can run with reduced scale.
4.3 DEV -- Builder / Experiment Zone¶
Used for:
- agent experiments
- rule iteration
- local service development
- migration rehearsal
- safe breakage
4.4 LOC -- Local Operator Node¶
Primary founder workstation environment.
Contains or may contain:
- RC-capable local execution surfaces
- local knowledge access
- local file operations
- CLI diagnostics
- fallback execution
- future local models
- operator utilities
LOC is not the production control plane, but it is an important resilience and intervention node.
5. Runtime Node Topology¶
5.1 Node A -- Cloud Control Node (Hetzner CPX62)¶
Primary always-on control and execution node.
Actual Hardware Specs (as of 2026-03-28)¶
| Spec | Value |
|---|---|
| Provider | Hetzner Cloud |
| Instance type | CPX62 (shared vCPU, AMD) |
| CPU | 16 vCPU AMD EPYC-Genoa |
| RAM | 30 GB |
| Storage | 150 GB SSD (NVMe) |
| OS | Ubuntu 24.04 LTS |
| Location | Hetzner EU |
Responsibilities:
- orchestrator control
- governance control
- API ingress
- work graph persistence
- scheduling
- runtime coordination
- background execution
- telemetry collection
5.2 Node B -- Local Operator Node (Mac Studio)¶
Connected via Tailscale VPN (encrypted mesh).
Responsibilities:
- founder interaction
- local CLI execution
- fallback RC-capable sessions
- local knowledge access
- manual validation
- future local inference experiments
5.3 Node C -- Future GPU / Model Node¶
Reserved for:
- self-hosted model serving
- heavier local inference
- embedding jobs
- batch processing
- specialized isolated workloads
5.4 Node D -- Future Product Runtime Node¶
Reserved for:
- STRUXIO product APIs
- customer-facing runtime isolation
- product workloads separated from XIOPro control plane
6. High-Level Infrastructure Overview¶
flowchart TD
User[User / Local Operator Node] --> Ingress[Ingress / API Gateway]
Ingress --> Control[Control Services]
Control --> Orchestrator["Orchestrator"]
Control --> Governor["Governor"]
Orchestrator --> Ruflo[Ruflo Execution Fabric]
Ruflo --> Surfaces[Execution Surfaces]
Surfaces --> Providers[Model Providers / Local Models]
Orchestrator --> DB[(PostgreSQL)]
Governor --> DB
Control --> DB
Control --> Knowledge[Knowledge / Librarian Services]
Control --> Telemetry[Logs / Metrics / Alerts]
DB --> Backup[Backup & Recovery]
Knowledge --> Backup
7. Service Architecture¶
7.1 Control Plane Services¶
Core services that maintain system state and coordination:
- API service
- orchestrator service
- governor service
- scheduler service
- worker/queue consumers
- RC/escalation broker
7.2 Execution Fabric Services¶
Services responsible for agent execution and provider interaction:
- Ruflo agent swarm engine
- LiteLLM router
- execution adapters
- CLI/runtime bridges
- provider connectors
7.3 Data and Knowledge Services¶
Authoritative storage and retrieval services:
- PostgreSQL
- knowledge/librarian service
- index refresh jobs
- document/asset storage references
7.4 Operational Services¶
Cross-cutting operations services:
- reverse proxy / ingress
- secrets delivery
- backup jobs
- log pipeline
- metrics exporter
- alert delivery
8. Canonical Service Inventory¶
8.1 Ingress / Reverse Proxy¶
Role:
- terminate TLS
- route inbound traffic
- expose minimal public surfaces
- forward requests to internal services
Examples:
- Caddy
- Traefik
- Nginx
8.2 API Service¶
Role:
- main entry point for UI and CLI
- authentication and authorization
- session/control endpoints
- work graph access
- human escalation endpoints
8.3 Orchestrator Service¶
Role:
- reads tickets/tasks/state
- assigns work
- selects execution path
- manages continuity
- coordinates domain/worker agents
8.4 Governor Service¶
Role:
- monitors cost, health, anomalies, and risk
- enforces policy actions
- raises alerts and intervention requests
- proposes optimization actions
8.5 Ruflo Runtime Service¶
Role:
- agent spawning
- sub-agent lifecycle management
- bounded multi-agent execution
- runtime coordination hooks
8.6 LiteLLM Router Service¶
Role:
- provider abstraction
- model routing
- fallback routing
- usage metering integration
- future local-model routing
8.7 Scheduler / Background Worker Service¶
Role:
- recurring jobs
- dream windows
- maintenance jobs
- index refresh
- backup execution
- telemetry rollups
8.8 PostgreSQL Service¶
Authoritative store for:
- ODM entities
- runtime state
- session state
- escalation state
- governance events
- cost records
- audit events
- control metadata
8.8.1 Connection Pooling¶
Connection pooling via PgBouncer or built-in pool_size is recommended when agent count exceeds 15. Current Fastify pool: { max: 20 }. Monitor with GET /metrics using the struxio_db_pool_* gauge family.
Rules:
- Below 15 agents: Fastify built-in pool (
max: 20) is sufficient - At 15+ agents: evaluate PgBouncer in transaction-pooling mode as a sidecar to the PostgreSQL container
- Pool exhaustion events must be captured as governance alerts (warning level)
struxio_db_pool_active,struxio_db_pool_idle, andstruxio_db_pool_waitinggauges must be emitted to the observability stack- PgBouncer configuration (if adopted) must be SOPS-encrypted and managed via the same secrets path as PostgreSQL credentials
8.9 Knowledge / Librarian Service¶
Role:
- ingest knowledge sources
- classify/index content
- maintain retrieval structures
- support render/export/query workflows
8.10 Object Storage / Backup Surface¶
Primary uses:
- database dumps
- snapshots
- compressed transcripts
- recovery packages
- exported artifacts
8.11 Observability Stack¶
Core outputs:
- logs
- metrics
- health state
- error events
- alert signals
- future traces
8.12 Module Portfolio Infrastructure Linkage¶
Purpose¶
Infrastructure must provide the real-world constraints and capabilities that make module portfolio governance credible.
The module steward can recommend and optimize modules only within an actual hosting envelope.
That means infrastructure must expose enough information for the portfolio layer to reason about:
- subscription-backed module access
- API-backed module access
- self-hosted module feasibility
- local vs cloud placement
- resource ceilings
- operational complexity
- fallback paths
8.12.1 Infrastructure Inputs Required by the Module Steward¶
Part 8 should provide the module steward with at least:
- available execution nodes
- node class and role
- approximate compute profile
- memory profile
- storage considerations
- network posture
- public vs private connectivity assumptions
- allowed runtime surfaces
- operational risk notes
- recovery and observability readiness
This is necessary so "recommended module" can mean: recommended and actually runnable.
8.12.2 Hosting Feasibility Principle¶
A module should not be marked portfolio-approved for self-hosted or local use unless there is a credible hosting profile for it.
A credible hosting profile must include at least:
- target environment
- resource assumptions
- deployment complexity notes
- security notes
- recovery notes
- observability notes
- fallback path if the hosting path fails
8.12.3 Local / Cloud / Hybrid Evaluation¶
The module steward should be able to evaluate candidate module options against at least these hosting classes:
- local Mac execution
- Hetzner primary control node
- future dedicated GPU/model node
- future isolated product runtime node
- hybrid cloud/provider access
Each class carries different tradeoffs in:
- quality
- stability
- trust
- latency
- bandwidth
- compute pressure
- operational complexity
8.12.4 Subscription and Surface Awareness¶
Infrastructure and module governance must stay aligned on where module access actually exists.
This includes awareness of:
- provider API access paths
- provider subscription-backed surfaces
- local CLI/runtime adapters
- routing-layer reachability
- fallback availability during provider failure
This prevents recommending modules that cannot actually be reached from the required runtime surface.
8.12.5 Optimization Telemetry Requirement¶
Infrastructure should preserve enough telemetry for portfolio optimization over time.
Useful telemetry includes:
- latency by module and task class
- error/failure rate by module and access path
- cost / usage by module
- retry rate by module
- fallback frequency
- node pressure when self-hosted or local
- bandwidth pressure where relevant
This allows the module steward to optimize with evidence, not intuition alone.
8.12.6 Adoption Rule¶
Infrastructure may support evaluation and comparison of new modules, subscriptions, and self-hosted options.
But infrastructure must not auto-adopt them.
Adoption still requires governed approval and a deliberate rollout decision.
8.12A Bus API Rate Limits¶
The Control Bus enforces rate limits to protect stability and ensure fair access across all actors. These limits are active in the current Bus implementation.
Default Limits¶
| Limit | Value | Notes |
|---|---|---|
| Default request rate | 100 req/min per actor | Warning logged at threshold; already implemented |
| Burst allowance | 200 req/min per actor | Allowed for short bursts; throttled (429) after sustained burst |
| SSE connections | 1 connection per actor per channel | Reconnect replaces the prior connection; no parallel SSE streams |
| Event emission rate | 50 events/min per actor | Applies to POST /events; excess events are queued or dropped with warning |
Rules¶
- Rate limits are applied per
actor_id, not per IP or session. - Burst capacity (200 req/min) is available for up to 30 seconds before throttle kicks in.
- Throttled requests receive HTTP 429 with a
Retry-Afterheader. - Rate limit violations are logged as Bus warning events and visible in the Dashboard alert feed.
- SSE reconnect on rate-limited channels retries after the backoff window (see Section 10.4 retry policy).
- These limits protect Bus and PostgreSQL from agent runaway — they are not negotiable per-actor.
Tuning Principle¶
Rate limits may be raised globally only if sustained Bus latency remains below 200ms after the increase. Individual actors may not self-negotiate higher limits — only the Governor may authorize a limit adjustment via a Bus configuration change.
8.13 Repository, Filesystem & Storage Layout¶
Purpose¶
XIOPro needs an explicit filesystem and repository model.
Without it, the system may have strong logic but weak operational discipline.
This section defines where source-of-truth assets live, how they are separated, and which storage surfaces are authoritative for which classes of data.
Principle¶
Not all data belongs in the same place.
XIOPro should separate:
- versioned source assets
- runtime state
- large artifacts
- backups
- local operator files
- experimental or temporary material
This prevents confusion between:
- what is canonical
- what is generated
- what is recoverable
- what is disposable
8.13.1 Canonical Storage Classes¶
Git Repositories¶
Use Git repositories for:
- source code
- blueprints
- rules
- skills
- activations
- prompt templates
- runbooks
- deployment definitions
- scripts
- configuration templates
Git is the human-readable and auditable source of truth for versioned text-based assets.
PostgreSQL¶
Use PostgreSQL for:
- ODM entities
- tickets
- tasks
- activities
- runtimes
- sessions
- escalations
- human decisions
- policy objects
- governance events
- cost/usage rollups
- scheduler state
- indexing metadata
PostgreSQL is the authoritative operational state store.
Object / Blob Storage¶
Use object storage for:
- transcript snapshots
- checkpoints
- recovery bundles
- exported artifacts
- large generated files
- retained log bundles
- research exports where size or format justifies it
Object storage is for durable large artifacts, not for the primary source of truth of structured runtime state.
Local Operator Filesystem¶
The local founder/operator node may hold:
- local clones of approved repos
- local working notes
- sandbox experiments
- review/export materials
- temporary staging files
- local tool caches
Local operator storage is useful, but it is not authoritative unless content is committed or ingested properly.
8.13.2 Recommended Repository Topology¶
T1P should align to the actual active STRUXIO repository family rather than a generic placeholder structure.
Canonical active repos:
struxio-osstruxio-logicstruxio-designstruxio-appstruxio-businessstruxio-knowledge
A transitional repo may still exist for a limited period:
struxio-aibus
Reference repos may also exist for research or inspiration, but they are not part of the canonical operating core.
struxio-os¶
Primary control-plane and operations repo.
Holds:
- infra
- state
- tickets
- deployment
- runbooks
- control-layer operational files
- bootstrap/update scripts
- ops-facing automation
struxio-logic¶
Primary cognition / behavior repo.
Holds:
- agents
- rules
- skills
- prompts
- logic-layer governance assets
- activation and protocol assets where appropriate
struxio-design¶
Primary architecture / blueprint / research repo.
Holds:
- blueprint parts
- architecture records
- system maps
- evolution notes
- product design
- PRDs
- research artifacts and synthesis outputs where text-first is appropriate
struxio-app¶
Primary product/application implementation repo.
Holds:
- app/runtime code
- APIs
- product-facing implementation
- product integration surfaces
- E2E test surfaces
struxio-business¶
Primary business / legal / finance / strategy repo.
Holds:
- business assets
- legal materials
- finance materials
- strategy
- brand and fundraising assets
struxio-knowledge¶
Primary knowledge / research / reference repo.
Holds:
- research artifacts
- curated reference material
- knowledge ledger assets
- synthesis outputs
- topic-indexed knowledge files
struxio-aibus (Transitional / Legacy)¶
Not a permanent first-class pillar.
Plan:
- identify still-valuable code or documents
- migrate what remains useful into canonical repos
- archive the repo once no longer operationally required
Rule¶
Part 8 repository topology must stay aligned with the canonical active repo family used by the work plan and migration model.
8.13.3 Filesystem Class Rules¶
Within any repo or managed storage surface, files should conceptually fall into these classes:
sourcegeneratedruntimearchivetemp
Source¶
Human-maintained canonical inputs.
Examples:
- code
- rules
- skills
- blueprints
- configs
- runbooks
Generated¶
System-produced durable outputs.
Examples:
- exports
- compiled artifacts
- evaluation reports
- generated documentation
- synthesized summaries
Generated assets should not silently replace source assets.
Runtime¶
Operationally live mutable state.
Examples:
- DB data
- active checkpoints
- session snapshots
- job state
Runtime state belongs in state stores, not committed source repos.
Archive¶
Longer-lived retained material not needed for active work.
Examples:
- retired reports
- older exports
- superseded bundles
- long-term retained incident artifacts
Temp¶
Disposable staging content.
Examples:
- scratch files
- transient downloads
- in-progress experiment outputs
- tool caches
Temp must never be treated as authoritative.
8.13.4 Authoritative Repo / State Rules¶
The system must be explicit about which surface is authoritative.
Rules:
- text assets -> authoritative in Git
- runtime operational state -> authoritative in PostgreSQL
- large artifacts / checkpoints / exports -> authoritative in object storage where applicable
- local machine files -> non-authoritative until committed or ingested
No agent should assume a local filesystem copy is canonical merely because it exists.
8.13.5 Research & Knowledge Storage Note¶
Research-related material may live across:
- Git-managed knowledge assets
- PostgreSQL metadata/indexing
- object storage exports
- local review workspaces
- Obsidian/NotebookLM connected surfaces
But the system must still preserve clear distinction between:
- raw source material
- curated knowledge
- generated derivative outputs
- scheduled research artifacts
8.14 Cost Telemetry & Attribution Pipeline¶
Purpose¶
Infrastructure must collect cost and usage signals from the moment an agent/runtime uses a module, and preserve them in a form that is:
- attributable
- queryable
- enforceable
- optimizable
This supports Part 3 cost propagation and Part 4/Part 7 runtime governance.
Principle¶
Cost must be captured both:
- during execution
- after execution
This requires a pipeline, not only a dashboard.
8.14.1 Collection Stages¶
Stage 1 -- Raw Usage Emission¶
Execution surfaces, routers, and adapters should emit raw usage events when work happens.
Typical sources:
- LiteLLM/router usage records
- provider API responses
- local runtime counters
- subscription-surface usage approximations where exact billing is delayed
- worker/task metadata
Stage 2 -- Activity Attribution¶
Raw usage must be attributed to the correct operational scope.
Minimum attribution targets:
- activity
- session
- agent runtime
- task
- ticket
- execution surface
- module/provider
- environment
Stage 3 -- Normalization¶
Usage must be normalized into comparable records.
Useful normalized fields include:
cost_event:
event_id: string
timestamp: datetime
activity_id: string|null
session_id: string|null
agent_runtime_id: string|null
task_id: string|null
ticket_id: string|null
module_id: string|null
provider: string|null
access_path: string|null
# api | subscription | self_hosted | hybrid
usage_units_in: float|null
usage_units_out: float|null
estimated_cost: float|null
billed_cost: float|null
currency: string|null
latency_ms: int|null
retries: int|null
node_id: string|null
notes: string|null
Stage 4 -- Rollup¶
Rollups should aggregate by at least:
- activity
- task
- ticket
- project
- module/provider
- access path
- runtime surface
- day / week / month
Stage 5 -- Governance Consumption¶
Rollups and anomaly signals should feed:
- the governor
- breaker policies
- budget policies
- module steward optimization analysis
- reporting/UI layers later
8.14.2 Collection Requirements by Access Type¶
API-Based Module Use¶
Preferred collection source:
- router/provider response metadata
- request/response usage counters
- billing approximation tables
- later reconciliation with actual billed usage where available
Subscription-Based Module Use¶
Exact billing detail may be weaker or delayed.
Minimum requirement:
- record which runtime used which subscription-backed surface
- approximate scope and intensity of use
- preserve task/runtime attribution
- support strategic optimization even when exact per-call pricing is unavailable
Self-Hosted Module Use¶
Collect at least:
- runtime used
- node used
- time consumed
- compute/memory pressure
- queue/wait cost proxy
- power/capacity proxy where useful later
Self-hosted cost is not zero just because no API bill exists.
8.14.3 Storage Rule¶
Cost telemetry should be stored in PostgreSQL as normalized operational records and rollups.
Large raw logs may additionally land in log/object storage, but authoritative attribution must remain queryable from the operational store.
8.14.4 Validation Rule¶
A task is not considered fully cost-observable unless XIOPro can answer at least:
- which module(s) were used
- by which runtime/surface
- for which task/ticket
- with what estimated or billed cost signal
- with what latency/retry profile
If this cannot be answered, cost governance is incomplete.
8.14.5 Final Rule¶
Cost is not "a later finance report".
It is a live infrastructure signal that must be captured at execution time and preserved for both governance and optimization.
9. Deployment Model¶
9.1 Initial T1P Deployment¶
Initial production baseline:
- single Hetzner CPX62 primary node
- Docker Compose or equivalent simple orchestrator
- all core XIOPro services colocated
- strict logical separation between services
- reverse proxy in front
- PostgreSQL persistent volume
- scheduled backup jobs
- private admin access only
This is acceptable because the current need is:
- founder-scale operation
- rapid iteration
- recoverability
- low complexity
It is not acceptable to let "single-node MVP" become "undefined production."
9.2 Initial Container Groups¶
Recommended initial groups:
ingressapiorchestratorgovernorruflolitellmschedulerworkerspostgresknowledgetelemetrybackup
9.3 Scale-Out Direction¶
When required, scale along these lines:
- split ingress/API from control services
- split PostgreSQL onto stronger isolated storage node
- split worker/runtime services from control node
- add dedicated GPU/model node
- isolate product runtime from XIOPro runtime
9.4 Non-Goals for Initial Phase¶
Do not introduce yet unless proven necessary:
- Kubernetes
- distributed queue complexity beyond real need
- service mesh
- heavy graph infrastructure
- multi-region architecture
- premature HA theater
These may become valid later, but are not required for T1P execution readiness.
9.5 Initial Hardware Baseline¶
9.5.1 Node A -- Hetzner CPX62 (Actual Specs)¶
The current production server is a Hetzner CPX62:
| Spec | Value |
|---|---|
| CPU | 16 vCPU AMD EPYC-Genoa (shared) |
| RAM | 30 GB |
| Storage | 150 GB SSD (NVMe) |
| OS | Ubuntu 24.04 LTS |
| Docker | Docker Engine 29.2.1, Docker Compose |
| Network | Public IPv4, Tailscale VPN overlay |
| Python | 3.12.3 |
| Node.js | 20.20.1 |
Practical Sizing Principle¶
The initial node must be sized for control-plane reliability first, not for speculative future self-hosted model serving.
That means it must comfortably support:
- orchestrator service
- governor service
- PostgreSQL
- API / ingress
- Ruflo
- LiteLLM
- scheduler / workers
- observability
- backup jobs
without sustained resource contention.
Initial Recommendation Logic¶
Choose a Hetzner class that prioritizes:
- CPU consistency
- RAM headroom
- fast NVMe/SSD
- stable Linux support
- easy vertical upgrade path
Do not size Node A around local-model aspirations. If self-hosted inference becomes real, it belongs on Node C.
9.5.2 Node B -- Local Operator Node (Mac Studio)¶
Current role:
- founder interaction
- RC-capable local sessions
- local CLI operations
- local validation
- local knowledge work
- fallback execution
Connected via Tailscale VPN (encrypted mesh, Hetzner <-> Mac).
Recommended baseline:
- stable workstation environment
- local CLI toolchain
- secure admin access to Node A
- local backup for critical operator-side configs
- optional local container tooling for test/fallback
9.5.3 Node C -- Future GPU / Self-Hosted Model Node¶
This node is optional and deferred.
It becomes justified only when one or more conditions are true:
- self-hosted models materially improve privacy
- unit economics justify dedicated inference
- batch embedding/index workloads become heavy
- provider dependence becomes strategically limiting
- offline or degraded-network resilience becomes important
Until then, Node C remains a reserved architectural slot, not an implementation obligation.
9.5A Container Memory Budget (CPX62 -- 30 GB)¶
With the CPX62 at 30 GB RAM, the memory budget after retirement of stale services is:
| Category | Estimated RAM | Notes |
|---|---|---|
| Docker containers (current, post-retirement) | ~2.25 GB | 10 containers after retiring devxio-frontend, devxio-bridge, devxio-librarian, graph_stack_neo4j (Neo4j deprecated -- both instances removed) |
| Agent processes (orchestrator + 2 brains typical) | ~2-3 GB | Claude Code sessions via Max20 |
| System / OS | ~2 GB | Ubuntu 24.04, systemd, journald, etc. |
| Available headroom | ~22-24 GB | |
| New XIOPro backend + UI (budget) | 4-6 GB | FastAPI backend, Next.js UI, workers |
| Remaining free | ~16-20 GB | Comfortable margin for spikes |
This gives substantial headroom for the new XIOPro services. The CPX62 is not a constraint for T1P.
Realistic Concurrent Agent Estimate¶
Each Claude Code agent process consumes approximately 300-500 MB of RAM. With the CPX62's 30 GB:
| Component | Estimated RAM |
|---|---|
| Services baseline (13 containers) | ~10 GB |
| System / OS | ~2 GB |
| Available for agents | ~18-20 GB |
| Agent process (each) | ~300-500 MB |
| Realistic concurrent agents | 8-10 (at ~500 MB each, with ~3-5 GB buffer for spikes) |
The realistic maximum is 8-10 concurrent agents on the current CPX62. This accounts for:
- Worst-case agent memory (~500 MB each)
- A 3-5 GB safety buffer for memory spikes, background jobs, and transient allocations
- The 85% RAM utilization hard limit from Part 1, Section 4.10 (no agent spawning above 85%)
Previous estimates of higher agent counts assumed smaller agent footprints. This revised estimate reflects observed Claude Code process sizes in production.
Budget Rule¶
If total container memory exceeds 15 GB sustained, investigate:
- which containers can be retired or consolidated
- whether any service is leaking memory
- whether workload should move to a separate node
See resources/SERVICE_FATE_MAP_v4_2.md for the full current-to-target service transition plan.
9.6 Installation Bill of Materials (T1P)¶
9.6.1 Host-Level Baseline¶
Node A should install and configure:
- Ubuntu LTS base OS
- Docker Engine
- Docker Compose or equivalent simple orchestrator
- UFW or nftables firewall
- Tailscale or equivalent secure overlay
- SSH server with key-only auth
- fail2ban if SSH remains publicly reachable
- log rotation baseline
- backup scripting/runtime support
- system time sync
- unattended or managed security update strategy
9.6.2 Core XIOPro Service Set¶
Initial service set:
- ingress / reverse proxy
- API service
- orchestrator service
- governor service
- Ruflo runtime service
- LiteLLM router service
- scheduler service
- worker service(s)
- PostgreSQL service
- knowledge / librarian service
- telemetry / monitoring service(s)
- backup service / scheduled jobs
9.6.3 Supporting Operational Components¶
Recommended supporting components:
- TLS certificate automation
- environment/secrets injection mechanism
- deployment scripts / make targets / runbooks
- uv-based Python version/dependency/tool management for Python services and scripts
- backup restore scripts
- database migration runner
- health-check endpoints
- metrics exporter(s)
- alert delivery integration
9.6.4 Deferred / Optional Components¶
Do not install for T1P unless clearly justified:
- Kubernetes
- service mesh
- heavy queue infrastructure
- dedicated tracing stack if basic telemetry is enough
- vector/graph infrastructure without proven usage
- GPU inference stack on Node A
9.6A CLI Toolchain¶
XIOPro follows a CLI-first principle: prefer CLI tools over MCP wrappers where both exist. CLI pipelines are faster, more composable, and more debuggable.
See resources/CLI_TOOLS_ASSESSMENT.md for the full assessment with install instructions.
See resources/DESIGN_cli_services.md for the config-driven CLI services framework design (operational commands executable via Bus API or devxio CLI, including DNS management via Porkbun API and infrastructure management via Hetzner hcloud CLI).
Already Installed¶
| Tool | Version | Purpose |
|---|---|---|
| tmux | 3.4 | Terminal multiplexer |
| ripgrep (rg) | 14.1.1 | Fast code/text search |
Must-Have (install in Phase 0)¶
| Tool | Purpose | Install |
|---|---|---|
| gh | GitHub CLI -- PR, issue, Actions automation | Official apt repo |
| jq | JSON processor -- API response parsing, config manipulation | apt install jq |
| uv | Python package manager -- 10-100x faster than pip, replaces pip+venv+pyenv | curl installer |
| fzf | Fuzzy finder -- history search, file navigation, pipeline glue | apt install fzf |
| fd | Fast find -- file discovery, respects .gitignore | apt install fd-find |
| yq | YAML processor -- state file manipulation, Docker Compose queries | wget binary |
| direnv | Per-directory env vars -- project isolation, agent env scoping | apt install direnv |
| hcloud | Hetzner Cloud CLI -- server, network, firewall management | Official apt repo |
Nice-to-Have (install when convenient)¶
| Tool | Purpose |
|---|---|
| bat | Syntax-highlighted file viewing |
| delta | Better git diffs |
| lazygit | Visual git TUI |
| xh | Friendlier HTTP client |
| dust | Visual disk usage |
| btm (bottom) | Visual system monitor |
| llm (Simon Willison) | Ad-hoc LLM queries from terminal |
Skip¶
| Tool | Reason |
|---|---|
| aider | Overlaps with Claude Code |
| aichat | Overlaps with Claude Code |
| jj (jujutsu) | Evaluate later; needs Rust toolchain |
Install Script¶
A bootstrap script is provided at resources/CLI_TOOLS_ASSESSMENT.md Section "Recommended Install Script".
Cost: zero (all tools are free and open-source). Disk: under 200 MB total.
9.7 Network Exposure Matrix¶
9.7.1 Principle¶
Every port and entry point must have an owner and justification.
No service should be reachable from the public internet unless:
- it is operationally required
- it is protected
- it is documented
9.7.2 Publicly Exposed Surfaces¶
Allowed public exposure should normally be limited to:
- HTTPS ingress endpoint
- optional HTTP -> HTTPS redirect endpoint
Public exposure should not directly include:
- PostgreSQL
- internal runtime adapters
- scheduler
- worker services
- observability admin surfaces
- raw agent runtimes
9.7.3 Private / Overlay-Only Surfaces¶
Prefer private-only access for:
- SSH administration
- database administration
- internal dashboards
- recovery tooling
- deployment control
- backup administration
- founder/operator maintenance access
This is where Tailscale or equivalent is strongly preferred.
9.7.4 Internal Service Communication¶
Internal services should communicate over:
- private Docker network(s)
- host-local interfaces where practical
- explicit service credentials
- service-to-service allow rules
The infrastructure should avoid a "flat trust" model.
9.8 Domain / DNS / Surface Allocation¶
9.8.1 Principle¶
Surface naming should reflect service boundaries, not historical accidents.
Recommended pattern:
- main XIOPro control surface
- optional API subdomain
- optional RC/escalation subdomain
- optional knowledge subdomain
- optional product/runtime subdomains later
9.8.2 T1P Surface Recommendation¶
For T1P, it is acceptable to expose only one or two public surfaces:
- primary XIOPro control endpoint
- optional API endpoint if separation is useful
Everything else may remain internal/private until needed.
This keeps complexity, certificate handling, and attack surface lower.
9.8.3 DNS Records (Active as of 2026-03-29)¶
Domain registrar: Porkbun. DNS managed via Porkbun.
| Record | Type | Value | Purpose |
|---|---|---|---|
bus.struxio.ai |
A | 89.167.96.154 | Control Bus REST + MCP API |
dashboard.struxio.ai |
A | 89.167.96.154 | Control Center UI |
paperclip.struxio.ai |
A | 89.167.96.154 | Paperclip issue tracker |
tickets.struxio.ai |
A | 89.167.96.154 | Ticket management surface |
chat.struxio.ai |
A | 89.167.96.154 | Open WebUI chat interface |
*.struxio.ai |
CNAME | pixie.porkbun.com | Wildcard — covers all subdomains not listed above |
Note: The wildcard CNAME means devxio.struxio.ai (and any other unlisted subdomain) resolves automatically via *.struxio.ai. Caddy just needs a site block to serve it.
Explicit A records take precedence over the wildcard CNAME for the four listed subdomains.
All public-facing subdomains are reverse-proxied through Caddy with automatic TLS (Let's Encrypt).
9.9 Access Path Matrix¶
9.9.1 Founder Admin Path¶
Used for:
- infrastructure administration
- recovery
- deployment
- secrets handling
- emergency intervention
Preferred path:
- private overlay network
- key-based auth
- auditable commands
9.9.2 System Service Path¶
Used for:
- service-to-service calls
- scheduled jobs
- DB access by approved services
- runtime adapter communication
Requirements:
- scoped credentials
- least privilege
- revocable access
- auditable configuration
9.9.3 Agent Runtime Path¶
Used for:
- execution requests
- provider/model calls
- artifact production
- bounded interaction with control/data services
Restrictions:
- no broad infrastructure admin rights
- no unrestricted DB access
- no unrestricted secrets access
- only approved tools/endpoints
9.9.4 Service Placement Matrix¶
Principle¶
Every service must have a default execution home.
This avoids accidental sprawl, unclear ownership, and unnecessary cross-node complexity.
T1P Recommended Placement¶
Node A -- Cloud Control Node (Hetzner CPX62)¶
Node A should host the initial authoritative platform baseline:
- ingress / reverse proxy
- API service
- orchestrator service
- governor service
- Ruflo runtime service
- LiteLLM router service
- scheduler service
- core worker service(s)
- PostgreSQL service
- librarian / knowledge service
- telemetry / monitoring baseline
- backup job runner
- deployment / migration runner
Node B -- Local Operator Node (Mac Studio)¶
Node B is the founder-operated local execution and intervention node.
It may host:
- local CLI surfaces
- RC-capable local sessions
- local validation tooling
- emergency operator tools
- local knowledge access
- safe sandbox experiments
- optional local container tooling for test/fallback
Node B must not be treated as the authoritative production control plane.
Node C -- Future GPU / Model Node¶
Node C is optional and deferred.
If added later, it should host only specialized higher-weight workloads such as:
- self-hosted model runtimes
- embedding or indexing jobs
- heavier background processing
- isolated experimental inference services
- other compute-intensive workloads that should not burden Node A
Node C should not be required for initial correctness.
Node D -- Future Product Runtime Node¶
Node D is optional and deferred.
If introduced later, it should host:
- STRUXIO product APIs
- customer-facing runtime services
- product-specific workloads isolated from XIOPro control-plane services
Node D exists to preserve separation between XIOPro internal operations and future product runtime responsibilities.
Rule¶
If a service has no explicit placement decision, it defaults to Node A for T1P.
9.9.5 Interface / Port Exposure Classes¶
Principle¶
T1P does not require a full port catalog yet, but it does require deterministic exposure classes.
Every interface must belong to one of the following classes.
Class A -- Public Internet Facing¶
Allowed only when operationally justified.
Typical examples:
- HTTPS ingress endpoint
- optional HTTP redirect endpoint
Requirements:
- protected by reverse proxy
- TLS enabled
- documented owner
- monitored
- minimal surface only
Class B -- Private Overlay Only¶
Accessible only through Tailscale or equivalent secure overlay.
Typical examples:
- SSH administration
- internal dashboard access
- deployment controls
- recovery tooling
- admin-only APIs
Requirements:
- key-based or equivalent strong auth
- operator-only access
- auditable usage
Class C -- Internal Service Network Only¶
Never publicly exposed.
Typical examples:
- PostgreSQL
- scheduler
- worker coordination
- Librarian internal interfaces
- telemetry collectors
- service-to-service APIs
Requirements:
- private Docker/network namespace or host-local isolation
- explicit service identity
- least-privilege credentials
Class D -- Localhost / Node-Local Only¶
Only reachable on the owning node.
Typical examples:
- migration runners
- emergency maintenance helpers
- temporary admin endpoints
- local-only debug utilities
Requirements:
- disabled by default unless needed
- never exposed externally by accident
Final Rule¶
No interface may exist without:
- exposure class
- owning service
- access method
- justification
9.9.6 Secrets Ownership and Injection Rules¶
Principle¶
Secrets must be scoped by role, not shared broadly across the platform.
Secret Classes¶
Examples of secret classes include:
- provider API credentials
- router/provider integration secrets
- database credentials
- session signing/application secrets
- backup/storage credentials
- deployment credentials
- notification/integration secrets
SOPS + age Secrets Encryption¶
Secrets are encrypted at rest using SOPS + age.
| Component | Details |
|---|---|
| Encryption tool | SOPS (Secrets OPerationS) |
| Key backend | age (modern file encryption) |
| Key location | ~/age-key.txt on Node A |
| Encrypted files | .sops.yaml configs, encrypted env files |
SOPS + age provides:
- encryption at rest for all secret files in Git and on disk
- per-file or per-key encryption granularity
- Git-friendly encrypted diffs (only values are encrypted, keys are visible)
- no external key management service required (age key is file-based)
- simple rotation: re-encrypt with new age key
Ownership Rules¶
Founder / Operator Only¶
The founder or emergency operator path may control:
- root infrastructure credentials
- overlay administration
- DNS/domain credentials
- emergency recovery credentials
- secret issuance / rotation authority
- age key management
Platform Services¶
Approved control-plane services may receive only the secrets they require.
Examples:
- API service -> app/session secrets, scoped DB access
- orchestrator / governor -> scoped platform secrets only where operationally necessary
- LiteLLM/router -> provider credentials required for routing
- backup service -> backup target credentials
Agent Runtimes¶
Agent runtimes must not receive broad secret visibility.
They should only receive:
- task-scoped credentials
- provider access via approved broker/router path
- temporary credentials where justified
They must not receive:
- unrestricted production DB credentials
- infrastructure root credentials
- blanket secret bundles
Injection Rules¶
Approved methods for T1P:
- environment injection at container/service start
- mounted secret files with restricted permissions
- managed secret loading wrapper
- SOPS-decrypted values injected at deploy time
Not allowed:
- plaintext secrets in Git
- plaintext secrets in blueprint docs
- secrets embedded in tickets
- secrets stored in general application tables unless explicitly encrypted and justified
Rotation Rule¶
Any secret class that can affect:
- provider spend
- production data
- recovery access
- external exposure
must be rotatable without redesigning the platform.
9.9.7 Environment Separation Rules¶
Principle¶
T1P must distinguish clearly between:
- local/dev
- production cloud
- recovery/emergency operation
Local / Dev Environment¶
Local/dev may be less durable, but must not silently share production authority.
Rules:
- no default reuse of production secrets
- no default connection to production database
- no hidden dependency on founder machine availability
- safe to destroy and recreate
Production Cloud Environment¶
Production cloud is the authoritative execution environment.
Rules:
- persistent state lives here
- scheduled automation lives here
- recovery baseline is validated here
- headless execution must function without local GUI dependency
Recovery / Emergency Path¶
Recovery path must exist even if the main control surface is unavailable.
Minimum expectation:
- private overlay access works
- key administrative commands are documented
- restore path is tested
- one founder/operator path remains usable during failure scenarios
Final Rule¶
No environment may depend on undocumented manual steps for core recovery, restart, or access.
9.10 T1P Deployment Acceptance Checklist¶
A T1P infrastructure deployment is not accepted unless all are true:
- Node A can reboot and recover services predictably
- PostgreSQL persistence is verified
- backup job runs successfully
- restore procedure is documented
- HTTPS ingress works
- non-essential public ports are closed
- Tailscale/private admin path works
- orchestrator, governor, API, PostgreSQL, Ruflo, LiteLLM, scheduler, and backup jobs are observable
- one task can run end-to-end headlessly
- one interruption/restart scenario has been tested
9.10.1 Final Rule¶
If an infrastructure component is installed, it must satisfy one of these:
- needed now for headless execution
- needed now for recovery/security/observability
- required to prevent near-term rework
Otherwise, defer it.
9.11 Bootstrap, Startup & Controlled Update Lifecycle¶
9.11.1 Purpose¶
XIOPro must be able to start, restart, update, and recover deliberately.
A serious headless system cannot depend on "manual remembering" to become operational after:
- host reboot
- deployment change
- schema update
- service crash
- secret rotation
- version rollout
This section defines the minimum controlled lifecycle.
9.11.2 Bootstrap Principle¶
Bootstrap must be:
- scripted
- repeatable
- environment-aware
- observable
- rollback-conscious
If startup requires undocumented manual steps, bootstrap is incomplete.
9.11.2A Python Environment Standard¶
Python-based XIOPro services and scripts should use uv as the default tooling layer for:
- Python version management
- environment creation
- dependency sync
- lockfile-driven reproducibility
- tool and script execution during bootstrap/update
Expected standards where applicable:
pyproject.tomluv.lock.python-version
Rule¶
Bootstrap and update automation for Python services should prefer uv-based workflows over ad hoc pip/venv handling.
The goal is:
- faster environment setup
- reproducible sync across Mac and Hetzner
- cleaner CI/deploy behavior
- fewer environment drift problems
9.11.3 Cold Start Sequence¶
A first-time or rebuilt environment should follow this order:
- host baseline ready
- network/security baseline ready
- secrets delivery path ready
- storage surfaces reachable
- PostgreSQL initialized or restored
- schema migrations applied
- core control services started
- scheduler/background jobs started
- knowledge/index refresh checks run
- observability/health checks confirmed
- workload admission opened
Rule¶
The system should not accept normal execution until foundational dependencies pass health gates.
9.11.4 Warm Restart Sequence¶
For ordinary reboot or redeploy:
- preserve or verify durable state
- restart PostgreSQL and storage dependencies
- restart control services
- rebind runtime/scheduler state
- verify pending sessions / checkpoints
- verify alerting and telemetry
- reopen execution intake
Warm restart should prefer continuity over full rebuild.
9.11.5 Controlled Update Flow¶
Every significant update should support:
- planned target version
- preflight validation
- backup / snapshot before change
- migration step if needed
- health verification after rollout
- rollback path if checks fail
Minimum stages:
controlled_update_flow:
- preflight
- snapshot
- deploy
- migrate
- verify
- reopen
- rollback_if_needed
9.11.6 Preflight Checks¶
Before deployment or upgrade, the system should check at least:
- target environment identity
- available disk/RAM headroom
- secrets availability
- database reachability
- migration compatibility
- backup readiness
- current health baseline
- operator approval where required
9.11.7 Health Gates¶
Startup/update should define health gates for at least:
- PostgreSQL
- API service
- orchestrator
- governor
- scheduler
- LiteLLM/router path
- Ruflo/runtime path
- backup jobs
- telemetry/alerts
If health gates fail, the system should remain in degraded or closed admission mode until reviewed.
9.11.8 Runtime Admission Control¶
After bootstrap or update, XIOPro should reopen work in controlled order.
Suggested order:
- read-only status visibility
- manual/operator access
- scheduler and maintenance jobs
- controlled task execution
- full execution intake
This prevents unstable startup from immediately turning into unstable work.
9.11.9 Version & Migration Discipline¶
The platform must keep clear record of:
- deployed service versions
- DB migration level
- blueprint/runtime compatibility notes
- last successful deployment
- last successful restore drill
- pending upgrade blockers
This is necessary for recovery and auditability.
9.11.10 Self-Restart vs Self-Mutation Rule¶
XIOPro should be able to:
- restart services
- rebind sessions
- resume controlled execution
- propose updates
- assist in rollout preparation
But it must not silently self-mutate production behavior without governed approval.
Self-recovery is allowed. Unapproved self-redefinition is not.
9.11.11 Success Criteria¶
Bootstrap and update discipline is successful when:
- a host can reboot without operational chaos
- a fresh node can be built from runbooks/scripts
- deployments are repeatable
- migrations are not guesswork
- rollback is realistic
- post-update health is explicit before execution resumes
9.11.12 Orchestrator Launch Commands¶
XIOPro orchestrator surfaces are launched via the devxio CLI command:
| Command | Surface | Host | Effect |
|---|---|---|---|
devxio go or GO |
Global Orchestrator | Hetzner | Starts the primary 24x7 orchestrator session. Reads CLAUDE.md, memory files, plan.yaml, and resumes execution. |
devxio mo or MO |
Mac Orchestrator | Mac Studio | Starts the Mac-local orchestrator. Handles Mac tasks, browser testing, local experiments. Reports to GO via Control Bus. |
Both surfaces can run simultaneously. GO is always the primary. See Part 4, Section 4.1A for the full naming convention and rules.
10. Backup & Recovery¶
10.1 Principle¶
Recovery is not a future enhancement. It is a required runtime property.
XIOPro must be able to recover from:
- node failure
- process crash
- session loss
- database corruption
- bad deployment
- accidental deletion
- provider-side disruption
- operator error
10.2 Backup Scope¶
All critical persistence surfaces must be covered.
10.2.1 PostgreSQL¶
Must back up:
- ODM entities
- tickets
- tasks
- activities
- runtimes
- sessions
- escalation requests
- human decisions
- governance state
- cost and telemetry aggregates
- scheduler state
10.2.2 Git Repositories¶
Must preserve:
- source code
- rules
- skills
- blueprints
- prompts
- configuration templates
- scripts
Git is already versioned, but mirror/backup copies are still required.
10.2.3 Object / Blob Storage¶
Must preserve:
- transcript snapshots
- exported artifacts
- checkpoints
- large outputs
- recovery bundles
- retained logs
10.2.4 Configuration & Infrastructure State¶
Must preserve:
- environment templates
- Docker compose files
- reverse proxy config
- firewall config
- job schedules
- deployment scripts
- secret references
- runbooks
Secrets themselves should not be dumped into general backups unless explicitly encrypted and controlled.
10.2A Restic Backup to Backblaze B2¶
Automated backup runs daily via Restic to Backblaze B2. Implemented and operational as of 2026-03-28.
| Parameter | Value |
|---|---|
| Tool | Restic |
| Target | Backblaze B2 bucket (STRUXIO-ai) |
| Schedule | Daily at 03:00 UTC (cron) |
| Script | /opt/struxio/backup/backup.sh |
| Scope | Workspace, configs, scripts, PostgreSQL dumps |
| Encryption | Restic built-in (AES-256) |
| Credentials | SOPS-encrypted (backup_secrets.enc.env), loaded at runtime via age key |
Backup Process (3 steps)¶
- Decrypt credentials — SOPS decrypts B2 account ID, key, and restic password from
backup_secrets.enc.envusing age key at~/age-key.txt - Dump PostgreSQL —
pg_dumpfor Bus DB and Paperclip DB to/opt/struxio/backup/pg_dumps/. Files named with date suffix. 7-day local retention. - Restic backup — backs up workspace, bus config, scripts, and pg_dumps to B2. Tags: daily, hetzner.
What Is Backed Up¶
| Path | Content |
|---|---|
/home/struxio/STRUXIO_Workspace |
All 7 Git repos |
/opt/struxio/bus |
Bus MCP source and config |
/opt/struxio/config |
System configuration |
/opt/struxio/scripts |
Operational scripts |
/opt/struxio/backup/pg_dumps |
Daily PostgreSQL dumps (Bus + Paperclip) |
Excluded: node_modules, .git, *.log, __pycache__, .venv
Security¶
- No plaintext credentials — B2 account key and restic password are SOPS-encrypted at rest
- Decryption requires the age private key (
~/age-key.txt) which is not in any Git repo - Backup data is encrypted by Restic (AES-256) before upload to B2
Retention Policy¶
Restic prunes automatically after each backup: - Keep 7 daily snapshots - Keep 4 weekly snapshots - Keep 6 monthly snapshots
10.3 Backup Cadence¶
Database¶
- logical dump: at least daily
- WAL archiving: continuous (see Section 10.3A)
- pre-deploy snapshot: required before high-risk migrations
Git / Markdown¶
- mirrored continuously through Git remote
- daily off-platform mirror recommended
Object Storage¶
- continuous durable write pattern preferred
- lifecycle retention policy required
Config / Infra State¶
- export on every significant infrastructure change
- nightly snapshot of deployment definitions recommended
10.3A PostgreSQL WAL Archiving for Point-in-Time Recovery¶
Daily logical dumps (Section 10.2A) provide a 24-hour RPO. WAL (Write-Ahead Log) archiving reduces the RPO to 5 minutes by continuously shipping transaction logs to Backblaze B2.
Configuration¶
wal_archiving:
archive_mode: "on"
archive_command: "restic backup --tag wal --stdin-filename %f < %p"
# Alternative direct B2 shipping:
# archive_command: "b2 upload-file STRUXIO-ai wal/%f %p"
wal_level: "replica"
max_wal_senders: 3
wal_keep_size: "1GB"
RPO Target¶
- Target RPO: 5 minutes (down from 24 hours with daily dumps alone)
- WAL segments are archived continuously as they complete (typically every few minutes under normal load)
- Combined with the daily base backup, any point in time within retention can be restored
Archive Destination¶
WAL segments are shipped to Backblaze B2 alongside the daily Restic backups:
| Component | Destination | Retention |
|---|---|---|
| Daily base backup (pg_dump) | B2 via Restic (existing) | 7 daily, 4 weekly, 6 monthly |
| WAL segments | B2 via Restic or direct B2 upload | 7 days minimum |
Point-in-Time Restore Procedure¶
- Identify target time — determine the recovery point (e.g., "2026-03-30 14:30:00 UTC")
- Restore base backup — restore the most recent daily pg_dump that precedes the target time
- Download WAL segments — retrieve all WAL files from B2 between the base backup and the target time
- Configure recovery — set
recovery_target_timeinpostgresql.conf(orrecovery.conffor older versions) - Start PostgreSQL in recovery mode — PostgreSQL replays WAL segments up to the target time
- Validate — verify table counts, recent data, and ODM entity integrity
- Promote — remove recovery configuration and restart as primary
Monitoring¶
- Alert if WAL archiving falls behind by more than 5 minutes (archive lag)
- Alert if archive_command fails 3 consecutive times
- Include WAL archive status in the daily backup health check
Rule¶
WAL archiving is required for T1P production. Daily logical dumps alone are insufficient for a system managing active tickets, tasks, and governance state.
10.4 Retention Policy¶
Minimum target policy:
- daily backups: 14 days
- weekly backups: 8 weeks
- monthly backups: 12 months
- critical milestone backups: retained until manually reviewed
Session checkpoints and recovery bundles may use shorter retention if cost requires it, but production recovery points must remain sufficient for incident handling.
10.5 Recovery Priorities¶
Recovery order must follow business value.
Priority 1¶
- database integrity
- orchestrator state
- governor state
- active ticket/task continuity
Priority 2¶
- session checkpoint restoration
- transcript recovery
- scheduler recovery
- API availability
Priority 3¶
- observability dashboards
- historical exports
- non-critical mirrors
10.6 Recovery Targets¶
Initial T1P targets:
- infrastructure RPO target: <= 5 minutes (with WAL archiving; <= 24 hours without)
- operational DB restore target: same day
- critical session recovery target: best effort via checkpoint + transcript snapshot
- redeploy target after node loss: scripted and repeatable
These are initial targets, not final enterprise targets. The key requirement is that recovery must be rehearsable and explicit.
10.7 Session & Runtime Recovery¶
Recovery must align with Part 3 and Part 4 runtime semantics.
Infrastructure must support:
- runtime restart without losing ticket linkage
- session rebind when possible
- replacement session creation when rebind fails
- recovery escalation to human when continuity is uncertain
- durable storage of context snapshots and transcript references
Infrastructure recovery is not complete unless runtime continuity is addressed.
10.8 Restore Drill Requirements¶
A restore drill must be executable from runbook.
Minimum Drill Scenarios¶
- PostgreSQL restore to clean environment
- full service restart from deployment definitions
- object storage recovery validation
- recovery of one interrupted active task
- rollback to prior known-good deployment
If recovery is not tested, it is not real.
Monthly Restore Drill Procedure¶
A restore drill must run at least once per calendar month. The drill validates that B2 backups are actually recoverable, not just present.
Drill Steps¶
restore_drill:
cadence: "monthly (first week of month)"
executor: "GO or designated ops agent"
steps:
1_download:
action: "Download latest Restic snapshot from B2"
command: "restic restore latest --target /tmp/restore_drill/"
verify: "Files exist at /tmp/restore_drill/"
2_restore_db:
action: "Restore PostgreSQL dump to temporary database"
command: "createdb restore_drill_db && pg_restore -d restore_drill_db /tmp/restore_drill/pg_dumps/latest.dump"
verify: "Database created without errors"
3_verify_tables:
action: "Verify table counts match production"
checks:
- "SELECT count(*) FROM tickets — within 5% of production count"
- "SELECT count(*) FROM tasks — within 5% of production count"
- "SELECT count(*) FROM messages — within 5% of production count"
- "SELECT count(*) FROM agent_runtimes — non-zero"
4_verify_recent_data:
action: "Verify data freshness"
checks:
- "SELECT max(created_at) FROM messages — within 24 hours of drill time"
- "SELECT max(created_at) FROM tasks — within 24 hours of drill time"
- "WAL recovery test: if WAL archiving active, verify PITR to specific timestamp"
5_cleanup:
action: "Remove temporary resources"
commands:
- "dropdb restore_drill_db"
- "rm -rf /tmp/restore_drill/"
6_record:
action: "Record drill results"
output:
file: "state/restore_drills.yaml"
fields:
- drill_date
- snapshot_id
- snapshot_age_hours
- tables_verified
- table_count_drift_pct
- data_freshness_hours
- wal_pitr_tested (boolean)
- pass_fail
- notes
- executor
Drill Success Criteria¶
- All tables restored without errors
- Table counts within 5% of production
- Most recent data within 24 hours of drill time (within RPO)
- WAL PITR test successful (if WAL archiving is active)
- Drill completes in under 30 minutes
Drill Failure Response¶
- If drill fails: create a critical governance alert (
backup.restore_drill.failed) - Root cause must be identified and fixed before the next scheduled drill
- Two consecutive drill failures trigger a human escalation to the founder
11. Security¶
11.1 Principle¶
XIOPro security must protect:
- proprietary strategy
- source code
- execution control
- credentials
- knowledge assets
- product plans
- customer-sensitive material
Security must be practical, layered, and compatible with headless execution.
11.2 Security Posture¶
Initial production posture:
- minimal public exposure
- private overlay access first
- least privilege by default
- founder-controlled admin path
- explicit service boundaries
- auditable changes
Public internet exposure should be minimized to only what is operationally required.
11.3 Access Model¶
Primary roles:
- founder_admin
- system_service
- agent_runtime
- emergency_operator
- read_only_observer
Rules:
- agents do not receive broad admin privileges
- infrastructure administration remains human-controlled
- service-to-service access uses explicit credentials
- emergency paths must be documented and separate from normal automation
11.4 Network Security Baseline¶
Recommended baseline:
- Tailscale or equivalent private overlay for administrative access
- SSH restricted to approved identities only (currently restricted to Tailscale)
- firewall deny-by-default posture (UFW active)
- only required inbound ports opened
- internal services bound privately where possible
- reverse proxy terminates TLS for exposed services
The preferred posture is:
- private access first
- public exposure second
11.5 Secrets Management¶
Secrets must never live as unmanaged plaintext in:
- code repositories
- markdown blueprints
- shared chat messages
- container images
Minimum standard:
- use environment injection or secret manager pattern
- separate secrets by environment
- rotate high-value credentials
- maintain inventory of critical secrets
- use scoped provider keys where supported
- encrypt secrets at rest using SOPS + age (see Section 9.9.6)
Recommended categories:
- provider API credentials
- GitHub tokens
- database credentials
- object storage credentials
- Tailscale / network auth material
- domain / DNS / TLS credentials
11.6 Service Isolation¶
Services must be logically isolated even if colocated.
Isolation baseline:
- separate containers for major services
- separate service credentials
- no unnecessary shared writable volumes
- DB access limited by service role
- execution runtimes separated from core control services where practical
Agent runtimes should not have unrestricted access to all system internals.
11.7 Endpoint Protection & Host Hardening¶
Baseline host controls:
- timely OS security updates
- non-root routine operation
- SSH key auth only
- fail2ban or equivalent if internet-facing SSH remains enabled
- UFW / nftables firewall policy
- audit of installed packages and open ports
- disk encryption where supported and operationally practical
11.8 Security Logging & Audit¶
Must record:
- admin logins
- deploy events
- secret changes
- permission changes
- breaker-triggered shutdowns
- emergency access usage
- unusual agent privilege attempts
Security-relevant events must be reviewable from an audit trail.
11.9 Incident Response Baseline¶
Every critical environment must have a basic incident path:
- detect
- contain
- preserve evidence
- rotate credentials if needed
- restore service safely
- document root cause
- update controls
A simple runbook is sufficient initially, but undocumented response is not acceptable.
11.10 Emergency Access, Out-of-Band Recovery & Memory Pressure Survival¶
Purpose¶
XIOPro must remain recoverable even when normal access paths fail.
This includes cases such as:
- host memory exhaustion
- service thrash or restart loops
- accidental firewall lockout
- Tailscale failure
- SSH unavailability
- broken deploy causing loss of normal admin path
This section defines the minimum emergency-access discipline.
11.10.1 Principle¶
Private overlay access and normal SSH are the preferred control paths.
But they are not sufficient as the only recovery plan.
Every critical environment must also have a documented out-of-band recovery path.
11.10.2 Required Access Layers¶
The environment should support these layers in order:
- normal private admin path
- Tailscale or equivalent
- SSH with key-only auth
-
normal deployment and maintenance workflow
-
degraded emergency operator path
- limited but documented recovery path
- safe rollback of firewall/network changes
-
ability to stop unstable services
-
out-of-band host access
- provider console / rescue mode / equivalent
- keyboard/layout-aware emergency instructions
- ability to restore basic reachability without guessing
Rule¶
A host is not operationally safe if only one access path exists.
11.10.3 Memory Pressure Survival Rule¶
The system must assume that memory exhaustion can impair:
- SSH responsiveness
- Tailscale responsiveness
- service health
- logging
- the ability to run normal recovery commands
Therefore Node A must reserve enough operational headroom to allow emergency access and controlled recovery.
Minimum policy:
- avoid sizing Node A so tightly that ordinary bursts can fully consume memory
- prefer explicit RAM headroom over theoretical maximum utilization
- treat repeated OOM behavior as a production-severity signal
- preserve the ability to stop or pause non-critical services under pressure
11.10.4 Emergency Recovery Controls¶
At minimum, the environment should support these emergency actions:
- stop or pause non-essential containers/services
- restore firewall/network path to a safe known baseline
- restart only core control-plane services first
- verify DB health before reopening broader execution
- keep an emergency runbook for Hetzner console / rescue operations
- keep known-good command snippets accessible outside the affected host
Examples of Core-First Recovery Order¶
- regain admin access
- verify disk and memory state
- stop unstable/non-essential services
- verify PostgreSQL
- restore API/orchestrator/governor path
- restore scheduler and workers
- reopen task admission gradually
11.10.5 Firewall Safety Rule¶
Firewall changes must be governed like risky production changes.
Minimum practice:
- keep a known-good baseline policy
- document rollback steps
- avoid permanent lockout risk from one bad rule push
- test private admin path after material firewall changes
- keep console-level rollback instructions documented
The goal is not perfect automation. The goal is avoiding avoidable lockout.
11.10.6 Emergency Operator Role¶
The emergency_operator role exists for major incident recovery.
This role is separate from normal automation and should be able to:
- use out-of-band access when needed
- execute documented recovery commands
- restore reachability
- preserve evidence before destructive actions
- log all meaningful emergency interventions
11.10.7 Runbook Requirement¶
At least one explicit emergency runbook must exist for Node A covering:
- Tailscale unavailable
- SSH unavailable
- firewall rollback
- memory exhaustion / OOM stabilization
- service stop order
- provider console usage
- post-incident verification checklist
An undocumented emergency procedure is not a real emergency procedure.
11.10.8 Acceptance Rule¶
Infrastructure is not accepted as production-capable unless the team can answer:
- how do we access the host if Tailscale fails?
- how do we recover if SSH is unresponsive?
- how do we recover if firewall changes block normal access?
- how do we stabilize the host if memory is exhausted?
- what is the exact first-command sequence in provider console mode?
If these answers are not documented, the security model is incomplete.
12. Observability¶
12.1 Principle¶
If XIOPro cannot observe itself, it cannot govern itself.
Observability must support:
- runtime visibility
- recovery
- cost control
- debugging
- safety decisions
- future optimization
12.2 Required Signals¶
Minimum required signal families:
- logs
- metrics
- health checks
- heartbeats
- alerts
- audit events
Tracing is recommended but may be phased in later.
12.3 Logging Requirements¶
Logs must exist for:
- API layer
- orchestrator
- governor
- scheduler
- runtime adapters
- database-related failures
- deployment actions
- security events
Log requirements:
- structured where possible
- timestamped
- correlated by request/session/task IDs where possible
- retained according to environment policy
- searchable during incidents
12.4 Metrics Requirements¶
Minimum operational metrics:
Platform¶
- CPU
- memory
- disk
- network
- container restarts
- process uptime
Runtime¶
- active runtimes
- active sessions
- waiting human escalations
- failed runs
- retries
- queue depth
Business/Execution¶
- tickets in progress
- tasks completed
- task latency
- session recovery count
- human intervention count
Cost¶
- provider cost estimate
- per-runtime estimated spend
- per-task estimated spend
- infra cost trend
12.5 Health Model¶
Each core service must expose a health view:
- healthy
- degraded
- blocked
- failed
Minimum monitored services:
- API
- orchestrator
- governor
- database
- scheduler
- runtime adapter layer
- reverse proxy
- object storage connectivity
12.6 Alerting Baseline¶
Alerts must be routed by severity.
Critical¶
- database unavailable
- orchestrator down
- repeated session recovery failure
- secret/security incident
- runaway cost spike
Warning¶
- elevated retry rate
- queue backlog
- degraded disk space
- failed backup job
- runtime adapter instability
Info¶
- deploy complete
- scheduled maintenance
- non-critical optimization suggestions
12.7 Dashboard Requirements¶
At minimum, the operator must be able to see:
- system health
- active runtimes
- active sessions
- waiting escalations
- error count
- recovery events
- cost trend
- backup status
This may begin with simple dashboards, but the signals themselves are mandatory.
12.8 Observability Storage & Retention¶
Need explicit retention rules for:
- operational logs
- audit logs
- metrics history
- incident snapshots
Retention length may vary by cost, but critical incident analysis must remain possible.
13. Cost Strategy¶
13.1 Principle¶
Infrastructure cost must be:
- visible
- attributable
- governable
- optimized without harming reliability
Cost strategy is not only about lowering spend. It is about choosing the right cost for the right leverage.
13.2 Cost Categories¶
Track at least these categories:
- hosting / compute
- storage
- network / bandwidth
- backup retention
- observability tooling
- provider runtime/API spend
- local hardware / future self-hosted capacity
13.3 Attribution Model¶
Infrastructure should support attribution by:
- environment
- node
- service
- runtime surface
- ticket or project where practical
This enables the governor and the operator to answer:
- what is expensive
- why it is expensive
- whether it is justified
13.4 Cost Control Rules¶
Initial rules:
- avoid idle heavyweight services without clear value
- scale up only when signal justifies it
- prefer simple colocated deployment before fragmentation
- separate services only when risk, cost, or operational pressure justifies it
- prune unused storage and log retention intentionally
13.5 Scale-Up Triggers¶
Infrastructure upgrade may be justified when one or more apply:
- repeated CPU or memory saturation
- queue growth impacting execution goals
- session recovery degradation due to node pressure
- observability overhead becoming material
- self-hosted model experimentation requiring isolated compute
- product workloads contaminating XIOPro control-plane stability
13.5A Scaling Triggers¶
The following specific conditions trigger a scaling evaluation. Meeting one trigger does not mandate action — it requires a deliberate review and decision. GO is responsible for raising the evaluation; the decision requires operator approval.
| Signal | Threshold | Evaluation Required |
|---|---|---|
| PostgreSQL write latency | > 50ms sustained at 10+ concurrent agents | Evaluate read replicas |
| Host memory | > 75% sustained (any host) | Add new host |
| Bus request latency | > 200ms p95 | Evaluate caching layer |
| Agent spawn queue depth | > 5 pending spawns | Distribute spawn load to additional hosts |
| Concurrent agent count | > 8 active simultaneously on a single host | Evaluate second host or reduce parallelism |
| Disk usage | > 80% on any data volume | Archive old activity partitions to B2; evaluate volume expansion |
Rules¶
- Triggers are measured over a sustained window (minimum 5 minutes), not transient spikes.
- A trigger that clears before review requires no action but should be logged.
- Scaling adds operational complexity — it must be justified by signal, not by precaution.
- GO reports trigger events via Bus alert (L3 or higher) so IO can route to the founder for decision.
13.6 Hetzner Upgrade Policy¶
Initial assumption:
- one primary Hetzner CPX62 node is acceptable for T1P
Upgrade path should remain open for:
- larger CPU / RAM node
- split DB to dedicated node
- split runtime workers from control plane
- add dedicated GPU / model experimentation node later
No upgrade should be performed only because it feels more "serious". Upgrade must follow observed bottlenecks.
13.7 Self-Hosted Model Decision Rule¶
Future self-hosted model infrastructure should be evaluated only if it improves one or more of:
- privacy posture
- unit economics
- latency
- offline resilience
- provider independence
- special workload suitability
It should not be adopted merely because self-hosting sounds strategic.
14. Service Fate Map Reference¶
The transition from current services to v5.0 target architecture is documented in:
resources/SERVICE_FATE_MAP_v4_2.md
This resource maps every currently running service/container to its v5.0 fate:
- KEEP: Caddy, PostgreSQL (upgrade), Hindsight, ISO 19650 engine (product code -- see
MVP1_PRODUCT_SPEC.md), Tailscale, UFW, Restic, SOPS+age, Ruflo, Claude Code, AutoDream - KEEP + EVOLVE: Bus (-> API gateway/relay), LiteLLM (activate routing)
- KEEP for now: Paperclip (until ODM parity), Tickets renderer, RC keepalive
- REPLACE: Dashboard (-> Control Center)
- RETIRE: devxio-frontend, devxio-bridge (stale pre-v3.1 code)
- RETIRED (deprecated): devxio-librarian (631 MB Neo4j), graph_stack_neo4j (1.2 GB) -- both Neo4j instances stopped and removed
Retirement RAM Impact¶
Retiring stale services frees approximately 1.95 GB, leaving approximately 26 GB available for new XIOPro backend, UI, and worker services on the CPX62.
Parallel Operation Rule¶
During migration, old services (Bus, Paperclip, dashboard) run alongside new services. No big-bang cutover. Parallel-run until new services are proven and feature parity is reached.
15. Current State¶
As of 2026-03-28, the infrastructure layer is operational:
What exists today:
- Hetzner CPX62 running Ubuntu 24.04 with 14 Docker containers (~4.2 GB RAM)
- Caddy reverse proxy with TLS and basic auth
- PostgreSQL (bus database, 44 MB)
- XIOPro Control Bus (evolving from Bus MCP): REST API :8088, SSE Push :8089, OAuth 2.1, PostgreSQL-backed. Currently 107 MB. Being extended with push delivery, intervention, task orchestration, agent registration, host capacity, and spawn coordination (see Part 2, Section 5.8)
- Paperclip issue tracker + DB (339 MB combined)
- Hindsight memory system (1.06 GB, Vectorize.io Docker)
- LiteLLM router (576 MB, not actively routing under Max20)
- ISO 19650 engine (57 MB, product code -- see
MVP1_PRODUCT_SPEC.md) - ~~Two Neo4j instances~~ (deprecated -- both stopped and removed, 1.83 GB freed)
- Phase 1 React dashboard (11 MB)
- Pre-v3.1 stale frontend + bridge (123 MB, candidates for immediate retirement)
- Tailscale VPN mesh (Hetzner <-> Mac)
- UFW firewall ACTIVE (SSH restricted to Tailscale 100.64.0.0/10, HTTP/HTTPS public, default deny incoming). Enabled 2026-03-28.
- Root password set for emergency Hetzner console access
- struxio user has sudo access
- Restic backup to Backblaze B2 (daily 03:00 UTC)
- SOPS + age for secret encryption
- Git history cleaned: plaintext secrets purged from STRUXIO_OS repo history via git-filter-repo (2026-03-28). Only SOPS-encrypted versions remain.
- Supply chain security: Socket.dev + GuardDog recommended for behavioral malicious package detection. Trivy for container scanning. pip-audit/npm-audit for CVE baseline.
- RC keepalive cron (every 10 min)
- Ruflo (claude-flow) for agent teams
- Claude Code v2.1.86 with Max20 OAuth
- AutoDream enabled (memory consolidation)
- tmux 3.4, ripgrep 14.1.1 installed
What must be built/changed:
- Install must-have CLI tools (gh, jq, uv, fzf, fd, yq, direnv)
- Retire stale containers (devxio-frontend, devxio-bridge)
- ~~Evaluate Neo4j instances for retirement~~ (done -- both retired, see Part 5 Section 12.1)
- Add pg_dump to restic backup scope
- Build new FastAPI backend + Next.js UI services
- Upgrade PostgreSQL to become primary ODM state store
- Evolve Bus into API gateway or keep as messaging relay
16. Infrastructure Success Criteria¶
Infrastructure is successful only if the following are true:
16.1 Reliability¶
- core services start reproducibly
- system can run continuously
- failures are detectable
- restart procedures are documented
16.2 Recoverability¶
- backups exist and are valid
- restore drill is executable
- runtime/session recovery path is defined
- bad deployments can be rolled back
16.3 Security¶
- secrets are controlled
- access is role-scoped
- public exposure is minimized
- audit trail exists for critical actions
16.4 Observability¶
- core services emit useful telemetry
- critical alerts reach the operator
- cost and health are visible
- incident diagnosis is possible without guesswork
16.5 Scalability¶
- architecture can separate services without redesign
- local node remains viable as fallback or augmentation
- future GPU or product nodes can be added cleanly
16.6 Cost Discipline¶
- infrastructure spend is explainable
- upgrade decisions are signal-based
- expensive idle complexity is avoided
Infrastructure that merely "runs" is not enough. It must be operable, recoverable, and governable.
17. Naming Conventions¶
All STRUXIO repositories, folders, and files follow a four-rule naming standard. These rules ensure consistency across GitHub, local disk, and internal structure.
17.0 General Principles¶
- Case-insensitive uniqueness: Never create two files or folders with the same name differing only by case. Uppercase in Mac root folders is for human readability only — the system must treat names as case-insensitive for search and deduplication.
- XIOPro and STRUXIO are proper names: Always written in uppercase. They are brand names with no abbreviation or meaning to decode — keep as-is everywhere.
- Mac vs Hetzner convention: Mac uses
STRUXIO_prefix on top-level folders for Finder readability. Hetzner uses the GitHub lowercase name (thegit clonedefault). Both are valid — they map to the same repo (see Section 17.5). - External tool names kept as-is: Third-party tool names (Neo4j, PostgreSQL, Caddy, Backblaze, Tailscale) retain their original casing in all documents.
- High-level folders are descriptive: Use full words —
STRUXIO_Design(notSTRUXIO_D),STRUXIO_Knowledge(not abbreviated). The folder name should explain what it contains.
17.1 Rule 1 — GitHub Repository Names¶
- All lowercase.
- Words separated by hyphens (
-). - Must start with
struxio-.
Examples: struxio-design, struxio-app, struxio-knowledge
17.2 Rule 2 — Local Top-Level Folders (Repos on Disk)¶
- Mac: Start with
STRUXIO_. Use underscores (_). CamelCase or logical uppercase for readability. - Hetzner: Use GitHub lowercase name as cloned (e.g.,
struxio-design). No renaming needed. - These represent the repos and are the exception to the lowercase rule on Mac.
Examples (Mac): STRUXIO_Design, STRUXIO_OS, STRUXIO_Knowledge, STRUXIO_DEVXIO_UI
Examples (Hetzner): struxio-design, struxio-os, struxio-knowledge
17.3 Rule 3 — Structure Folders (Inside Repos)¶
- All lowercase.
- Words separated by underscores (
_). - No spaces, no hyphens.
Examples: 02_devxio_architecture, blueprint_devxio_bl_v4_2_set, resources
The daily folder cleanup cron at 04:00 UTC enforces Rule 3 (lowercase structure folders).
17.4 Rule 4 — File Names¶
- Start with a function/type prefix in UPPERCASE.
- Rest uses appropriate casing for readability.
Examples: BLUEPRINT_XIOPro_v4_2_Part1_Foundations.md, SKILL_REGISTRY.yaml, REVIEW_final_freeze_v4_2.md, PLAN_iso19650_integration.md
17.5 Repository Mapping¶
| GitHub Repo | Mac Folder | Hetzner Folder | Purpose |
|---|---|---|---|
| struxio-design | STRUXIO_Design | struxio-design | Architecture, blueprints, design docs |
| struxio-logic | STRUXIO_Logic | struxio-logic | Agent activations, rules, skills |
| struxio-os | STRUXIO_OS | STRUXIO_OS | State, tickets, engineering, infra |
| struxio-app | STRUXIO_App | struxio-app | Product code (see MVP1_PRODUCT_SPEC.md) |
| struxio-business | STRUXIO_Business | struxio-business | Business docs |
| struxio-knowledge | STRUXIO_Knowledge | struxio-knowledge | Knowledge vault, Obsidian sync |
| struxio-devxio-ai | STRUXIO_DEVXIO_UI | devxio-control-center | Control Center UI (Next.js) |
| struxio-aibus | STRUXIO_AIBUS | struxio-aibus | Bus MCP Server source |
| struxio-dashboard | STRUXIO_Dashboard | struxio-dashboard | Dashboard UI |
| struxio-tickets | STRUXIO_Tickets | struxio-tickets | Ticket tracking |
17.6 Operational Tools¶
| Tool | Command | Schedule | Purpose |
|---|---|---|---|
| Folder Naming Cleanup | /opt/struxio/scripts/folder_naming_cleanup.sh |
Daily 04:00 UTC | Enforces Rule 3 (lowercase structure folders) |
| Workspace Graph | /opt/struxio/scripts/workspace_graph.sh |
Daily 04:01 UTC | Generates STATE_workspace_graph.yaml — full folder/file map for agent navigation |
18. Final Statement¶
Infrastructure is the execution ground of XIOPro.
If this layer is weak:
- runtime becomes fragile
- recovery becomes guesswork
- security becomes accidental
- costs become opaque
If this layer is strong:
- the system can run headless with confidence
- failures can be absorbed and repaired
- the founder can scale with less fear
- future growth does not require rethinking everything
Changelog¶
| Version | Date | Author | Changes |
|---|---|---|---|
| 4.1.0 | 2026-03-27 | BM | Initial infrastructure blueprint |
| 4.2.0 | 2026-03-28 | BM | C8.1: Added actual Hetzner CPX62 specs (16 vCPU AMD EPYC-Genoa, 30GB RAM, 150GB SSD) to Section 5.1 and 9.5.1. C8.2: Added SOPS+age secrets encryption to Section 9.9.6. C8.3: Added Restic backup to Backblaze B2 section (10.2A). C8.4: Added service fate map reference (Section 14). C8.5: Added container memory budget (Section 9.5A). C8.6: Added CLI toolchain section (9.6A) referencing CLI_TOOLS_ASSESSMENT.md. CX.1: Global "Rufio" to "Ruflo" rename. CX.2: Updated version header to 4.2.0. CX.3: Added changelog. CX.4: Added current state section (Section 15). Renumbered success criteria to Section 16, final statement to Section 17. |
| 4.2.2 | 2026-03-28 | 000 | Agent naming migration: O00/O01 replaced with 000 (orchestrator role) / 000 (governor role). M01 replaced with module steward role. BM replaced with 000. Container group names updated from o00/o01 to orchestrator/governor. Backblaze B2 references preserved unchanged. Changelog author entries preserved as historical. |
| 4.2.3 | 2026-03-28 | 000 | Roles over numbers: Removed agent IDs from all architectural descriptions, section headers, diagrams, and service lists. Role names used throughout instead of agent numbers. |
| 4.2.7 | 2026-03-28 | BM | Neo4j deprecated: Both instances (devxio-librarian, graph_stack_neo4j) marked as retired/removed across Sections 9.5A, 14, 15. PostgreSQL + pgvector replaces all Neo4j use cases for T1P. |
| 4.2.11 | 2026-03-29 | BM | Added Section 9.11.12 (Orchestrator Launch Commands) — devxio go and devxio mo launch commands for GO and MO surfaces with cross-reference to Part 4, Section 4.1A. |
| 4.2.12 | 2026-03-29 | BM | Added Section 17 (Naming Conventions) — four-rule naming standard for repos, folders, and files with repository mapping table. Renumbered Final Statement to Section 18. |
| 4.2.13 | 2026-03-29 | BM | Updated Section 17 naming conventions: added Section 17.0 (General Principles — case-insensitive uniqueness, proper names, Mac vs Hetzner, tool names, descriptive folders). Updated 17.2 to distinguish Mac/Hetzner. Updated 17.5 mapping table with Hetzner column. Added 17.6 (Operational Tools — folder cleanup + workspace graph). |
| 4.2.14 | 2026-03-29 | BM | Cross-references: Added pointer to resources/DESIGN_cli_services.md in Section 9.6A (CLI services framework including Porkbun DNS and Hetzner hcloud). Added hcloud to Must-Have CLI tools table. |
| 5.0.1 | 2026-03-30 | GO | N22: Added Section 8.8.1 (Connection Pooling) -- PgBouncer or built-in pool_size recommended at 15+ agents, current Fastify pool max: 20, struxio_db_pool_* gauge monitoring via GET /metrics, pool exhaustion = warning alert. |
| 5.0.2 | 2026-03-30 | GO | N8: Added Section 13.5A (Scaling Triggers) — four specific thresholds: PostgreSQL write latency > 50ms at 10+ agents → read replicas; host memory > 75% sustained → new host; Bus latency > 200ms → caching layer; spawn queue depth > 5 → distribute to additional hosts. N20: Added Section 8.12A (Bus API Rate Limits) — default 100 req/min per actor, burst 200 req/min throttled, 1 SSE connection per actor per channel, 50 events/min per actor. |
| 5.0.3 | 2026-03-30 | GO | C4: Added Section 10.3A (PostgreSQL WAL Archiving) — continuous WAL shipping to B2, RPO reduced from 24h to 5 minutes, archive_mode/archive_command config, point-in-time restore procedure (7 steps), monitoring rules. Updated Section 10.6 RPO target to reflect WAL archiving. C5: Expanded Section 10.8 (Restore Drill Requirements) — monthly restore drill procedure with 6-step checklist (download, restore, verify tables, verify freshness, cleanup, record), success criteria, failure response, results recorded in state/restore_drills.yaml. |
| 5.0.4 | 2026-03-30 | GO | I13: Revised agent count estimate in Section 9.5A — realistic max 8-10 concurrent agents on CPX62 (30 GB RAM). Each Claude Code process ~300-500 MB, services baseline ~10 GB, 3-5 GB safety buffer. Previous higher estimates reflected smaller assumed agent footprints. |
| 5.0.5 | 2026-03-30 | GO | N8 addendum: Added two scaling triggers to Section 13.5A — concurrent agent count > 8 per host → evaluate second host; disk usage > 80% → archive old partitions. |