XIOPro Production Blueprint v5.0¶

Part 8 — Infrastructure & Deployment Architecture¶

1. Purpose¶

Defines the concrete infrastructure baseline required to run XIOPro as a headless-first, recoverable, secure, and provider-independent execution system.

This part specifies:

runtime environments
node roles
service boundaries
deployment topology
network shape
storage surfaces
installation inventory
scaling direction
operational constraints

This document is not a cloud wishlist. It is the execution platform contract for XIOPro.

2. Infrastructure Thesis¶

Infrastructure must support all of the following simultaneously:

continuous headless operation
recoverable multi-agent execution
explicit control-plane separation
durable state persistence
low-friction founder intervention
provider-swappable model access
future expansion without redesign

Infrastructure exists to make the architectural rules real.

3. Infrastructure Principles¶

3.1 Headless First¶

All critical execution must continue without UI.

The UI may observe and control, but must never become the only runtime path.

3.2 Durable State First¶

No important execution state may live only inside:

a terminal tab
a provider chat window
a single container memory space
an agent-local temp file

Durable state must land in authoritative storage surfaces.

3.3 Replaceability¶

The infrastructure must allow replacement of:

model providers
agent runtimes
API gateway/router
UI
storage backends
observability stack

without invalidating the XIOPro operating model.

3.4 Logical Separation Before Physical Separation¶

Even when colocated on one server initially, the following concerns must remain logically separated:

control plane
execution fabric
governance
data/state
knowledge services
ingress/API
observability
backup/recovery

3.5 Recovery Is Native¶

Infrastructure must assume:

session crash
provider disconnect
container restart
host reboot
partial service outage
network interruption
founder disconnect

Recovery is not a future enhancement. It is a base requirement.

3.6 Security by Reduction¶

Prefer:

private network paths
minimum exposed ports
minimum standing privileges
minimum long-lived secrets
explicit auditability

4. Canonical Environment Model¶

4.1 PRD -- Production Runtime¶

Primary live environment for XIOPro system operation.

Contains:

orchestrator runtime
governor runtime
API/control services
PostgreSQL
scheduler/worker services
LiteLLM router
Ruflo swarm runtime
knowledge service backends
observability services
backup jobs

4.2 TST -- Integration Validation¶

Used to validate:

schema changes
orchestration behavior
recovery behavior
deployment updates
service compatibility

TST must be structurally similar to PRD, but can run with reduced scale.

4.3 DEV -- Builder / Experiment Zone¶

Used for:

agent experiments
rule iteration
local service development
migration rehearsal
safe breakage

4.4 LOC -- Local Operator Node¶

Primary founder workstation environment.

Contains or may contain:

RC-capable local execution surfaces
local knowledge access
local file operations
CLI diagnostics
fallback execution
future local models
operator utilities

LOC is not the production control plane, but it is an important resilience and intervention node.

5. Runtime Node Topology¶

5.1 Node A -- Cloud Control Node (Hetzner CPX62)¶

Primary always-on control and execution node.

Actual Hardware Specs (as of 2026-03-28)¶

Spec	Value
Provider	Hetzner Cloud
Instance type	CPX62 (shared vCPU, AMD)
CPU	16 vCPU AMD EPYC-Genoa
RAM	30 GB
Storage	150 GB SSD (NVMe)
OS	Ubuntu 24.04 LTS
Location	Hetzner EU

Responsibilities:

orchestrator control
governance control
API ingress
work graph persistence
scheduling
runtime coordination
background execution
telemetry collection

5.2 Node B -- Local Operator Node (Mac Studio)¶

Connected via Tailscale VPN (encrypted mesh).

Responsibilities:

founder interaction
local CLI execution
fallback RC-capable sessions
local knowledge access
manual validation
future local inference experiments

5.3 Node C -- Future GPU / Model Node¶

Reserved for:

self-hosted model serving
heavier local inference
embedding jobs
batch processing
specialized isolated workloads

5.4 Node D -- Future Product Runtime Node¶

Reserved for:

STRUXIO product APIs
customer-facing runtime isolation
product workloads separated from XIOPro control plane

6. High-Level Infrastructure Overview¶

flowchart TD
    User[User / Local Operator Node] --> Ingress[Ingress / API Gateway]
    Ingress --> Control[Control Services]
    Control --> Orchestrator["Orchestrator"]
    Control --> Governor["Governor"]
    Orchestrator --> Ruflo[Ruflo Execution Fabric]
    Ruflo --> Surfaces[Execution Surfaces]
    Surfaces --> Providers[Model Providers / Local Models]
    Orchestrator --> DB[(PostgreSQL)]
    Governor --> DB
    Control --> DB
    Control --> Knowledge[Knowledge / Librarian Services]
    Control --> Telemetry[Logs / Metrics / Alerts]
    DB --> Backup[Backup & Recovery]
    Knowledge --> Backup

7. Service Architecture¶

7.1 Control Plane Services¶

Core services that maintain system state and coordination:

API service
orchestrator service
governor service
scheduler service
worker/queue consumers
RC/escalation broker

7.2 Execution Fabric Services¶

Services responsible for agent execution and provider interaction:

Ruflo agent swarm engine
LiteLLM router
execution adapters
CLI/runtime bridges
provider connectors

7.3 Data and Knowledge Services¶

Authoritative storage and retrieval services:

PostgreSQL
knowledge/librarian service
index refresh jobs
document/asset storage references

7.4 Operational Services¶

Cross-cutting operations services:

reverse proxy / ingress
secrets delivery
backup jobs
log pipeline
metrics exporter
alert delivery

8. Canonical Service Inventory¶

8.1 Ingress / Reverse Proxy¶

Role:

terminate TLS
route inbound traffic
expose minimal public surfaces
forward requests to internal services

Examples:

Caddy
Traefik
Nginx

8.2 API Service¶

Role:

main entry point for UI and CLI
authentication and authorization
session/control endpoints
work graph access
human escalation endpoints

8.3 Orchestrator Service¶

Role:

reads tickets/tasks/state
assigns work
selects execution path
manages continuity
coordinates domain/worker agents

8.4 Governor Service¶

Role:

monitors cost, health, anomalies, and risk
enforces policy actions
raises alerts and intervention requests
proposes optimization actions

8.5 Ruflo Runtime Service¶

Role:

agent spawning
sub-agent lifecycle management
bounded multi-agent execution
runtime coordination hooks

8.6 LiteLLM Router Service¶

Role:

provider abstraction
model routing
fallback routing
usage metering integration
future local-model routing

8.7 Scheduler / Background Worker Service¶

Role:

recurring jobs
dream windows
maintenance jobs
index refresh
backup execution
telemetry rollups

8.8 PostgreSQL Service¶

Authoritative store for:

ODM entities
runtime state
session state
escalation state
governance events
cost records
audit events
control metadata

8.8.1 Connection Pooling¶

Connection pooling via PgBouncer or built-in pool_size is recommended when agent count exceeds 15. Current Fastify pool: { max: 20 }. Monitor with GET /metrics using the struxio_db_pool_* gauge family.

Rules:

Below 15 agents: Fastify built-in pool (max: 20) is sufficient
At 15+ agents: evaluate PgBouncer in transaction-pooling mode as a sidecar to the PostgreSQL container
Pool exhaustion events must be captured as governance alerts (warning level)
struxio_db_pool_active, struxio_db_pool_idle, and struxio_db_pool_waiting gauges must be emitted to the observability stack
PgBouncer configuration (if adopted) must be SOPS-encrypted and managed via the same secrets path as PostgreSQL credentials

8.9 Knowledge / Librarian Service¶

Role:

ingest knowledge sources
classify/index content
maintain retrieval structures
support render/export/query workflows

8.10 Object Storage / Backup Surface¶

Primary uses:

database dumps
snapshots
compressed transcripts
recovery packages
exported artifacts

8.11 Observability Stack¶

Core outputs:

logs
metrics
health state
error events
alert signals
future traces

8.12 Module Portfolio Infrastructure Linkage¶

Purpose¶

Infrastructure must provide the real-world constraints and capabilities that make module portfolio governance credible.

The module steward can recommend and optimize modules only within an actual hosting envelope.

That means infrastructure must expose enough information for the portfolio layer to reason about:

subscription-backed module access
API-backed module access
self-hosted module feasibility
local vs cloud placement
resource ceilings
operational complexity
fallback paths

8.12.1 Infrastructure Inputs Required by the Module Steward¶

Part 8 should provide the module steward with at least:

available execution nodes
node class and role
approximate compute profile
memory profile
storage considerations
network posture
public vs private connectivity assumptions
allowed runtime surfaces
operational risk notes
recovery and observability readiness

This is necessary so "recommended module" can mean: recommended and actually runnable.

8.12.2 Hosting Feasibility Principle¶

A module should not be marked portfolio-approved for self-hosted or local use unless there is a credible hosting profile for it.

A credible hosting profile must include at least:

target environment
resource assumptions
deployment complexity notes
security notes
recovery notes
observability notes
fallback path if the hosting path fails

8.12.3 Local / Cloud / Hybrid Evaluation¶

The module steward should be able to evaluate candidate module options against at least these hosting classes:

local Mac execution
Hetzner primary control node
future dedicated GPU/model node
future isolated product runtime node
hybrid cloud/provider access

Each class carries different tradeoffs in:

quality
stability
trust
latency
bandwidth
compute pressure
operational complexity

8.12.4 Subscription and Surface Awareness¶

Infrastructure and module governance must stay aligned on where module access actually exists.

This includes awareness of:

provider API access paths
provider subscription-backed surfaces
local CLI/runtime adapters
routing-layer reachability
fallback availability during provider failure

This prevents recommending modules that cannot actually be reached from the required runtime surface.

8.12.5 Optimization Telemetry Requirement¶

Infrastructure should preserve enough telemetry for portfolio optimization over time.

Useful telemetry includes:

latency by module and task class
error/failure rate by module and access path
cost / usage by module
retry rate by module
fallback frequency
node pressure when self-hosted or local
bandwidth pressure where relevant

This allows the module steward to optimize with evidence, not intuition alone.

8.12.6 Adoption Rule¶

Infrastructure may support evaluation and comparison of new modules, subscriptions, and self-hosted options.

But infrastructure must not auto-adopt them.

Adoption still requires governed approval and a deliberate rollout decision.

8.12A Bus API Rate Limits¶

The Control Bus enforces rate limits to protect stability and ensure fair access across all actors. These limits are active in the current Bus implementation.

Default Limits¶

Limit	Value	Notes
Default request rate	100 req/min per actor	Warning logged at threshold; already implemented
Burst allowance	200 req/min per actor	Allowed for short bursts; throttled (429) after sustained burst
SSE connections	1 connection per actor per channel	Reconnect replaces the prior connection; no parallel SSE streams
Event emission rate	50 events/min per actor	Applies to `POST /events`; excess events are queued or dropped with warning

Rules¶

Rate limits are applied per actor_id, not per IP or session.
Burst capacity (200 req/min) is available for up to 30 seconds before throttle kicks in.
Throttled requests receive HTTP 429 with a Retry-After header.
Rate limit violations are logged as Bus warning events and visible in the Dashboard alert feed.
SSE reconnect on rate-limited channels retries after the backoff window (see Section 10.4 retry policy).
These limits protect Bus and PostgreSQL from agent runaway — they are not negotiable per-actor.

Tuning Principle¶

Rate limits may be raised globally only if sustained Bus latency remains below 200ms after the increase. Individual actors may not self-negotiate higher limits — only the Governor may authorize a limit adjustment via a Bus configuration change.

8.13 Repository, Filesystem & Storage Layout¶

Purpose¶

XIOPro needs an explicit filesystem and repository model.

Without it, the system may have strong logic but weak operational discipline.

This section defines where source-of-truth assets live, how they are separated, and which storage surfaces are authoritative for which classes of data.

Principle¶

Not all data belongs in the same place.

XIOPro should separate:

versioned source assets
runtime state
large artifacts
backups
local operator files
experimental or temporary material

This prevents confusion between:

what is canonical
what is generated
what is recoverable
what is disposable

8.13.1 Canonical Storage Classes¶

Git Repositories¶

Use Git repositories for:

source code
blueprints
rules
skills
activations
prompt templates
runbooks
deployment definitions
scripts
configuration templates

Git is the human-readable and auditable source of truth for versioned text-based assets.

PostgreSQL¶

Use PostgreSQL for:

ODM entities
tickets
tasks
activities
runtimes
sessions
escalations
human decisions
policy objects
governance events
cost/usage rollups
scheduler state
indexing metadata

PostgreSQL is the authoritative operational state store.

Object / Blob Storage¶

Use object storage for:

transcript snapshots
checkpoints
recovery bundles
exported artifacts
large generated files
retained log bundles
research exports where size or format justifies it

Object storage is for durable large artifacts, not for the primary source of truth of structured runtime state.

Local Operator Filesystem¶

The local founder/operator node may hold:

local clones of approved repos
local working notes
sandbox experiments
review/export materials
temporary staging files
local tool caches

Local operator storage is useful, but it is not authoritative unless content is committed or ingested properly.

8.13.2 Recommended Repository Topology¶

T1P should align to the actual active STRUXIO repository family rather than a generic placeholder structure.

Canonical active repos:

struxio-os
struxio-logic
struxio-design
struxio-app
struxio-business
struxio-knowledge

A transitional repo may still exist for a limited period:

struxio-aibus

Reference repos may also exist for research or inspiration, but they are not part of the canonical operating core.

struxio-os¶

Primary control-plane and operations repo.

Holds:

infra
state
tickets
deployment
runbooks
control-layer operational files
bootstrap/update scripts
ops-facing automation

struxio-logic¶

Primary cognition / behavior repo.

Holds:

agents
rules
skills
prompts
logic-layer governance assets
activation and protocol assets where appropriate

struxio-design¶

Primary architecture / blueprint / research repo.

Holds:

blueprint parts
architecture records
system maps
evolution notes
product design
PRDs
research artifacts and synthesis outputs where text-first is appropriate

struxio-app¶

Primary product/application implementation repo.

Holds:

app/runtime code
APIs
product-facing implementation
product integration surfaces
E2E test surfaces

struxio-business¶

Primary business / legal / finance / strategy repo.

Holds:

business assets
legal materials
finance materials
strategy
brand and fundraising assets

struxio-knowledge¶

Primary knowledge / research / reference repo.

Holds:

research artifacts
curated reference material
knowledge ledger assets
synthesis outputs
topic-indexed knowledge files

struxio-aibus (Transitional / Legacy)¶

Not a permanent first-class pillar.

Plan:

identify still-valuable code or documents
migrate what remains useful into canonical repos
archive the repo once no longer operationally required

Rule¶

Part 8 repository topology must stay aligned with the canonical active repo family used by the work plan and migration model.

8.13.3 Filesystem Class Rules¶

Within any repo or managed storage surface, files should conceptually fall into these classes:

source
generated
runtime
archive
temp

Source¶

Human-maintained canonical inputs.

Examples:

code
rules
skills
blueprints
configs
runbooks

Generated¶

System-produced durable outputs.

Examples:

exports
compiled artifacts
evaluation reports
generated documentation
synthesized summaries

Generated assets should not silently replace source assets.

Runtime¶

Operationally live mutable state.

Examples:

DB data
active checkpoints
session snapshots
job state

Runtime state belongs in state stores, not committed source repos.

Archive¶

Longer-lived retained material not needed for active work.

Examples:

retired reports
older exports
superseded bundles
long-term retained incident artifacts

Temp¶

Disposable staging content.

Examples:

scratch files
transient downloads
in-progress experiment outputs
tool caches

Temp must never be treated as authoritative.

8.13.4 Authoritative Repo / State Rules¶

The system must be explicit about which surface is authoritative.

Rules:

text assets -> authoritative in Git
runtime operational state -> authoritative in PostgreSQL
large artifacts / checkpoints / exports -> authoritative in object storage where applicable
local machine files -> non-authoritative until committed or ingested

No agent should assume a local filesystem copy is canonical merely because it exists.

8.13.5 Research & Knowledge Storage Note¶

Research-related material may live across:

Git-managed knowledge assets
PostgreSQL metadata/indexing
object storage exports
local review workspaces
Obsidian/NotebookLM connected surfaces

But the system must still preserve clear distinction between:

raw source material
curated knowledge
generated derivative outputs
scheduled research artifacts

8.14 Cost Telemetry & Attribution Pipeline¶

Purpose¶

Infrastructure must collect cost and usage signals from the moment an agent/runtime uses a module, and preserve them in a form that is:

attributable
queryable
enforceable
optimizable

This supports Part 3 cost propagation and Part 4/Part 7 runtime governance.

Principle¶

Cost must be captured both:

during execution
after execution

This requires a pipeline, not only a dashboard.

8.14.1 Collection Stages¶

Stage 1 -- Raw Usage Emission¶

Execution surfaces, routers, and adapters should emit raw usage events when work happens.

Typical sources:

LiteLLM/router usage records
provider API responses
local runtime counters
subscription-surface usage approximations where exact billing is delayed
worker/task metadata

Stage 2 -- Activity Attribution¶

Raw usage must be attributed to the correct operational scope.

Minimum attribution targets:

activity
session
agent runtime
task
ticket
execution surface
module/provider
environment

Stage 3 -- Normalization¶

Usage must be normalized into comparable records.

Useful normalized fields include:

cost_event:
  event_id: string
  timestamp: datetime

  activity_id: string|null
  session_id: string|null
  agent_runtime_id: string|null
  task_id: string|null
  ticket_id: string|null

  module_id: string|null
  provider: string|null
  access_path: string|null
  # api | subscription | self_hosted | hybrid

  usage_units_in: float|null
  usage_units_out: float|null
  estimated_cost: float|null
  billed_cost: float|null
  currency: string|null

  latency_ms: int|null
  retries: int|null
  node_id: string|null
  notes: string|null

Stage 4 -- Rollup¶

Rollups should aggregate by at least:

activity
task
ticket
project
module/provider
access path
runtime surface
day / week / month

Stage 5 -- Governance Consumption¶

Rollups and anomaly signals should feed:

the governor
breaker policies
budget policies
module steward optimization analysis
reporting/UI layers later

8.14.2 Collection Requirements by Access Type¶

API-Based Module Use¶

Preferred collection source:

router/provider response metadata
request/response usage counters
billing approximation tables
later reconciliation with actual billed usage where available

Subscription-Based Module Use¶

Exact billing detail may be weaker or delayed.

Minimum requirement:

record which runtime used which subscription-backed surface
approximate scope and intensity of use
preserve task/runtime attribution
support strategic optimization even when exact per-call pricing is unavailable

Self-Hosted Module Use¶

Collect at least:

runtime used
node used
time consumed
compute/memory pressure
queue/wait cost proxy
power/capacity proxy where useful later

Self-hosted cost is not zero just because no API bill exists.

8.14.3 Storage Rule¶

Cost telemetry should be stored in PostgreSQL as normalized operational records and rollups.

Large raw logs may additionally land in log/object storage, but authoritative attribution must remain queryable from the operational store.

8.14.4 Validation Rule¶

A task is not considered fully cost-observable unless XIOPro can answer at least:

which module(s) were used
by which runtime/surface
for which task/ticket
with what estimated or billed cost signal
with what latency/retry profile

If this cannot be answered, cost governance is incomplete.

8.14.5 Final Rule¶

Cost is not "a later finance report".

It is a live infrastructure signal that must be captured at execution time and preserved for both governance and optimization.

9. Deployment Model¶

9.1 Initial T1P Deployment¶

Initial production baseline:

single Hetzner CPX62 primary node
Docker Compose or equivalent simple orchestrator
all core XIOPro services colocated
strict logical separation between services
reverse proxy in front
PostgreSQL persistent volume
scheduled backup jobs
private admin access only

This is acceptable because the current need is:

founder-scale operation
rapid iteration
recoverability
low complexity

It is not acceptable to let "single-node MVP" become "undefined production."

9.2 Initial Container Groups¶

Recommended initial groups:

ingress
api
orchestrator
governor
ruflo
litellm
scheduler
workers
postgres
knowledge
telemetry
backup

9.3 Scale-Out Direction¶

When required, scale along these lines:

split ingress/API from control services
split PostgreSQL onto stronger isolated storage node
split worker/runtime services from control node
add dedicated GPU/model node
isolate product runtime from XIOPro runtime

9.4 Non-Goals for Initial Phase¶

Do not introduce yet unless proven necessary:

Kubernetes
distributed queue complexity beyond real need
service mesh
heavy graph infrastructure
multi-region architecture
premature HA theater

These may become valid later, but are not required for T1P execution readiness.

9.5 Initial Hardware Baseline¶

9.5.1 Node A -- Hetzner CPX62 (Actual Specs)¶

The current production server is a Hetzner CPX62:

Spec	Value
CPU	16 vCPU AMD EPYC-Genoa (shared)
RAM	30 GB
Storage	150 GB SSD (NVMe)
OS	Ubuntu 24.04 LTS
Docker	Docker Engine 29.2.1, Docker Compose
Network	Public IPv4, Tailscale VPN overlay
Python	3.12.3
Node.js	20.20.1

Practical Sizing Principle¶

The initial node must be sized for control-plane reliability first, not for speculative future self-hosted model serving.

That means it must comfortably support:

orchestrator service
governor service
PostgreSQL
API / ingress
Ruflo
LiteLLM
scheduler / workers
observability
backup jobs

without sustained resource contention.

Initial Recommendation Logic¶

Choose a Hetzner class that prioritizes:

CPU consistency
RAM headroom
fast NVMe/SSD
stable Linux support
easy vertical upgrade path

Do not size Node A around local-model aspirations. If self-hosted inference becomes real, it belongs on Node C.

9.5.2 Node B -- Local Operator Node (Mac Studio)¶

Current role:

founder interaction
RC-capable local sessions
local CLI operations
local validation
local knowledge work
fallback execution

Connected via Tailscale VPN (encrypted mesh, Hetzner <-> Mac).

Recommended baseline:

stable workstation environment
local CLI toolchain
secure admin access to Node A
local backup for critical operator-side configs
optional local container tooling for test/fallback

9.5.3 Node C -- Future GPU / Self-Hosted Model Node¶

This node is optional and deferred.

It becomes justified only when one or more conditions are true:

self-hosted models materially improve privacy
unit economics justify dedicated inference
batch embedding/index workloads become heavy
provider dependence becomes strategically limiting
offline or degraded-network resilience becomes important

Until then, Node C remains a reserved architectural slot, not an implementation obligation.

9.5A Container Memory Budget (CPX62 -- 30 GB)¶

With the CPX62 at 30 GB RAM, the memory budget after retirement of stale services is:

Category	Estimated RAM	Notes
Docker containers (current, post-retirement)	~2.25 GB	10 containers after retiring devxio-frontend, devxio-bridge, devxio-librarian, graph_stack_neo4j (Neo4j deprecated -- both instances removed)
Agent processes (orchestrator + 2 brains typical)	~2-3 GB	Claude Code sessions via Max20
System / OS	~2 GB	Ubuntu 24.04, systemd, journald, etc.
Available headroom	~22-24 GB
New XIOPro backend + UI (budget)	4-6 GB	FastAPI backend, Next.js UI, workers
Remaining free	~16-20 GB	Comfortable margin for spikes

This gives substantial headroom for the new XIOPro services. The CPX62 is not a constraint for T1P.

Realistic Concurrent Agent Estimate¶

Each Claude Code agent process consumes approximately 300-500 MB of RAM. With the CPX62's 30 GB:

Component	Estimated RAM
Services baseline (13 containers)	~10 GB
System / OS	~2 GB
Available for agents	~18-20 GB
Agent process (each)	~300-500 MB
Realistic concurrent agents	8-10 (at ~500 MB each, with ~3-5 GB buffer for spikes)

The realistic maximum is 8-10 concurrent agents on the current CPX62. This accounts for:

Worst-case agent memory (~500 MB each)
A 3-5 GB safety buffer for memory spikes, background jobs, and transient allocations
The 85% RAM utilization hard limit from Part 1, Section 4.10 (no agent spawning above 85%)

Previous estimates of higher agent counts assumed smaller agent footprints. This revised estimate reflects observed Claude Code process sizes in production.

Budget Rule¶

If total container memory exceeds 15 GB sustained, investigate:

which containers can be retired or consolidated
whether any service is leaking memory
whether workload should move to a separate node

See resources/SERVICE_FATE_MAP_v4_2.md for the full current-to-target service transition plan.

9.6 Installation Bill of Materials (T1P)¶

9.6.1 Host-Level Baseline¶

Node A should install and configure:

Ubuntu LTS base OS
Docker Engine
Docker Compose or equivalent simple orchestrator
UFW or nftables firewall
Tailscale or equivalent secure overlay
SSH server with key-only auth
fail2ban if SSH remains publicly reachable
log rotation baseline
backup scripting/runtime support
system time sync
unattended or managed security update strategy

9.6.2 Core XIOPro Service Set¶

Initial service set:

ingress / reverse proxy
API service
orchestrator service
governor service
Ruflo runtime service
LiteLLM router service
scheduler service
worker service(s)
PostgreSQL service
knowledge / librarian service
telemetry / monitoring service(s)
backup service / scheduled jobs

9.6.3 Supporting Operational Components¶

Recommended supporting components:

TLS certificate automation
environment/secrets injection mechanism
deployment scripts / make targets / runbooks
uv-based Python version/dependency/tool management for Python services and scripts
backup restore scripts
database migration runner
health-check endpoints
metrics exporter(s)
alert delivery integration

9.6.4 Deferred / Optional Components¶

Do not install for T1P unless clearly justified:

Kubernetes
service mesh
heavy queue infrastructure
dedicated tracing stack if basic telemetry is enough
vector/graph infrastructure without proven usage
GPU inference stack on Node A

9.6A CLI Toolchain¶

XIOPro follows a CLI-first principle: prefer CLI tools over MCP wrappers where both exist. CLI pipelines are faster, more composable, and more debuggable.

See resources/CLI_TOOLS_ASSESSMENT.md for the full assessment with install instructions.

See resources/DESIGN_cli_services.md for the config-driven CLI services framework design (operational commands executable via Bus API or devxio CLI, including DNS management via Porkbun API and infrastructure management via Hetzner hcloud CLI).

Already Installed¶

Tool	Version	Purpose
tmux	3.4	Terminal multiplexer
ripgrep (rg)	14.1.1	Fast code/text search

Must-Have (install in Phase 0)¶

Tool	Purpose	Install
gh	GitHub CLI -- PR, issue, Actions automation	Official apt repo
jq	JSON processor -- API response parsing, config manipulation	`apt install jq`
uv	Python package manager -- 10-100x faster than pip, replaces pip+venv+pyenv	curl installer
fzf	Fuzzy finder -- history search, file navigation, pipeline glue	`apt install fzf`
fd	Fast find -- file discovery, respects .gitignore	`apt install fd-find`
yq	YAML processor -- state file manipulation, Docker Compose queries	wget binary
direnv	Per-directory env vars -- project isolation, agent env scoping	`apt install direnv`
hcloud	Hetzner Cloud CLI -- server, network, firewall management	Official apt repo

Nice-to-Have (install when convenient)¶

Tool	Purpose
bat	Syntax-highlighted file viewing
delta	Better git diffs
lazygit	Visual git TUI
xh	Friendlier HTTP client
dust	Visual disk usage
btm (bottom)	Visual system monitor
llm (Simon Willison)	Ad-hoc LLM queries from terminal

Skip¶

Tool	Reason
aider	Overlaps with Claude Code
aichat	Overlaps with Claude Code
jj (jujutsu)	Evaluate later; needs Rust toolchain

Install Script¶

A bootstrap script is provided at resources/CLI_TOOLS_ASSESSMENT.md Section "Recommended Install Script". Cost: zero (all tools are free and open-source). Disk: under 200 MB total.

9.7 Network Exposure Matrix¶

9.7.1 Principle¶

Every port and entry point must have an owner and justification.

No service should be reachable from the public internet unless:

it is operationally required
it is protected
it is documented

9.7.2 Publicly Exposed Surfaces¶

Allowed public exposure should normally be limited to:

HTTPS ingress endpoint
optional HTTP -> HTTPS redirect endpoint

Public exposure should not directly include:

PostgreSQL
internal runtime adapters
scheduler
worker services
observability admin surfaces
raw agent runtimes

9.7.3 Private / Overlay-Only Surfaces¶

Prefer private-only access for:

SSH administration
database administration
internal dashboards
recovery tooling
deployment control
backup administration
founder/operator maintenance access

This is where Tailscale or equivalent is strongly preferred.

9.7.4 Internal Service Communication¶

Internal services should communicate over:

private Docker network(s)
host-local interfaces where practical
explicit service credentials
service-to-service allow rules

The infrastructure should avoid a "flat trust" model.

9.8 Domain / DNS / Surface Allocation¶

9.8.1 Principle¶

Surface naming should reflect service boundaries, not historical accidents.

Recommended pattern:

main XIOPro control surface
optional API subdomain
optional RC/escalation subdomain
optional knowledge subdomain
optional product/runtime subdomains later

9.8.2 T1P Surface Recommendation¶

For T1P, it is acceptable to expose only one or two public surfaces:

primary XIOPro control endpoint
optional API endpoint if separation is useful

Everything else may remain internal/private until needed.

This keeps complexity, certificate handling, and attack surface lower.

9.8.3 DNS Records (Active as of 2026-03-29)¶

Domain registrar: Porkbun. DNS managed via Porkbun.

Record	Type	Value	Purpose
`bus.struxio.ai`	A	89.167.96.154	Control Bus REST + MCP API
`dashboard.struxio.ai`	A	89.167.96.154	Control Center UI
`paperclip.struxio.ai`	A	89.167.96.154	Paperclip issue tracker
`tickets.struxio.ai`	A	89.167.96.154	Ticket management surface
`chat.struxio.ai`	A	89.167.96.154	Open WebUI chat interface
`*.struxio.ai`	CNAME	pixie.porkbun.com	Wildcard — covers all subdomains not listed above

Note: The wildcard CNAME means devxio.struxio.ai (and any other unlisted subdomain) resolves automatically via *.struxio.ai. Caddy just needs a site block to serve it.

Explicit A records take precedence over the wildcard CNAME for the four listed subdomains.

All public-facing subdomains are reverse-proxied through Caddy with automatic TLS (Let's Encrypt).

9.9 Access Path Matrix¶

9.9.1 Founder Admin Path¶

Used for:

infrastructure administration
recovery
deployment
secrets handling
emergency intervention

Preferred path:

private overlay network
key-based auth
auditable commands

9.9.2 System Service Path¶

Used for:

service-to-service calls
scheduled jobs
DB access by approved services
runtime adapter communication

Requirements:

scoped credentials
least privilege
revocable access
auditable configuration

9.9.3 Agent Runtime Path¶

Used for:

execution requests
provider/model calls
artifact production
bounded interaction with control/data services

Restrictions:

no broad infrastructure admin rights
no unrestricted DB access
no unrestricted secrets access
only approved tools/endpoints

9.9.4 Service Placement Matrix¶

Principle¶

Every service must have a default execution home.

This avoids accidental sprawl, unclear ownership, and unnecessary cross-node complexity.

T1P Recommended Placement¶

Node A -- Cloud Control Node (Hetzner CPX62)¶

Node A should host the initial authoritative platform baseline:

ingress / reverse proxy
API service
orchestrator service
governor service
Ruflo runtime service
LiteLLM router service
scheduler service
core worker service(s)
PostgreSQL service
librarian / knowledge service
telemetry / monitoring baseline
backup job runner
deployment / migration runner

Node B -- Local Operator Node (Mac Studio)¶

Node B is the founder-operated local execution and intervention node.

It may host:

local CLI surfaces
RC-capable local sessions
local validation tooling
emergency operator tools
local knowledge access
safe sandbox experiments
optional local container tooling for test/fallback

Node B must not be treated as the authoritative production control plane.

Node C -- Future GPU / Model Node¶

Node C is optional and deferred.

If added later, it should host only specialized higher-weight workloads such as:

self-hosted model runtimes
embedding or indexing jobs
heavier background processing
isolated experimental inference services
other compute-intensive workloads that should not burden Node A

Node C should not be required for initial correctness.

Node D -- Future Product Runtime Node¶

Node D is optional and deferred.

If introduced later, it should host:

STRUXIO product APIs
customer-facing runtime services
product-specific workloads isolated from XIOPro control-plane services

Node D exists to preserve separation between XIOPro internal operations and future product runtime responsibilities.

Rule¶

If a service has no explicit placement decision, it defaults to Node A for T1P.

9.9.5 Interface / Port Exposure Classes¶

Principle¶

T1P does not require a full port catalog yet, but it does require deterministic exposure classes.

Every interface must belong to one of the following classes.

Class A -- Public Internet Facing¶

Allowed only when operationally justified.

Typical examples:

HTTPS ingress endpoint
optional HTTP redirect endpoint

Requirements:

protected by reverse proxy
TLS enabled
documented owner
monitored
minimal surface only

Class B -- Private Overlay Only¶

Accessible only through Tailscale or equivalent secure overlay.

Typical examples:

SSH administration
internal dashboard access
deployment controls
recovery tooling
admin-only APIs

Requirements:

key-based or equivalent strong auth
operator-only access
auditable usage

Class C -- Internal Service Network Only¶

Never publicly exposed.

Typical examples:

PostgreSQL
scheduler
worker coordination
Librarian internal interfaces
telemetry collectors
service-to-service APIs

Requirements:

private Docker/network namespace or host-local isolation
explicit service identity
least-privilege credentials

Class D -- Localhost / Node-Local Only¶

Only reachable on the owning node.

Typical examples:

migration runners
emergency maintenance helpers
temporary admin endpoints
local-only debug utilities

Requirements:

disabled by default unless needed
never exposed externally by accident

Final Rule¶

No interface may exist without:

exposure class
owning service
access method
justification

9.9.6 Secrets Ownership and Injection Rules¶

Principle¶

Secrets must be scoped by role, not shared broadly across the platform.

Secret Classes¶

Examples of secret classes include:

provider API credentials
router/provider integration secrets
database credentials
session signing/application secrets
backup/storage credentials
deployment credentials
notification/integration secrets

SOPS + age Secrets Encryption¶

Secrets are encrypted at rest using SOPS + age.

Component	Details
Encryption tool	SOPS (Secrets OPerationS)
Key backend	age (modern file encryption)
Key location	`~/age-key.txt` on Node A
Encrypted files	`.sops.yaml` configs, encrypted env files

SOPS + age provides:

encryption at rest for all secret files in Git and on disk
per-file or per-key encryption granularity
Git-friendly encrypted diffs (only values are encrypted, keys are visible)
no external key management service required (age key is file-based)
simple rotation: re-encrypt with new age key

Ownership Rules¶

Founder / Operator Only¶

The founder or emergency operator path may control:

root infrastructure credentials
overlay administration
DNS/domain credentials
emergency recovery credentials
secret issuance / rotation authority
age key management

Platform Services¶

Approved control-plane services may receive only the secrets they require.

Examples:

API service -> app/session secrets, scoped DB access
orchestrator / governor -> scoped platform secrets only where operationally necessary
LiteLLM/router -> provider credentials required for routing
backup service -> backup target credentials

Agent Runtimes¶

Agent runtimes must not receive broad secret visibility.

They should only receive:

task-scoped credentials
provider access via approved broker/router path
temporary credentials where justified

They must not receive:

unrestricted production DB credentials
infrastructure root credentials
blanket secret bundles

Injection Rules¶

Approved methods for T1P:

environment injection at container/service start
mounted secret files with restricted permissions
managed secret loading wrapper
SOPS-decrypted values injected at deploy time

Not allowed:

plaintext secrets in Git
plaintext secrets in blueprint docs
secrets embedded in tickets
secrets stored in general application tables unless explicitly encrypted and justified

Rotation Rule¶

Any secret class that can affect:

provider spend
production data
recovery access
external exposure

must be rotatable without redesigning the platform.

9.9.7 Environment Separation Rules¶

Principle¶

T1P must distinguish clearly between:

local/dev
production cloud
recovery/emergency operation

Local / Dev Environment¶

Local/dev may be less durable, but must not silently share production authority.

Rules:

no default reuse of production secrets
no default connection to production database
no hidden dependency on founder machine availability
safe to destroy and recreate

Production Cloud Environment¶

Production cloud is the authoritative execution environment.

Rules:

persistent state lives here
scheduled automation lives here
recovery baseline is validated here
headless execution must function without local GUI dependency

Recovery / Emergency Path¶

Recovery path must exist even if the main control surface is unavailable.

Minimum expectation:

private overlay access works
key administrative commands are documented
restore path is tested
one founder/operator path remains usable during failure scenarios

Final Rule¶

No environment may depend on undocumented manual steps for core recovery, restart, or access.

9.10 T1P Deployment Acceptance Checklist¶

A T1P infrastructure deployment is not accepted unless all are true:

Node A can reboot and recover services predictably
PostgreSQL persistence is verified
backup job runs successfully
restore procedure is documented
HTTPS ingress works
non-essential public ports are closed
Tailscale/private admin path works
orchestrator, governor, API, PostgreSQL, Ruflo, LiteLLM, scheduler, and backup jobs are observable
one task can run end-to-end headlessly
one interruption/restart scenario has been tested

9.10.1 Final Rule¶

If an infrastructure component is installed, it must satisfy one of these:

needed now for headless execution
needed now for recovery/security/observability
required to prevent near-term rework

Otherwise, defer it.

9.11 Bootstrap, Startup & Controlled Update Lifecycle¶

9.11.1 Purpose¶

XIOPro must be able to start, restart, update, and recover deliberately.

A serious headless system cannot depend on "manual remembering" to become operational after:

host reboot
deployment change
schema update
service crash
secret rotation
version rollout

This section defines the minimum controlled lifecycle.

9.11.2 Bootstrap Principle¶

Bootstrap must be:

scripted
repeatable
environment-aware
observable
rollback-conscious

If startup requires undocumented manual steps, bootstrap is incomplete.

9.11.2A Python Environment Standard¶

Python-based XIOPro services and scripts should use uv as the default tooling layer for:

Python version management
environment creation
dependency sync
lockfile-driven reproducibility
tool and script execution during bootstrap/update

Expected standards where applicable:

pyproject.toml
uv.lock
.python-version

Rule¶

Bootstrap and update automation for Python services should prefer uv-based workflows over ad hoc pip/venv handling.

The goal is:

faster environment setup
reproducible sync across Mac and Hetzner
cleaner CI/deploy behavior
fewer environment drift problems

9.11.3 Cold Start Sequence¶

A first-time or rebuilt environment should follow this order:

host baseline ready
network/security baseline ready
secrets delivery path ready
storage surfaces reachable
PostgreSQL initialized or restored
schema migrations applied
core control services started
scheduler/background jobs started
knowledge/index refresh checks run
observability/health checks confirmed
workload admission opened

Rule¶

The system should not accept normal execution until foundational dependencies pass health gates.

9.11.4 Warm Restart Sequence¶

For ordinary reboot or redeploy:

preserve or verify durable state
restart PostgreSQL and storage dependencies
restart control services
rebind runtime/scheduler state
verify pending sessions / checkpoints
verify alerting and telemetry
reopen execution intake

Warm restart should prefer continuity over full rebuild.

9.11.5 Controlled Update Flow¶

Every significant update should support:

planned target version
preflight validation
backup / snapshot before change
migration step if needed
health verification after rollout
rollback path if checks fail

Minimum stages:

controlled_update_flow:
  - preflight
  - snapshot
  - deploy
  - migrate
  - verify
  - reopen
  - rollback_if_needed

9.11.6 Preflight Checks¶

Before deployment or upgrade, the system should check at least:

target environment identity
available disk/RAM headroom
secrets availability
database reachability
migration compatibility
backup readiness
current health baseline
operator approval where required

9.11.7 Health Gates¶

Startup/update should define health gates for at least:

PostgreSQL
API service
orchestrator
governor
scheduler
LiteLLM/router path
Ruflo/runtime path
backup jobs
telemetry/alerts

If health gates fail, the system should remain in degraded or closed admission mode until reviewed.

9.11.8 Runtime Admission Control¶

After bootstrap or update, XIOPro should reopen work in controlled order.

Suggested order:

read-only status visibility
manual/operator access
scheduler and maintenance jobs
controlled task execution
full execution intake

This prevents unstable startup from immediately turning into unstable work.

9.11.9 Version & Migration Discipline¶

The platform must keep clear record of:

deployed service versions
DB migration level
blueprint/runtime compatibility notes
last successful deployment
last successful restore drill
pending upgrade blockers

This is necessary for recovery and auditability.

9.11.10 Self-Restart vs Self-Mutation Rule¶

XIOPro should be able to:

restart services
rebind sessions
resume controlled execution
propose updates
assist in rollout preparation

But it must not silently self-mutate production behavior without governed approval.

Self-recovery is allowed. Unapproved self-redefinition is not.

9.11.11 Success Criteria¶

Bootstrap and update discipline is successful when:

a host can reboot without operational chaos
a fresh node can be built from runbooks/scripts
deployments are repeatable
migrations are not guesswork
rollback is realistic
post-update health is explicit before execution resumes

9.11.12 Orchestrator Launch Commands¶

XIOPro orchestrator surfaces are launched via the devxio CLI command:

Command	Surface	Host	Effect
`devxio go` or `GO`	Global Orchestrator	Hetzner	Starts the primary 24x7 orchestrator session. Reads CLAUDE.md, memory files, plan.yaml, and resumes execution.
`devxio mo` or `MO`	Mac Orchestrator	Mac Studio	Starts the Mac-local orchestrator. Handles Mac tasks, browser testing, local experiments. Reports to GO via Control Bus.

Both surfaces can run simultaneously. GO is always the primary. See Part 4, Section 4.1A for the full naming convention and rules.

10. Backup & Recovery¶

10.1 Principle¶

Recovery is not a future enhancement. It is a required runtime property.

XIOPro must be able to recover from:

node failure
process crash
session loss
database corruption
bad deployment
accidental deletion
provider-side disruption
operator error

10.2 Backup Scope¶

All critical persistence surfaces must be covered.

10.2.1 PostgreSQL¶

Must back up:

ODM entities
tickets
tasks
activities
runtimes
sessions
escalation requests
human decisions
governance state
cost and telemetry aggregates
scheduler state

10.2.2 Git Repositories¶

Must preserve:

source code
rules
skills
blueprints
prompts
configuration templates
scripts

Git is already versioned, but mirror/backup copies are still required.

10.2.3 Object / Blob Storage¶

Must preserve:

transcript snapshots
exported artifacts
checkpoints
large outputs
recovery bundles
retained logs

10.2.4 Configuration & Infrastructure State¶

Must preserve:

environment templates
Docker compose files
reverse proxy config
firewall config
job schedules
deployment scripts
secret references
runbooks

Secrets themselves should not be dumped into general backups unless explicitly encrypted and controlled.

10.2A Restic Backup to Backblaze B2¶

Automated backup runs daily via Restic to Backblaze B2. Implemented and operational as of 2026-03-28.

Parameter	Value
Tool	Restic
Target	Backblaze B2 bucket (STRUXIO-ai)
Schedule	Daily at 03:00 UTC (cron)
Script	`/opt/struxio/backup/backup.sh`
Scope	Workspace, configs, scripts, PostgreSQL dumps
Encryption	Restic built-in (AES-256)
Credentials	SOPS-encrypted (`backup_secrets.enc.env`), loaded at runtime via age key

Backup Process (3 steps)¶

Decrypt credentials — SOPS decrypts B2 account ID, key, and restic password from backup_secrets.enc.env using age key at ~/age-key.txt
Dump PostgreSQL — pg_dump for Bus DB and Paperclip DB to /opt/struxio/backup/pg_dumps/. Files named with date suffix. 7-day local retention.
Restic backup — backs up workspace, bus config, scripts, and pg_dumps to B2. Tags: daily, hetzner.

What Is Backed Up¶

Path	Content
`/home/struxio/STRUXIO_Workspace`	All 7 Git repos
`/opt/struxio/bus`	Bus MCP source and config
`/opt/struxio/config`	System configuration
`/opt/struxio/scripts`	Operational scripts
`/opt/struxio/backup/pg_dumps`	Daily PostgreSQL dumps (Bus + Paperclip)

Excluded: node_modules, .git, *.log, __pycache__, .venv

Security¶

No plaintext credentials — B2 account key and restic password are SOPS-encrypted at rest
Decryption requires the age private key (~/age-key.txt) which is not in any Git repo
Backup data is encrypted by Restic (AES-256) before upload to B2

Retention Policy¶

Restic prunes automatically after each backup: - Keep 7 daily snapshots - Keep 4 weekly snapshots - Keep 6 monthly snapshots

10.3 Backup Cadence¶

Database¶

logical dump: at least daily
WAL archiving: continuous (see Section 10.3A)
pre-deploy snapshot: required before high-risk migrations

Git / Markdown¶

mirrored continuously through Git remote
daily off-platform mirror recommended

Object Storage¶

continuous durable write pattern preferred
lifecycle retention policy required

Config / Infra State¶

export on every significant infrastructure change
nightly snapshot of deployment definitions recommended

10.3A PostgreSQL WAL Archiving for Point-in-Time Recovery¶

Daily logical dumps (Section 10.2A) provide a 24-hour RPO. WAL (Write-Ahead Log) archiving reduces the RPO to 5 minutes by continuously shipping transaction logs to Backblaze B2.

Configuration¶

wal_archiving:
  archive_mode: "on"
  archive_command: "restic backup --tag wal --stdin-filename %f < %p"
  # Alternative direct B2 shipping:
  # archive_command: "b2 upload-file STRUXIO-ai wal/%f %p"
  wal_level: "replica"
  max_wal_senders: 3
  wal_keep_size: "1GB"

RPO Target¶

Target RPO: 5 minutes (down from 24 hours with daily dumps alone)
WAL segments are archived continuously as they complete (typically every few minutes under normal load)
Combined with the daily base backup, any point in time within retention can be restored

Archive Destination¶

WAL segments are shipped to Backblaze B2 alongside the daily Restic backups:

Component	Destination	Retention
Daily base backup (pg_dump)	B2 via Restic (existing)	7 daily, 4 weekly, 6 monthly
WAL segments	B2 via Restic or direct B2 upload	7 days minimum

Point-in-Time Restore Procedure¶

Identify target time — determine the recovery point (e.g., "2026-03-30 14:30:00 UTC")
Restore base backup — restore the most recent daily pg_dump that precedes the target time
Download WAL segments — retrieve all WAL files from B2 between the base backup and the target time
Configure recovery — set recovery_target_time in postgresql.conf (or recovery.conf for older versions)
Start PostgreSQL in recovery mode — PostgreSQL replays WAL segments up to the target time
Validate — verify table counts, recent data, and ODM entity integrity
Promote — remove recovery configuration and restart as primary

Monitoring¶

Alert if WAL archiving falls behind by more than 5 minutes (archive lag)
Alert if archive_command fails 3 consecutive times
Include WAL archive status in the daily backup health check

Rule¶

WAL archiving is required for T1P production. Daily logical dumps alone are insufficient for a system managing active tickets, tasks, and governance state.

10.4 Retention Policy¶

Minimum target policy:

daily backups: 14 days
weekly backups: 8 weeks
monthly backups: 12 months
critical milestone backups: retained until manually reviewed

Session checkpoints and recovery bundles may use shorter retention if cost requires it, but production recovery points must remain sufficient for incident handling.

10.5 Recovery Priorities¶

Recovery order must follow business value.

Priority 1¶

database integrity
orchestrator state
governor state
active ticket/task continuity

Priority 2¶

session checkpoint restoration
transcript recovery
scheduler recovery
API availability

Priority 3¶

observability dashboards
historical exports
non-critical mirrors

10.6 Recovery Targets¶

Initial T1P targets:

infrastructure RPO target: <= 5 minutes (with WAL archiving; <= 24 hours without)
operational DB restore target: same day
critical session recovery target: best effort via checkpoint + transcript snapshot
redeploy target after node loss: scripted and repeatable

These are initial targets, not final enterprise targets. The key requirement is that recovery must be rehearsable and explicit.

10.7 Session & Runtime Recovery¶

Recovery must align with Part 3 and Part 4 runtime semantics.

Infrastructure must support:

runtime restart without losing ticket linkage
session rebind when possible
replacement session creation when rebind fails
recovery escalation to human when continuity is uncertain
durable storage of context snapshots and transcript references

Infrastructure recovery is not complete unless runtime continuity is addressed.

10.8 Restore Drill Requirements¶

A restore drill must be executable from runbook.

Minimum Drill Scenarios¶

PostgreSQL restore to clean environment
full service restart from deployment definitions
object storage recovery validation
recovery of one interrupted active task
rollback to prior known-good deployment

If recovery is not tested, it is not real.

Monthly Restore Drill Procedure¶

A restore drill must run at least once per calendar month. The drill validates that B2 backups are actually recoverable, not just present.

Drill Steps¶

restore_drill:
  cadence: "monthly (first week of month)"
  executor: "GO or designated ops agent"

  steps:
    1_download:
      action: "Download latest Restic snapshot from B2"
      command: "restic restore latest --target /tmp/restore_drill/"
      verify: "Files exist at /tmp/restore_drill/"

    2_restore_db:
      action: "Restore PostgreSQL dump to temporary database"
      command: "createdb restore_drill_db && pg_restore -d restore_drill_db /tmp/restore_drill/pg_dumps/latest.dump"
      verify: "Database created without errors"

    3_verify_tables:
      action: "Verify table counts match production"
      checks:
        - "SELECT count(*) FROM tickets — within 5% of production count"
        - "SELECT count(*) FROM tasks — within 5% of production count"
        - "SELECT count(*) FROM messages — within 5% of production count"
        - "SELECT count(*) FROM agent_runtimes — non-zero"

    4_verify_recent_data:
      action: "Verify data freshness"
      checks:
        - "SELECT max(created_at) FROM messages — within 24 hours of drill time"
        - "SELECT max(created_at) FROM tasks — within 24 hours of drill time"
        - "WAL recovery test: if WAL archiving active, verify PITR to specific timestamp"

    5_cleanup:
      action: "Remove temporary resources"
      commands:
        - "dropdb restore_drill_db"
        - "rm -rf /tmp/restore_drill/"

    6_record:
      action: "Record drill results"
      output:
        file: "state/restore_drills.yaml"
        fields:
          - drill_date
          - snapshot_id
          - snapshot_age_hours
          - tables_verified
          - table_count_drift_pct
          - data_freshness_hours
          - wal_pitr_tested (boolean)
          - pass_fail
          - notes
          - executor

Drill Success Criteria¶

All tables restored without errors
Table counts within 5% of production
Most recent data within 24 hours of drill time (within RPO)
WAL PITR test successful (if WAL archiving is active)
Drill completes in under 30 minutes

Drill Failure Response¶

If drill fails: create a critical governance alert (backup.restore_drill.failed)
Root cause must be identified and fixed before the next scheduled drill
Two consecutive drill failures trigger a human escalation to the founder

11. Security¶

11.1 Principle¶

XIOPro security must protect:

proprietary strategy
source code
execution control
credentials
knowledge assets
product plans
customer-sensitive material

Security must be practical, layered, and compatible with headless execution.

11.2 Security Posture¶

Initial production posture:

minimal public exposure
private overlay access first
least privilege by default
founder-controlled admin path
explicit service boundaries
auditable changes

Public internet exposure should be minimized to only what is operationally required.

11.3 Access Model¶

Primary roles:

founder_admin
system_service
agent_runtime
emergency_operator
read_only_observer

Rules:

agents do not receive broad admin privileges
infrastructure administration remains human-controlled
service-to-service access uses explicit credentials
emergency paths must be documented and separate from normal automation

11.4 Network Security Baseline¶

Recommended baseline:

Tailscale or equivalent private overlay for administrative access
SSH restricted to approved identities only (currently restricted to Tailscale)
firewall deny-by-default posture (UFW active)
only required inbound ports opened
internal services bound privately where possible
reverse proxy terminates TLS for exposed services

The preferred posture is:

private access first
public exposure second

11.5 Secrets Management¶

Secrets must never live as unmanaged plaintext in:

code repositories
markdown blueprints
shared chat messages
container images

Minimum standard:

use environment injection or secret manager pattern
separate secrets by environment
rotate high-value credentials
maintain inventory of critical secrets
use scoped provider keys where supported
encrypt secrets at rest using SOPS + age (see Section 9.9.6)

Recommended categories:

provider API credentials
GitHub tokens
database credentials
object storage credentials
Tailscale / network auth material
domain / DNS / TLS credentials

11.6 Service Isolation¶

Services must be logically isolated even if colocated.

Isolation baseline:

separate containers for major services
separate service credentials
no unnecessary shared writable volumes
DB access limited by service role
execution runtimes separated from core control services where practical

Agent runtimes should not have unrestricted access to all system internals.

11.7 Endpoint Protection & Host Hardening¶

Baseline host controls:

timely OS security updates
non-root routine operation
SSH key auth only
fail2ban or equivalent if internet-facing SSH remains enabled
UFW / nftables firewall policy
audit of installed packages and open ports
disk encryption where supported and operationally practical

11.8 Security Logging & Audit¶

Must record:

admin logins
deploy events
secret changes
permission changes
breaker-triggered shutdowns
emergency access usage
unusual agent privilege attempts

Security-relevant events must be reviewable from an audit trail.

11.9 Incident Response Baseline¶

Every critical environment must have a basic incident path:

detect
contain
preserve evidence
rotate credentials if needed
restore service safely
document root cause
update controls

A simple runbook is sufficient initially, but undocumented response is not acceptable.

11.10 Emergency Access, Out-of-Band Recovery & Memory Pressure Survival¶

Purpose¶

XIOPro must remain recoverable even when normal access paths fail.

This includes cases such as:

host memory exhaustion
service thrash or restart loops
accidental firewall lockout
Tailscale failure
SSH unavailability
broken deploy causing loss of normal admin path

This section defines the minimum emergency-access discipline.

11.10.1 Principle¶

Private overlay access and normal SSH are the preferred control paths.

But they are not sufficient as the only recovery plan.

Every critical environment must also have a documented out-of-band recovery path.

11.10.2 Required Access Layers¶

The environment should support these layers in order:

normal private admin path
Tailscale or equivalent
SSH with key-only auth
normal deployment and maintenance workflow
degraded emergency operator path
limited but documented recovery path
safe rollback of firewall/network changes
ability to stop unstable services
out-of-band host access
provider console / rescue mode / equivalent
keyboard/layout-aware emergency instructions
ability to restore basic reachability without guessing

Rule¶

A host is not operationally safe if only one access path exists.

11.10.3 Memory Pressure Survival Rule¶

The system must assume that memory exhaustion can impair:

SSH responsiveness
Tailscale responsiveness
service health
logging
the ability to run normal recovery commands

Therefore Node A must reserve enough operational headroom to allow emergency access and controlled recovery.

Minimum policy:

avoid sizing Node A so tightly that ordinary bursts can fully consume memory
prefer explicit RAM headroom over theoretical maximum utilization
treat repeated OOM behavior as a production-severity signal
preserve the ability to stop or pause non-critical services under pressure

11.10.4 Emergency Recovery Controls¶

At minimum, the environment should support these emergency actions:

stop or pause non-essential containers/services
restore firewall/network path to a safe known baseline
restart only core control-plane services first
verify DB health before reopening broader execution
keep an emergency runbook for Hetzner console / rescue operations
keep known-good command snippets accessible outside the affected host

Examples of Core-First Recovery Order¶

regain admin access
verify disk and memory state
stop unstable/non-essential services
verify PostgreSQL
restore API/orchestrator/governor path
restore scheduler and workers
reopen task admission gradually

11.10.5 Firewall Safety Rule¶

Firewall changes must be governed like risky production changes.

Minimum practice:

keep a known-good baseline policy
document rollback steps
avoid permanent lockout risk from one bad rule push
test private admin path after material firewall changes
keep console-level rollback instructions documented

The goal is not perfect automation. The goal is avoiding avoidable lockout.

11.10.6 Emergency Operator Role¶

The emergency_operator role exists for major incident recovery.

This role is separate from normal automation and should be able to:

use out-of-band access when needed
execute documented recovery commands
restore reachability
preserve evidence before destructive actions
log all meaningful emergency interventions

11.10.7 Runbook Requirement¶

At least one explicit emergency runbook must exist for Node A covering:

Tailscale unavailable
SSH unavailable
firewall rollback
memory exhaustion / OOM stabilization
service stop order
provider console usage
post-incident verification checklist

An undocumented emergency procedure is not a real emergency procedure.

11.10.8 Acceptance Rule¶

Infrastructure is not accepted as production-capable unless the team can answer:

how do we access the host if Tailscale fails?
how do we recover if SSH is unresponsive?
how do we recover if firewall changes block normal access?
how do we stabilize the host if memory is exhausted?
what is the exact first-command sequence in provider console mode?

If these answers are not documented, the security model is incomplete.

12. Observability¶

12.1 Principle¶

If XIOPro cannot observe itself, it cannot govern itself.

Observability must support:

runtime visibility
recovery
cost control
debugging
safety decisions
future optimization

12.2 Required Signals¶

Minimum required signal families:

logs
metrics
health checks
heartbeats
alerts
audit events

Tracing is recommended but may be phased in later.

12.3 Logging Requirements¶

Logs must exist for:

API layer
orchestrator
governor
scheduler
runtime adapters
database-related failures
deployment actions
security events

Log requirements:

structured where possible
timestamped
correlated by request/session/task IDs where possible
retained according to environment policy
searchable during incidents

12.4 Metrics Requirements¶

Minimum operational metrics:

Platform¶

CPU
memory
disk
network
container restarts
process uptime

Runtime¶

active runtimes
active sessions
waiting human escalations
failed runs
retries
queue depth

Business/Execution¶

tickets in progress
tasks completed
task latency
session recovery count
human intervention count

Cost¶

provider cost estimate
per-runtime estimated spend
per-task estimated spend
infra cost trend

12.5 Health Model¶

Each core service must expose a health view:

healthy
degraded
blocked
failed

Minimum monitored services:

API
orchestrator
governor
database
scheduler
runtime adapter layer
reverse proxy
object storage connectivity

12.6 Alerting Baseline¶

Alerts must be routed by severity.

Critical¶

database unavailable
orchestrator down
repeated session recovery failure
secret/security incident
runaway cost spike

Warning¶

elevated retry rate
queue backlog
degraded disk space
failed backup job
runtime adapter instability

Info¶

deploy complete
scheduled maintenance
non-critical optimization suggestions

12.7 Dashboard Requirements¶

At minimum, the operator must be able to see:

system health
active runtimes
active sessions
waiting escalations
error count
recovery events
cost trend
backup status

This may begin with simple dashboards, but the signals themselves are mandatory.

12.8 Observability Storage & Retention¶

Need explicit retention rules for:

operational logs
audit logs
metrics history
incident snapshots

Retention length may vary by cost, but critical incident analysis must remain possible.

13. Cost Strategy¶

13.1 Principle¶

Infrastructure cost must be:

visible
attributable
governable
optimized without harming reliability

Cost strategy is not only about lowering spend. It is about choosing the right cost for the right leverage.

13.2 Cost Categories¶

Track at least these categories:

hosting / compute
storage
network / bandwidth
backup retention
observability tooling
provider runtime/API spend
local hardware / future self-hosted capacity

13.3 Attribution Model¶

Infrastructure should support attribution by:

environment
node
service
runtime surface
ticket or project where practical

This enables the governor and the operator to answer:

what is expensive
why it is expensive
whether it is justified

13.4 Cost Control Rules¶

Initial rules:

avoid idle heavyweight services without clear value
scale up only when signal justifies it
prefer simple colocated deployment before fragmentation
separate services only when risk, cost, or operational pressure justifies it
prune unused storage and log retention intentionally

13.5 Scale-Up Triggers¶

Infrastructure upgrade may be justified when one or more apply:

repeated CPU or memory saturation
queue growth impacting execution goals
session recovery degradation due to node pressure
observability overhead becoming material
self-hosted model experimentation requiring isolated compute
product workloads contaminating XIOPro control-plane stability

13.5A Scaling Triggers¶

The following specific conditions trigger a scaling evaluation. Meeting one trigger does not mandate action — it requires a deliberate review and decision. GO is responsible for raising the evaluation; the decision requires operator approval.

Signal	Threshold	Evaluation Required
PostgreSQL write latency	> 50ms sustained at 10+ concurrent agents	Evaluate read replicas
Host memory	> 75% sustained (any host)	Add new host
Bus request latency	> 200ms p95	Evaluate caching layer
Agent spawn queue depth	> 5 pending spawns	Distribute spawn load to additional hosts
Concurrent agent count	> 8 active simultaneously on a single host	Evaluate second host or reduce parallelism
Disk usage	> 80% on any data volume	Archive old activity partitions to B2; evaluate volume expansion

Rules¶

Triggers are measured over a sustained window (minimum 5 minutes), not transient spikes.
A trigger that clears before review requires no action but should be logged.
Scaling adds operational complexity — it must be justified by signal, not by precaution.
GO reports trigger events via Bus alert (L3 or higher) so IO can route to the founder for decision.

13.6 Hetzner Upgrade Policy¶

Initial assumption:

one primary Hetzner CPX62 node is acceptable for T1P

Upgrade path should remain open for:

larger CPU / RAM node
split DB to dedicated node
split runtime workers from control plane
add dedicated GPU / model experimentation node later

No upgrade should be performed only because it feels more "serious". Upgrade must follow observed bottlenecks.

13.7 Self-Hosted Model Decision Rule¶

Future self-hosted model infrastructure should be evaluated only if it improves one or more of:

privacy posture
unit economics
latency
offline resilience
provider independence
special workload suitability

It should not be adopted merely because self-hosting sounds strategic.

14. Service Fate Map Reference¶

The transition from current services to v5.0 target architecture is documented in:

resources/SERVICE_FATE_MAP_v4_2.md

This resource maps every currently running service/container to its v5.0 fate:

KEEP: Caddy, PostgreSQL (upgrade), Hindsight, ISO 19650 engine (product code -- see MVP1_PRODUCT_SPEC.md), Tailscale, UFW, Restic, SOPS+age, Ruflo, Claude Code, AutoDream
KEEP + EVOLVE: Bus (-> API gateway/relay), LiteLLM (activate routing)
KEEP for now: Paperclip (until ODM parity), Tickets renderer, RC keepalive
REPLACE: Dashboard (-> Control Center)
RETIRE: devxio-frontend, devxio-bridge (stale pre-v3.1 code)
RETIRED (deprecated): devxio-librarian (631 MB Neo4j), graph_stack_neo4j (1.2 GB) -- both Neo4j instances stopped and removed

Retirement RAM Impact¶

Retiring stale services frees approximately 1.95 GB, leaving approximately 26 GB available for new XIOPro backend, UI, and worker services on the CPX62.

Parallel Operation Rule¶

During migration, old services (Bus, Paperclip, dashboard) run alongside new services. No big-bang cutover. Parallel-run until new services are proven and feature parity is reached.

15. Current State¶

As of 2026-03-28, the infrastructure layer is operational:

What exists today:

Hetzner CPX62 running Ubuntu 24.04 with 14 Docker containers (~4.2 GB RAM)
Caddy reverse proxy with TLS and basic auth
PostgreSQL (bus database, 44 MB)
XIOPro Control Bus (evolving from Bus MCP): REST API :8088, SSE Push :8089, OAuth 2.1, PostgreSQL-backed. Currently 107 MB. Being extended with push delivery, intervention, task orchestration, agent registration, host capacity, and spawn coordination (see Part 2, Section 5.8)
Paperclip issue tracker + DB (339 MB combined)
Hindsight memory system (1.06 GB, Vectorize.io Docker)
LiteLLM router (576 MB, not actively routing under Max20)
ISO 19650 engine (57 MB, product code -- see MVP1_PRODUCT_SPEC.md)
~~Two Neo4j instances~~ (deprecated -- both stopped and removed, 1.83 GB freed)
Phase 1 React dashboard (11 MB)
Pre-v3.1 stale frontend + bridge (123 MB, candidates for immediate retirement)
Tailscale VPN mesh (Hetzner <-> Mac)
UFW firewall ACTIVE (SSH restricted to Tailscale 100.64.0.0/10, HTTP/HTTPS public, default deny incoming). Enabled 2026-03-28.
Root password set for emergency Hetzner console access
struxio user has sudo access
Restic backup to Backblaze B2 (daily 03:00 UTC)
SOPS + age for secret encryption
Git history cleaned: plaintext secrets purged from STRUXIO_OS repo history via git-filter-repo (2026-03-28). Only SOPS-encrypted versions remain.
Supply chain security: Socket.dev + GuardDog recommended for behavioral malicious package detection. Trivy for container scanning. pip-audit/npm-audit for CVE baseline.
RC keepalive cron (every 10 min)
Ruflo (claude-flow) for agent teams
Claude Code v2.1.86 with Max20 OAuth
AutoDream enabled (memory consolidation)
tmux 3.4, ripgrep 14.1.1 installed

What must be built/changed:

Install must-have CLI tools (gh, jq, uv, fzf, fd, yq, direnv)
Retire stale containers (devxio-frontend, devxio-bridge)
~~Evaluate Neo4j instances for retirement~~ (done -- both retired, see Part 5 Section 12.1)
Add pg_dump to restic backup scope
Build new FastAPI backend + Next.js UI services
Upgrade PostgreSQL to become primary ODM state store
Evolve Bus into API gateway or keep as messaging relay

16. Infrastructure Success Criteria¶

Infrastructure is successful only if the following are true:

16.1 Reliability¶

core services start reproducibly
system can run continuously
failures are detectable
restart procedures are documented

16.2 Recoverability¶

backups exist and are valid
restore drill is executable
runtime/session recovery path is defined
bad deployments can be rolled back

16.3 Security¶

secrets are controlled
access is role-scoped
public exposure is minimized
audit trail exists for critical actions

16.4 Observability¶

core services emit useful telemetry
critical alerts reach the operator
cost and health are visible
incident diagnosis is possible without guesswork

16.5 Scalability¶

architecture can separate services without redesign
local node remains viable as fallback or augmentation
future GPU or product nodes can be added cleanly

16.6 Cost Discipline¶

infrastructure spend is explainable
upgrade decisions are signal-based
expensive idle complexity is avoided

Infrastructure that merely "runs" is not enough. It must be operable, recoverable, and governable.

17. Naming Conventions¶

All STRUXIO repositories, folders, and files follow a four-rule naming standard. These rules ensure consistency across GitHub, local disk, and internal structure.

17.0 General Principles¶

Case-insensitive uniqueness: Never create two files or folders with the same name differing only by case. Uppercase in Mac root folders is for human readability only — the system must treat names as case-insensitive for search and deduplication.
XIOPro and STRUXIO are proper names: Always written in uppercase. They are brand names with no abbreviation or meaning to decode — keep as-is everywhere.
Mac vs Hetzner convention: Mac uses STRUXIO_ prefix on top-level folders for Finder readability. Hetzner uses the GitHub lowercase name (the git clone default). Both are valid — they map to the same repo (see Section 17.5).
External tool names kept as-is: Third-party tool names (Neo4j, PostgreSQL, Caddy, Backblaze, Tailscale) retain their original casing in all documents.
High-level folders are descriptive: Use full words — STRUXIO_Design (not STRUXIO_D), STRUXIO_Knowledge (not abbreviated). The folder name should explain what it contains.

17.1 Rule 1 — GitHub Repository Names¶

All lowercase.
Words separated by hyphens (-).
Must start with struxio-.

Examples: struxio-design, struxio-app, struxio-knowledge

17.2 Rule 2 — Local Top-Level Folders (Repos on Disk)¶

Mac: Start with STRUXIO_. Use underscores (_). CamelCase or logical uppercase for readability.
Hetzner: Use GitHub lowercase name as cloned (e.g., struxio-design). No renaming needed.
These represent the repos and are the exception to the lowercase rule on Mac.

Examples (Mac): STRUXIO_Design, STRUXIO_OS, STRUXIO_Knowledge, STRUXIO_DEVXIO_UI Examples (Hetzner): struxio-design, struxio-os, struxio-knowledge

17.3 Rule 3 — Structure Folders (Inside Repos)¶

All lowercase.
Words separated by underscores (_).
No spaces, no hyphens.

Examples: 02_devxio_architecture, blueprint_devxio_bl_v4_2_set, resources

The daily folder cleanup cron at 04:00 UTC enforces Rule 3 (lowercase structure folders).

17.4 Rule 4 — File Names¶

Start with a function/type prefix in UPPERCASE.
Rest uses appropriate casing for readability.

Examples: BLUEPRINT_XIOPro_v4_2_Part1_Foundations.md, SKILL_REGISTRY.yaml, REVIEW_final_freeze_v4_2.md, PLAN_iso19650_integration.md

17.5 Repository Mapping¶

GitHub Repo	Mac Folder	Hetzner Folder	Purpose
struxio-design	STRUXIO_Design	struxio-design	Architecture, blueprints, design docs
struxio-logic	STRUXIO_Logic	struxio-logic	Agent activations, rules, skills
struxio-os	STRUXIO_OS	STRUXIO_OS	State, tickets, engineering, infra
struxio-app	STRUXIO_App	struxio-app	Product code (see `MVP1_PRODUCT_SPEC.md`)
struxio-business	STRUXIO_Business	struxio-business	Business docs
struxio-knowledge	STRUXIO_Knowledge	struxio-knowledge	Knowledge vault, Obsidian sync
struxio-devxio-ai	STRUXIO_DEVXIO_UI	devxio-control-center	Control Center UI (Next.js)
struxio-aibus	STRUXIO_AIBUS	struxio-aibus	Bus MCP Server source
struxio-dashboard	STRUXIO_Dashboard	struxio-dashboard	Dashboard UI
struxio-tickets	STRUXIO_Tickets	struxio-tickets	Ticket tracking

17.6 Operational Tools¶

Tool	Command	Schedule	Purpose
Folder Naming Cleanup	`/opt/struxio/scripts/folder_naming_cleanup.sh`	Daily 04:00 UTC	Enforces Rule 3 (lowercase structure folders)
Workspace Graph	`/opt/struxio/scripts/workspace_graph.sh`	Daily 04:01 UTC	Generates `STATE_workspace_graph.yaml` — full folder/file map for agent navigation

18. Final Statement¶

Infrastructure is the execution ground of XIOPro.

If this layer is weak:

runtime becomes fragile
recovery becomes guesswork
security becomes accidental
costs become opaque

If this layer is strong:

the system can run headless with confidence
failures can be absorbed and repaired
the founder can scale with less fear
future growth does not require rethinking everything

Changelog¶

Version	Date	Author	Changes
4.1.0	2026-03-27	BM	Initial infrastructure blueprint
4.2.0	2026-03-28	BM	C8.1: Added actual Hetzner CPX62 specs (16 vCPU AMD EPYC-Genoa, 30GB RAM, 150GB SSD) to Section 5.1 and 9.5.1. C8.2: Added SOPS+age secrets encryption to Section 9.9.6. C8.3: Added Restic backup to Backblaze B2 section (10.2A). C8.4: Added service fate map reference (Section 14). C8.5: Added container memory budget (Section 9.5A). C8.6: Added CLI toolchain section (9.6A) referencing CLI_TOOLS_ASSESSMENT.md. CX.1: Global "Rufio" to "Ruflo" rename. CX.2: Updated version header to 4.2.0. CX.3: Added changelog. CX.4: Added current state section (Section 15). Renumbered success criteria to Section 16, final statement to Section 17.
4.2.2	2026-03-28	000	Agent naming migration: O00/O01 replaced with 000 (orchestrator role) / 000 (governor role). M01 replaced with module steward role. BM replaced with 000. Container group names updated from o00/o01 to orchestrator/governor. Backblaze B2 references preserved unchanged. Changelog author entries preserved as historical.
4.2.3	2026-03-28	000	Roles over numbers: Removed agent IDs from all architectural descriptions, section headers, diagrams, and service lists. Role names used throughout instead of agent numbers.
4.2.7	2026-03-28	BM	Neo4j deprecated: Both instances (devxio-librarian, graph_stack_neo4j) marked as retired/removed across Sections 9.5A, 14, 15. PostgreSQL + pgvector replaces all Neo4j use cases for T1P.
4.2.11	2026-03-29	BM	Added Section 9.11.12 (Orchestrator Launch Commands) — `devxio go` and `devxio mo` launch commands for GO and MO surfaces with cross-reference to Part 4, Section 4.1A.
4.2.12	2026-03-29	BM	Added Section 17 (Naming Conventions) — four-rule naming standard for repos, folders, and files with repository mapping table. Renumbered Final Statement to Section 18.
4.2.13	2026-03-29	BM	Updated Section 17 naming conventions: added Section 17.0 (General Principles — case-insensitive uniqueness, proper names, Mac vs Hetzner, tool names, descriptive folders). Updated 17.2 to distinguish Mac/Hetzner. Updated 17.5 mapping table with Hetzner column. Added 17.6 (Operational Tools — folder cleanup + workspace graph).
4.2.14	2026-03-29	BM	Cross-references: Added pointer to `resources/DESIGN_cli_services.md` in Section 9.6A (CLI services framework including Porkbun DNS and Hetzner hcloud). Added hcloud to Must-Have CLI tools table.
5.0.1	2026-03-30	GO	N22: Added Section 8.8.1 (Connection Pooling) -- PgBouncer or built-in pool_size recommended at 15+ agents, current Fastify pool max: 20, struxio_db_pool_* gauge monitoring via GET /metrics, pool exhaustion = warning alert.
5.0.2	2026-03-30	GO	N8: Added Section 13.5A (Scaling Triggers) — four specific thresholds: PostgreSQL write latency > 50ms at 10+ agents → read replicas; host memory > 75% sustained → new host; Bus latency > 200ms → caching layer; spawn queue depth > 5 → distribute to additional hosts. N20: Added Section 8.12A (Bus API Rate Limits) — default 100 req/min per actor, burst 200 req/min throttled, 1 SSE connection per actor per channel, 50 events/min per actor.
5.0.3	2026-03-30	GO	C4: Added Section 10.3A (PostgreSQL WAL Archiving) — continuous WAL shipping to B2, RPO reduced from 24h to 5 minutes, archive_mode/archive_command config, point-in-time restore procedure (7 steps), monitoring rules. Updated Section 10.6 RPO target to reflect WAL archiving. C5: Expanded Section 10.8 (Restore Drill Requirements) — monthly restore drill procedure with 6-step checklist (download, restore, verify tables, verify freshness, cleanup, record), success criteria, failure response, results recorded in state/restore_drills.yaml.
5.0.4	2026-03-30	GO	I13: Revised agent count estimate in Section 9.5A — realistic max 8-10 concurrent agents on CPX62 (30 GB RAM). Each Claude Code process ~300-500 MB, services baseline ~10 GB, 3-5 GB safety buffer. Previous higher estimates reflected smaller assumed agent footprints.
5.0.5	2026-03-30	GO	N8 addendum: Added two scaling triggers to Section 13.5A — concurrent agent count > 8 per host → evaluate second host; disk usage > 80% → archive old partitions.