AORXI Homelab
Reference

Decisions Log

Chronological record of key architectural, hardware, and network decisions made during the AORXI homelab build, with dates and rationale.

Key architectural, hardware, and network decisions made during the AORXI homelab build are recorded here in chronological order. The vault documents under vault/ are the source of truth; this page synthesizes the most significant choices and reversals for quick reference. Entries marked "(corrects earlier assumption)" indicate a decision that reversed or refined a prior position.

Key Decisions

DateDecisionNotes / Why
2026-06-06NIC inventory finalized: 6 multiport 10G NICs total, all allocated, zero spare. 5× Intel X710-T4 → Site B (one per sb-cmp-0x node); 1× XL710 quad-port (i40e) already installed in sa-stor-01. 2× Intel X550-T2 → sa-cmp-01 and sa-cmp-02 (one port each).Corrects an earlier draft that listed a spare X710-T4. No reserve NIC exists. See Per-Site Inventory.
2026-06-06Hostname scheme standardized: {site}-{role}-{num} (e.g. sa-edge-01, sb-cmp-03, sa-fw-01). Cluster names sa-pve / sb-pve.Legacy names (pve-a-edge, etc.) purged from all vault docs. Single canonical form.
2026-06-06Site A switch wiring: Proxmox mgmt (VLAN 20) + IPMI (VLAN 10) offloaded to sa-sw-02/sa-sw-03 (access tier); dedicated Corosync (VLAN 25) on sa-edge-01 onboard 1G #2 into sa-sw-01 (core). Fios 2 Gbps WAN on sa-edge-01 10G #1.Offloading mgmt/IPMI frees core-switch ports on sa-sw-01 for dedicated per-node Corosync links.
2026-06-06Site B per-node wiring locked: onboard 1G #1 = mgmt (VLAN 20); 1G #2 = Corosync (VLAN 25, no GW, dedicated); onboard 10G = Ceph cluster (VLAN 65, no GW); X710-T4 p1 = VM Services (VLAN 30), p2 = Kubernetes Nodes / VIPs (VLANs 40/50), p3 = Ceph public (VLAN 60), p4 = Backup / Replication (VLAN 90). Single links, no LACP. sb-edge-01 WAN on 1G #1 (1 Gbps).FN8TP SFP+ ports left unused. X710-T4 trunk layout splits Ceph public from Ceph cluster for independent path control.
2026-06-07sa-stor-01 Corosync: dedicated 2nd onboard GbE assumption dropped. VLAN 25 rides tagged on XL710 p1 alongside VLAN 30; sa-sw-01 p7 freed to spare.(Corrects earlier assumption.) Rack inspection confirmed only one 1GbE port available for mgmt; the second onboard port is the AQC107 10G, unsuitable for Corosync.
2026-06-07sb-edge-01 WAN port corrected to onboard 1G #1 (not 10G #1).(Corrects earlier doc error.) 10G #1 is already assigned to Corosync; using it for WAN would be a port collision.
2026-06-07DNS: 4× Technitium VMs on VLAN 30 (sa-dns-01 at 10.10.30.10 as primary; AXFR to sa-dns-02, sb-dns-01, sb-dns-02). Internal zone: core.aorxi.io. Certificates via Let's Encrypt DNS-01 (Cloudflare) — no private CA needed for web services.A subdomain of the owned public domain (aorxi.io) eliminates a private CA for browser-trusted certs while keeping the internal zone authoritative and off the public internet.
2026-06-07Certificate strategy: Let's Encrypt DNS-01 as the primary path; private step-ca at sa-ca-01 (10.10.30.30) for IPMI BMCs (cannot use ACME) and future internal mTLS.Two-path strategy: ACME for everything browser-facing, step-ca only for devices that cannot run ACME. Avoids bloating the private CA footprint.
2026-06-08sa-ap-01 (UniFi U7 Pro XGS, WiFi 7) added to sa-sw-03 port 3 (10G RJ45, PoE++ 802.3bt ≈ 29 W). sa-sw-02 and sa-sw-03 uplinks moved from copper RJ45 to SFP+ DAC (sa-sw-01 combo slots 15F/16F). sa-sw-01 p12/p13 now copper spares; emergency-admin access relocated to p12. AP port profile: native VLAN 10, tagged VLANs 100/110/120.The XG uplinks previously occupied the PoE-capable RJ45 ports that the AP requires. Switching uplinks to SFP+ DAC freed two PoE RJ45 ports without adding switches.
2026-06-21Switch naming standardized: {site}-sw-{num} applied across all docs. Mapping: sa-sw-01 (Netgear XS716T core), sa-sw-02 (UniFi XG 6 PoE access #1), sa-sw-03 (UniFi XG 6 PoE access #2), sb-sw-01 (Netgear XS748T core), sb-sw-02 (UniFi USW 24 PoE access).Aligns switch naming with the {site}-{role}-{num} convention already used for compute, edge, and storage nodes. 28 vault files updated.
2026-06-21Address-block convention formalized for every routed /24: .1 = OPNsense gateway; .2–.9 = network infra (switches, APs); .10–.39 = physical host interfaces (host-octet); .40–.49 = service VMs (PBS, DNS); .50–.199 = DHCP pool / additional static services; .200–.254 = VIPs / MetalLB. 64 host /32s verified, zero collisions.Replaces placeholder "convention-derived" markers with concrete addresses across all IP tables. See IP Tables.
2026-06-24sa-stor-01 (5049A-T / X11SPA-T) onboard NIC layout corrected: 1× Intel i210 1GbE (red RJ45 = Proxmox mgmt, vmbr0) + 1× Aquantia AQC107 10GbE (atlantic driver). AQC107 must not carry mgmt or Corosync; Corosync stays tagged on XL710 p1.(Corrects earlier "only one onboard GbE" assumption.) ServeTheHome rear-panel photo confirmed both ports. The atlantic driver has a documented history of instability under Linux/Proxmox; AQC107 kept for non-critical data only or left unused.
2026-06-24node_exporter binds to each host's VLAN 20 Proxmox mgmt IP, not 0.0.0.0. Configured via --web.listen-address in /etc/default/prometheus-node-exporter.Host metrics are management-plane data. Per the addressing policy, hosts receive an L3 address only on the infra VLANs they terminate (10, 20, 25, 60, 65, 90); VLAN 80 is the monitoring-stack network, not a host IP.
2026-06-25Corosync on sa-edge-01 confirmed: stays on sa-sw-01 (core switch) via onboard 1G #2. Access switches sa-sw-02/sa-sw-03 must never carry VLAN 25.VLAN 25 is forbidden on access-tier uplinks by design. Routing Corosync via an access switch would share bandwidth with client/IoT/guest WiFi, introducing jitter that causes false fencing events.
2026-06-25DNS infrastructure runs as Proxmox VMs, not inside Kubernetes.DNS is a layer-0 dependency: Proxmox, OPNsense, Proxmox Backup Server (PBS), ArgoCD, cert-manager, and etcd all need name resolution to start. Running DNS inside Kubernetes creates a circular bootstrap problem — a node reboot or CNI fault takes down resolution site-wide including recovery tooling.
2026-06-25OPNsense VM (sa-fw-01) hardware config: VirtIO SCSI disk (scsi0), 32 GB thin-provisioned, Discard enabled; 4 vCPUs (type host), NUMA off (single-socket Xeon D); 8 GB RAM fixed, ballooning off. KSM disabled on both E200 hosts (sa-edge-01, sb-edge-01) via systemctl disable --now ksmtuned ksm.Ballooning off for deterministic firewall performance. KSM disabled to eliminate cross-VM memory side-channel risk on a security appliance at negligible memory-pressure benefit.
2026-06-25MTU policy locked: vmbr1 (WAN) and vmbr2 (LAN trunk) stay at 1500. Jumbo MTU 9000 end-to-end only on dedicated storage data paths (VLANs 60, 65, 90). Corosync (VLAN 25) must never use jumbo frames.A VLAN-aware bridge carries a single MTU; the OPNsense trunk also carries internet-bound traffic limited to 1500. Jumbo on Corosync risks MTU mismatches that silently drop heartbeats.
2026-06-26Ansible hardening: replaced the devsec.hardening collection with hand-rolled, STIG-mapped roles (one role per concern: ssh, sysctl, pam, auditd, packages, firewall). Each task tagged with the STIG control ID it implements. Audit via OpenSCAP / SCAP Security Guide (SSG), read-only. Compliance posture: STIG-aligned, not certified.No DISA STIG exists for Debian/Proxmox; closest analogues are the Ubuntu STIG and generic OS SRG. devsec.hardening had opaque defaults that conflicted with Proxmox-safety requirements (root SSH key login, no vfat blacklist, IP forwarding).
2026-06-26Proxmox firewall posture: pve-firewall ships default OFF on all nodes; enable per-host after cluster formation, not during bootstrap.OPNsense handles perimeter enforcement. pve-firewall is a second layer added post-cluster where it can be applied and tested without impacting bootstrap networking or Corosync.
2026-06-26Ansible / Pulumi boundary: Ansible scope = host OS only (hardening, baseline, day-2 ops); it never touches VMs, containers, or PVE API objects. Pulumi (aorxi/) scope = all VM/CT/PVE API provisioning. Root CLAUDE.md is the single source of truth for IPs and hostnames; Ansible inventory mirrors it, never invents values.Hard boundary prevents two tools managing the same surface and creating IaC drift.
2026-06-26Monorepo: ansible/ and aorxi/ merged into the homelab repo as subdirs via git filter-repo path rewrite (full history preserved). Both are no longer standalone git repos.Single clone covers all automation; simplifies cross-repo context and CI.
2026-06-26Kubernetes pod/service CIDRs corrected: pods 10.128.0.0/14; services 172.30.x.0/16 (stay within 172.16.0.0/12). Previous draft used 10.110.x.0 / 10.120.x.0, which overlapped the IoT (VLAN 110) and Guest WiFi (VLAN 120) subnets.Hard rule: pod and service CIDRs must not overlap any real LAN or VPN range. Do not use 172.32.x.x — outside RFC 1918.
2026-06-28Ceph release pinned to tentacle (not squid). Repository: Proxmox ceph-tentacle no-subscription repo (download.proxmox.com/debian/ceph-tentacle). Ansible baseline role rewrites ceph.sources to tentacle; packages role blacklists the Ceph stack from unattended-upgrades so versions stay Proxmox-owned.Tentacle is the current stable Ceph release aligned with the Proxmox version in use. The no-subscription repo avoids the enterprise license requirement; the "No valid subscription" popup is cosmetic only.
2026-06-28One self-hosted UniFi OS Server (UOS) controller for the whole lab: sa-uos-01 on sa-edge-01, final IP 10.10.10.40 (VLAN 10). Pinned UOS 5.1.19, Ubuntu 24.04 VM (4 vCPU / 8 GB / 64 GB, Podman). Adopts standalone UniFi gear only — never the Gateway Max / USG Pro, which stay self-managed as bootstrap/fallback. No second controller at Site B; its gear adopts over WireGuard (L3 adoption).VLAN 10 placement puts the controller on the same L2 as the devices it manages, so adoption needs no inter-VLAN firewall rule. UOS cannot manage Cloud Gateways, and cloned controllers reuse Site Manager tokens — rebuild fresh via Pulumi instead of cloning.
2026-06-30OPNsense Phase-1 provisioning method: opnsense/provision/ Pulumi (project aorxi-opnsense) builds sa-fw-01 as a 2-NIC VM from a pinned FreeBSD cloud image; cloud-init writes a seed /conf/config.xml then runs opnsense-bootstrap.sh -y -r <release> to convert FreeBSD → OPNsense in place. Phase-1 app config (NAT + firewall) applied by the opnsense/config/ Ansible subproject; base config (hostname, interfaces, API credentials) stays seed-side.OPNsense ships no cloud image of its own; bootstrapping a FreeBSD image in place yields an API-ready firewall that is fully rebuildable from the repo, with a clean split between provisioning (Pulumi) and app config (Ansible).
2026-07-01Repo reorganized into self-contained subprojects: aorxi/ slimmed to platform/ (Pulumi project name stays aorxi), ansible/ renamed baseline/, and each concern owns provisioning and configuration (unifi/provision+unifi/config, opnsense/provision+opnsense/config). Shared Pulumi building blocks (Vm, CloudImage) extracted to core/ (aorxi_core). Secrets standardized: gitignored repo-root .env.local is the single source of truth (PULUMI_CONFIG_PASSPHRASE, PROXMOX_VE_*, ANSIBLE_VAULT_PASSWORD); per-project .env.local / .vault_pass files removed.No live resources existed at reorg time, so no Pulumi state migration was needed — new stacks start fresh. One secrets file ends passphrase sprawl; Ansible reads it via scripts/vault-pass.sh.
2026-07-02OpenBao secrets manager (provision-only phase): one independent OpenBao 2.5.4 instance per site on the edge E200s — sa-bao-01 (10.10.30.40) and sb-bao-01 (10.20.30.40), VLAN 30. No stretched Raft; cross-site DR via Raft snapshots over the backup path. Cross-site transit auto-unseal (each site unseals via the other over WireGuard); cold-start deadlock break-glass = seal-migrate to Shamir with recovery keys from the password manager. Interim: Site B non-operational, so sa-bao-01 runs standalone on Shamir manual unseal; the site-b stack stays parked at enabled: false. Bootstrap tier stays in root .env.local — bao can't hold the secrets that build bao.OpenBao has no cross-site replication (Vault Enterprise feature) and a 2-node Raft over WireGuard cannot quorum — mirroring the one-cluster-per-site rule. Transit auto-unseal gives unattended reboots without storing unseal keys on disk.

Source and currency

Decisions are sourced from the vault/Chat Summaries/ session logs (2026-06-06 → 2026-07-02) and dated notes in root CLAUDE.md. The vault is the source of truth — update vault/project-instructions.md and the relevant numbered doc whenever a decision changes.

  • Design Principles — the hard architecture rules distilled from these decisions
  • Architecture Overview — current two-site model reflecting all confirmed decisions
  • Reference — IP tables, port tables, hardware BOM, and this decisions log

On this page