From 1262e03e218e6346c6897a92c3276bba7de6618a Mon Sep 17 00:00:00 2001
From: Petru Paler <petru@paler.net>
Date: Tue, 21 Oct 2025 00:05:44 +0100
Subject: [PATCH] Cluster changes writeup.

---
 docs/CLUSTER_REVAMP.md | 1717 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 1717 insertions(+)
 create mode 100644 docs/CLUSTER_REVAMP.md

diff --git a/docs/CLUSTER_REVAMP.md b/docs/CLUSTER_REVAMP.md
new file mode 100644
index 0000000..1a96d4c
--- /dev/null
+++ b/docs/CLUSTER_REVAMP.md
@@ -0,0 +1,1717 @@
+# Cluster Architecture Revamp
+
+**Status**: Planning complete, ready for review and refinement
+
+## Key Decisions
+
+✅ **Replication**: 5-minute intervals (incremental btrfs send)
+✅ **WordPress**: Currently syncthing → will use `/data/services` via NFS
+✅ **Media**: Only media.hcl needs `/data/media`, constrained to fractal
+✅ **Unifi**: Floating (no constraint needed)
+✅ **Sunny**: Standalone, ethereum data stays local (not replicated)
+✅ **Quorum**: 5 servers (c1, c2, c3, fractal, zippy)
+✅ **NFS Failover**: Via Consul DNS (`services.service.consul`)
+
+## Table of Contents
+1. [End State Architecture](#end-state-architecture)
+2. [Migration Steps](#migration-steps)
+3. [Service Catalog](#service-catalog)
+4. [Failover Procedures](#failover-procedures)
+
+---
+
+## End State Architecture
+
+### Cluster Topology
+
+**5-Server Quorum (Consul + Nomad server+client):**
+- **c1, c2, c3**: Cattle nodes - x86_64, run most stateless workloads
+- **fractal**: Storage node - x86_64, 6x spinning drives, runs media workloads
+- **zippy**: Stateful anchor - x86_64, runs database workloads (via affinity), primary NFS server
+
+**Standalone Nodes (not in quorum):**
+- **sunny**: x86_64, ethereum node + staking, base NixOS configs only
+- **chilly**: x86_64, Home Assistant VM, base NixOS configs only
+
+**Quorum Math:**
+- 5 servers → quorum requires 3 healthy nodes
+- Can tolerate 2 simultaneous failures
+- Bootstrap expect: 3
+
+### Storage Architecture
+
+**Primary Storage (zippy):**
+- `/persist/services` - btrfs subvolume
+  - Contains: mysql, postgres, redis, clickhouse, mongodb, app data
+  - Exported via NFS to: `services.service.consul:/persist/services`
+  - Replicated via **btrfs send** to c1 and c2 every **5 minutes** (incremental)
+
+**Standby Storage (c1, c2):**
+- `/persist/services-standby` - btrfs subvolume
+  - Receives replicated snapshots from zippy via incremental btrfs send
+  - Can be promoted to `/persist/services` and exported as NFS during failover
+  - Maximum data loss: **5 minutes** (last replication interval)
+
+**Standalone Storage (sunny):**
+- `/persist/ethereum` - local btrfs subvolume (or similar)
+  - Contains: ethereum blockchain data, staking keys
+  - **NOT replicated** - too large/expensive to replicate full ethereum node
+  - Backed up via kopia to fractal (if feasible/needed)
+
+**Media Storage (fractal):**
+- `/data/media` - existing spinning drive storage
+  - Exported via Samba (existing)
+  - Mounted on c1, c2, c3 via CIFS (existing)
+  - Local access on fractal for media workloads
+
+**Shared Storage (fractal):**
+- `/data/shared` - existing spinning drive storage
+  - Exported via Samba (existing)
+  - Mounted on c1, c2, c3 via CIFS (existing)
+
+### Network Services
+
+**NFS Primary (zippy):**
+```nix
+services.nfs.server = {
+  enable = true;
+  exports = ''
+    /persist/services 192.168.1.0/24(rw,sync,no_subtree_check,no_root_squash)
+  '';
+};
+
+services.consul.extraConfig.services = [{
+  name = "services";
+  port = 2049;
+  checks = [{ tcp = "localhost:2049"; interval = "30s"; }];
+}];
+```
+
+**NFS Client (all nodes):**
+```nix
+fileSystems."/data/services" = {
+  device = "services.service.consul:/persist/services";
+  fsType = "nfs";
+  options = [ "x-systemd.automount" "noauto" "x-systemd.idle-timeout=60" ];
+};
+```
+
+**Samba Exports (fractal - existing):**
+- `//fractal/media` → `/data/media`
+- `//fractal/shared` → `/data/shared`
+
+### Nomad Job Placement Strategy
+
+**Affinity-based (prefer zippy, allow c1/c2):**
+- mysql, postgres, redis - stateful databases
+- Run on zippy normally, can failover to c1/c2 if zippy down
+
+**Constrained (must run on fractal):**
+- **media.hcl** - radarr, sonarr, bazarr, plex, qbittorrent
+  - Reason: Heavy /data/media access, benefits from local storage
+- **prometheus.hcl** - metrics database with 30d retention
+  - Reason: Large time-series data, spinning disks OK, saves SSD space
+- **loki.hcl** - log aggregation with 31d retention
+  - Reason: Large log data, spinning disks OK
+- **clickhouse.hcl** - analytics database for plausible
+  - Reason: Large time-series data, spinning disks OK
+
+**Floating (can run anywhere on c1/c2/c3/fractal/zippy):**
+- All other services including:
+  - traefik, authentik, web apps
+  - **grafana** (small data, just dashboards/config, queries prometheus for metrics)
+  - databases (mysql, postgres, redis)
+  - vector (system job, runs everywhere)
+- Nomad schedules based on resources and constraints
+
+### Data Migration
+
+**Path changes needed in Nomad jobs:**
+- `/data/compute/appdata/*` → `/data/services/*`
+- `/data/compute/config/*` → `/data/services/*`
+- `/data/sync/wordpress` → `/data/services/wordpress`
+
+**No changes needed:**
+- `/data/media/*` - stays the same (CIFS mount from fractal, used only by media services)
+- `/data/shared/*` - stays the same (CIFS mount from fractal)
+
+**Deprecated after migration:**
+- `/data/sync/wordpress` - currently managed by syncthing to avoid slow GlusterFS
+  - Will be replaced by NFS mount at `/data/services/wordpress`
+  - Syncthing configuration for this can be removed
+  - Final sync: copy from syncthing to `/persist/services/wordpress` on zippy before cutover
+
+---
+
+## Migration Steps
+
+**Important path simplification note:**
+- All service paths use `/data/services/*` directly (not `/data/services/appdata/*`)
+- Example: `/data/compute/appdata/mysql` → `/data/services/mysql`
+- Simpler, cleaner, easier to manage
+
+### Phase 0: Preparation
+**Duration: 1-2 hours**
+
+1. **Backup everything**
+   ```bash
+   # On all nodes, ensure kopia backups are current
+   kopia snapshot list
+
+   # Backup glusterfs data manually
+   rsync -av /data/compute/ /backup/compute-pre-migration/
+   ```
+
+2. **Document current state**
+   ```bash
+   # Save current nomad job list
+   nomad job status -json > /backup/nomad-jobs-pre-migration.json
+
+   # Save consul service catalog
+   consul catalog services > /backup/consul-services-pre-migration.txt
+   ```
+
+3. **Review this document**
+   - Verify all services are cataloged
+   - Confirm priority assignments
+   - Adjust as needed
+
+### Phase 1: Convert fractal to NixOS
+**Duration: 6-8 hours**
+
+**Current state:**
+- Proxmox on ZFS
+- System pool: `rpool` (~500GB, will be wiped)
+- Data pools (preserved):
+  - `double1` - 3.6T (homes, shared)
+  - `double2` - 7.2T (backup - kopia repo, PBS)
+  - `double3` - 17T (media, torrent)
+- Services: Samba (homes, shared, media), Kopia server, PBS
+- Bind mounts: `/data/{homes,shared,media,torrent}` → ZFS datasets
+
+**Goal:** Fresh NixOS on rpool, preserve data pools, join cluster
+
+#### Step-by-step procedure:
+
+**1. Pre-migration documentation**
+   ```bash
+   # On fractal, save ZFS layout
+   cat > /tmp/detect-zfs.sh << 'EOF'
+#!/bin/bash
+echo "=== ZFS Pools ==="
+zpool status
+
+echo -e "\n=== ZFS Datasets ==="
+zfs list -o name,mountpoint,used,avail,mounted -r double1 double2 double3
+
+echo -e "\n=== Bind mounts ==="
+cat /etc/fstab | grep double
+
+echo -e "\n=== Data directories ==="
+ls -la /data/
+
+echo -e "\n=== Samba users/groups ==="
+getent group shared compute
+getent passwd compute
+EOF
+   chmod +x /tmp/detect-zfs.sh
+   ssh fractal /tmp/detect-zfs.sh > /backup/fractal-zfs-layout.txt
+
+   # Save samba config
+   scp fractal:/etc/samba/smb.conf /backup/fractal-smb.conf
+
+   # Save kopia certs and config
+   scp -r fractal:~/kopia-certs /backup/fractal-kopia-certs/
+   scp fractal:~/.config/kopia/repository.config /backup/fractal-kopia-repository.config
+
+   # Verify kopia backups are current
+   ssh fractal "kopia snapshot list --all"
+   ```
+
+**2. Stop services on fractal**
+   ```bash
+   ssh fractal "systemctl stop smbd nmbd kopia"
+   # Don't stop PBS yet (in case we need to restore)
+   ```
+
+**3. Install NixOS**
+   - Boot NixOS installer USB
+   - **IMPORTANT**: Do NOT touch double1, double2, double3 during install!
+   - Install only on `rpool` (or create new pool if needed)
+
+   ```bash
+   # In NixOS installer
+   # Option A: Reuse rpool (wipe and recreate)
+   zpool destroy rpool
+
+   # Option B: Use different disk if available
+   # Then follow standard NixOS btrfs install on that disk
+   ```
+
+   - Use standard encrypted btrfs layout (matching other hosts)
+   - Minimal install first, will add cluster configs later
+
+**4. First boot - import ZFS pools**
+   ```bash
+   # SSH into fresh NixOS install
+
+   # Import pools (read-only first, to be safe)
+   zpool import -f -o readonly=on double1
+   zpool import -f -o readonly=on double2
+   zpool import -f -o readonly=on double3
+
+   # Verify datasets
+   zfs list -r double1 double2 double3
+
+   # Example output should show:
+   # double1/homes
+   # double1/shared
+   # double2/backup
+   # double3/media
+   # double3/torrent
+
+   # If everything looks good, export and reimport read-write
+   zpool export double1 double2 double3
+   zpool import double1
+   zpool import double2
+   zpool import double3
+
+   # Set ZFS mountpoints (if needed)
+   # These may already be set from Proxmox
+   zfs set mountpoint=/double1 double1
+   zfs set mountpoint=/double2 double2
+   zfs set mountpoint=/double3 double3
+   ```
+
+**5. Create fractal NixOS configuration**
+   ```nix
+   # hosts/fractal/default.nix
+   { config, pkgs, ... }:
+   {
+     imports = [
+       ../../common/encrypted-btrfs-layout.nix
+       ../../common/global
+       ../../common/cluster-node.nix  # Consul + Nomad (will add in step 7)
+       ../../common/nomad.nix  # Both server and client
+       ./hardware.nix
+     ];
+
+     networking.hostName = "fractal";
+
+     # ZFS support
+     boot.supportedFilesystems = [ "zfs" ];
+     boot.zfs.extraPools = [ "double1" "double2" "double3" ];
+
+     # Ensure ZFS pools are imported before mounting
+     systemd.services.zfs-import.wantedBy = [ "multi-user.target" ];
+
+     # Bind mounts for /data (matching Proxmox setup)
+     fileSystems."/data/homes" = {
+       device = "/double1/homes";
+       fsType = "none";
+       options = [ "bind" "x-systemd.requires=zfs-mount.service" ];
+     };
+
+     fileSystems."/data/shared" = {
+       device = "/double1/shared";
+       fsType = "none";
+       options = [ "bind" "x-systemd.requires=zfs-mount.service" ];
+     };
+
+     fileSystems."/data/media" = {
+       device = "/double3/media";
+       fsType = "none";
+       options = [ "bind" "x-systemd.requires=zfs-mount.service" ];
+     };
+
+     fileSystems."/data/torrent" = {
+       device = "/double3/torrent";
+       fsType = "none";
+       options = [ "bind" "x-systemd.requires=zfs-mount.service" ];
+     };
+
+     fileSystems."/backup" = {
+       device = "/double2/backup";
+       fsType = "none";
+       options = [ "bind" "x-systemd.requires=zfs-mount.service" ];
+     };
+
+     # Create data directory structure
+     systemd.tmpfiles.rules = [
+       "d /data 0755 root root -"
+     ];
+
+     # Users and groups for samba
+     users.groups.shared = { gid = 1001; };
+     users.groups.compute = { gid = 1002; };
+     users.users.compute = {
+       isSystemUser = true;
+       uid = 1002;
+       group = "compute";
+     };
+
+     # Ensure ppetru is in shared group
+     users.users.ppetru.extraGroups = [ "shared" ];
+
+     # Samba server
+     services.samba = {
+       enable = true;
+       openFirewall = true;
+
+       extraConfig = ''
+         workgroup = WORKGROUP
+         server string = fractal
+         netbios name = fractal
+         security = user
+         map to guest = bad user
+       '';
+
+       shares = {
+         homes = {
+           comment = "Home Directories";
+           browseable = "no";
+           path = "/data/homes/%S";
+           "read only" = "no";
+         };
+
+         shared = {
+           path = "/data/shared";
+           "read only" = "no";
+           browseable = "yes";
+           "guest ok" = "no";
+           "create mask" = "0775";
+           "directory mask" = "0775";
+           "force group" = "+shared";
+         };
+
+         media = {
+           path = "/data/media";
+           "read only" = "no";
+           browseable = "yes";
+           "guest ok" = "no";
+           "create mask" = "0755";
+           "directory mask" = "0755";
+         };
+       };
+     };
+
+     # Kopia backup server
+     systemd.services.kopia-server = {
+       description = "Kopia Backup Server";
+       wantedBy = [ "multi-user.target" ];
+       after = [ "network.target" "zfs-mount.service" ];
+
+       serviceConfig = {
+         User = "ppetru";
+         Group = "users";
+         ExecStart = ''
+           ${pkgs.kopia}/bin/kopia server start \
+             --address 0.0.0.0:51515 \
+             --tls-cert-file /persist/kopia-certs/kopia.cert \
+             --tls-key-file /persist/kopia-certs/kopia.key
+         '';
+         Restart = "on-failure";
+       };
+     };
+
+     # Kopia nightly snapshot (from cron)
+     systemd.services.kopia-snapshot = {
+       description = "Kopia snapshot of homes and shared";
+       serviceConfig = {
+         Type = "oneshot";
+         User = "ppetru";
+         Group = "users";
+         ExecStart = ''
+           ${pkgs.kopia}/bin/kopia --config-file=/home/ppetru/.config/kopia/repository.config \
+             snapshot create /data/homes /data/shared \
+             --log-level=warning --no-progress
+         '';
+       };
+     };
+
+     systemd.timers.kopia-snapshot = {
+       wantedBy = [ "timers.target" ];
+       timerConfig = {
+         OnCalendar = "22:47";
+         Persistent = true;
+       };
+     };
+
+     # Keep kopia config and certs persistent
+     environment.persistence."/persist" = {
+       directories = [
+         "/home/ppetru/.config/kopia"
+         "/home/ppetru/kopia-certs"
+       ];
+     };
+
+     networking.firewall.allowedTCPPorts = [
+       139 445  # Samba
+       51515    # Kopia
+     ];
+     networking.firewall.allowedUDPPorts = [
+       137 138  # Samba
+     ];
+   }
+   ```
+
+**6. Deploy initial config (without cluster)**
+   ```bash
+   # First, deploy without cluster-node.nix to verify storage works
+   # Comment out cluster-node import temporarily
+
+   deploy -s '.#fractal'
+
+   # Verify mounts
+   ssh fractal "df -h | grep data"
+   ssh fractal "ls -la /data/"
+
+   # Test samba
+   smbclient -L fractal -U ppetru
+
+   # Test kopia
+   ssh fractal "systemctl status kopia-server"
+   ```
+
+**7. Join cluster (add to quorum)**
+   ```bash
+   # Uncomment cluster-node.nix import in fractal config
+   # Update all cluster configs for 5-server quorum
+   # (See step 3 in existing Phase 1 docs)
+
+   deploy  # Deploy to all nodes
+
+   # Verify quorum
+   consul members
+   nomad server members
+   ```
+
+**8. Update cluster configs for 5-server quorum**
+   ```nix
+   # common/consul.nix
+   servers = ["c1" "c2" "c3" "fractal" "zippy"];
+   bootstrap_expect = 3;
+
+   # common/nomad.nix
+   servers = ["c1" "c2" "c3" "fractal" "zippy"];
+   bootstrap_expect = 3;
+   ```
+
+**9. Verify fractal is fully operational**
+   ```bash
+   # Check all services
+   ssh fractal "systemctl status samba kopia-server kopia-snapshot.timer"
+
+   # Verify ZFS pools
+   ssh fractal "zpool status"
+   ssh fractal "zfs list"
+
+   # Test accessing shares from another node
+   ssh c1 "ls /data/media /data/shared"
+
+   # Verify kopia clients can still connect
+   kopia repository status --server=https://fractal:51515
+
+   # Check nomad can see fractal
+   nomad node status | grep fractal
+
+   # Verify quorum
+   consul members  # Should see c1, c2, c3, fractal
+   nomad server members  # Should see 4 servers
+   ```
+
+### Phase 2: Setup zippy storage layer
+**Duration: 2-3 hours**
+
+**Goal:** Prepare zippy for NFS server role, setup replication
+
+1. **Create btrfs subvolume on zippy**
+   ```bash
+   ssh zippy
+   sudo btrfs subvolume create /persist/services
+   sudo chown ppetru:users /persist/services
+   ```
+
+2. **Update zippy configuration**
+   ```nix
+   # hosts/zippy/default.nix
+   imports = [
+     ../../common/encrypted-btrfs-layout.nix
+     ../../common/global
+     ../../common/cluster-node.nix  # Adds to quorum
+     ../../common/nomad.nix
+     ./hardware.nix
+   ];
+
+   # NFS server
+   services.nfs.server = {
+     enable = true;
+     exports = ''
+       /persist/services 192.168.1.0/24(rw,sync,no_subtree_check,no_root_squash)
+     '';
+   };
+
+   # Consul service registration for NFS
+   services.consul.extraConfig.services = [{
+     name = "services";
+     port = 2049;
+     checks = [{ tcp = "localhost:2049"; interval = "30s"; }];
+   }];
+
+   # Btrfs replication to standbys (incremental after first full send)
+   systemd.services.replicate-to-c1 = {
+     description = "Replicate /persist/services to c1";
+     script = ''
+       ${pkgs.btrfs-progs}/bin/btrfs subvolume snapshot -r /persist/services /persist/services@$(date +%Y%m%d-%H%M%S)
+       LATEST=$(ls -t /persist/services@* | head -1)
+
+       # Get previous snapshot for incremental send
+       PREV=$(ls -t /persist/services@* | head -2 | tail -1)
+
+       # First run: full send. Subsequent: incremental with -p (parent)
+       if [ "$LATEST" != "$PREV" ]; then
+         ${pkgs.btrfs-progs}/bin/btrfs send -p $PREV $LATEST | ${pkgs.openssh}/bin/ssh c1 "${pkgs.btrfs-progs}/bin/btrfs receive /persist/"
+       else
+         # First snapshot, full send
+         ${pkgs.btrfs-progs}/bin/btrfs send $LATEST | ${pkgs.openssh}/bin/ssh c1 "${pkgs.btrfs-progs}/bin/btrfs receive /persist/"
+       fi
+
+       # Cleanup old snapshots (keep last 24 hours on sender)
+       find /persist/services@* -mtime +1 -exec ${pkgs.btrfs-progs}/bin/btrfs subvolume delete {} \;
+     '';
+   };
+
+   systemd.timers.replicate-to-c1 = {
+     wantedBy = [ "timers.target" ];
+     timerConfig = {
+       OnCalendar = "*:0/5";  # Every 5 minutes (incremental after first full send)
+       Persistent = true;
+     };
+   };
+
+   # Same for c2
+   systemd.services.replicate-to-c2 = { ... };
+   systemd.timers.replicate-to-c2 = { ... };
+   ```
+
+3. **Setup standby storage on c1 and c2**
+   ```bash
+   # On c1 and c2
+   ssh c1 sudo btrfs subvolume create /persist/services-standby
+   ssh c2 sudo btrfs subvolume create /persist/services-standby
+   ```
+
+4. **Deploy and verify**
+   ```bash
+   deploy -s '.#zippy'
+
+   # Verify NFS export
+   showmount -e zippy
+
+   # Verify Consul registration
+   dig @localhost -p 8600 services.service.consul
+   ```
+
+5. **Verify quorum is now 5 servers**
+   ```bash
+   consul members  # Should show c1, c2, c3, fractal, zippy
+   nomad server members
+   ```
+
+### Phase 3: Migrate from GlusterFS to NFS
+**Duration: 3-4 hours**
+
+**Goal:** Move all data, update mounts, remove GlusterFS
+
+1. **Copy data from GlusterFS to zippy**
+   ```bash
+   # On any node with /data/compute mounted
+   rsync -av --progress /data/compute/ zippy:/persist/services/
+
+   # Verify
+   ssh zippy du -sh /persist/services
+   ```
+
+2. **Update all nodes to mount NFS**
+   ```nix
+   # Update common/glusterfs-client.nix → common/nfs-client.nix
+   # OR update common/cluster-node.nix to import nfs-client instead
+
+   fileSystems."/data/services" = {
+     device = "services.service.consul:/persist/services";
+     fsType = "nfs";
+     options = [ "x-systemd.automount" "noauto" "x-systemd.idle-timeout=60" ];
+   };
+
+   # Remove old GlusterFS mount
+   # fileSystems."/data/compute" = ...  # DELETE
+   ```
+
+3. **Deploy updated configs**
+   ```bash
+   deploy -s '.#c1' '.#c2' '.#c3' '.#fractal' '.#zippy'
+   ```
+
+4. **Verify NFS mounts**
+   ```bash
+   for host in c1 c2 c3 fractal zippy; do
+     ssh $host "df -h | grep services"
+   done
+   ```
+
+5. **Stop all Nomad jobs temporarily**
+   ```bash
+   # Get list of running jobs
+   nomad job status | grep running | awk '{print $1}' > /tmp/running-jobs.txt
+
+   # Stop all (they'll be restarted with updated paths in Phase 4)
+   cat /tmp/running-jobs.txt | xargs -I {} nomad job stop {}
+   ```
+
+6. **Remove GlusterFS from cluster**
+   ```bash
+   # On c1 (or any gluster server)
+   gluster volume stop compute
+   gluster volume delete compute
+
+   # On all nodes
+   for host in c1 c2 c3; do
+     ssh $host "sudo systemctl stop glusterd; sudo systemctl disable glusterd"
+   done
+   ```
+
+7. **Remove GlusterFS from NixOS configs**
+   ```nix
+   # common/compute-node.nix - remove ./glusterfs.nix import
+   # Deploy again
+   deploy
+   ```
+
+### Phase 4: Update and redeploy Nomad jobs
+**Duration: 2-4 hours**
+
+**Goal:** Update all Nomad job paths, add constraints/affinities, redeploy
+
+1. **Update job specs** (see Service Catalog below for details)
+   - Change `/data/compute` → `/data/services`
+   - Add constraints for media jobs → fractal
+   - Add affinities for database jobs → zippy
+
+2. **Deploy critical services first**
+   ```bash
+   # Core infrastructure
+   nomad run services/mysql.hcl
+   nomad run services/postgres.hcl
+   nomad run services/redis.hcl
+   nomad run services/traefik.hcl
+   nomad run services/authentik.hcl
+
+   # Verify
+   nomad job status mysql
+   consul catalog services
+   ```
+
+3. **Deploy high-priority services**
+   ```bash
+   nomad run services/prometheus.hcl
+   nomad run services/grafana.hcl
+   nomad run services/loki.hcl
+   nomad run services/vector.hcl
+
+   nomad run services/unifi.hcl
+   nomad run services/gitea.hcl
+   ```
+
+4. **Deploy medium-priority services**
+   ```bash
+   # See service catalog for full list
+   nomad run services/wordpress.hcl
+   nomad run services/ghost.hcl
+   nomad run services/wiki.hcl
+   # ... etc
+   ```
+
+5. **Deploy low-priority services**
+   ```bash
+   nomad run services/media.hcl  # Will run on fractal due to constraint
+   # ... etc
+   ```
+
+6. **Verify all services healthy**
+   ```bash
+   nomad job status
+   consul catalog services
+   # Check traefik dashboard for health
+   ```
+
+### Phase 5: Convert sunny to NixOS (Optional, can defer)
+**Duration: 6-10 hours (split across 2 stages)**
+
+**Current state:**
+- Proxmox with ~1.5TB ethereum node data
+- 2x LXC containers: besu (execution client), lighthouse (consensus beacon)
+- 1x VM: Rocketpool smartnode (docker containers for validator, node, MEV-boost, etc.)
+- Running in "hybrid mode" - managing own execution/consensus, rocketpool manages the rest
+
+**Goal:** Get sunny on NixOS quickly, preserve ethereum data, defer "perfect" native setup
+
+---
+
+#### Stage 1: Quick NixOS Migration (containers)
+**Duration: 6-8 hours**
+**Goal:** NixOS + containerized ethereum stack, minimal disruption
+
+**1. Pre-migration backup and documentation**
+   ```bash
+   # Document current setup
+   ssh sunny "pct list" > /backup/sunny-containers.txt
+   ssh sunny "qm list" > /backup/sunny-vms.txt
+
+   # Find ethereum data locations in LXC containers
+   ssh sunny "pct config BESU_CT_ID" > /backup/sunny-besu-config.txt
+   ssh sunny "pct config LIGHTHOUSE_CT_ID" > /backup/sunny-lighthouse-config.txt
+
+   # Document rocketpool VM volumes
+   ssh sunny "qm config ROCKETPOOL_VM_ID" > /backup/sunny-rocketpool-config.txt
+
+   # Estimate ethereum data size
+   ssh sunny "du -sh /path/to/besu/data"
+   ssh sunny "du -sh /path/to/lighthouse/data"
+
+   # Backup rocketpool config (docker-compose, wallet keys, etc.)
+   # This is in the VM - need to access and backup critical files
+   ```
+
+**2. Extract ethereum data from containers/VM**
+   ```bash
+   # Stop ethereum services to get consistent state
+   # (This will pause validation! Plan for attestation penalties)
+
+   # Copy besu data out of LXC
+   ssh sunny "pct stop BESU_CT_ID"
+   rsync -av --progress sunny:/var/lib/lxc/BESU_CT_ID/rootfs/path/to/besu/ /backup/sunny-besu-data/
+
+   # Copy lighthouse data out of LXC
+   ssh sunny "pct stop LIGHTHOUSE_CT_ID"
+   rsync -av --progress sunny:/var/lib/lxc/LIGHTHOUSE_CT_ID/rootfs/path/to/lighthouse/ /backup/sunny-lighthouse-data/
+
+   # Copy rocketpool data out of VM
+   # This includes validator keys, wallet, node config
+   # Access VM and copy out: ~/.rocketpool/data
+   ```
+
+**3. Install NixOS on sunny**
+   - Fresh install with btrfs + impermanence
+   - Create large `/persist/ethereum` for 1.5TB+ data
+   - **DO NOT** try to resync from network (takes weeks!)
+
+**4. Restore ethereum data to NixOS**
+   ```bash
+   # After NixOS install, copy data back
+   ssh sunny "mkdir -p /persist/ethereum/{besu,lighthouse,rocketpool}"
+
+   rsync -av --progress /backup/sunny-besu-data/ sunny:/persist/ethereum/besu/
+   rsync -av --progress /backup/sunny-lighthouse-data/ sunny:/persist/ethereum/lighthouse/
+   # Rocketpool data copied later
+   ```
+
+**5. Create sunny NixOS config (container-based)**
+   ```nix
+   # hosts/sunny/default.nix
+   { config, pkgs, ... }:
+   {
+     imports = [
+       ../../common/encrypted-btrfs-layout.nix
+       ../../common/global
+       ./hardware.nix
+     ];
+
+     networking.hostName = "sunny";
+
+     # NO cluster-node import - standalone for now
+     # Can add to quorum later if desired
+
+     # Container runtime
+     virtualisation.podman = {
+       enable = true;
+       dockerCompat = true;  # Provides 'docker' command
+       defaultNetwork.settings.dns_enabled = true;
+     };
+
+     # Besu execution client (container)
+     virtualisation.oci-containers.containers.besu = {
+       image = "hyperledger/besu:latest";
+       volumes = [
+         "/persist/ethereum/besu:/var/lib/besu"
+       ];
+       ports = [
+         "8545:8545"   # HTTP RPC
+         "8546:8546"   # WebSocket RPC
+         "30303:30303" # P2P
+       ];
+       cmd = [
+         "--data-path=/var/lib/besu"
+         "--rpc-http-enabled=true"
+         "--rpc-http-host=0.0.0.0"
+         "--rpc-ws-enabled=true"
+         "--rpc-ws-host=0.0.0.0"
+         "--engine-rpc-enabled=true"
+         "--engine-host-allowlist=*"
+         "--engine-jwt-secret=/var/lib/besu/jwt.hex"
+         # Add other besu flags as needed
+       ];
+       autoStart = true;
+     };
+
+     # Lighthouse beacon client (container)
+     virtualisation.oci-containers.containers.lighthouse-beacon = {
+       image = "sigp/lighthouse:latest";
+       volumes = [
+         "/persist/ethereum/lighthouse:/data"
+         "/persist/ethereum/besu/jwt.hex:/jwt.hex:ro"
+       ];
+       ports = [
+         "5052:5052"   # HTTP API
+         "9000:9000"   # P2P
+       ];
+       cmd = [
+         "lighthouse"
+         "beacon"
+         "--datadir=/data"
+         "--http"
+         "--http-address=0.0.0.0"
+         "--execution-endpoint=http://besu:8551"
+         "--execution-jwt=/jwt.hex"
+         # Add other lighthouse flags
+       ];
+       dependsOn = [ "besu" ];
+       autoStart = true;
+     };
+
+     # Rocketpool stack (podman-compose for multi-container setup)
+     # TODO: This requires converting docker-compose to NixOS config
+     # For now, can run docker-compose via systemd service
+     systemd.services.rocketpool = {
+       description = "Rocketpool Smartnode Stack";
+       after = [ "podman.service" "lighthouse-beacon.service" ];
+       wantedBy = [ "multi-user.target" ];
+
+       serviceConfig = {
+         Type = "oneshot";
+         RemainAfterExit = "yes";
+         WorkingDirectory = "/persist/ethereum/rocketpool";
+         ExecStart = "${pkgs.docker-compose}/bin/docker-compose up -d";
+         ExecStop = "${pkgs.docker-compose}/bin/docker-compose down";
+       };
+     };
+
+     # Ensure ethereum data persists
+     environment.persistence."/persist" = {
+       directories = [
+         "/persist/ethereum"
+       ];
+     };
+
+     # Firewall for ethereum
+     networking.firewall = {
+       allowedTCPPorts = [
+         30303  # Besu P2P
+         9000   # Lighthouse P2P
+         # Add rocketpool ports
+       ];
+       allowedUDPPorts = [
+         30303  # Besu P2P
+         9000   # Lighthouse P2P
+       ];
+     };
+   }
+   ```
+
+**6. Setup rocketpool docker-compose on NixOS**
+   ```bash
+   # After NixOS is running, restore rocketpool config
+   ssh sunny "mkdir -p /persist/ethereum/rocketpool"
+
+   # Copy rocketpool data (wallet, keys, config)
+   rsync -av /backup/sunny-rocketpool-data/ sunny:/persist/ethereum/rocketpool/
+
+   # Create docker-compose.yml for rocketpool stack
+   # Based on rocketpool hybrid mode docs
+   # This runs: validator, node software, MEV-boost, prometheus, etc.
+   # Connects to your besu + lighthouse containers
+   ```
+
+**7. Deploy and test**
+   ```bash
+   deploy -s '.#sunny'
+
+   # Verify containers are running
+   ssh sunny "podman ps"
+
+   # Check besu sync status
+   ssh sunny "curl -X POST -H 'Content-Type: application/json' --data '{\"jsonrpc\":\"2.0\",\"method\":\"eth_syncing\",\"params\":[],\"id\":1}' http://localhost:8545"
+
+   # Check lighthouse sync status
+   ssh sunny "curl http://localhost:5052/eth/v1/node/syncing"
+
+   # Monitor rocketpool
+   ssh sunny "cd /persist/ethereum/rocketpool && docker-compose logs -f"
+   ```
+
+**8. Monitor and stabilize**
+   - Ethereum should resume from where it left off (not resync!)
+   - Validation will resume once beacon is sync'd
+   - May have missed a few attestations during migration (minor penalty)
+
+---
+
+#### Stage 2: Native NixOS Services (Future)
+**Duration: TBD (do this later when time permits)**
+**Goal:** Convert to native NixOS services using ethereum-nix
+
+**Why defer this:**
+- Complex (rocketpool not fully packaged for Nix)
+- Current container setup works fine
+- Can migrate incrementally (besu → native, then lighthouse, etc.)
+- No downtime once Stage 1 is stable
+
+**When ready:**
+1. Research ethereum-nix support for besu + lighthouse + rocketpool
+2. Test on separate machine first
+3. Migrate one service at a time with minimal downtime
+4. Document in separate migration plan
+
+**For now:** Stage 1 gets sunny on NixOS with base configs, managed declaratively, just using containers instead of native services.
+
+### Phase 6: Verification and cleanup
+**Duration: 1 hour**
+
+1. **Test failover procedure** (see Failover Procedures below)
+
+2. **Verify backups are working**
+   ```bash
+   kopia snapshot list
+   # Check that /persist/services is being backed up
+   ```
+
+3. **Update documentation**
+   - Update README.md
+   - Document new architecture
+   - Update stateful-commands.txt
+
+4. **Clean up old GlusterFS data**
+   ```bash
+   # Only after verifying everything works!
+   for host in c1 c2 c3; do
+     ssh $host "sudo rm -rf /persist/glusterfs"
+   done
+   ```
+
+---
+
+## Service Catalog
+
+**Legend:**
+- **Priority**: CRITICAL (must be up) / HIGH (important) / MEDIUM (nice to have) / LOW (can wait)
+- **Target**: Where it should run (constraint or affinity)
+- **Data**: What data it needs access to
+- **Changes**: What needs updating in the .hcl file
+
+### Core Infrastructure
+
+#### mysql
+- **File**: `services/mysql.hcl`
+- **Priority**: CRITICAL
+- **Current**: Uses `/data/compute/appdata/mysql`
+- **Target**: Affinity for zippy, allow c1/c2
+- **Data**: `/data/services/appdata/mysql` (NFS from zippy)
+- **Changes**:
+  - ✏️ Volume path: `/data/compute/appdata/mysql` → `/data/services/appdata/mysql`
+  - ✏️ Add affinity:
+    ```hcl
+    affinity {
+      attribute = "${node.unique.name}"
+      value     = "zippy"
+      weight    = 100
+    }
+    ```
+  - ✏️ Add constraint to allow fallback:
+    ```hcl
+    constraint {
+      attribute = "${node.unique.name}"
+      operator  = "regexp"
+      value     = "zippy|c1|c2"
+    }
+    ```
+- **Notes**: Core database, needs to stay up. Consul DNS `mysql.service.consul` unchanged.
+
+#### postgres
+- **File**: `services/postgres.hcl`
+- **Priority**: CRITICAL
+- **Current**: Uses `/data/compute/appdata/postgres`, `/data/compute/appdata/pgadmin`
+- **Target**: Affinity for zippy, allow c1/c2
+- **Data**: `/data/services/appdata/postgres`, `/data/services/appdata/pgadmin` (NFS)
+- **Changes**:
+  - ✏️ Volume paths: `/data/compute/appdata/*` → `/data/services/appdata/*`
+  - ✏️ Add affinity and constraint (same as mysql)
+- **Notes**: Core database for authentik, gitea, plausible, netbox, etc.
+
+#### redis
+- **File**: `services/redis.hcl`
+- **Priority**: CRITICAL
+- **Current**: Uses `/data/compute/appdata/redis`
+- **Target**: Affinity for zippy, allow c1/c2
+- **Data**: `/data/services/appdata/redis` (NFS)
+- **Changes**:
+  - ✏️ Volume path: `/data/compute/appdata/redis` → `/data/services/appdata/redis`
+  - ✏️ Add affinity and constraint (same as mysql)
+- **Notes**: Used by authentik, wordpress. Should co-locate with databases.
+
+#### traefik
+- **File**: `services/traefik.hcl`
+- **Priority**: CRITICAL
+- **Current**: Uses `/data/compute/config/traefik`
+- **Target**: Float on c1/c2/c3 (keepalived handles HA)
+- **Data**: `/data/services/config/traefik` (NFS)
+- **Changes**:
+  - ✏️ Volume path: `/data/compute/config/traefik` → `/data/services/config/traefik`
+- **Notes**: Reverse proxy, has keepalived for VIP failover. Critical for all web access.
+
+#### authentik
+- **File**: `services/authentik.hcl`
+- **Priority**: CRITICAL
+- **Current**: No persistent volumes (stateless, uses postgres/redis)
+- **Target**: Float on c1/c2/c3
+- **Data**: None (uses postgres.service.consul, redis.service.consul)
+- **Changes**: None needed
+- **Notes**: SSO for most services. Must stay up.
+
+### Monitoring Stack
+
+#### prometheus
+- **File**: `services/prometheus.hcl`
+- **Priority**: HIGH
+- **Current**: Uses `/data/compute/appdata/prometheus`
+- **Target**: Float on c1/c2/c3
+- **Data**: `/data/services/appdata/prometheus` (NFS)
+- **Changes**:
+  - ✏️ Volume path: `/data/compute/appdata/prometheus` → `/data/services/appdata/prometheus`
+- **Notes**: Metrics database. Important for monitoring but not critical for services.
+
+#### grafana
+- **File**: `services/grafana.hcl`
+- **Priority**: HIGH
+- **Current**: Uses `/data/compute/appdata/grafana`
+- **Target**: Float on c1/c2/c3
+- **Data**: `/data/services/appdata/grafana` (NFS)
+- **Changes**:
+  - ✏️ Volume path: `/data/compute/appdata/grafana` → `/data/services/appdata/grafana`
+- **Notes**: Monitoring UI. Depends on prometheus.
+
+#### loki
+- **File**: `services/loki.hcl`
+- **Priority**: HIGH
+- **Current**: Uses `/data/compute/appdata/loki`
+- **Target**: Float on c1/c2/c3
+- **Data**: `/data/services/appdata/loki` (NFS)
+- **Changes**:
+  - ✏️ Volume path: `/data/compute/appdata/loki` → `/data/services/appdata/loki`
+- **Notes**: Log aggregation. Important for debugging.
+
+#### vector
+- **File**: `services/vector.hcl`
+- **Priority**: MEDIUM
+- **Current**: No persistent volumes, type=system (runs on all nodes)
+- **Target**: System job (runs everywhere)
+- **Data**: None (ephemeral logs, ships to loki)
+- **Changes**:
+  - ❓ Check if glusterfs log path is still needed: `/var/log/glusterfs:/var/log/glusterfs:ro`
+  - ✏️ Remove glusterfs log collection after GlusterFS is removed
+- **Notes**: Log shipper. Can tolerate downtime.
+
+### Databases (Specialized)
+
+#### clickhouse
+- **File**: `services/clickhouse.hcl`
+- **Priority**: HIGH
+- **Current**: Uses `/data/compute/appdata/clickhouse`
+- **Target**: Affinity for zippy (large dataset), allow c1/c2/c3
+- **Data**: `/data/services/appdata/clickhouse` (NFS)
+- **Changes**:
+  - ✏️ Volume path: `/data/compute/appdata/clickhouse` → `/data/services/appdata/clickhouse`
+  - ✏️ Add affinity for zippy (optional, but helps with performance)
+- **Notes**: Used by plausible. Large time-series data. Important but can be recreated.
+
+#### mongodb
+- **File**: `services/unifi.hcl` (embedded in unifi job)
+- **Priority**: HIGH
+- **Current**: Uses `/data/compute/appdata/unifi/mongodb`
+- **Target**: Float on c1/c2/c3 (with unifi)
+- **Data**: `/data/services/appdata/unifi/mongodb` (NFS)
+- **Changes**: See unifi below
+- **Notes**: Only used by unifi. Should stay with unifi controller.
+
+### Web Applications
+
+#### wordpress
+- **File**: `services/wordpress.hcl`
+- **Priority**: HIGH
+- **Current**: Uses `/data/sync/wordpress` (syncthing-managed to avoid slow GlusterFS)
+- **Target**: Float on c1/c2/c3
+- **Data**: `/data/services/appdata/wordpress` (NFS from zippy)
+- **Changes**:
+  - ✏️ Volume path: `/data/sync/wordpress` → `/data/services/appdata/wordpress`
+  - 📋 **Before cutover**: Copy data from syncthing to zippy: `rsync -av /data/sync/wordpress/ zippy:/persist/services/appdata/wordpress/`
+  - 📋 **After migration**: Remove syncthing configuration for wordpress sync
+- **Notes**: Production website. Important but can tolerate brief downtime during migration.
+
+#### ghost
+- **File**: `services/ghost.hcl`
+- **Priority**: no longer used, should wipe
+- **Current**: Uses `/data/compute/appdata/ghost`
+- **Target**: Float on c1/c2/c3
+- **Data**: `/data/services/appdata/ghost` (NFS)
+- **Changes**:
+  - ✏️ Volume path: `/data/compute/appdata/ghost` → `/data/services/appdata/ghost`
+- **Notes**: Blog platform (alo.land). Can tolerate downtime.
+
+#### gitea
+- **File**: `services/gitea.hcl`
+- **Priority**: HIGH
+- **Current**: Uses `/data/compute/appdata/gitea/data`, `/data/compute/appdata/gitea/config`
+- **Target**: Float on c1/c2/c3
+- **Data**: `/data/services/appdata/gitea/*` (NFS)
+- **Changes**:
+  - ✏️ Volume paths: `/data/compute/appdata/gitea/*` → `/data/services/appdata/gitea/*`
+- **Notes**: Git server. Contains code repositories. Important.
+
+#### wiki (tiddlywiki)
+- **File**: `services/wiki.hcl`
+- **Priority**: HIGH
+- **Current**: Uses `/data/compute/appdata/wiki` via host volume mount
+- **Target**: Float on c1/c2/c3
+- **Data**: `/data/services/appdata/wiki` (NFS)
+- **Changes**:
+  - ✏️ Volume mount path in `volume_mount` blocks
+  - ⚠️ Uses `exec` driver with host volumes - verify NFS mount works with this
+- **Notes**: Multiple tiddlywiki instances. Personal wikis. Can tolerate downtime.
+
+#### code-server
+- **File**: `services/code-server.hcl`
+- **Priority**: LOW
+- **Current**: Uses `/data/compute/appdata/code`
+- **Target**: Float on c1/c2/c3
+- **Data**: `/data/services/appdata/code` (NFS)
+- **Changes**:
+  - ✏️ Volume path: `/data/compute/appdata/code` → `/data/services/appdata/code`
+- **Notes**: Web IDE. Low priority, for development only.
+
+#### beancount (fava)
+- **File**: `services/beancount.hcl`
+- **Priority**: MEDIUM
+- **Current**: Uses `/data/compute/appdata/beancount`
+- **Target**: Float on c1/c2/c3
+- **Data**: `/data/services/appdata/beancount` (NFS)
+- **Changes**:
+  - ✏️ Volume path: `/data/compute/appdata/beancount` → `/data/services/appdata/beancount`
+- **Notes**: Finance tracking. Low priority.
+
+#### adminer
+- **File**: `services/adminer.hcl`
+- **Priority**: LOW
+- **Current**: Stateless
+- **Target**: Float on c1/c2/c3
+- **Data**: None
+- **Changes**: None needed
+- **Notes**: Database admin UI. Only needed for maintenance.
+
+#### plausible
+- **File**: `services/plausible.hcl`
+- **Priority**: HIGH
+- **Current**: Stateless (uses postgres and clickhouse)
+- **Target**: Float on c1/c2/c3
+- **Data**: None (uses postgres.service.consul, clickhouse.service.consul)
+- **Changes**: None needed
+- **Notes**: Website analytics. Nice to have but not critical.
+
+#### evcc
+- **File**: `services/evcc.hcl`
+- **Priority**: HIGH
+- **Current**: Uses `/data/compute/appdata/evcc/evcc.yaml`, `/data/compute/appdata/evcc/evcc`
+- **Target**: Float on c1/c2/c3
+- **Data**: `/data/services/appdata/evcc/*` (NFS)
+- **Changes**:
+  - ✏️ Volume paths: `/data/compute/appdata/evcc/*` → `/data/services/appdata/evcc/*`
+- **Notes**: EV charging controller. Important for daily use.
+
+#### vikunja
+- **File**: `services/vikunja.hcl` (assumed to exist based on README)
+- **Priority**: no longer used, should delete
+- **Current**: Likely uses `/data/compute/appdata/vikunja`
+- **Target**: Float on c1/c2/c3
+- **Data**: `/data/services/appdata/vikunja` (NFS)
+- **Changes**:
+  - ✏️ Volume paths: Update to `/data/services/appdata/vikunja`
+- **Notes**: Task management. Low priority.
+
+#### leantime
+- **File**: `services/leantime.hcl`
+- **Priority**: no longer used, should delete
+- **Current**: Likely uses `/data/compute/appdata/leantime`
+- **Target**: Float on c1/c2/c3
+- **Data**: `/data/services/appdata/leantime` (NFS)
+- **Changes**:
+  - ✏️ Volume paths: Update to `/data/services/appdata/leantime`
+- **Notes**: Project management. Low priority.
+
+### Network Infrastructure
+
+#### unifi
+- **File**: `services/unifi.hcl`
+- **Priority**: HIGH
+- **Current**: Uses `/data/compute/appdata/unifi/data`, `/data/compute/appdata/unifi/mongodb`
+- **Target**: Float on c1/c2/c3/fractal/zippy
+- **Data**: `/data/services/appdata/unifi/*` (NFS)
+- **Changes**:
+  - ✏️ Volume paths: `/data/compute/appdata/unifi/*` → `/data/services/appdata/unifi/*`
+- **Notes**: UniFi network controller. Critical for network management. Has keepalived VIP for stable inform address. Floating is fine.
+
+### Media Stack
+
+#### media (radarr, sonarr, bazarr, plex, qbittorrent)
+- **File**: `services/media.hcl`
+- **Priority**: MEDIUM
+- **Current**: Uses `/data/compute/appdata/radarr`, `/data/compute/appdata/sonarr`, etc. and `/data/media`
+- **Target**: **MUST run on fractal** (local /data/media access)
+- **Data**:
+  - `/data/services/appdata/radarr` (NFS) - config data
+  - `/data/media` (local CIFS mount on fractal, local disk on fractal)
+- **Changes**:
+  - ✏️ Volume paths: `/data/compute/appdata/*` → `/data/services/appdata/*`
+  - ✏️ **Add constraint**:
+    ```hcl
+    constraint {
+      attribute = "${node.unique.name}"
+      value     = "fractal"
+    }
+    ```
+- **Notes**: Heavy I/O to /data/media. Must run on fractal for performance. Has keepalived VIP.
+
+### Utility Services
+
+#### weewx
+- **File**: `services/weewx.hcl`
+- **Priority**: HIGH
+- **Current**: Likely uses `/data/compute/appdata/weewx`
+- **Target**: Float on c1/c2/c3
+- **Data**: `/data/services/appdata/weewx` (NFS)
+- **Changes**:
+  - ✏️ Volume paths: Update to `/data/services/appdata/weewx`
+- **Notes**: Weather station. Low priority.
+
+#### maps
+- **File**: `services/maps.hcl`
+- **Priority**: MEDIUM
+- **Current**: Likely uses `/data/compute/appdata/maps`
+- **Target**: Float on c1/c2/c3 (or fractal if large tile data)
+- **Data**: `/data/services/appdata/maps` (NFS) or `/data/media/maps` if large
+- **Changes**:
+  - ✏️ Volume paths: Check data size, may want to move to /data/media
+- **Notes**: Map tiles. Low priority.
+
+#### netbox
+- **File**: `services/netbox.hcl`
+- **Priority**: LOW
+- **Current**: Likely uses `/data/compute/appdata/netbox`
+- **Target**: Float on c1/c2/c3
+- **Data**: `/data/services/appdata/netbox` (NFS)
+- **Changes**:
+  - ✏️ Volume paths: Update to `/data/services/appdata/netbox`
+- **Notes**: IPAM/DCIM. Low priority, for documentation.
+
+#### farmos
+- **File**: `services/farmos.hcl`
+- **Priority**: LOW
+- **Current**: Likely uses `/data/compute/appdata/farmos`
+- **Target**: Float on c1/c2/c3
+- **Data**: `/data/services/appdata/farmos` (NFS)
+- **Changes**:
+  - ✏️ Volume paths: Update to `/data/services/appdata/farmos`
+- **Notes**: Farm management. Low priority.
+
+#### urbit
+- **File**: `services/urbit.hcl`
+- **Priority**: LOW
+- **Current**: Likely uses `/data/compute/appdata/urbit`
+- **Target**: Float on c1/c2/c3
+- **Data**: `/data/services/appdata/urbit` (NFS)
+- **Changes**:
+  - ✏️ Volume paths: Update to `/data/services/appdata/urbit`
+- **Notes**: Urbit node. Experimental, low priority.
+
+#### webodm
+- **File**: `services/webodm.hcl`
+- **Priority**: LOW
+- **Current**: Likely uses `/data/compute/appdata/webodm`
+- **Target**: Float on c1/c2/c3 (or fractal if processing large imagery from /data/media)
+- **Data**: `/data/services/appdata/webodm` (NFS)
+- **Changes**:
+  - ✏️ Volume paths: Update to `/data/services/appdata/webodm`
+  - 🤔 May benefit from running on fractal if it processes files from /data/media
+- **Notes**: Drone imagery processing. Low priority.
+
+#### velutrack
+- **File**: `services/velutrack.hcl`
+- **Priority**: LOW
+- **Current**: Likely minimal state
+- **Target**: Float on c1/c2/c3
+- **Data**: Minimal
+- **Changes**: Verify if any volume paths need updating
+- **Notes**: Vehicle tracking. Low priority.
+
+#### resol-gateway
+- **File**: `services/resol-gateway.hcl`
+- **Priority**: HIGH
+- **Current**: Likely minimal state
+- **Target**: Float on c1/c2/c3
+- **Data**: Minimal
+- **Changes**: Verify if any volume paths need updating
+- **Notes**: Solar thermal controller. Low priority.
+
+#### igsync
+- **File**: `services/igsync.hcl`
+- **Priority**: MEDIUM
+- **Current**: Likely uses `/data/compute/appdata/igsync` or `/data/media`
+- **Target**: Float on c1/c2/c3 (or fractal if storing to /data/media)
+- **Data**: Check if it writes to `/data/media` or `/data/services`
+- **Changes**:
+  - ✏️ Volume paths: Verify and update
+- **Notes**: Instagram sync. Low priority.
+
+#### jupyter
+- **File**: `services/jupyter.hcl`
+- **Priority**: LOW
+- **Current**: Stateless or minimal state
+- **Target**: Float on c1/c2/c3
+- **Data**: Minimal
+- **Changes**: Verify if any volume paths need updating
+- **Notes**: Notebook server. Low priority, for experimentation.
+
+#### whoami
+- **File**: `services/whoami.hcl`
+- **Priority**: LOW
+- **Current**: Stateless
+- **Target**: Float on c1/c2/c3
+- **Data**: None
+- **Changes**: None needed
+- **Notes**: Test service. Can be stopped during migration.
+
+#### tiddlywiki (if separate from wiki.hcl)
+- **File**: `services/tiddlywiki.hcl`
+- **Priority**: MEDIUM
+- **Current**: Likely same as wiki.hcl
+- **Target**: Float on c1/c2/c3
+- **Data**: `/data/services/appdata/tiddlywiki` (NFS)
+- **Changes**: Same as wiki.hcl
+- **Notes**: May be duplicate of wiki.hcl.
+
+### Backup Jobs
+
+#### mysql-backup
+- **File**: `services/mysql-backup.hcl`
+- **Priority**: HIGH
+- **Current**: Likely writes to `/data/compute` or `/data/shared`
+- **Target**: Float on c1/c2/c3
+- **Data**: Should write to `/data/shared` (backed up to fractal)
+- **Changes**:
+  - ✏️ Verify backup destination, should be `/data/shared/backups/mysql`
+- **Notes**: Important for disaster recovery. Should run regularly.
+
+#### postgres-backup
+- **File**: `services/postgres-backup.hcl`
+- **Priority**: HIGH
+- **Current**: Likely writes to `/data/compute` or `/data/shared`
+- **Target**: Float on c1/c2/c3
+- **Data**: Should write to `/data/shared` (backed up to fractal)
+- **Changes**:
+  - ✏️ Verify backup destination, should be `/data/shared/backups/postgres`
+- **Notes**: Important for disaster recovery. Should run regularly.
+
+#### wordpress-backup
+- **File**: `services/wordpress-backup.hcl`
+- **Priority**: MEDIUM
+- **Current**: Likely writes to `/data/compute` or `/data/shared`
+- **Target**: Float on c1/c2/c3
+- **Data**: Should write to `/data/shared` (backed up to fractal)
+- **Changes**:
+  - ✏️ Verify backup destination
+- **Notes**: Periodic backup job.
+
+---
+
+## Failover Procedures
+
+### NFS Server Failover (zippy → c1 or c2)
+
+**When to use:** zippy is down and not coming back soon
+
+**Prerequisites:**
+- c1 and c2 have been receiving btrfs snapshots from zippy
+- Last successful replication was < 1 hour ago (verify timestamps)
+
+**Procedure:**
+
+1. **Choose standby node** (c1 or c2)
+   ```bash
+   # Check replication freshness
+   ssh c1 "ls -lt /persist/services-standby@* | head -5"
+   ssh c2 "ls -lt /persist/services-standby@* | head -5"
+
+   # Choose the one with most recent snapshot
+   # For this example, we'll use c1
+   ```
+
+2. **On standby node (c1), promote standby to primary**
+   ```bash
+   ssh c1
+
+   # Stop NFS client mount (if running)
+   sudo systemctl stop data-services.mount
+
+   # Find latest snapshot
+   LATEST=$(ls -t /persist/services-standby@* | head -1)
+
+   # Create writable subvolume from snapshot
+   sudo btrfs subvolume snapshot $LATEST /persist/services
+
+   # Verify
+   ls -la /persist/services
+   ```
+
+3. **Deploy c1-nfs-server configuration**
+   ```bash
+   # From your workstation
+   deploy -s '.#c1-nfs-server'
+
+   # This activates:
+   # - NFS server on c1
+   # - Consul service registration for "services"
+   # - Firewall rules
+   ```
+
+4. **On c1, verify NFS is running**
+   ```bash
+   ssh c1
+   sudo systemctl status nfs-server
+   showmount -e localhost
+   dig @localhost -p 8600 services.service.consul  # Should show c1's IP
+   ```
+
+5. **On other nodes, remount NFS**
+   ```bash
+   # Nodes should auto-remount via Consul DNS, but you can force it:
+   for host in c2 c3 fractal zippy; do
+     ssh $host "sudo systemctl restart data-services.mount"
+   done
+   ```
+
+6. **Verify Nomad jobs are healthy**
+   ```bash
+   nomad job status mysql
+   nomad job status postgres
+   # Check all critical services
+   ```
+
+7. **Update monitoring/alerts**
+   - Note in documentation that c1 is now primary NFS server
+   - Set up alert to remember to fail back to zippy when it's repaired
+
+**Recovery Time Objective (RTO):** ~10-15 minutes
+
+**Recovery Point Objective (RPO):** Last snapshot interval (**5 minutes** max)
+
+### Failing Back to zippy
+
+**When to use:** zippy is repaired and ready to resume primary role
+
+**Procedure:**
+
+1. **Sync data from c1 back to zippy**
+   ```bash
+   # On c1 (current primary)
+   sudo btrfs subvolume snapshot -r /persist/services /persist/services@failback-$(date +%Y%m%d-%H%M%S)
+   FAILBACK=$(ls -t /persist/services@failback-* | head -1)
+   sudo btrfs send $FAILBACK | ssh zippy "sudo btrfs receive /persist/"
+
+   # On zippy, make it writable
+   ssh zippy "sudo btrfs subvolume snapshot /persist/$(basename $FAILBACK) /persist/services"
+   ```
+
+2. **Deploy zippy back to NFS server role**
+   ```bash
+   deploy -s '.#zippy'
+   # Consul will register services.service.consul → zippy again
+   ```
+
+3. **Demote c1 back to standby**
+   ```bash
+   deploy -s '.#c1'
+   # This removes NFS server, restores NFS client mount
+   ```
+
+4. **Verify all nodes are mounting from zippy**
+   ```bash
+   dig @c1 -p 8600 services.service.consul  # Should show zippy's IP
+
+   for host in c1 c2 c3 fractal; do
+     ssh $host "df -h | grep services"
+   done
+   ```
+
+### Database Job Failover (automatic via Nomad)
+
+**When to use:** zippy is down, database jobs need to run elsewhere
+
+**What happens automatically:**
+1. Nomad detects zippy is unhealthy
+2. Jobs with constraint `zippy|c1|c2` are rescheduled to c1 or c2
+3. Jobs start on new node, accessing `/data/services` (now via NFS from promoted standby)
+
+**Manual intervention needed:**
+- None if NFS failover completed successfully
+- If jobs are stuck: `nomad job stop mysql && nomad job run services/mysql.hcl`
+
+**What to check:**
+```bash
+nomad job status mysql
+nomad job status postgres
+nomad job status redis
+
+# Verify they're running on c1 or c2, not zippy
+nomad alloc status <alloc-id>
+```
+
+### Complete Cluster Failure (lose quorum)
+
+**Scenario:** 3 or more servers go down, quorum lost
+
+**Prevention:** This is why we have 5 servers (need 3 for quorum)
+
+**Recovery:**
+1. **Bring up at least 3 servers** (any 3 from c1, c2, c3, fractal, zippy)
+2. **If that's not possible, bootstrap new cluster:**
+   ```bash
+   # On one surviving server, force bootstrap
+   consul force-leave <failed-node>
+   nomad operator raft list-peers
+   nomad operator raft remove-peer <failed-peer>
+   ```
+3. **Restore from backups** (worst case)
+
+---
+
+## Post-Migration Verification Checklist
+
+- [ ] All 5 servers in quorum: `consul members` shows c1, c2, c3, fractal, zippy
+- [ ] NFS mounts working: `df -h | grep services` on all nodes
+- [ ] Btrfs replication running: Check systemd timers on zippy
+- [ ] Critical services up: mysql, postgres, redis, traefik, authentik
+- [ ] Monitoring working: Prometheus, Grafana, Loki accessible
+- [ ] Media stack on fractal: `nomad alloc status` shows media job on fractal
+- [ ] Database jobs on zippy: `nomad alloc status` shows mysql/postgres on zippy
+- [ ] Consul DNS working: `dig @localhost -p 8600 services.service.consul`
+- [ ] Backups running: Kopia snapshots include `/persist/services`
+- [ ] GlusterFS removed: No glusterfs processes, volumes deleted
+- [ ] Documentation updated: README.md, architecture diagrams
+
+---
+
+## Rollback Plan
+
+**If migration fails catastrophically:**
+
+1. **Stop all new Nomad jobs**
+   ```bash
+   nomad job stop -purge <new-jobs>
+   ```
+
+2. **Restore GlusterFS mounts**
+   ```bash
+   # On all nodes, re-enable GlusterFS client
+   deploy  # With old configs
+   ```
+
+3. **Restart old Nomad jobs**
+   ```bash
+   # With old paths pointing to /data/compute
+   nomad run services/*.hcl  # Old versions from git
+   ```
+
+4. **Restore data if needed**
+   ```bash
+   rsync -av /backup/compute-pre-migration/ /data/compute/
+   ```
+
+**Important:** Keep GlusterFS running until Phase 4 is complete and verified!
+
+---
+
+## Questions Answered
+
+1. ✅ **Where is `/data/sync/wordpress` mounted from?**
+   - **Answer**: Syncthing-managed to avoid slow GlusterFS
+   - **Action**: Migrate to `/data/services/appdata/wordpress`, remove syncthing config
+
+2. ✅ **Which services use `/data/media` directly?**
+   - **Answer**: Only media.hcl (radarr, sonarr, plex, qbittorrent)
+   - **Action**: Constrain media.hcl to fractal, everything else uses CIFS mount
+
+3. ✅ **Do we want unifi on fractal or floating?**
+   - **Answer**: Floating is fine
+   - **Action**: No constraint needed
+
+4. ✅ **What's the plan for sunny's existing data?**
+   - **Answer**: Ethereum data stays local, not replicated (too expensive)
+   - **Action**: Either backup/restore or resync from network during NixOS conversion
+
+## Questions Still to Answer
+
+1. **Backup retention for btrfs snapshots?**
+   - Current plan: Keep 24 hours of snapshots on zippy
+   - Is this enough? Or do we want more for safety?
+   - This should be fine -- snapshots are just for hot recovery. More/older backups are kept via kopia on fractal.
+
+2. **c1-nfs-server vs c1 config - same host, different configs?**
+   - Recommendation: Use same hostname, different flake output
+   - `c1` = normal config with NFS client
+   - `c1-nfs-server` = variant with NFS server enabled
+   - Both in flake.nix, deploy appropriate one based on role
+   - Answer: recommendation makes sense.
+
+3. **Should we verify webodm, igsync, maps don't need /data/media access?**
+   - neither of them needs /data/media
+   - maps needs /data/shared
+
+---
+
+## Timeline Estimate
+
+**Total duration: 12-20 hours** (can be split across multiple sessions)
+
+- Phase 0 (Prep): 1-2 hours
+- Phase 1 (fractal): 4-6 hours
+- Phase 2 (zippy storage): 2-3 hours
+- Phase 3 (GlusterFS → NFS): 3-4 hours
+- Phase 4 (Nomad jobs): 2-4 hours
+- Phase 5 (sunny): 2-3 hours (optional, can be done later)
+- Phase 6 (Cleanup): 1 hour
+
+**Suggested schedule:**
+- **Day 1**: Phases 0-1 (fractal conversion, establish quorum)
+- **Day 2**: Phases 2-3 (zippy storage, data migration)
+- **Day 3**: Phase 4 (Nomad job updates and deployment)
+- **Day 4**: Phases 5-6 (sunny + cleanup) or take a break and do later
+
+**Maintenance windows needed:**
+- Phase 3: ~1 hour downtime (all services stopped during data migration)
+- Phase 4: Rolling (services come back up as redeployed)