# Cluster Architecture Revamp **Status**: Planning complete, ready for review and refinement ## Key Decisions ✅ **Replication**: 5-minute intervals (incremental btrfs send) ✅ **WordPress**: Currently syncthing → will use `/data/services` via NFS ✅ **Media**: Only media.hcl needs `/data/media`, constrained to fractal ✅ **Unifi**: Floating (no constraint needed) ✅ **Sunny**: Standalone, ethereum data stays local (not replicated) ✅ **Quorum**: 5 servers (c1, c2, c3, fractal, zippy) ✅ **NFS Failover**: Via Consul DNS (`services.service.consul`) ## Table of Contents 1. [End State Architecture](#end-state-architecture) 2. [Migration Steps](#migration-steps) 3. [Service Catalog](#service-catalog) 4. [Failover Procedures](#failover-procedures) --- ## End State Architecture ### Cluster Topology **5-Server Quorum (Consul + Nomad server+client):** - **c1, c2, c3**: Cattle nodes - x86_64, run most stateless workloads - **fractal**: Storage node - x86_64, 6x spinning drives, runs media workloads - **zippy**: Stateful anchor - x86_64, runs database workloads (via affinity), primary NFS server **Standalone Nodes (not in quorum):** - **sunny**: x86_64, ethereum node + staking, base NixOS configs only - **chilly**: x86_64, Home Assistant VM, base NixOS configs only **Quorum Math:** - 5 servers → quorum requires 3 healthy nodes - Can tolerate 2 simultaneous failures - Bootstrap expect: 3 ### Storage Architecture **Primary Storage (zippy):** - `/persist/services` - btrfs subvolume - Contains: mysql, postgres, redis, clickhouse, mongodb, app data - Exported via NFS to: `services.service.consul:/persist/services` - Replicated via **btrfs send** to c1 and c2 every **5 minutes** (incremental) **Standby Storage (c1, c2):** - `/persist/services-standby` - btrfs subvolume - Receives replicated snapshots from zippy via incremental btrfs send - Can be promoted to `/persist/services` and exported as NFS during failover - Maximum data loss: **5 minutes** (last replication interval) **Standalone Storage (sunny):** - `/persist/ethereum` - local btrfs subvolume (or similar) - Contains: ethereum blockchain data, staking keys - **NOT replicated** - too large/expensive to replicate full ethereum node - Backed up via kopia to fractal (if feasible/needed) **Media Storage (fractal):** - `/data/media` - existing spinning drive storage - Exported via Samba (existing) - Mounted on c1, c2, c3 via CIFS (existing) - Local access on fractal for media workloads **Shared Storage (fractal):** - `/data/shared` - existing spinning drive storage - Exported via Samba (existing) - Mounted on c1, c2, c3 via CIFS (existing) ### Network Services **NFS Primary (zippy):** ```nix services.nfs.server = { enable = true; exports = '' /persist/services 192.168.1.0/24(rw,sync,no_subtree_check,no_root_squash) ''; }; services.consul.extraConfig.services = [{ name = "services"; port = 2049; checks = [{ tcp = "localhost:2049"; interval = "30s"; }]; }]; ``` **NFS Client (all nodes):** ```nix fileSystems."/data/services" = { device = "services.service.consul:/persist/services"; fsType = "nfs"; options = [ "x-systemd.automount" "noauto" "x-systemd.idle-timeout=60" ]; }; ``` **Samba Exports (fractal - existing):** - `//fractal/media` → `/data/media` - `//fractal/shared` → `/data/shared` ### Nomad Job Placement Strategy **Affinity-based (prefer zippy, allow c1/c2):** - mysql, postgres, redis - stateful databases - Run on zippy normally, can failover to c1/c2 if zippy down **Constrained (must run on fractal):** - **media.hcl** - radarr, sonarr, bazarr, plex, qbittorrent - Reason: Heavy /data/media access, benefits from local storage - **prometheus.hcl** - metrics database with 30d retention - Reason: Large time-series data, spinning disks OK, saves SSD space - **loki.hcl** - log aggregation with 31d retention - Reason: Large log data, spinning disks OK - **clickhouse.hcl** - analytics database for plausible - Reason: Large time-series data, spinning disks OK **Floating (can run anywhere on c1/c2/c3/fractal/zippy):** - All other services including: - traefik, authentik, web apps - **grafana** (small data, just dashboards/config, queries prometheus for metrics) - databases (mysql, postgres, redis) - vector (system job, runs everywhere) - Nomad schedules based on resources and constraints ### Data Migration **Path changes needed in Nomad jobs:** - `/data/compute/appdata/*` → `/data/services/*` - `/data/compute/config/*` → `/data/services/*` - `/data/sync/wordpress` → `/data/services/wordpress` **No changes needed:** - `/data/media/*` - stays the same (CIFS mount from fractal, used only by media services) - `/data/shared/*` - stays the same (CIFS mount from fractal) **Deprecated after migration:** - `/data/sync/wordpress` - currently managed by syncthing to avoid slow GlusterFS - Will be replaced by NFS mount at `/data/services/wordpress` - Syncthing configuration for this can be removed - Final sync: copy from syncthing to `/persist/services/wordpress` on zippy before cutover --- ## Migration Steps **Important path simplification note:** - All service paths use `/data/services/*` directly (not `/data/services/*`) - Example: `/data/compute/appdata/mysql` → `/data/services/mysql` - Simpler, cleaner, easier to manage ### Phase 0: Preparation **Duration: 1-2 hours** 1. **Backup everything** ```bash # On all nodes, ensure kopia backups are current kopia snapshot list # Backup glusterfs data manually rsync -av /data/compute/ /backup/compute-pre-migration/ ``` 2. **Document current state** ```bash # Save current nomad job list nomad job status -json > /backup/nomad-jobs-pre-migration.json # Save consul service catalog consul catalog services > /backup/consul-services-pre-migration.txt ``` 3. **Review this document** - Verify all services are cataloged - Confirm priority assignments - Adjust as needed ### Phase 1: Convert fractal to NixOS **Duration: 6-8 hours** **Current state:** - Proxmox on ZFS - System pool: `rpool` (~500GB, will be wiped) - Data pools (preserved): - `double1` - 3.6T (homes, shared) - `double2` - 7.2T (backup - kopia repo, PBS) - `double3` - 17T (media, torrent) - Services: Samba (homes, shared, media), Kopia server, PBS - Bind mounts: `/data/{homes,shared,media,torrent}` → ZFS datasets **Goal:** Fresh NixOS on rpool, preserve data pools, join cluster #### Step-by-step procedure: **1. Pre-migration documentation** ```bash # On fractal, save ZFS layout cat > /tmp/detect-zfs.sh << 'EOF' #!/bin/bash echo "=== ZFS Pools ===" zpool status echo -e "\n=== ZFS Datasets ===" zfs list -o name,mountpoint,used,avail,mounted -r double1 double2 double3 echo -e "\n=== Bind mounts ===" cat /etc/fstab | grep double echo -e "\n=== Data directories ===" ls -la /data/ echo -e "\n=== Samba users/groups ===" getent group shared compute getent passwd compute EOF chmod +x /tmp/detect-zfs.sh ssh fractal /tmp/detect-zfs.sh > /backup/fractal-zfs-layout.txt # Save samba config scp fractal:/etc/samba/smb.conf /backup/fractal-smb.conf # Save kopia certs and config scp -r fractal:~/kopia-certs /backup/fractal-kopia-certs/ scp fractal:~/.config/kopia/repository.config /backup/fractal-kopia-repository.config # Verify kopia backups are current ssh fractal "kopia snapshot list --all" ``` **2. Stop services on fractal** ```bash ssh fractal "systemctl stop smbd nmbd kopia" # Don't stop PBS yet (in case we need to restore) ``` **3. Install NixOS** - Boot NixOS installer USB - **IMPORTANT**: Do NOT touch double1, double2, double3 during install! - Install only on `rpool` (or create new pool if needed) ```bash # In NixOS installer # Option A: Reuse rpool (wipe and recreate) zpool destroy rpool # Option B: Use different disk if available # Then follow standard NixOS btrfs install on that disk ``` - Use standard encrypted btrfs layout (matching other hosts) - Minimal install first, will add cluster configs later **4. First boot - import ZFS pools** ```bash # SSH into fresh NixOS install # Import pools (read-only first, to be safe) zpool import -f -o readonly=on double1 zpool import -f -o readonly=on double2 zpool import -f -o readonly=on double3 # Verify datasets zfs list -r double1 double2 double3 # Example output should show: # double1/homes # double1/shared # double2/backup # double3/media # double3/torrent # If everything looks good, export and reimport read-write zpool export double1 double2 double3 zpool import double1 zpool import double2 zpool import double3 # Set ZFS mountpoints (if needed) # These may already be set from Proxmox zfs set mountpoint=/double1 double1 zfs set mountpoint=/double2 double2 zfs set mountpoint=/double3 double3 ``` **5. Create fractal NixOS configuration** ```nix # hosts/fractal/default.nix { config, pkgs, ... }: { imports = [ ../../common/encrypted-btrfs-layout.nix ../../common/global ../../common/cluster-node.nix # Consul + Nomad (will add in step 7) ../../common/nomad.nix # Both server and client ./hardware.nix ]; networking.hostName = "fractal"; # ZFS support boot.supportedFilesystems = [ "zfs" ]; boot.zfs.extraPools = [ "double1" "double2" "double3" ]; # Ensure ZFS pools are imported before mounting systemd.services.zfs-import.wantedBy = [ "multi-user.target" ]; # Bind mounts for /data (matching Proxmox setup) fileSystems."/data/homes" = { device = "/double1/homes"; fsType = "none"; options = [ "bind" "x-systemd.requires=zfs-mount.service" ]; }; fileSystems."/data/shared" = { device = "/double1/shared"; fsType = "none"; options = [ "bind" "x-systemd.requires=zfs-mount.service" ]; }; fileSystems."/data/media" = { device = "/double3/media"; fsType = "none"; options = [ "bind" "x-systemd.requires=zfs-mount.service" ]; }; fileSystems."/data/torrent" = { device = "/double3/torrent"; fsType = "none"; options = [ "bind" "x-systemd.requires=zfs-mount.service" ]; }; fileSystems."/backup" = { device = "/double2/backup"; fsType = "none"; options = [ "bind" "x-systemd.requires=zfs-mount.service" ]; }; # Create data directory structure systemd.tmpfiles.rules = [ "d /data 0755 root root -" ]; # Users and groups for samba users.groups.shared = { gid = 1001; }; users.groups.compute = { gid = 1002; }; users.users.compute = { isSystemUser = true; uid = 1002; group = "compute"; }; # Ensure ppetru is in shared group users.users.ppetru.extraGroups = [ "shared" ]; # Samba server services.samba = { enable = true; openFirewall = true; extraConfig = '' workgroup = WORKGROUP server string = fractal netbios name = fractal security = user map to guest = bad user ''; shares = { homes = { comment = "Home Directories"; browseable = "no"; path = "/data/homes/%S"; "read only" = "no"; }; shared = { path = "/data/shared"; "read only" = "no"; browseable = "yes"; "guest ok" = "no"; "create mask" = "0775"; "directory mask" = "0775"; "force group" = "+shared"; }; media = { path = "/data/media"; "read only" = "no"; browseable = "yes"; "guest ok" = "no"; "create mask" = "0755"; "directory mask" = "0755"; }; }; }; # Kopia backup server systemd.services.kopia-server = { description = "Kopia Backup Server"; wantedBy = [ "multi-user.target" ]; after = [ "network.target" "zfs-mount.service" ]; serviceConfig = { User = "ppetru"; Group = "users"; ExecStart = '' ${pkgs.kopia}/bin/kopia server start \ --address 0.0.0.0:51515 \ --tls-cert-file /persist/kopia-certs/kopia.cert \ --tls-key-file /persist/kopia-certs/kopia.key ''; Restart = "on-failure"; }; }; # Kopia nightly snapshot (from cron) systemd.services.kopia-snapshot = { description = "Kopia snapshot of homes and shared"; serviceConfig = { Type = "oneshot"; User = "ppetru"; Group = "users"; ExecStart = '' ${pkgs.kopia}/bin/kopia --config-file=/home/ppetru/.config/kopia/repository.config \ snapshot create /data/homes /data/shared \ --log-level=warning --no-progress ''; }; }; systemd.timers.kopia-snapshot = { wantedBy = [ "timers.target" ]; timerConfig = { OnCalendar = "22:47"; Persistent = true; }; }; # Keep kopia config and certs persistent environment.persistence."/persist" = { directories = [ "/home/ppetru/.config/kopia" "/home/ppetru/kopia-certs" ]; }; networking.firewall.allowedTCPPorts = [ 139 445 # Samba 51515 # Kopia ]; networking.firewall.allowedUDPPorts = [ 137 138 # Samba ]; } ``` **6. Deploy initial config (without cluster)** ```bash # First, deploy without cluster-node.nix to verify storage works # Comment out cluster-node import temporarily deploy -s '.#fractal' # Verify mounts ssh fractal "df -h | grep data" ssh fractal "ls -la /data/" # Test samba smbclient -L fractal -U ppetru # Test kopia ssh fractal "systemctl status kopia-server" ``` **7. Join cluster (add to quorum)** ```bash # Uncomment cluster-node.nix import in fractal config # Update all cluster configs for 5-server quorum # (See step 3 in existing Phase 1 docs) deploy # Deploy to all nodes # Verify quorum consul members nomad server members ``` **8. Update cluster configs for 5-server quorum** ```nix # common/consul.nix servers = ["c1" "c2" "c3" "fractal" "zippy"]; bootstrap_expect = 3; # common/nomad.nix servers = ["c1" "c2" "c3" "fractal" "zippy"]; bootstrap_expect = 3; ``` **9. Verify fractal is fully operational** ```bash # Check all services ssh fractal "systemctl status samba kopia-server kopia-snapshot.timer" # Verify ZFS pools ssh fractal "zpool status" ssh fractal "zfs list" # Test accessing shares from another node ssh c1 "ls /data/media /data/shared" # Verify kopia clients can still connect kopia repository status --server=https://fractal:51515 # Check nomad can see fractal nomad node status | grep fractal # Verify quorum consul members # Should see c1, c2, c3, fractal nomad server members # Should see 4 servers ``` ### Phase 2: Setup zippy storage layer **Duration: 2-3 hours** **Goal:** Prepare zippy for NFS server role, setup replication 1. **Create btrfs subvolume on zippy** ```bash ssh zippy sudo btrfs subvolume create /persist/services sudo chown ppetru:users /persist/services ``` 2. **Update zippy configuration** ```nix # hosts/zippy/default.nix imports = [ ../../common/encrypted-btrfs-layout.nix ../../common/global ../../common/cluster-node.nix # Adds to quorum ../../common/nomad.nix ./hardware.nix ]; # NFS server services.nfs.server = { enable = true; exports = '' /persist/services 192.168.1.0/24(rw,sync,no_subtree_check,no_root_squash) ''; }; # Consul service registration for NFS services.consul.extraConfig.services = [{ name = "services"; port = 2049; checks = [{ tcp = "localhost:2049"; interval = "30s"; }]; }]; # Btrfs replication to standbys (incremental after first full send) systemd.services.replicate-to-c1 = { description = "Replicate /persist/services to c1"; script = '' ${pkgs.btrfs-progs}/bin/btrfs subvolume snapshot -r /persist/services /persist/services@$(date +%Y%m%d-%H%M%S) LATEST=$(ls -t /persist/services@* | head -1) # Get previous snapshot for incremental send PREV=$(ls -t /persist/services@* | head -2 | tail -1) # First run: full send. Subsequent: incremental with -p (parent) if [ "$LATEST" != "$PREV" ]; then ${pkgs.btrfs-progs}/bin/btrfs send -p $PREV $LATEST | ${pkgs.openssh}/bin/ssh c1 "${pkgs.btrfs-progs}/bin/btrfs receive /persist/" else # First snapshot, full send ${pkgs.btrfs-progs}/bin/btrfs send $LATEST | ${pkgs.openssh}/bin/ssh c1 "${pkgs.btrfs-progs}/bin/btrfs receive /persist/" fi # Cleanup old snapshots (keep last 24 hours on sender) find /persist/services@* -mtime +1 -exec ${pkgs.btrfs-progs}/bin/btrfs subvolume delete {} \; ''; }; systemd.timers.replicate-to-c1 = { wantedBy = [ "timers.target" ]; timerConfig = { OnCalendar = "*:0/5"; # Every 5 minutes (incremental after first full send) Persistent = true; }; }; # Same for c2 systemd.services.replicate-to-c2 = { ... }; systemd.timers.replicate-to-c2 = { ... }; ``` 3. **Setup standby storage on c1 and c2** ```bash # On c1 and c2 ssh c1 sudo btrfs subvolume create /persist/services-standby ssh c2 sudo btrfs subvolume create /persist/services-standby ``` 4. **Deploy and verify** ```bash deploy -s '.#zippy' # Verify NFS export showmount -e zippy # Verify Consul registration dig @localhost -p 8600 services.service.consul ``` 5. **Verify quorum is now 5 servers** ```bash consul members # Should show c1, c2, c3, fractal, zippy nomad server members ``` ### Phase 3: Migrate from GlusterFS to NFS **Duration: 3-4 hours** **Goal:** Move all data, update mounts, remove GlusterFS 1. **Copy data from GlusterFS to zippy** ```bash # On any node with /data/compute mounted rsync -av --progress /data/compute/ zippy:/persist/services/ # Verify ssh zippy du -sh /persist/services ``` 2. **Update all nodes to mount NFS** ```nix # Update common/glusterfs-client.nix → common/nfs-client.nix # OR update common/cluster-node.nix to import nfs-client instead fileSystems."/data/services" = { device = "services.service.consul:/persist/services"; fsType = "nfs"; options = [ "x-systemd.automount" "noauto" "x-systemd.idle-timeout=60" ]; }; # Remove old GlusterFS mount # fileSystems."/data/compute" = ... # DELETE ``` 3. **Deploy updated configs** ```bash deploy -s '.#c1' '.#c2' '.#c3' '.#fractal' '.#zippy' ``` 4. **Verify NFS mounts** ```bash for host in c1 c2 c3 fractal zippy; do ssh $host "df -h | grep services" done ``` 5. **Stop all Nomad jobs temporarily** ```bash # Get list of running jobs nomad job status | grep running | awk '{print $1}' > /tmp/running-jobs.txt # Stop all (they'll be restarted with updated paths in Phase 4) cat /tmp/running-jobs.txt | xargs -I {} nomad job stop {} ``` 6. **Remove GlusterFS from cluster** ```bash # On c1 (or any gluster server) gluster volume stop compute gluster volume delete compute # On all nodes for host in c1 c2 c3; do ssh $host "sudo systemctl stop glusterd; sudo systemctl disable glusterd" done ``` 7. **Remove GlusterFS from NixOS configs** ```nix # common/compute-node.nix - remove ./glusterfs.nix import # Deploy again deploy ``` ### Phase 4: Update and redeploy Nomad jobs **Duration: 2-4 hours** **Goal:** Update all Nomad job paths, add constraints/affinities, redeploy 1. **Update job specs** (see Service Catalog below for details) - Change `/data/compute` → `/data/services` - Add constraints for media jobs → fractal - Add affinities for database jobs → zippy 2. **Deploy critical services first** ```bash # Core infrastructure nomad run services/mysql.hcl nomad run services/postgres.hcl nomad run services/redis.hcl nomad run services/traefik.hcl nomad run services/authentik.hcl # Verify nomad job status mysql consul catalog services ``` 3. **Deploy high-priority services** ```bash nomad run services/prometheus.hcl nomad run services/grafana.hcl nomad run services/loki.hcl nomad run services/vector.hcl nomad run services/unifi.hcl nomad run services/gitea.hcl ``` 4. **Deploy medium-priority services** ```bash # See service catalog for full list nomad run services/wordpress.hcl nomad run services/ghost.hcl nomad run services/wiki.hcl # ... etc ``` 5. **Deploy low-priority services** ```bash nomad run services/media.hcl # Will run on fractal due to constraint # ... etc ``` 6. **Verify all services healthy** ```bash nomad job status consul catalog services # Check traefik dashboard for health ``` ### Phase 5: Convert sunny to NixOS (Optional, can defer) **Duration: 6-10 hours (split across 2 stages)** **Current state:** - Proxmox with ~1.5TB ethereum node data - 2x LXC containers: besu (execution client), lighthouse (consensus beacon) - 1x VM: Rocketpool smartnode (docker containers for validator, node, MEV-boost, etc.) - Running in "hybrid mode" - managing own execution/consensus, rocketpool manages the rest **Goal:** Get sunny on NixOS quickly, preserve ethereum data, defer "perfect" native setup --- #### Stage 1: Quick NixOS Migration (containers) **Duration: 6-8 hours** **Goal:** NixOS + containerized ethereum stack, minimal disruption **1. Pre-migration backup and documentation** ```bash # Document current setup ssh sunny "pct list" > /backup/sunny-containers.txt ssh sunny "qm list" > /backup/sunny-vms.txt # Find ethereum data locations in LXC containers ssh sunny "pct config BESU_CT_ID" > /backup/sunny-besu-config.txt ssh sunny "pct config LIGHTHOUSE_CT_ID" > /backup/sunny-lighthouse-config.txt # Document rocketpool VM volumes ssh sunny "qm config ROCKETPOOL_VM_ID" > /backup/sunny-rocketpool-config.txt # Estimate ethereum data size ssh sunny "du -sh /path/to/besu/data" ssh sunny "du -sh /path/to/lighthouse/data" # Backup rocketpool config (docker-compose, wallet keys, etc.) # This is in the VM - need to access and backup critical files ``` **2. Extract ethereum data from containers/VM** ```bash # Stop ethereum services to get consistent state # (This will pause validation! Plan for attestation penalties) # Copy besu data out of LXC ssh sunny "pct stop BESU_CT_ID" rsync -av --progress sunny:/var/lib/lxc/BESU_CT_ID/rootfs/path/to/besu/ /backup/sunny-besu-data/ # Copy lighthouse data out of LXC ssh sunny "pct stop LIGHTHOUSE_CT_ID" rsync -av --progress sunny:/var/lib/lxc/LIGHTHOUSE_CT_ID/rootfs/path/to/lighthouse/ /backup/sunny-lighthouse-data/ # Copy rocketpool data out of VM # This includes validator keys, wallet, node config # Access VM and copy out: ~/.rocketpool/data ``` **3. Install NixOS on sunny** - Fresh install with btrfs + impermanence - Create large `/persist/ethereum` for 1.5TB+ data - **DO NOT** try to resync from network (takes weeks!) **4. Restore ethereum data to NixOS** ```bash # After NixOS install, copy data back ssh sunny "mkdir -p /persist/ethereum/{besu,lighthouse,rocketpool}" rsync -av --progress /backup/sunny-besu-data/ sunny:/persist/ethereum/besu/ rsync -av --progress /backup/sunny-lighthouse-data/ sunny:/persist/ethereum/lighthouse/ # Rocketpool data copied later ``` **5. Create sunny NixOS config (container-based)** ```nix # hosts/sunny/default.nix { config, pkgs, ... }: { imports = [ ../../common/encrypted-btrfs-layout.nix ../../common/global ./hardware.nix ]; networking.hostName = "sunny"; # NO cluster-node import - standalone for now # Can add to quorum later if desired # Container runtime virtualisation.podman = { enable = true; dockerCompat = true; # Provides 'docker' command defaultNetwork.settings.dns_enabled = true; }; # Besu execution client (container) virtualisation.oci-containers.containers.besu = { image = "hyperledger/besu:latest"; volumes = [ "/persist/ethereum/besu:/var/lib/besu" ]; ports = [ "8545:8545" # HTTP RPC "8546:8546" # WebSocket RPC "30303:30303" # P2P ]; cmd = [ "--data-path=/var/lib/besu" "--rpc-http-enabled=true" "--rpc-http-host=0.0.0.0" "--rpc-ws-enabled=true" "--rpc-ws-host=0.0.0.0" "--engine-rpc-enabled=true" "--engine-host-allowlist=*" "--engine-jwt-secret=/var/lib/besu/jwt.hex" # Add other besu flags as needed ]; autoStart = true; }; # Lighthouse beacon client (container) virtualisation.oci-containers.containers.lighthouse-beacon = { image = "sigp/lighthouse:latest"; volumes = [ "/persist/ethereum/lighthouse:/data" "/persist/ethereum/besu/jwt.hex:/jwt.hex:ro" ]; ports = [ "5052:5052" # HTTP API "9000:9000" # P2P ]; cmd = [ "lighthouse" "beacon" "--datadir=/data" "--http" "--http-address=0.0.0.0" "--execution-endpoint=http://besu:8551" "--execution-jwt=/jwt.hex" # Add other lighthouse flags ]; dependsOn = [ "besu" ]; autoStart = true; }; # Rocketpool stack (podman-compose for multi-container setup) # TODO: This requires converting docker-compose to NixOS config # For now, can run docker-compose via systemd service systemd.services.rocketpool = { description = "Rocketpool Smartnode Stack"; after = [ "podman.service" "lighthouse-beacon.service" ]; wantedBy = [ "multi-user.target" ]; serviceConfig = { Type = "oneshot"; RemainAfterExit = "yes"; WorkingDirectory = "/persist/ethereum/rocketpool"; ExecStart = "${pkgs.docker-compose}/bin/docker-compose up -d"; ExecStop = "${pkgs.docker-compose}/bin/docker-compose down"; }; }; # Ensure ethereum data persists environment.persistence."/persist" = { directories = [ "/persist/ethereum" ]; }; # Firewall for ethereum networking.firewall = { allowedTCPPorts = [ 30303 # Besu P2P 9000 # Lighthouse P2P # Add rocketpool ports ]; allowedUDPPorts = [ 30303 # Besu P2P 9000 # Lighthouse P2P ]; }; } ``` **6. Setup rocketpool docker-compose on NixOS** ```bash # After NixOS is running, restore rocketpool config ssh sunny "mkdir -p /persist/ethereum/rocketpool" # Copy rocketpool data (wallet, keys, config) rsync -av /backup/sunny-rocketpool-data/ sunny:/persist/ethereum/rocketpool/ # Create docker-compose.yml for rocketpool stack # Based on rocketpool hybrid mode docs # This runs: validator, node software, MEV-boost, prometheus, etc. # Connects to your besu + lighthouse containers ``` **7. Deploy and test** ```bash deploy -s '.#sunny' # Verify containers are running ssh sunny "podman ps" # Check besu sync status ssh sunny "curl -X POST -H 'Content-Type: application/json' --data '{\"jsonrpc\":\"2.0\",\"method\":\"eth_syncing\",\"params\":[],\"id\":1}' http://localhost:8545" # Check lighthouse sync status ssh sunny "curl http://localhost:5052/eth/v1/node/syncing" # Monitor rocketpool ssh sunny "cd /persist/ethereum/rocketpool && docker-compose logs -f" ``` **8. Monitor and stabilize** - Ethereum should resume from where it left off (not resync!) - Validation will resume once beacon is sync'd - May have missed a few attestations during migration (minor penalty) --- #### Stage 2: Native NixOS Services (Future) **Duration: TBD (do this later when time permits)** **Goal:** Convert to native NixOS services using ethereum-nix **Why defer this:** - Complex (rocketpool not fully packaged for Nix) - Current container setup works fine - Can migrate incrementally (besu → native, then lighthouse, etc.) - No downtime once Stage 1 is stable **When ready:** 1. Research ethereum-nix support for besu + lighthouse + rocketpool 2. Test on separate machine first 3. Migrate one service at a time with minimal downtime 4. Document in separate migration plan **For now:** Stage 1 gets sunny on NixOS with base configs, managed declaratively, just using containers instead of native services. ### Phase 6: Verification and cleanup **Duration: 1 hour** 1. **Test failover procedure** (see Failover Procedures below) 2. **Verify backups are working** ```bash kopia snapshot list # Check that /persist/services is being backed up ``` 3. **Update documentation** - Update README.md - Document new architecture - Update stateful-commands.txt 4. **Clean up old GlusterFS data** ```bash # Only after verifying everything works! for host in c1 c2 c3; do ssh $host "sudo rm -rf /persist/glusterfs" done ``` --- ## Service Catalog **Legend:** - **Priority**: CRITICAL (must be up) / HIGH (important) / MEDIUM (nice to have) / LOW (can wait) - **Target**: Where it should run (constraint or affinity) - **Data**: What data it needs access to - **Changes**: What needs updating in the .hcl file ### Core Infrastructure #### mysql - **File**: `services/mysql.hcl` - **Priority**: CRITICAL - **Current**: Uses `/data/compute/appdata/mysql` - **Target**: Affinity for zippy, allow c1/c2 - **Data**: `/data/services/mysql` (NFS from zippy) - **Changes**: - ✏️ Volume path: `/data/compute/appdata/mysql` → `/data/services/mysql` - ✏️ Add affinity: ```hcl affinity { attribute = "${node.unique.name}" value = "zippy" weight = 100 } ``` - ✏️ Add constraint to allow fallback: ```hcl constraint { attribute = "${node.unique.name}" operator = "regexp" value = "zippy|c1|c2" } ``` - **Notes**: Core database, needs to stay up. Consul DNS `mysql.service.consul` unchanged. #### postgres - **File**: `services/postgres.hcl` - **Priority**: CRITICAL - **Current**: Uses `/data/compute/appdata/postgres`, `/data/compute/appdata/pgadmin` - **Target**: Affinity for zippy, allow c1/c2 - **Data**: `/data/services/postgres`, `/data/services/pgadmin` (NFS) - **Changes**: - ✏️ Volume paths: `/data/compute/appdata/*` → `/data/services/*` - ✏️ Add affinity and constraint (same as mysql) - **Notes**: Core database for authentik, gitea, plausible, netbox, etc. #### redis - **File**: `services/redis.hcl` - **Priority**: CRITICAL - **Current**: Uses `/data/compute/appdata/redis` - **Target**: Affinity for zippy, allow c1/c2 - **Data**: `/data/services/redis` (NFS) - **Changes**: - ✏️ Volume path: `/data/compute/appdata/redis` → `/data/services/redis` - ✏️ Add affinity and constraint (same as mysql) - **Notes**: Used by authentik, wordpress. Should co-locate with databases. #### traefik - **File**: `services/traefik.hcl` - **Priority**: CRITICAL - **Current**: Uses `/data/compute/config/traefik` - **Target**: Float on c1/c2/c3 (keepalived handles HA) - **Data**: `/data/services/config/traefik` (NFS) - **Changes**: - ✏️ Volume path: `/data/compute/config/traefik` → `/data/services/config/traefik` - **Notes**: Reverse proxy, has keepalived for VIP failover. Critical for all web access. #### authentik - **File**: `services/authentik.hcl` - **Priority**: CRITICAL - **Current**: No persistent volumes (stateless, uses postgres/redis) - **Target**: Float on c1/c2/c3 - **Data**: None (uses postgres.service.consul, redis.service.consul) - **Changes**: None needed - **Notes**: SSO for most services. Must stay up. ### Monitoring Stack #### prometheus - **File**: `services/prometheus.hcl` - **Priority**: HIGH - **Current**: Uses `/data/compute/appdata/prometheus` - **Target**: Float on c1/c2/c3 - **Data**: `/data/services/prometheus` (NFS) - **Changes**: - ✏️ Volume path: `/data/compute/appdata/prometheus` → `/data/services/prometheus` - **Notes**: Metrics database. Important for monitoring but not critical for services. #### grafana - **File**: `services/grafana.hcl` - **Priority**: HIGH - **Current**: Uses `/data/compute/appdata/grafana` - **Target**: Float on c1/c2/c3 - **Data**: `/data/services/grafana` (NFS) - **Changes**: - ✏️ Volume path: `/data/compute/appdata/grafana` → `/data/services/grafana` - **Notes**: Monitoring UI. Depends on prometheus. #### loki - **File**: `services/loki.hcl` - **Priority**: HIGH - **Current**: Uses `/data/compute/appdata/loki` - **Target**: Float on c1/c2/c3 - **Data**: `/data/services/loki` (NFS) - **Changes**: - ✏️ Volume path: `/data/compute/appdata/loki` → `/data/services/loki` - **Notes**: Log aggregation. Important for debugging. #### vector - **File**: `services/vector.hcl` - **Priority**: MEDIUM - **Current**: No persistent volumes, type=system (runs on all nodes) - **Target**: System job (runs everywhere) - **Data**: None (ephemeral logs, ships to loki) - **Changes**: - ❓ Check if glusterfs log path is still needed: `/var/log/glusterfs:/var/log/glusterfs:ro` - ✏️ Remove glusterfs log collection after GlusterFS is removed - **Notes**: Log shipper. Can tolerate downtime. ### Databases (Specialized) #### clickhouse - **File**: `services/clickhouse.hcl` - **Priority**: HIGH - **Current**: Uses `/data/compute/appdata/clickhouse` - **Target**: Affinity for zippy (large dataset), allow c1/c2/c3 - **Data**: `/data/services/clickhouse` (NFS) - **Changes**: - ✏️ Volume path: `/data/compute/appdata/clickhouse` → `/data/services/clickhouse` - ✏️ Add affinity for zippy (optional, but helps with performance) - **Notes**: Used by plausible. Large time-series data. Important but can be recreated. #### mongodb - **File**: `services/unifi.hcl` (embedded in unifi job) - **Priority**: HIGH - **Current**: Uses `/data/compute/appdata/unifi/mongodb` - **Target**: Float on c1/c2/c3 (with unifi) - **Data**: `/data/services/unifi/mongodb` (NFS) - **Changes**: See unifi below - **Notes**: Only used by unifi. Should stay with unifi controller. ### Web Applications #### wordpress - **File**: `services/wordpress.hcl` - **Priority**: HIGH - **Current**: Uses `/data/sync/wordpress` (syncthing-managed to avoid slow GlusterFS) - **Target**: Float on c1/c2/c3 - **Data**: `/data/services/wordpress` (NFS from zippy) - **Changes**: - ✏️ Volume path: `/data/sync/wordpress` → `/data/services/wordpress` - 📋 **Before cutover**: Copy data from syncthing to zippy: `rsync -av /data/sync/wordpress/ zippy:/persist/services/appdata/wordpress/` - 📋 **After migration**: Remove syncthing configuration for wordpress sync - **Notes**: Production website. Important but can tolerate brief downtime during migration. #### ghost - **File**: `services/ghost.hcl` - **Priority**: no longer used, should wipe - **Current**: Uses `/data/compute/appdata/ghost` - **Target**: Float on c1/c2/c3 - **Data**: `/data/services/ghost` (NFS) - **Changes**: - ✏️ Volume path: `/data/compute/appdata/ghost` → `/data/services/ghost` - **Notes**: Blog platform (alo.land). Can tolerate downtime. #### gitea - **File**: `services/gitea.hcl` - **Priority**: HIGH - **Current**: Uses `/data/compute/appdata/gitea/data`, `/data/compute/appdata/gitea/config` - **Target**: Float on c1/c2/c3 - **Data**: `/data/services/gitea/*` (NFS) - **Changes**: - ✏️ Volume paths: `/data/compute/appdata/gitea/*` → `/data/services/gitea/*` - **Notes**: Git server. Contains code repositories. Important. #### wiki (tiddlywiki) - **File**: `services/wiki.hcl` - **Priority**: HIGH - **Current**: Uses `/data/compute/appdata/wiki` via host volume mount - **Target**: Float on c1/c2/c3 - **Data**: `/data/services/wiki` (NFS) - **Changes**: - ✏️ Volume mount path in `volume_mount` blocks - ⚠️ Uses `exec` driver with host volumes - verify NFS mount works with this - **Notes**: Multiple tiddlywiki instances. Personal wikis. Can tolerate downtime. #### code-server - **File**: `services/code-server.hcl` - **Priority**: LOW - **Current**: Uses `/data/compute/appdata/code` - **Target**: Float on c1/c2/c3 - **Data**: `/data/services/code` (NFS) - **Changes**: - ✏️ Volume path: `/data/compute/appdata/code` → `/data/services/code` - **Notes**: Web IDE. Low priority, for development only. #### beancount (fava) - **File**: `services/beancount.hcl` - **Priority**: MEDIUM - **Current**: Uses `/data/compute/appdata/beancount` - **Target**: Float on c1/c2/c3 - **Data**: `/data/services/beancount` (NFS) - **Changes**: - ✏️ Volume path: `/data/compute/appdata/beancount` → `/data/services/beancount` - **Notes**: Finance tracking. Low priority. #### adminer - **File**: `services/adminer.hcl` - **Priority**: LOW - **Current**: Stateless - **Target**: Float on c1/c2/c3 - **Data**: None - **Changes**: None needed - **Notes**: Database admin UI. Only needed for maintenance. #### plausible - **File**: `services/plausible.hcl` - **Priority**: HIGH - **Current**: Stateless (uses postgres and clickhouse) - **Target**: Float on c1/c2/c3 - **Data**: None (uses postgres.service.consul, clickhouse.service.consul) - **Changes**: None needed - **Notes**: Website analytics. Nice to have but not critical. #### evcc - **File**: `services/evcc.hcl` - **Priority**: HIGH - **Current**: Uses `/data/compute/appdata/evcc/evcc.yaml`, `/data/compute/appdata/evcc/evcc` - **Target**: Float on c1/c2/c3 - **Data**: `/data/services/evcc/*` (NFS) - **Changes**: - ✏️ Volume paths: `/data/compute/appdata/evcc/*` → `/data/services/evcc/*` - **Notes**: EV charging controller. Important for daily use. #### vikunja - **File**: `services/vikunja.hcl` (assumed to exist based on README) - **Priority**: no longer used, should delete - **Current**: Likely uses `/data/compute/appdata/vikunja` - **Target**: Float on c1/c2/c3 - **Data**: `/data/services/vikunja` (NFS) - **Changes**: - ✏️ Volume paths: Update to `/data/services/vikunja` - **Notes**: Task management. Low priority. #### leantime - **File**: `services/leantime.hcl` - **Priority**: no longer used, should delete - **Current**: Likely uses `/data/compute/appdata/leantime` - **Target**: Float on c1/c2/c3 - **Data**: `/data/services/leantime` (NFS) - **Changes**: - ✏️ Volume paths: Update to `/data/services/leantime` - **Notes**: Project management. Low priority. ### Network Infrastructure #### unifi - **File**: `services/unifi.hcl` - **Priority**: HIGH - **Current**: Uses `/data/compute/appdata/unifi/data`, `/data/compute/appdata/unifi/mongodb` - **Target**: Float on c1/c2/c3/fractal/zippy - **Data**: `/data/services/unifi/*` (NFS) - **Changes**: - ✏️ Volume paths: `/data/compute/appdata/unifi/*` → `/data/services/unifi/*` - **Notes**: UniFi network controller. Critical for network management. Has keepalived VIP for stable inform address. Floating is fine. ### Media Stack #### media (radarr, sonarr, bazarr, plex, qbittorrent) - **File**: `services/media.hcl` - **Priority**: MEDIUM - **Current**: Uses `/data/compute/appdata/radarr`, `/data/compute/appdata/sonarr`, etc. and `/data/media` - **Target**: **MUST run on fractal** (local /data/media access) - **Data**: - `/data/services/radarr` (NFS) - config data - `/data/media` (local CIFS mount on fractal, local disk on fractal) - **Changes**: - ✏️ Volume paths: `/data/compute/appdata/*` → `/data/services/*` - ✏️ **Add constraint**: ```hcl constraint { attribute = "${node.unique.name}" value = "fractal" } ``` - **Notes**: Heavy I/O to /data/media. Must run on fractal for performance. Has keepalived VIP. ### Utility Services #### weewx - **File**: `services/weewx.hcl` - **Priority**: HIGH - **Current**: Likely uses `/data/compute/appdata/weewx` - **Target**: Float on c1/c2/c3 - **Data**: `/data/services/weewx` (NFS) - **Changes**: - ✏️ Volume paths: Update to `/data/services/weewx` - **Notes**: Weather station. Low priority. #### maps - **File**: `services/maps.hcl` - **Priority**: MEDIUM - **Current**: Likely uses `/data/compute/appdata/maps` - **Target**: Float on c1/c2/c3 (or fractal if large tile data) - **Data**: `/data/services/maps` (NFS) or `/data/media/maps` if large - **Changes**: - ✏️ Volume paths: Check data size, may want to move to /data/media - **Notes**: Map tiles. Low priority. #### netbox - **File**: `services/netbox.hcl` - **Priority**: LOW - **Current**: Likely uses `/data/compute/appdata/netbox` - **Target**: Float on c1/c2/c3 - **Data**: `/data/services/netbox` (NFS) - **Changes**: - ✏️ Volume paths: Update to `/data/services/netbox` - **Notes**: IPAM/DCIM. Low priority, for documentation. #### farmos - **File**: `services/farmos.hcl` - **Priority**: LOW - **Current**: Likely uses `/data/compute/appdata/farmos` - **Target**: Float on c1/c2/c3 - **Data**: `/data/services/farmos` (NFS) - **Changes**: - ✏️ Volume paths: Update to `/data/services/farmos` - **Notes**: Farm management. Low priority. #### urbit - **File**: `services/urbit.hcl` - **Priority**: LOW - **Current**: Likely uses `/data/compute/appdata/urbit` - **Target**: Float on c1/c2/c3 - **Data**: `/data/services/urbit` (NFS) - **Changes**: - ✏️ Volume paths: Update to `/data/services/urbit` - **Notes**: Urbit node. Experimental, low priority. #### webodm - **File**: `services/webodm.hcl` - **Priority**: LOW - **Current**: Likely uses `/data/compute/appdata/webodm` - **Target**: Float on c1/c2/c3 (or fractal if processing large imagery from /data/media) - **Data**: `/data/services/webodm` (NFS) - **Changes**: - ✏️ Volume paths: Update to `/data/services/webodm` - 🤔 May benefit from running on fractal if it processes files from /data/media - **Notes**: Drone imagery processing. Low priority. #### velutrack - **File**: `services/velutrack.hcl` - **Priority**: LOW - **Current**: Likely minimal state - **Target**: Float on c1/c2/c3 - **Data**: Minimal - **Changes**: Verify if any volume paths need updating - **Notes**: Vehicle tracking. Low priority. #### resol-gateway - **File**: `services/resol-gateway.hcl` - **Priority**: HIGH - **Current**: Likely minimal state - **Target**: Float on c1/c2/c3 - **Data**: Minimal - **Changes**: Verify if any volume paths need updating - **Notes**: Solar thermal controller. Low priority. #### igsync - **File**: `services/igsync.hcl` - **Priority**: MEDIUM - **Current**: Likely uses `/data/compute/appdata/igsync` or `/data/media` - **Target**: Float on c1/c2/c3 (or fractal if storing to /data/media) - **Data**: Check if it writes to `/data/media` or `/data/services` - **Changes**: - ✏️ Volume paths: Verify and update - **Notes**: Instagram sync. Low priority. #### jupyter - **File**: `services/jupyter.hcl` - **Priority**: LOW - **Current**: Stateless or minimal state - **Target**: Float on c1/c2/c3 - **Data**: Minimal - **Changes**: Verify if any volume paths need updating - **Notes**: Notebook server. Low priority, for experimentation. #### whoami - **File**: `services/whoami.hcl` - **Priority**: LOW - **Current**: Stateless - **Target**: Float on c1/c2/c3 - **Data**: None - **Changes**: None needed - **Notes**: Test service. Can be stopped during migration. #### tiddlywiki (if separate from wiki.hcl) - **File**: `services/tiddlywiki.hcl` - **Priority**: MEDIUM - **Current**: Likely same as wiki.hcl - **Target**: Float on c1/c2/c3 - **Data**: `/data/services/tiddlywiki` (NFS) - **Changes**: Same as wiki.hcl - **Notes**: May be duplicate of wiki.hcl. ### Backup Jobs #### mysql-backup - **File**: `services/mysql-backup.hcl` - **Priority**: HIGH - **Current**: Likely writes to `/data/compute` or `/data/shared` - **Target**: Float on c1/c2/c3 - **Data**: Should write to `/data/shared` (backed up to fractal) - **Changes**: - ✏️ Verify backup destination, should be `/data/shared/backups/mysql` - **Notes**: Important for disaster recovery. Should run regularly. #### postgres-backup - **File**: `services/postgres-backup.hcl` - **Priority**: HIGH - **Current**: Likely writes to `/data/compute` or `/data/shared` - **Target**: Float on c1/c2/c3 - **Data**: Should write to `/data/shared` (backed up to fractal) - **Changes**: - ✏️ Verify backup destination, should be `/data/shared/backups/postgres` - **Notes**: Important for disaster recovery. Should run regularly. #### wordpress-backup - **File**: `services/wordpress-backup.hcl` - **Priority**: MEDIUM - **Current**: Likely writes to `/data/compute` or `/data/shared` - **Target**: Float on c1/c2/c3 - **Data**: Should write to `/data/shared` (backed up to fractal) - **Changes**: - ✏️ Verify backup destination - **Notes**: Periodic backup job. --- ## Failover Procedures ### NFS Server Failover (zippy → c1 or c2) **When to use:** zippy is down and not coming back soon **Prerequisites:** - c1 and c2 have been receiving btrfs snapshots from zippy - Last successful replication was < 1 hour ago (verify timestamps) **Procedure:** 1. **Choose standby node** (c1 or c2) ```bash # Check replication freshness ssh c1 "ls -lt /persist/services-standby@* | head -5" ssh c2 "ls -lt /persist/services-standby@* | head -5" # Choose the one with most recent snapshot # For this example, we'll use c1 ``` 2. **On standby node (c1), promote standby to primary** ```bash ssh c1 # Stop NFS client mount (if running) sudo systemctl stop data-services.mount # Find latest snapshot LATEST=$(ls -t /persist/services-standby@* | head -1) # Create writable subvolume from snapshot sudo btrfs subvolume snapshot $LATEST /persist/services # Verify ls -la /persist/services ``` 3. **Deploy c1-nfs-server configuration** ```bash # From your workstation deploy -s '.#c1-nfs-server' # This activates: # - NFS server on c1 # - Consul service registration for "services" # - Firewall rules ``` 4. **On c1, verify NFS is running** ```bash ssh c1 sudo systemctl status nfs-server showmount -e localhost dig @localhost -p 8600 services.service.consul # Should show c1's IP ``` 5. **On other nodes, remount NFS** ```bash # Nodes should auto-remount via Consul DNS, but you can force it: for host in c2 c3 fractal zippy; do ssh $host "sudo systemctl restart data-services.mount" done ``` 6. **Verify Nomad jobs are healthy** ```bash nomad job status mysql nomad job status postgres # Check all critical services ``` 7. **Update monitoring/alerts** - Note in documentation that c1 is now primary NFS server - Set up alert to remember to fail back to zippy when it's repaired **Recovery Time Objective (RTO):** ~10-15 minutes **Recovery Point Objective (RPO):** Last snapshot interval (**5 minutes** max) ### Failing Back to zippy **When to use:** zippy is repaired and ready to resume primary role **Procedure:** 1. **Sync data from c1 back to zippy** ```bash # On c1 (current primary) sudo btrfs subvolume snapshot -r /persist/services /persist/services@failback-$(date +%Y%m%d-%H%M%S) FAILBACK=$(ls -t /persist/services@failback-* | head -1) sudo btrfs send $FAILBACK | ssh zippy "sudo btrfs receive /persist/" # On zippy, make it writable ssh zippy "sudo btrfs subvolume snapshot /persist/$(basename $FAILBACK) /persist/services" ``` 2. **Deploy zippy back to NFS server role** ```bash deploy -s '.#zippy' # Consul will register services.service.consul → zippy again ``` 3. **Demote c1 back to standby** ```bash deploy -s '.#c1' # This removes NFS server, restores NFS client mount ``` 4. **Verify all nodes are mounting from zippy** ```bash dig @c1 -p 8600 services.service.consul # Should show zippy's IP for host in c1 c2 c3 fractal; do ssh $host "df -h | grep services" done ``` ### Database Job Failover (automatic via Nomad) **When to use:** zippy is down, database jobs need to run elsewhere **What happens automatically:** 1. Nomad detects zippy is unhealthy 2. Jobs with constraint `zippy|c1|c2` are rescheduled to c1 or c2 3. Jobs start on new node, accessing `/data/services` (now via NFS from promoted standby) **Manual intervention needed:** - None if NFS failover completed successfully - If jobs are stuck: `nomad job stop mysql && nomad job run services/mysql.hcl` **What to check:** ```bash nomad job status mysql nomad job status postgres nomad job status redis # Verify they're running on c1 or c2, not zippy nomad alloc status ``` ### Complete Cluster Failure (lose quorum) **Scenario:** 3 or more servers go down, quorum lost **Prevention:** This is why we have 5 servers (need 3 for quorum) **Recovery:** 1. **Bring up at least 3 servers** (any 3 from c1, c2, c3, fractal, zippy) 2. **If that's not possible, bootstrap new cluster:** ```bash # On one surviving server, force bootstrap consul force-leave nomad operator raft list-peers nomad operator raft remove-peer ``` 3. **Restore from backups** (worst case) --- ## Post-Migration Verification Checklist - [ ] All 5 servers in quorum: `consul members` shows c1, c2, c3, fractal, zippy - [ ] NFS mounts working: `df -h | grep services` on all nodes - [ ] Btrfs replication running: Check systemd timers on zippy - [ ] Critical services up: mysql, postgres, redis, traefik, authentik - [ ] Monitoring working: Prometheus, Grafana, Loki accessible - [ ] Media stack on fractal: `nomad alloc status` shows media job on fractal - [ ] Database jobs on zippy: `nomad alloc status` shows mysql/postgres on zippy - [ ] Consul DNS working: `dig @localhost -p 8600 services.service.consul` - [ ] Backups running: Kopia snapshots include `/persist/services` - [ ] GlusterFS removed: No glusterfs processes, volumes deleted - [ ] Documentation updated: README.md, architecture diagrams --- ## Rollback Plan **If migration fails catastrophically:** 1. **Stop all new Nomad jobs** ```bash nomad job stop -purge ``` 2. **Restore GlusterFS mounts** ```bash # On all nodes, re-enable GlusterFS client deploy # With old configs ``` 3. **Restart old Nomad jobs** ```bash # With old paths pointing to /data/compute nomad run services/*.hcl # Old versions from git ``` 4. **Restore data if needed** ```bash rsync -av /backup/compute-pre-migration/ /data/compute/ ``` **Important:** Keep GlusterFS running until Phase 4 is complete and verified! --- ## Questions Answered 1. ✅ **Where is `/data/sync/wordpress` mounted from?** - **Answer**: Syncthing-managed to avoid slow GlusterFS - **Action**: Migrate to `/data/services/wordpress`, remove syncthing config 2. ✅ **Which services use `/data/media` directly?** - **Answer**: Only media.hcl (radarr, sonarr, plex, qbittorrent) - **Action**: Constrain media.hcl to fractal, everything else uses CIFS mount 3. ✅ **Do we want unifi on fractal or floating?** - **Answer**: Floating is fine - **Action**: No constraint needed 4. ✅ **What's the plan for sunny's existing data?** - **Answer**: Ethereum data stays local, not replicated (too expensive) - **Action**: Either backup/restore or resync from network during NixOS conversion ## Questions Still to Answer 1. **Backup retention for btrfs snapshots?** - Current plan: Keep 24 hours of snapshots on zippy - Is this enough? Or do we want more for safety? - This should be fine -- snapshots are just for hot recovery. More/older backups are kept via kopia on fractal. 2. **c1-nfs-server vs c1 config - same host, different configs?** - Recommendation: Use same hostname, different flake output - `c1` = normal config with NFS client - `c1-nfs-server` = variant with NFS server enabled - Both in flake.nix, deploy appropriate one based on role - Answer: recommendation makes sense. 3. **Should we verify webodm, igsync, maps don't need /data/media access?** - neither of them needs /data/media - maps needs /data/shared --- ## Timeline Estimate **Total duration: 12-20 hours** (can be split across multiple sessions) - Phase 0 (Prep): 1-2 hours - Phase 1 (fractal): 4-6 hours - Phase 2 (zippy storage): 2-3 hours - Phase 3 (GlusterFS → NFS): 3-4 hours - Phase 4 (Nomad jobs): 2-4 hours - Phase 5 (sunny): 2-3 hours (optional, can be done later) - Phase 6 (Cleanup): 1 hour **Suggested schedule:** - **Day 1**: Phases 0-1 (fractal conversion, establish quorum) - **Day 2**: Phases 2-3 (zippy storage, data migration) - **Day 3**: Phase 4 (Nomad job updates and deployment) - **Day 4**: Phases 5-6 (sunny + cleanup) or take a break and do later **Maintenance windows needed:** - Phase 3: ~1 hour downtime (all services stopped during data migration) - Phase 4: Rolling (services come back up as redeployed)