Move unifi to zippy.

Upgrade to 2025.6.
Move loki to zippy.
2025-10-22 14:51:39 +01:00 · 2025-10-22 14:51:28 +01:00 · 2025-10-22 14:39:13 +01:00 · 2025-10-22 14:29:57 +01:00 · 2025-10-22 14:23:39 +01:00 · 2025-10-22 14:22:39 +01:00
25 changed files with 1120 additions and 124 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -0,0 +1,92 @@
 # Claude Code Quick Reference
 NixOS cluster configuration using flakes. Homelab infrastructure with Nomad/Consul orchestration.
 ## Project Structure
 ```
 ├── common/
 │   ├── global/          # Applied to all hosts (backup, sops, users, etc.)
 │   ├── compute-node.nix # Nomad client + Consul agent + NFS client
 │   ├── cluster-node.nix # Nomad server + Consul server (for quorum members)
 │   ├── nfs-services-server.nix   # NFS server + btrfs replication (zippy)
 │   └── nfs-services-standby.nix  # NFS standby + receive replication (c1, c2)
 ├── hosts/
 │   ├── c1/, c2/, c3/    # Cattle nodes (compute, quorum members)
 │   ├── zippy/           # Primary storage + NFS server + stateful workloads
 │   ├── fractal/         # (Proxmox, will become NixOS storage node)
 │   ├── sunny/           # (Standalone ethereum node, not in cluster)
 │   └── chilly/          # (Home Assistant VM, not in cluster)
 ├── docs/
 │   ├── CLUSTER_REVAMP.md    # Master plan for architecture changes
 │   ├── MIGRATION_TODO.md    # Tracking checklist for migration
 │   └── NFS_FAILOVER.md      # NFS failover procedures
 └── services/            # Nomad job specs (.hcl files)
 ```
 ## Current Architecture (transitioning)
 **OLD**: GlusterFS on c1/c2/c3 at `/data/compute` (being phased out)
 **NEW**: NFS from zippy at `/data/services` (current target)
 ### Storage Mounts
 - `/data/services` - NFS from `data-services.service.consul` (zippy primary, c1 standby)
 - `/data/media` - CIFS from fractal (existing, unchanged)
 - `/data/shared` - CIFS from fractal (existing, unchanged)
 ### Hosts
 - **c1, c2, c3**: Cattle nodes, run most workloads, Nomad/Consul quorum
 - **zippy**: Primary NFS server, runs databases (affinity), replicates to c1 every 5min
 - **fractal**: Storage node (Proxmox/ZFS), will join quorum after GlusterFS removed
 - **sunny**: Standalone ethereum staking node
 - **chilly**: Home Assistant VM
 ## Key Patterns
 **NFS Server/Standby**:
 - Primary (zippy): imports `nfs-services-server.nix`, sets `standbys = ["c1"]`
 - Standby (c1): imports `nfs-services-standby.nix`, sets `replicationKeys = [...]`
 - Replication: btrfs send/receive every 5min, incremental with fallback to full
 **Backups**:
 - Kopia client on all nodes → Kopia server on fractal
 - Backs up `/persist` hourly via btrfs snapshot
 - Excludes: `services@*` and `services-standby/services@*` (replication snapshots)
 **Secrets**:
 - SOPS for secrets, files in `secrets/`
 - Keys managed per-host
 ## Migration Status
 **Phase**: 2 complete, ready for Phase 3
 **Current**: Migrating GlusterFS → NFS
 **Next**: Copy data, update Nomad jobs, remove GlusterFS
 **Later**: Convert fractal to NixOS (deferred)
 See `docs/MIGRATION_TODO.md` for detailed checklist.
 ## Common Tasks
 **Deploy a host**: `deploy -s '.#hostname'`
 **Deploy all**: `deploy`
 **Check replication**: `ssh zippy journalctl -u replicate-services-to-c1.service -f`
 **NFS failover**: See `docs/NFS_FAILOVER.md`
 **Nomad jobs**: `services/*.hcl` - update paths: `/data/compute/appdata/foo` → `/data/services/foo` (NOT `/data/services/appdata/foo`!)
 ## Troubleshooting Hints
 - Replication errors with "empty stream": SSH key restricted to `btrfs receive`, can't run other commands
 - NFS split-brain protection: nfs-server checks Consul before starting
 - Btrfs snapshots: nested snapshots appear as empty dirs in parent snapshots
 - Kopia: uses temporary snapshot for consistency, doesn't back up nested subvolumes
 ## Important Files
 - `common/global/backup.nix` - Kopia backup configuration
 - `hosts/zippy/default.nix` - NFS server config, replication targets
 - `hosts/c1/default.nix` - NFS standby config, authorized replication keys
 - `flake.nix` - Host definitions, nixpkgs inputs
 ---
 *Auto-generated reference for Claude Code. Keep concise. Update when architecture changes.*
--- a/common/cluster-node.nix
+++ b/common/cluster-node.nix
@@ -1,13 +1,14 @@
 { pkgs, ... }:
 {
  # Cluster node configuration
-  # Extends minimal-node with cluster-specific services (Consul, GlusterFS, CIFS)
+  # Extends minimal-node with cluster-specific services (Consul, GlusterFS, CIFS, NFS)
  # Used by: compute nodes (c1, c2, c3)
  imports = [
    ./minimal-node.nix
    ./unattended-encryption.nix
    ./cifs-client.nix
    ./consul.nix
-    ./glusterfs-client.nix
+    ./glusterfs-client.nix  # Keep during migration, will be removed in Phase 3
    ./nfs-services-client.nix  # New: NFS client for /data/services
  ];
 }
--- a/common/global/backup.nix
+++ b/common/global/backup.nix
@@ -21,7 +21,11 @@ let
      ${btrfs} subvolume snapshot -r "$target_path" "$snapshot_path"
      # --no-send-snapshot-path due to https://github.com/kopia/kopia/issues/4402
-      ${kopia} snapshot create --no-send-snapshot-report --override-source "$target_path" -- "$snapshot_path"
+      # Exclude btrfs replication snapshots (they appear as empty dirs in the snapshot anyway)
      ${kopia} snapshot create --no-send-snapshot-report --override-source "$target_path" \
        --ignore "services@*" \
        --ignore "services-standby/services@*" \
        -- "$snapshot_path"
      ${btrfs} subvolume delete "$snapshot_path"
      ${kopia} repository disconnect
--- a/common/nfs-services-client.nix
+++ b/common/nfs-services-client.nix
@@ -0,0 +1,21 @@
 { pkgs, ... }:
 {
  # NFS client for /data/services
  # Mounts from data-services.service.consul (Consul DNS for automatic failover)
  # The NFS server registers itself in Consul, so this will automatically
  # point to whichever host is currently running the NFS server
  fileSystems."/data/services" = {
    device = "data-services.service.consul:/persist/services";
    fsType = "nfs";
    options = [
      "x-systemd.automount"  # Auto-mount on access
      "noauto"               # Don't mount at boot (automount handles it)
      "x-systemd.idle-timeout=60"  # Unmount after 60s of inactivity
      "_netdev"              # Network filesystem (wait for network)
    ];
  };
  # Ensure NFS client packages are available
  environment.systemPackages = [ pkgs.nfs-utils ];
 }
--- a/common/nfs-services-server.nix
+++ b/common/nfs-services-server.nix
@@ -0,0 +1,176 @@
 { config, lib, pkgs, ... }:
 let
  cfg = config.nfsServicesServer;
 in
 {
  options.nfsServicesServer = {
    enable = lib.mkEnableOption "NFS services server" // { default = true; };
    standbys = lib.mkOption {
      type = lib.types.listOf lib.types.str;
      default = [];
      description = ''
        List of standby hostnames to replicate to (e.g. ["c1"]).
        Requires one-time setup on the NFS server:
          sudo mkdir -p /persist/root/.ssh
          sudo ssh-keygen -t ed25519 -f /persist/root/.ssh/btrfs-replication -N "" -C "root@$(hostname)-replication"
        Then add the public key to each standby's nfsServicesStandby.replicationKeys option.
      '';
    };
  };
  config = lib.mkIf cfg.enable {
    # Persist root SSH directory for replication key
    environment.persistence."/persist" = {
      directories = [
        "/root/.ssh"
      ];
    };
    # Bind mount /persist/services to /data/services for local access
    # This makes the path consistent with NFS clients
    # Use mkForce to override the NFS client mount from cluster-node.nix
    fileSystems."/data/services" = lib.mkForce {
      device = "/persist/services";
      fsType = "none";
      options = [ "bind" ];
    };
    # Nomad node metadata: mark this as the primary storage node
    # Jobs can constrain to ${meta.storage_role} = "primary"
    services.nomad.settings.client.meta = {
      storage_role = "primary";
    };
    # NFS server configuration
    services.nfs.server = {
      enable = true;
      exports = ''
        /persist/services 192.168.1.0/24(rw,sync,no_subtree_check,no_root_squash)
      '';
    };
    # Consul service registration for NFS
    services.consul.extraConfig.services = [{
      name = "data-services";
      port = 2049;
      checks = [{
        tcp = "localhost:2049";
        interval = "30s";
      }];
    }];
    # Firewall for NFS
    networking.firewall.allowedTCPPorts = [ 2049 111 20048 ];
    networking.firewall.allowedUDPPorts = [ 2049 111 20048 ];
    # systemd services: NFS server split-brain check + replication services
    systemd.services = lib.mkMerge ([
      # Safety check: prevent split-brain by ensuring no other NFS server is active
      {
        nfs-server = {
          preStart = ''
            # Wait for Consul to be available
            for i in {1..30}; do
              if ${pkgs.netcat}/bin/nc -z localhost 8600; then
                break
              fi
              echo "Waiting for Consul DNS... ($i/30)"
              sleep 1
            done
            # Check if another NFS server is already registered in Consul
            CURRENT_SERVER=$(${pkgs.dnsutils}/bin/dig +short @localhost -p 8600 data-services.service.consul | head -1 || true)
            MY_IP=$(${pkgs.iproute2}/bin/ip -4 addr show | ${pkgs.gnugrep}/bin/grep -oP '(?<=inet\s)\d+(\.\d+){3}' | ${pkgs.gnugrep}/bin/grep -v '^127\.' | head -1)
            if [ -n "$CURRENT_SERVER" ] && [ "$CURRENT_SERVER" != "$MY_IP" ]; then
              echo "ERROR: Another NFS server is already active at $CURRENT_SERVER"
              echo "This host ($MY_IP) is configured as NFS server but should be standby."
              echo "To fix:"
              echo "  1. If this is intentional (failback), first demote the other server"
              echo "  2. Update this host's config to use nfs-services-standby.nix instead"
              echo "  3. Sync data from active server before promoting this host"
              exit 1
            fi
            echo "NFS server startup check passed (no other active server found)"
          '';
        };
      }
    ] ++ (lib.forEach cfg.standbys (standby: {
        "replicate-services-to-${standby}" = {
          description = "Replicate /persist/services to ${standby}";
          path = [ pkgs.btrfs-progs pkgs.openssh pkgs.coreutils pkgs.findutils pkgs.gnugrep ];
          script = ''
            set -euo pipefail
            SSH_KEY="/persist/root/.ssh/btrfs-replication"
            if [ ! -f "$SSH_KEY" ]; then
              echo "ERROR: SSH key not found at $SSH_KEY"
              echo "Run: sudo ssh-keygen -t ed25519 -f $SSH_KEY -N \"\" -C \"root@$(hostname)-replication\""
              exit 1
            fi
            SNAPSHOT_NAME="services@$(date +%Y%m%d-%H%M%S)"
            SNAPSHOT_PATH="/persist/$SNAPSHOT_NAME"
            # Create readonly snapshot
            btrfs subvolume snapshot -r /persist/services "$SNAPSHOT_PATH"
            # Find previous snapshot on sender (sort by name since readonly snapshots have same mtime)
            # Use -d to list directories only, not their contents
            PREV_LOCAL=$(ls -1d /persist/services@* 2>/dev/null | grep -v "^$SNAPSHOT_PATH$" | sort -r | head -1 || true)
            # Try incremental send if we have a parent, fall back to full send if it fails
            if [ -n "$PREV_LOCAL" ]; then
              echo "Attempting incremental send from $(basename $PREV_LOCAL) to ${standby}"
              # Try incremental send, if it fails (e.g., parent missing on receiver), fall back to full
              if btrfs send -p "$PREV_LOCAL" "$SNAPSHOT_PATH" | \
                 ssh -i "$SSH_KEY" -o StrictHostKeyChecking=accept-new root@${standby} \
                 "btrfs receive /persist/services-standby"; then
                echo "Incremental send completed successfully"
              else
                echo "Incremental send failed (likely missing parent on receiver), falling back to full send"
                btrfs send "$SNAPSHOT_PATH" | \
                  ssh -i "$SSH_KEY" -o StrictHostKeyChecking=accept-new root@${standby} \
                  "btrfs receive /persist/services-standby"
              fi
            else
              # First snapshot, do full send
              echo "Full send to ${standby} (first snapshot)"
              btrfs send "$SNAPSHOT_PATH" | \
                ssh -i "$SSH_KEY" -o StrictHostKeyChecking=accept-new root@${standby} \
                "btrfs receive /persist/services-standby"
            fi
            # Cleanup old snapshots on sender (keep last 24 hours = 288 snapshots at 5min intervals)
            find /persist -maxdepth 1 -name 'services@*' -mmin +1440 -exec btrfs subvolume delete {} \;
          '';
          serviceConfig = {
            Type = "oneshot";
            User = "root";
          };
        };
      }))
    );
    systemd.timers = lib.mkMerge (
      lib.forEach cfg.standbys (standby: {
        "replicate-services-to-${standby}" = {
          description = "Timer for replicating /persist/services to ${standby}";
          wantedBy = [ "timers.target" ];
          timerConfig = {
            OnCalendar = "*:0/5";  # Every 5 minutes
            Persistent = true;
          };
        };
      })
    );
  };
 }
--- a/common/nfs-services-standby.nix
+++ b/common/nfs-services-standby.nix
@@ -0,0 +1,68 @@
 { config, lib, pkgs, ... }:
 let
  cfg = config.nfsServicesStandby;
 in
 {
  options.nfsServicesStandby = {
    enable = lib.mkEnableOption "NFS services standby" // { default = true; };
    replicationKeys = lib.mkOption {
      type = lib.types.listOf lib.types.str;
      default = [];
      description = ''
        SSH public keys authorized to replicate btrfs snapshots to this standby.
        These keys are restricted to only run 'btrfs receive /persist/services-standby'.
        Get the public key from the NFS server:
          ssh <nfs-server> sudo cat /persist/root/.ssh/btrfs-replication.pub
      '';
    };
  };
  config = lib.mkIf cfg.enable {
    # Allow root SSH login for replication (restricted by command= in authorized_keys)
    # This is configured in common/sshd.nix
    # Restricted SSH keys for btrfs replication
    users.users.root.openssh.authorizedKeys.keys =
      map (key: ''command="btrfs receive /persist/services-standby",restrict ${key}'') cfg.replicationKeys;
    # Mount point for services-standby subvolume
    # This is just declarative documentation - the subvolume must be created manually once:
    #   sudo btrfs subvolume create /persist/services-standby
    # After that, it will persist across reboots (it's under /persist)
    fileSystems."/persist/services-standby" = {
      device = "/persist/services-standby";
      fsType = "none";
      options = [ "bind" ];
      noCheck = true;
    };
    # Cleanup old snapshots on standby (keep last 48 hours for safety)
    systemd.services.cleanup-services-standby-snapshots = {
      description = "Cleanup old btrfs snapshots in services-standby";
      path = [ pkgs.btrfs-progs pkgs.findutils ];
      script = ''
        set -euo pipefail
        # Keep last 48 hours of snapshots (576 snapshots at 5min intervals)
        find /persist/services-standby -maxdepth 1 -name 'services@*' -mmin +2880 -exec btrfs subvolume delete {} \; || true
      '';
      serviceConfig = {
        Type = "oneshot";
        User = "root";
      };
    };
    systemd.timers.cleanup-services-standby-snapshots = {
      description = "Timer for cleaning up old snapshots on standby";
      wantedBy = [ "timers.target" ];
      timerConfig = {
        OnCalendar = "daily";
        Persistent = true;
      };
    };
  };
 }
--- a/common/sshd.nix
+++ b/common/sshd.nix
@@ -5,6 +5,7 @@
    settings = {
      PasswordAuthentication = false;
      KbdInteractiveAuthentication = false;
      PermitRootLogin = "prohibit-password";  # Allow root login with SSH keys only
    };
  };
--- a/docs/CLUSTER_REVAMP.md
+++ b/docs/CLUSTER_REVAMP.md
@@ -146,7 +146,7 @@ fileSystems."/data/services" = {
 ## Migration Steps
 **Important path simplification note:**
- All service paths use `/data/services/*` directly (not `/data/services/appdata/*`)
+- All service paths use `/data/services/*` directly (not `/data/services/*`)
 - Example: `/data/compute/appdata/mysql` → `/data/services/mysql`
 - Simpler, cleaner, easier to manage
@@ -1024,9 +1024,9 @@ EOF
 - **Priority**: CRITICAL
 - **Current**: Uses `/data/compute/appdata/mysql`
 - **Target**: Affinity for zippy, allow c1/c2
- **Data**: `/data/services/appdata/mysql` (NFS from zippy)
+- **Data**: `/data/services/mysql` (NFS from zippy)
 - **Changes**:
-  - ✏️ Volume path: `/data/compute/appdata/mysql` → `/data/services/appdata/mysql`
+  - ✏️ Volume path: `/data/compute/appdata/mysql` → `/data/services/mysql`
  - ✏️ Add affinity:
    ```hcl
    affinity {
@@ -1050,9 +1050,9 @@ EOF
 - **Priority**: CRITICAL
 - **Current**: Uses `/data/compute/appdata/postgres`, `/data/compute/appdata/pgadmin`
 - **Target**: Affinity for zippy, allow c1/c2
- **Data**: `/data/services/appdata/postgres`, `/data/services/appdata/pgadmin` (NFS)
+- **Data**: `/data/services/postgres`, `/data/services/pgadmin` (NFS)
 - **Changes**:
-  - ✏️ Volume paths: `/data/compute/appdata/*` → `/data/services/appdata/*`
+  - ✏️ Volume paths: `/data/compute/appdata/*` → `/data/services/*`
  - ✏️ Add affinity and constraint (same as mysql)
 - **Notes**: Core database for authentik, gitea, plausible, netbox, etc.
@@ -1061,9 +1061,9 @@ EOF
 - **Priority**: CRITICAL
 - **Current**: Uses `/data/compute/appdata/redis`
 - **Target**: Affinity for zippy, allow c1/c2
- **Data**: `/data/services/appdata/redis` (NFS)
+- **Data**: `/data/services/redis` (NFS)
 - **Changes**:
-  - ✏️ Volume path: `/data/compute/appdata/redis` → `/data/services/appdata/redis`
+  - ✏️ Volume path: `/data/compute/appdata/redis` → `/data/services/redis`
  - ✏️ Add affinity and constraint (same as mysql)
 - **Notes**: Used by authentik, wordpress. Should co-locate with databases.
@@ -1093,9 +1093,9 @@ EOF
 - **Priority**: HIGH
 - **Current**: Uses `/data/compute/appdata/prometheus`
 - **Target**: Float on c1/c2/c3
- **Data**: `/data/services/appdata/prometheus` (NFS)
+- **Data**: `/data/services/prometheus` (NFS)
 - **Changes**:
-  - ✏️ Volume path: `/data/compute/appdata/prometheus` → `/data/services/appdata/prometheus`
+  - ✏️ Volume path: `/data/compute/appdata/prometheus` → `/data/services/prometheus`
 - **Notes**: Metrics database. Important for monitoring but not critical for services.
 #### grafana
@@ -1103,9 +1103,9 @@ EOF
 - **Priority**: HIGH
 - **Current**: Uses `/data/compute/appdata/grafana`
 - **Target**: Float on c1/c2/c3
- **Data**: `/data/services/appdata/grafana` (NFS)
+- **Data**: `/data/services/grafana` (NFS)
 - **Changes**:
-  - ✏️ Volume path: `/data/compute/appdata/grafana` → `/data/services/appdata/grafana`
+  - ✏️ Volume path: `/data/compute/appdata/grafana` → `/data/services/grafana`
 - **Notes**: Monitoring UI. Depends on prometheus.
 #### loki
@@ -1113,9 +1113,9 @@ EOF
 - **Priority**: HIGH
 - **Current**: Uses `/data/compute/appdata/loki`
 - **Target**: Float on c1/c2/c3
- **Data**: `/data/services/appdata/loki` (NFS)
+- **Data**: `/data/services/loki` (NFS)
 - **Changes**:
-  - ✏️ Volume path: `/data/compute/appdata/loki` → `/data/services/appdata/loki`
+  - ✏️ Volume path: `/data/compute/appdata/loki` → `/data/services/loki`
 - **Notes**: Log aggregation. Important for debugging.
 #### vector
@@ -1136,9 +1136,9 @@ EOF
 - **Priority**: HIGH
 - **Current**: Uses `/data/compute/appdata/clickhouse`
 - **Target**: Affinity for zippy (large dataset), allow c1/c2/c3
- **Data**: `/data/services/appdata/clickhouse` (NFS)
+- **Data**: `/data/services/clickhouse` (NFS)
 - **Changes**:
-  - ✏️ Volume path: `/data/compute/appdata/clickhouse` → `/data/services/appdata/clickhouse`
+  - ✏️ Volume path: `/data/compute/appdata/clickhouse` → `/data/services/clickhouse`
  - ✏️ Add affinity for zippy (optional, but helps with performance)
 - **Notes**: Used by plausible. Large time-series data. Important but can be recreated.
@@ -1147,7 +1147,7 @@ EOF
 - **Priority**: HIGH
 - **Current**: Uses `/data/compute/appdata/unifi/mongodb`
 - **Target**: Float on c1/c2/c3 (with unifi)
- **Data**: `/data/services/appdata/unifi/mongodb` (NFS)
+- **Data**: `/data/services/unifi/mongodb` (NFS)
 - **Changes**: See unifi below
 - **Notes**: Only used by unifi. Should stay with unifi controller.
@@ -1158,9 +1158,9 @@ EOF
 - **Priority**: HIGH
 - **Current**: Uses `/data/sync/wordpress` (syncthing-managed to avoid slow GlusterFS)
 - **Target**: Float on c1/c2/c3
- **Data**: `/data/services/appdata/wordpress` (NFS from zippy)
+- **Data**: `/data/services/wordpress` (NFS from zippy)
 - **Changes**:
-  - ✏️ Volume path: `/data/sync/wordpress` → `/data/services/appdata/wordpress`
+  - ✏️ Volume path: `/data/sync/wordpress` → `/data/services/wordpress`
  - 📋 **Before cutover**: Copy data from syncthing to zippy: `rsync -av /data/sync/wordpress/ zippy:/persist/services/appdata/wordpress/`
  - 📋 **After migration**: Remove syncthing configuration for wordpress sync
 - **Notes**: Production website. Important but can tolerate brief downtime during migration.
@@ -1170,9 +1170,9 @@ EOF
 - **Priority**: no longer used, should wipe
 - **Current**: Uses `/data/compute/appdata/ghost`
 - **Target**: Float on c1/c2/c3
- **Data**: `/data/services/appdata/ghost` (NFS)
+- **Data**: `/data/services/ghost` (NFS)
 - **Changes**:
-  - ✏️ Volume path: `/data/compute/appdata/ghost` → `/data/services/appdata/ghost`
+  - ✏️ Volume path: `/data/compute/appdata/ghost` → `/data/services/ghost`
 - **Notes**: Blog platform (alo.land). Can tolerate downtime.
 #### gitea
@@ -1180,9 +1180,9 @@ EOF
 - **Priority**: HIGH
 - **Current**: Uses `/data/compute/appdata/gitea/data`, `/data/compute/appdata/gitea/config`
 - **Target**: Float on c1/c2/c3
- **Data**: `/data/services/appdata/gitea/*` (NFS)
+- **Data**: `/data/services/gitea/*` (NFS)
 - **Changes**:
-  - ✏️ Volume paths: `/data/compute/appdata/gitea/*` → `/data/services/appdata/gitea/*`
+  - ✏️ Volume paths: `/data/compute/appdata/gitea/*` → `/data/services/gitea/*`
 - **Notes**: Git server. Contains code repositories. Important.
 #### wiki (tiddlywiki)
@@ -1190,7 +1190,7 @@ EOF
 - **Priority**: HIGH
 - **Current**: Uses `/data/compute/appdata/wiki` via host volume mount
 - **Target**: Float on c1/c2/c3
- **Data**: `/data/services/appdata/wiki` (NFS)
+- **Data**: `/data/services/wiki` (NFS)
 - **Changes**:
  - ✏️ Volume mount path in `volume_mount` blocks
  - ⚠️ Uses `exec` driver with host volumes - verify NFS mount works with this
@@ -1201,9 +1201,9 @@ EOF
 - **Priority**: LOW
 - **Current**: Uses `/data/compute/appdata/code`
 - **Target**: Float on c1/c2/c3
- **Data**: `/data/services/appdata/code` (NFS)
+- **Data**: `/data/services/code` (NFS)
 - **Changes**:
-  - ✏️ Volume path: `/data/compute/appdata/code` → `/data/services/appdata/code`
+  - ✏️ Volume path: `/data/compute/appdata/code` → `/data/services/code`
 - **Notes**: Web IDE. Low priority, for development only.
 #### beancount (fava)
@@ -1211,9 +1211,9 @@ EOF
 - **Priority**: MEDIUM
 - **Current**: Uses `/data/compute/appdata/beancount`
 - **Target**: Float on c1/c2/c3
- **Data**: `/data/services/appdata/beancount` (NFS)
+- **Data**: `/data/services/beancount` (NFS)
 - **Changes**:
-  - ✏️ Volume path: `/data/compute/appdata/beancount` → `/data/services/appdata/beancount`
+  - ✏️ Volume path: `/data/compute/appdata/beancount` → `/data/services/beancount`
 - **Notes**: Finance tracking. Low priority.
 #### adminer
@@ -1239,9 +1239,9 @@ EOF
 - **Priority**: HIGH
 - **Current**: Uses `/data/compute/appdata/evcc/evcc.yaml`, `/data/compute/appdata/evcc/evcc`
 - **Target**: Float on c1/c2/c3
- **Data**: `/data/services/appdata/evcc/*` (NFS)
+- **Data**: `/data/services/evcc/*` (NFS)
 - **Changes**:
-  - ✏️ Volume paths: `/data/compute/appdata/evcc/*` → `/data/services/appdata/evcc/*`
+  - ✏️ Volume paths: `/data/compute/appdata/evcc/*` → `/data/services/evcc/*`
 - **Notes**: EV charging controller. Important for daily use.
 #### vikunja
@@ -1249,9 +1249,9 @@ EOF
 - **Priority**: no longer used, should delete
 - **Current**: Likely uses `/data/compute/appdata/vikunja`
 - **Target**: Float on c1/c2/c3
- **Data**: `/data/services/appdata/vikunja` (NFS)
+- **Data**: `/data/services/vikunja` (NFS)
 - **Changes**:
-  - ✏️ Volume paths: Update to `/data/services/appdata/vikunja`
+  - ✏️ Volume paths: Update to `/data/services/vikunja`
 - **Notes**: Task management. Low priority.
 #### leantime
@@ -1259,9 +1259,9 @@ EOF
 - **Priority**: no longer used, should delete
 - **Current**: Likely uses `/data/compute/appdata/leantime`
 - **Target**: Float on c1/c2/c3
- **Data**: `/data/services/appdata/leantime` (NFS)
+- **Data**: `/data/services/leantime` (NFS)
 - **Changes**:
-  - ✏️ Volume paths: Update to `/data/services/appdata/leantime`
+  - ✏️ Volume paths: Update to `/data/services/leantime`
 - **Notes**: Project management. Low priority.
 ### Network Infrastructure
@@ -1271,9 +1271,9 @@ EOF
 - **Priority**: HIGH
 - **Current**: Uses `/data/compute/appdata/unifi/data`, `/data/compute/appdata/unifi/mongodb`
 - **Target**: Float on c1/c2/c3/fractal/zippy
- **Data**: `/data/services/appdata/unifi/*` (NFS)
+- **Data**: `/data/services/unifi/*` (NFS)
 - **Changes**:
-  - ✏️ Volume paths: `/data/compute/appdata/unifi/*` → `/data/services/appdata/unifi/*`
+  - ✏️ Volume paths: `/data/compute/appdata/unifi/*` → `/data/services/unifi/*`
 - **Notes**: UniFi network controller. Critical for network management. Has keepalived VIP for stable inform address. Floating is fine.
 ### Media Stack
@@ -1284,10 +1284,10 @@ EOF
 - **Current**: Uses `/data/compute/appdata/radarr`, `/data/compute/appdata/sonarr`, etc. and `/data/media`
 - **Target**: **MUST run on fractal** (local /data/media access)
 - **Data**:
-  - `/data/services/appdata/radarr` (NFS) - config data
+  - `/data/services/radarr` (NFS) - config data
  - `/data/media` (local CIFS mount on fractal, local disk on fractal)
 - **Changes**:
-  - ✏️ Volume paths: `/data/compute/appdata/*` → `/data/services/appdata/*`
+  - ✏️ Volume paths: `/data/compute/appdata/*` → `/data/services/*`
  - ✏️ **Add constraint**:
    ```hcl
    constraint {
@@ -1304,9 +1304,9 @@ EOF
 - **Priority**: HIGH
 - **Current**: Likely uses `/data/compute/appdata/weewx`
 - **Target**: Float on c1/c2/c3
- **Data**: `/data/services/appdata/weewx` (NFS)
+- **Data**: `/data/services/weewx` (NFS)
 - **Changes**:
-  - ✏️ Volume paths: Update to `/data/services/appdata/weewx`
+  - ✏️ Volume paths: Update to `/data/services/weewx`
 - **Notes**: Weather station. Low priority.
 #### maps
@@ -1314,7 +1314,7 @@ EOF
 - **Priority**: MEDIUM
 - **Current**: Likely uses `/data/compute/appdata/maps`
 - **Target**: Float on c1/c2/c3 (or fractal if large tile data)
- **Data**: `/data/services/appdata/maps` (NFS) or `/data/media/maps` if large
+- **Data**: `/data/services/maps` (NFS) or `/data/media/maps` if large
 - **Changes**:
  - ✏️ Volume paths: Check data size, may want to move to /data/media
 - **Notes**: Map tiles. Low priority.
@@ -1324,9 +1324,9 @@ EOF
 - **Priority**: LOW
 - **Current**: Likely uses `/data/compute/appdata/netbox`
 - **Target**: Float on c1/c2/c3
- **Data**: `/data/services/appdata/netbox` (NFS)
+- **Data**: `/data/services/netbox` (NFS)
 - **Changes**:
-  - ✏️ Volume paths: Update to `/data/services/appdata/netbox`
+  - ✏️ Volume paths: Update to `/data/services/netbox`
 - **Notes**: IPAM/DCIM. Low priority, for documentation.
 #### farmos
@@ -1334,9 +1334,9 @@ EOF
 - **Priority**: LOW
 - **Current**: Likely uses `/data/compute/appdata/farmos`
 - **Target**: Float on c1/c2/c3
- **Data**: `/data/services/appdata/farmos` (NFS)
+- **Data**: `/data/services/farmos` (NFS)
 - **Changes**:
-  - ✏️ Volume paths: Update to `/data/services/appdata/farmos`
+  - ✏️ Volume paths: Update to `/data/services/farmos`
 - **Notes**: Farm management. Low priority.
 #### urbit
@@ -1344,9 +1344,9 @@ EOF
 - **Priority**: LOW
 - **Current**: Likely uses `/data/compute/appdata/urbit`
 - **Target**: Float on c1/c2/c3
- **Data**: `/data/services/appdata/urbit` (NFS)
+- **Data**: `/data/services/urbit` (NFS)
 - **Changes**:
-  - ✏️ Volume paths: Update to `/data/services/appdata/urbit`
+  - ✏️ Volume paths: Update to `/data/services/urbit`
 - **Notes**: Urbit node. Experimental, low priority.
 #### webodm
@@ -1354,9 +1354,9 @@ EOF
 - **Priority**: LOW
 - **Current**: Likely uses `/data/compute/appdata/webodm`
 - **Target**: Float on c1/c2/c3 (or fractal if processing large imagery from /data/media)
- **Data**: `/data/services/appdata/webodm` (NFS)
+- **Data**: `/data/services/webodm` (NFS)
 - **Changes**:
-  - ✏️ Volume paths: Update to `/data/services/appdata/webodm`
+  - ✏️ Volume paths: Update to `/data/services/webodm`
  - 🤔 May benefit from running on fractal if it processes files from /data/media
 - **Notes**: Drone imagery processing. Low priority.
@@ -1411,7 +1411,7 @@ EOF
 - **Priority**: MEDIUM
 - **Current**: Likely same as wiki.hcl
 - **Target**: Float on c1/c2/c3
- **Data**: `/data/services/appdata/tiddlywiki` (NFS)
+- **Data**: `/data/services/tiddlywiki` (NFS)
 - **Changes**: Same as wiki.hcl
 - **Notes**: May be duplicate of wiki.hcl.
@@ -1660,7 +1660,7 @@ nomad alloc status <alloc-id>
 1. ✅ **Where is `/data/sync/wordpress` mounted from?**
   - **Answer**: Syncthing-managed to avoid slow GlusterFS
-   - **Action**: Migrate to `/data/services/appdata/wordpress`, remove syncthing config
+   - **Action**: Migrate to `/data/services/wordpress`, remove syncthing config
 2. ✅ **Which services use `/data/media` directly?**
   - **Answer**: Only media.hcl (radarr, sonarr, plex, qbittorrent)
--- a/docs/MIGRATION_TODO.md
+++ b/docs/MIGRATION_TODO.md
@@ -0,0 +1,153 @@
 # Cluster Revamp Migration TODO
 Track migration progress from GlusterFS to NFS-based architecture.
 See [CLUSTER_REVAMP.md](./CLUSTER_REVAMP.md) for detailed procedures.
 ## Phase 0: Preparation
 - [x] Review cluster revamp plan
 - [ ] Backup everything (kopia snapshots current)
 - [ ] Document current state (nomad jobs, consul services)
 ## Phase 1: Convert fractal to NixOS (DEFERRED - do after GlusterFS migration)
 - [ ] Document fractal's current ZFS layout
 - [ ] Install NixOS on fractal
 - [ ] Import ZFS pools (double1, double2, double3)
 - [ ] Create fractal NixOS configuration
 - [ ] Configure Samba server for media/shared/homes
 - [ ] Configure Kopia backup server
 - [ ] Deploy and verify fractal base config
 - [ ] Join fractal to cluster (5-server quorum)
 - [ ] Update all cluster configs for 5-server quorum
 - [ ] Verify fractal fully operational
 ## Phase 2: Setup zippy storage layer
 - [x] Create btrfs subvolume `/persist/services` on zippy
 - [x] Configure NFS server on zippy (nfs-services-server.nix)
 - [x] Configure Consul service registration for NFS
 - [x] Setup btrfs replication to c1 (incremental, 5min intervals)
 - [x] Fix replication script to handle SSH command restrictions
 - [x] Setup standby storage on c1 (`/persist/services-standby`)
 - [x] Configure c1 as standby (nfs-services-standby.nix)
 - [x] Configure Kopia to exclude replication snapshots
 - [x] Deploy and verify NFS server on zippy
 - [x] Verify replication working to c1
 - [ ] Setup standby storage on c2 (if desired)
 - [ ] Configure replication to c2 (if desired)
 ## Phase 3: Migrate from GlusterFS to NFS
 - [x] Update all nodes to mount NFS at `/data/services`
 - [x] Deploy updated configs (NFS client on all nodes)
 - [ ] Stop all Nomad jobs temporarily
 - [ ] Copy data from GlusterFS to zippy NFS
  - [ ] Copy `/data/compute/appdata/*` → `/persist/services/appdata/`
  - [ ] Copy `/data/compute/config/*` → `/persist/services/config/`
  - [ ] Copy `/data/sync/wordpress` → `/persist/services/appdata/wordpress`
  - [ ] Verify data integrity
 - [ ] Verify NFS mounts working on all nodes
 - [ ] Stop GlusterFS volume
 - [ ] Delete GlusterFS volume
 - [ ] Remove GlusterFS from NixOS configs
 - [ ] Remove syncthing wordpress sync configuration
 ## Phase 4: Update and redeploy Nomad jobs
 ### Core Infrastructure (CRITICAL)
 - [x] mysql.hcl - moved to zippy, using `/data/services`
 - [ ] postgres.hcl - update paths, add affinity for zippy
 - [ ] redis.hcl - update paths, add affinity for zippy
 - [ ] traefik.hcl - update paths (already floating)
 - [ ] authentik.hcl - verify (stateless, no changes needed)
 ### Monitoring Stack (HIGH)
 - [ ] prometheus.hcl - update paths
 - [ ] grafana.hcl - update paths
 - [ ] loki.hcl - update paths
 - [ ] vector.hcl - remove glusterfs log collection
 ### Databases (HIGH)
 - [ ] clickhouse.hcl - update paths, add affinity for zippy
 - [ ] unifi.hcl - update paths (includes mongodb)
 ### Web Applications (HIGH-MEDIUM)
 - [ ] wordpress.hcl - update from `/data/sync/wordpress` to `/data/services/wordpress`
 - [ ] gitea.hcl - update paths
 - [ ] wiki.hcl - update paths, verify with exec driver
 - [ ] plausible.hcl - verify (stateless)
 ### Web Applications (LOW, may be deprecated)
 - [ ] ghost.hcl - update paths or remove (no longer used?)
 - [ ] vikunja.hcl - update paths or remove (no longer used?)
 - [ ] leantime.hcl - update paths or remove (no longer used?)
 ### Network Infrastructure (HIGH)
 - [ ] unifi.hcl - update paths (already listed above)
 ### Media Stack (MEDIUM)
 - [ ] media.hcl - update paths, add constraint for fractal
  - [ ] radarr, sonarr, bazarr, plex, qbittorrent
 ### Utility Services (MEDIUM-LOW)
 - [ ] evcc.hcl - update paths
 - [ ] weewx.hcl - update paths
 - [ ] code-server.hcl - update paths
 - [ ] beancount.hcl - update paths
 - [ ] adminer.hcl - verify (stateless)
 - [ ] maps.hcl - update paths
 - [ ] netbox.hcl - update paths
 - [ ] farmos.hcl - update paths
 - [ ] urbit.hcl - update paths
 - [ ] webodm.hcl - update paths
 - [ ] velutrack.hcl - verify paths
 - [ ] resol-gateway.hcl - verify paths
 - [ ] igsync.hcl - update paths
 - [ ] jupyter.hcl - verify paths
 - [ ] whoami.hcl - verify (stateless test service)
 - [ ] tiddlywiki.hcl - update paths (if separate from wiki.hcl)
 ### Backup Jobs (HIGH)
 - [x] mysql-backup - moved to zippy, verified
 - [ ] postgres-backup.hcl - verify destination
 - [ ] wordpress-backup.hcl - verify destination
 ### Verification
 - [ ] All services healthy in Nomad
 - [ ] All services registered in Consul
 - [ ] Traefik routes working
 - [ ] Database jobs running on zippy (verify via nomad alloc status)
 - [ ] Media jobs running on fractal (verify via nomad alloc status)
 ## Phase 5: Convert sunny to NixOS (OPTIONAL - can defer)
 - [ ] Document current sunny setup (ethereum containers/VMs)
 - [ ] Backup ethereum data
 - [ ] Install NixOS on sunny
 - [ ] Restore ethereum data to `/persist/ethereum`
 - [ ] Create sunny container-based config (besu, lighthouse, rocketpool)
 - [ ] Deploy and verify ethereum stack
 - [ ] Monitor sync status and validation
 ## Phase 6: Verification and cleanup
 - [ ] Test NFS failover procedure (zippy → c1)
 - [ ] Verify backups include `/persist/services` data
 - [ ] Verify backups exclude replication snapshots
 - [ ] Update documentation (README.md, architecture diagrams)
 - [ ] Clean up old GlusterFS data (only after everything verified!)
 - [ ] Remove old glusterfs directories from all nodes
 ## Post-Migration Checklist
 - [ ] All 5 servers in quorum (consul members)
 - [ ] NFS mounts working on all nodes
 - [ ] Btrfs replication running (check systemd timers on zippy)
 - [ ] Critical services up (mysql, postgres, redis, traefik, authentik)
 - [ ] Monitoring working (prometheus, grafana, loki)
 - [ ] Media stack on fractal
 - [ ] Database jobs on zippy
 - [ ] Consul DNS working (dig @localhost -p 8600 data-services.service.consul)
 - [ ] Backups running (kopia snapshots include /persist/services)
 - [ ] GlusterFS removed (no processes, volumes deleted)
 - [ ] Documentation updated
 ---
 **Last updated**: 2025-10-22
 **Current phase**: Phase 2 complete (zippy storage setup done), ready for Phase 3 (GlusterFS → NFS migration)
 **Note**: Phase 1 (fractal NixOS conversion) deferred until after GlusterFS migration is complete
--- a/docs/NFS_FAILOVER.md
+++ b/docs/NFS_FAILOVER.md
@@ -0,0 +1,438 @@
 # NFS Services Failover Procedures
 This document describes how to fail over the `/data/services` NFS server between hosts and how to fail back.
 ## Architecture Overview
 - **Primary NFS Server**: Typically `zippy`
  - Exports `/persist/services` via NFS
  - Has local bind mount: `/data/services` → `/persist/services` (same path as clients)
  - Registers `data-services.service.consul` in Consul
  - Sets Nomad node meta: `storage_role = "primary"`
  - Replicates snapshots to standbys every 5 minutes via btrfs send
  - **Safety check**: Refuses to start if another NFS server is already active in Consul
 - **Standby**: Typically `c1`
  - Receives snapshots at `/persist/services-standby/services@<timestamp>`
  - Can be promoted to NFS server during failover
  - No special Nomad node meta (not primary)
 - **Clients**: All cluster nodes (c1, c2, c3, zippy)
  - Mount `/data/services` from `data-services.service.consul:/persist/services`
  - Automatically connect to whoever is registered in Consul
 ### Nomad Job Constraints
 Jobs that need to run on the primary storage node should use:
 ```hcl
 constraint {
  attribute = "${meta.storage_role}"
  value     = "primary"
 }
 ```
 This is useful for:
 - Database jobs (mysql, postgres, redis) that benefit from local storage
 - Jobs that need guaranteed fast disk I/O
 During failover, the `storage_role = "primary"` meta attribute moves to the new NFS server, and Nomad automatically reschedules constrained jobs to the new primary.
 ## Prerequisites
 - Standby has been receiving snapshots (check: `ls /persist/services-standby/services@*`)
 - Last successful replication was recent (< 5-10 minutes)
 ---
 ## Failover: Promoting Standby to Primary
 **Scenario**: `zippy` is down and you need to promote `c1` to be the NFS server.
 ### Step 1: Choose Latest Snapshot
 On the standby (c1):
 ```bash
 ssh c1
 sudo ls -lt /persist/services-standby/services@* | head -5
 ```
 Find the most recent snapshot. Note the timestamp to estimate data loss (typically < 5 minutes).
 ### Step 2: Promote Snapshot to Read-Write Subvolume
 On c1:
 ```bash
 # Find the latest snapshot
 LATEST=$(sudo ls -t /persist/services-standby/services@* | head -1)
 # Create writable subvolume from snapshot
 sudo btrfs subvolume snapshot "$LATEST" /persist/services
 # Verify
 ls -la /persist/services
 ```
 ### Step 3: Update NixOS Configuration
 Edit your configuration to swap the NFS server role:
 **In `hosts/c1/default.nix`**:
 ```nix
 imports = [
  # ... existing imports ...
  # ../../common/nfs-services-standby.nix  # REMOVE THIS
  ../../common/nfs-services-server.nix     # ADD THIS
 ];
 # Add standbys if desired (optional - can leave empty during emergency)
 nfsServicesServer.standbys = [];  # Or ["c2"] to add a new standby
 ```
 **Optional: Prepare zippy config for when it comes back**:
 In `hosts/zippy/default.nix` (can do this later too):
 ```nix
 imports = [
  # ... existing imports ...
  # ../../common/nfs-services-server.nix   # REMOVE THIS
  ../../common/nfs-services-standby.nix    # ADD THIS
 ];
 # Add the replication key from c1 (get it from c1:/persist/root/.ssh/btrfs-replication.pub)
 nfsServicesStandby.replicationKeys = [
  "ssh-ed25519 AAAA... root@c1-replication"
 ];
 ```
 ### Step 4: Deploy Configuration
 ```bash
 # From your workstation
 deploy -s '.#c1'
 # If zippy is still down, updating its config will fail, but that's okay
 # You can update it later when it comes back
 ```
 ### Step 5: Verify NFS Server is Running
 On c1:
 ```bash
 sudo systemctl status nfs-server
 sudo showmount -e localhost
 dig @localhost -p 8600 data-services.service.consul  # Should show c1's IP
 ```
 ### Step 6: Verify Clients Can Access
 From any node:
 ```bash
 df -h | grep services
 ls /data/services
 ```
 The mount should automatically reconnect via Consul DNS.
 ### Step 7: Check Nomad Jobs
 ```bash
 nomad job status mysql
 nomad job status postgres
 # Verify critical services are healthy
 # Jobs constrained to ${meta.storage_role} = "primary" will automatically
 # reschedule to c1 once it's deployed with the NFS server module
 ```
 **Recovery Time Objective (RTO)**: ~10-15 minutes
 **Recovery Point Objective (RPO)**: Last replication interval (5 minutes max)
 **Note**: Jobs with the `storage_role = "primary"` constraint will automatically move to c1 because it now has that node meta attribute. No job spec changes needed!
 ---
 ## What Happens When zippy Comes Back?
 **IMPORTANT**: If zippy reboots while still configured as NFS server, it will **refuse to start** the NFS service because it detects c1 is already active in Consul.
 You'll see this error in `journalctl -u nfs-server`:
 ```
 ERROR: Another NFS server is already active at 192.168.1.X
 This host (192.168.1.2) is configured as NFS server but should be standby.
 To fix:
  1. If this is intentional (failback), first demote the other server
  2. Update this host's config to use nfs-services-standby.nix instead
  3. Sync data from active server before promoting this host
 ```
 This is a **safety feature** to prevent split-brain and data corruption.
 ### Options when zippy comes back:
 **Option A: Keep c1 as primary** (zippy becomes standby)
 1. Update zippy's config to use `nfs-services-standby.nix`
 2. Deploy to zippy
 3. c1 will start replicating to zippy
 **Option B: Fail back to zippy as primary**
 Follow the "Failing Back to Original Primary" procedure below.
 ---
 ## Failing Back to Original Primary
 **Scenario**: `zippy` is repaired and you want to move the NFS server role back from `c1` to `zippy`.
 ### Step 1: Sync Latest Data from c1 to zippy
 On c1 (current primary):
 ```bash
 # Create readonly snapshot of current state
 sudo btrfs subvolume snapshot -r /persist/services /persist/services@failback-$(date +%Y%m%d-%H%M%S)
 # Find the snapshot
 FAILBACK=$(sudo ls -t /persist/services@failback-* | head -1)
 # Send to zippy (use root SSH key if available, or generate temporary key)
 sudo btrfs send "$FAILBACK" | ssh root@zippy "btrfs receive /persist/"
 ```
 On zippy:
 ```bash
 # Verify snapshot arrived
 ls -la /persist/services@failback-*
 # Create writable subvolume from the snapshot
 FAILBACK=$(ls -t /persist/services@failback-* | head -1)
 sudo btrfs subvolume snapshot "$FAILBACK" /persist/services
 # Verify
 ls -la /persist/services
 ```
 ### Step 2: Update NixOS Configuration
 Swap the roles back:
 **In `hosts/zippy/default.nix`**:
 ```nix
 imports = [
  # ... existing imports ...
  # ../../common/nfs-services-standby.nix  # REMOVE THIS
  ../../common/nfs-services-server.nix     # ADD THIS
 ];
 nfsServicesServer.standbys = ["c1"];
 ```
 **In `hosts/c1/default.nix`**:
 ```nix
 imports = [
  # ... existing imports ...
  # ../../common/nfs-services-server.nix   # REMOVE THIS
  ../../common/nfs-services-standby.nix    # ADD THIS
 ];
 nfsServicesStandby.replicationKeys = [
  "ssh-ed25519 AAAA... root@zippy-replication"  # Get from zippy:/persist/root/.ssh/btrfs-replication.pub
 ];
 ```
 ### Step 3: Deploy Configurations
 ```bash
 # IMPORTANT: Deploy c1 FIRST to demote it
 deploy -s '.#c1'
 # Wait for c1 to stop NFS server
 ssh c1 sudo systemctl status nfs-server  # Should be inactive
 # Then deploy zippy to promote it
 deploy -s '.#zippy'
 ```
 The order matters! If you deploy zippy first, it will see c1 is still active and refuse to start.
 ### Step 4: Verify Failback
 Check Consul DNS points to zippy:
 ```bash
 dig @c1 -p 8600 data-services.service.consul  # Should show zippy's IP
 ```
 Check clients are mounting from zippy:
 ```bash
 for host in c1 c2 c3; do
  ssh $host "df -h | grep services"
 done
 ```
 ### Step 5: Clean Up Temporary Snapshots
 On c1:
 ```bash
 # Remove the failback snapshot and the promoted subvolume
 sudo btrfs subvolume delete /persist/services@failback-*
 sudo btrfs subvolume delete /persist/services
 ```
 ---
 ## Adding a New Standby
 **Scenario**: You want to add `c2` as an additional standby.
 ### Step 1: Create Standby Subvolume on c2
 ```bash
 ssh c2
 sudo btrfs subvolume create /persist/services-standby
 ```
 ### Step 2: Update c2 Configuration
 **In `hosts/c2/default.nix`**:
 ```nix
 imports = [
  # ... existing imports ...
  ../../common/nfs-services-standby.nix
 ];
 nfsServicesStandby.replicationKeys = [
  "ssh-ed25519 AAAA... root@zippy-replication"  # Get from current NFS server
 ];
 ```
 ### Step 3: Update NFS Server Configuration
 On the current NFS server (e.g., zippy), update the standbys list:
 **In `hosts/zippy/default.nix`**:
 ```nix
 nfsServicesServer.standbys = ["c1" "c2"];  # Added c2
 ```
 ### Step 4: Deploy
 ```bash
 deploy -s '.#c2'
 deploy -s '.#zippy'
 ```
 The next replication cycle (within 5 minutes) will do a full send to c2, then switch to incremental.
 ---
 ## Troubleshooting
 ### Replication Failed
 Check the replication service logs:
 ```bash
 # On NFS server
 sudo journalctl -u replicate-services-to-c1 -f
 ```
 Common issues:
 - SSH key not found → Run key generation step (see stateful-commands.txt)
 - Permission denied → Check authorized_keys on standby
 - Snapshot already exists → Old snapshot with same timestamp, wait for next cycle
 ### Clients Can't Mount
 Check Consul:
 ```bash
 dig @localhost -p 8600 data-services.service.consul
 consul catalog services | grep data-services
 ```
 If Consul isn't resolving:
 - NFS server might not have registered → Check `sudo systemctl status nfs-server`
 - Consul agent might be down → Check `sudo systemctl status consul`
 ### Mount is Stale
 Force remount:
 ```bash
 sudo systemctl restart data-services.mount
 ```
 Or unmount and let automount handle it:
 ```bash
 sudo umount /data/services
 ls /data/services  # Triggers automount
 ```
 ### Split-Brain Prevention: NFS Server Won't Start
 If you see:
 ```
 ERROR: Another NFS server is already active at 192.168.1.X
 ```
 This is **intentional** - the safety check is working! You have two options:
 1. **Keep the other server as primary**: Update this host's config to be a standby instead
 2. **Fail back to this host**: First demote the other server, sync data, then deploy both hosts in correct order
 ---
 ## Monitoring
 ### Check Replication Status
 On NFS server:
 ```bash
 # List recent snapshots
 ls -lt /persist/services@* | head
 # Check last replication run
 sudo systemctl status replicate-services-to-c1
 # Check replication logs
 sudo journalctl -u replicate-services-to-c1 --since "1 hour ago"
 ```
 On standby:
 ```bash
 # List received snapshots
 ls -lt /persist/services-standby/services@* | head
 # Check how old the latest snapshot is
 stat /persist/services-standby/services@* | grep Modify | head -1
 ```
 ### Verify NFS Exports
 ```bash
 sudo showmount -e localhost
 ```
 Should show:
 ```
 /persist/services 192.168.1.0/24
 ```
 ### Check Consul Registration
 ```bash
 consul catalog services | grep data-services
 dig @localhost -p 8600 data-services.service.consul
 ```
--- a/hosts/c1/default.nix
+++ b/hosts/c1/default.nix
@@ -4,6 +4,11 @@
    ../../common/encrypted-btrfs-layout.nix
    ../../common/global
    ../../common/compute-node.nix
    ../../common/nfs-services-standby.nix  # NFS standby for /data/services
    # To promote to NFS server (during failover):
    # 1. Follow procedure in docs/NFS_FAILOVER.md
    # 2. Replace above line with: ../../common/nfs-services-server.nix
    # 3. Add nfsServicesServer.standbys = [ "c2" ]; (or leave empty)
    ./hardware.nix
  ];
@@ -15,4 +20,9 @@
  networking.hostName = "c1";
  services.tailscaleAutoconnect.authkey = "tskey-auth-k2nQ771YHM11CNTRL-YVpoumL2mgR6nLPG51vNhRpEKMDN7gLAi";
  # NFS standby configuration: accept replication from zippy
  nfsServicesStandby.replicationKeys = [
    "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIHyTKsMCbwCIlMcC/aopgz5Yfx/Q9QdlWC9jzMLgYFAV root@zippy-replication"
  ];
 }
--- a/hosts/zippy/default.nix
+++ b/hosts/zippy/default.nix
@@ -5,6 +5,11 @@
    ../../common/global
    ../../common/compute-node.nix
 #    ../../common/ethereum.nix
    ../../common/nfs-services-server.nix  # NFS server for /data/services
    # To move NFS server role to another host:
    # 1. Follow procedure in docs/NFS_FAILOVER.md
    # 2. Replace above line with: ../../common/nfs-services-standby.nix
    # 3. Add nfsServicesStandby.replicationKeys with the new server's public key
    ./hardware.nix
  ];
@@ -16,4 +21,7 @@
  networking.hostName = "zippy";
  services.tailscaleAutoconnect.authkey = "tskey-auth-ktKyQ59f2p11CNTRL-ut8E71dLWPXsVtb92hevNX9RTjmk4owBf";
  # NFS server configuration: replicate to c1 as standby
  nfsServicesServer.standbys = [ "c1" ];
 }
--- a/services/authentik.hcl
+++ b/services/authentik.hcl
@@ -114,5 +114,5 @@ variable "secret_key" {
 variable "authentik_version" {
  type = string
-  default = "2025.4"
+  default = "2025.6"
 }
--- a/services/clickhouse.hcl
+++ b/services/clickhouse.hcl
@@ -6,6 +6,13 @@ job "clickhouse" {
  }
  group "db" {
    # Run on primary storage node (zippy) for local disk performance
    # TODO: move to fractal once it's converted to NixOS (spinning disks OK for time-series data)
    constraint {
      attribute = "${meta.storage_role}"
      value     = "primary"
    }
    network {
      port "clickhouse" {
        static = 8123
@@ -18,7 +25,7 @@ job "clickhouse" {
      config {
        image = "clickhouse/clickhouse-server:25.9"
        volumes = [
-          "/data/compute/appdata/clickhouse:/var/lib/clickhouse",
+          "/data/services/clickhouse:/var/lib/clickhouse",
          "local/clickhouse-config.xml:/etc/clickhouse-server/config.d/logging.xml:ro",
          "local/clickhouse-user-config.xml:/etc/clickhouse-server/users.d/logging.xml:ro",
        ]
--- a/services/leantime.hcl
+++ b/services/leantime.hcl
@@ -1,51 +0,0 @@
 job "leantime" {
  datacenters = ["alo"]
  group "web" {
    network {
      port "http" {
        to = 80
      }
    }
    task "server" {
      driver = "docker"
      config {
        image = "leantime/leantime:latest"
        ports = ["http"]
        volumes = [
          "/data/compute/appdata/leantime:/var/www/html/userfiles",
        ]
      }
      env {
        LEAN_DEFAULT_TIMEZONE = "Europe/Lisbon"
        LEAN_DB_HOST = "mysql.service.consul"
        LEAN_DB_USER = "leantime"
        LEAN_DB_PASSWORD = "Xuphaedoo9kuaseeQuei"
        LEAN_DB_DATABASE = "leantime"
        LEAN_EMAIL_RETURN = "leantime@paler.net"
        LEAN_APP_URL = "https://leantime.v.paler.net"
        LEAN_EMAIL_SMTP_HOSTS = "192.168.1.1"
        LEAN_EMAIL_SMTP_AUTH = "false"
        LEAN_OIDC_ENABLE = "true"
        LEAN_OIDC_CREATE_USER = "true"
        LEAN_OIDC_PROVIDER_URL = "https://authentik.v.paler.net/application/o/leantime/"
        LEAN_OIDC_CLIENT_ID = "nWqJu9g4avhdpmUzqqvjsExCA1Jrick7GSMd0D6u"
        LEAN_OIDC_CLIENT_SECRET = "VvPQi5q3kkVTCwN8QWwwPTCqjWc9VbRanCFxa0zB2mhr1ZPxUYXP7Ygg6naMInE4P5vyqJd5w8XiWkuecW14G4KxgXpFtWChKnCOOpe47gjZGNbkYIEDZUmkUB99Saxx"
      }
      service {
        name = "leantime"
        port = "http"
        tags = [
          "traefik.enable=true",
          "traefik.http.routers.leantime.entryPoints=websecure",
          "traefik.http.routers.leantime.middlewares=authentik@file",
        ]
      }
    }
  }
 }
--- a/services/loki.hcl
+++ b/services/loki.hcl
@@ -10,6 +10,14 @@ job "loki" {
  }
  group "loki" {
    count = 1
    # Run on primary storage node (zippy) for local disk performance
    # TODO: move to fractal once it's converted to NixOS (spinning disks OK for log data)
    constraint {
      attribute = "${meta.storage_role}"
      value     = "primary"
    }
    restart {
      attempts = 3
      interval = "5m"
@@ -31,7 +39,7 @@ job "loki" {
          "local/loki/local-config.yaml",
        ]
        ports = ["loki"]
-        volumes = ["/data/compute/appdata/loki:/loki"]
+        volumes = ["/data/services/loki:/loki"]
      }
      template {
        data = <<EOH
--- a/services/mysql-backup.hcl
+++ b/services/mysql-backup.hcl
@@ -3,11 +3,17 @@ job "mysql-backup" {
  type = "batch"
  periodic {
-    cron = "23 23 * * * *"
+    crons = ["23 23 * * * *"]
    prohibit_overlap = true
  }
  group "db" {
    # Run on primary storage node for fast local disk access
    constraint {
      attribute = "${meta.storage_role}"
      value     = "primary"
    }
    task "backup" {
      driver = "raw_exec"
@@ -21,7 +27,7 @@ job "mysql-backup" {
        data = <<EOH
          set -e
          /run/current-system/sw/bin/nomad alloc exec -job -task=mysqld mysql \
-            mysqldump -u root --password="$MYSQL_ROOT_PASS" --all-databases > /data/compute/appdata/db-backups/mysql/backup.sql && \
+            mysqldump -u root --password="$MYSQL_ROOT_PASS" --all-databases > /data/services/db-backups/mysql/backup.sql && \
          echo "last_success $(date +%s)" | \
          /run/current-system/sw/bin/curl --data-binary @- http://pushgateway.service.consul:9091/metrics/job/mysql_backup
        EOH
--- a/services/mysql.hcl
+++ b/services/mysql.hcl
@@ -6,6 +6,12 @@ job "mysql" {
  }
  group "db" {
    # Run on primary storage node (zippy) for local disk performance
    constraint {
      attribute = "${meta.storage_role}"
      value     = "primary"
    }
    network {
      port "db" {
        static = 3306
@@ -19,13 +25,9 @@ job "mysql" {
      config {
        image = "mysql:9.4"
        args = [
          # 300M, up from default of 100M
          "--innodb-redo-log-capacity=314572800",
        ]
        ports = ["db"]
        volumes = [
-          "/data/compute/appdata/mysql:/var/lib/mysql",
+          "/data/services/mysql:/var/lib/mysql",
        ]
      }
--- a/services/postgres-backup.hcl
+++ b/services/postgres-backup.hcl
@@ -3,11 +3,17 @@ job "postgres-backup" {
  type = "batch"
  periodic {
-    cron = "22 22 * * * *"
+    crons = ["22 22 * * * *"]
    prohibit_overlap = true
  }
  group "db" {
    # Run on primary storage node (zippy) where postgres runs
    constraint {
      attribute = "${meta.storage_role}"
      value     = "primary"
    }
    task "backup" {
      driver = "raw_exec"
@@ -21,7 +27,7 @@ job "postgres-backup" {
        data = <<EOH
          set -e
          /run/current-system/sw/bin/nomad alloc exec -job -task=postgres postgres \
-            pg_dumpall -U postgres > /data/compute/appdata/db-backups/postgresql/backup.sql && \
+            pg_dumpall -U postgres > /data/services/db-backups/postgresql/backup.sql && \
          echo "last_success $(date +%s)" | \
          /run/current-system/sw/bin/curl --data-binary @- http://pushgateway.service.consul:9091/metrics/job/postgres_backup
        EOH
--- a/services/postgres.hcl
+++ b/services/postgres.hcl
@@ -7,6 +7,12 @@ job "postgres" {
  group "db" {
    # Run on primary storage node (zippy) for local disk performance
    constraint {
      attribute = "${meta.storage_role}"
      value     = "primary"
    }
    network {
      port "db" {
        static = 5432
@@ -23,7 +29,7 @@ job "postgres" {
      config {
        image = "postgis/postgis:15-3.4-alpine"
        ports = ["db"]
-        volumes = [ "/data/compute/appdata/postgres:/var/lib/postgresql/data" ]
+        volumes = [ "/data/services/postgres:/var/lib/postgresql/data" ]
      }
      env {
@@ -72,7 +78,7 @@ job "postgres" {
      config {
        image = "dpage/pgadmin4:latest"
        ports = ["admin"]
-        volumes = [ "/data/compute/appdata/pgadmin:/var/lib/pgadmin" ]
+        volumes = [ "/data/services/pgadmin:/var/lib/pgadmin" ]
      }
      env {
--- a/services/prometheus.hcl
+++ b/services/prometheus.hcl
@@ -10,6 +10,13 @@ job "prometheus" {
  group "monitoring" {
    count = 1
    # Run on primary storage node (zippy) for local disk performance
    # TODO: move to fractal once it's converted to NixOS (spinning disks OK for time-series data)
    constraint {
      attribute = "${meta.storage_role}"
      value     = "primary"
    }
    network {
      port "http" {
        #host_network = "tailscale"
@@ -37,7 +44,7 @@ job "prometheus" {
        volumes = [
          "local/alerts.yml:/prometheus/alerts.yml",
          "local/prometheus.yml:/prometheus/prometheus.yml",
-          "/data/compute/appdata/prometheus:/opt/prometheus",
+          "/data/services/prometheus:/opt/prometheus",
        ]
      }
--- a/services/redis.hcl
+++ b/services/redis.hcl
@@ -6,6 +6,12 @@ job "redis" {
  }
  group "db" {
    # Run on primary storage node (zippy) for local disk performance
    constraint {
      attribute = "${meta.storage_role}"
      value     = "primary"
    }
    network {
      port "redis" {
        static = 6379
@@ -21,7 +27,7 @@ job "redis" {
      config {
        image = "redis:alpine"
        ports = ["redis"]
-        volumes = [ "/data/compute/appdata/redis:/data" ]
+        volumes = [ "/data/services/redis:/data" ]
      }
      service {
--- a/services/unifi.hcl
+++ b/services/unifi.hcl
@@ -6,6 +6,14 @@ job "unifi" {
  }
  group "net" {
    # Run on primary storage node (zippy) for local disk performance
    # MongoDB needs local disk, not NFS
    # TODO: can move to fractal once it's converted to NixOS
    constraint {
      attribute = "${meta.storage_role}"
      value     = "primary"
    }
    network {
      port "p8443" { static = 8443 }
      port "p3478" { static = 3478 }
@@ -38,7 +46,7 @@ job "unifi" {
          "p5514",
        ]
        volumes = [
-          "/data/compute/appdata/unifi/data:/config",
+          "/data/services/unifi/data:/config",
        ]
      }
@@ -105,8 +113,8 @@ job "unifi" {
        image = "mongo:8.0"
        ports = ["mongodb"]
        volumes = [
-          "/data/compute/appdata/unifi/mongodb:/data/db",
+          "/data/services/unifi/mongodb:/data/db",
-          "/data/compute/appdata/unifi/init-mongo.sh:/docker-entrypoint-initdb.d/init-mongo.sh:ro"
+          "/data/services/unifi/init-mongo.sh:/docker-entrypoint-initdb.d/init-mongo.sh:ro"
        ]
      }
--- a/services/wordpress-backup.hcl
+++ b/services/wordpress-backup.hcl
@@ -3,7 +3,7 @@ job "wordpress-backup" {
  type = "batch"
  periodic {
-    cron = "*/5 * * * * *"
+    crons = ["*/5 * * * * *"]
    prohibit_overlap = true
  }
--- a/stateful-commands.txt
+++ b/stateful-commands.txt
@@ -39,3 +39,22 @@ kopia repository server setup (on a non-NixOS host at the time):
  * kopia server start --address 0.0.0.0:51515 --tls-cert-file ~/kopia-certs/kopia.cert --tls-key-file ~/kopia-certs/kopia.key --tls-generate-cert (first time)
  * kopia server start --address 0.0.0.0:51515 --tls-cert-file ~/kopia-certs/kopia.cert --tls-key-file ~/kopia-certs/kopia.key (subsequent)
 [TLS is mandatory for this]
 NFS services server setup (one-time on the NFS server host, e.g. zippy):
  * sudo btrfs subvolume create /persist/services
  * sudo mkdir -p /persist/root/.ssh
  * sudo ssh-keygen -t ed25519 -f /persist/root/.ssh/btrfs-replication -N "" -C "root@$(hostname)-replication"
  * Get the public key: sudo cat /persist/root/.ssh/btrfs-replication.pub
    Then add this public key to each standby's nfsServicesStandby.replicationKeys option
 NFS services standby setup (one-time on each standby host, e.g. c1):
  * sudo btrfs subvolume create /persist/services-standby
 Moving NFS server role between hosts (e.g. from zippy to c1):
  See docs/NFS_FAILOVER.md for detailed procedure
  Summary:
  1. On current primary: create final snapshot and send to new primary
  2. On new primary: promote snapshot to /persist/services
  3. Update configs: remove nfs-services-server.nix from old primary, add to new primary
  4. Update configs: add nfs-services-standby.nix to old primary (with replication keys)
  5. Deploy old primary first (to demote), then new primary (to promote)
Author	SHA1	Message	Date
Petru Paler	2437d46aa9	Move unifi to zippy.	2025-10-22 14:51:39 +01:00
Petru Paler	d16ffd9c65	Upgrade to 2025.6.	2025-10-22 14:51:28 +01:00
Petru Paler	49f159e2a6	Move loki to zippy.	2025-10-22 14:39:13 +01:00
Petru Paler	17c0f2db2a	Move prometheus to zippy.	2025-10-22 14:29:57 +01:00
Petru Paler	c80a2c9a58	Remove unused leantime config.	2025-10-22 14:23:39 +01:00
Petru Paler	706f46ae77	And another replication fix.	2025-10-22 14:22:39 +01:00
Petru Paler	fa603e8aea	Move clickhouse to zippy.	2025-10-22 14:19:50 +01:00
Petru Paler	8032ad4d20	Move redis to zippy.	2025-10-22 14:11:37 +01:00
Petru Paler	8ce5194ca9	YA replication fix.	2025-10-22 14:08:28 +01:00
Petru Paler	a948f26ffb	Move postgres to zippy.	2025-10-22 14:05:45 +01:00
Petru Paler	f414ac0146	Fix path names.	2025-10-22 13:59:31 +01:00
Petru Paler	17711da0b6	Fix replication again.	2025-10-22 13:59:25 +01:00
Petru Paler	ed06f07116	More docs.	2025-10-22 13:50:03 +01:00
Petru Paler	bffc09cbd6	Ignore NFS primary/standby snapshots for backup.	2025-10-22 13:45:03 +01:00
Petru Paler	f488b710bf	Fix incremental snapshot logic.	2025-10-22 13:41:28 +01:00
Petru Paler	65835e1ed0	Run mysql on the primary storage machine.	2025-10-22 13:20:13 +01:00
Petru Paler	967ff34a51	NFS server and client setup.	2025-10-22 13:06:21 +01:00