Reduce NFS snapshot retention time to save disk space.

Remove glusterfs references.
Remove glusterfs.
2025-10-25 11:32:13 +01:00 · 2025-10-25 08:51:50 +01:00 · 2025-10-25 08:51:29 +01:00 · 2025-10-25 08:34:21 +01:00 · 2025-10-24 22:49:32 +01:00 · 2025-10-24 22:30:16 +01:00
22 changed files with 112 additions and 114 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -2,3 +2,4 @@
 .tmp
 result
 .aider*
+.claude
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -30,10 +30,7 @@ NixOS cluster configuration using flakes. Homelab infrastructure with Nomad/Cons
 └── services/            # Nomad job specs (.hcl files)
 ```

-## Current Architecture (transitioning)
-
-**OLD**: GlusterFS on c1/c2/c3 at `/data/compute` (being phased out)
-**NEW**: NFS from zippy at `/data/services` (current target)
+## Current Architecture

 ### Storage Mounts
 - `/data/services` - NFS from `data-services.service.consul` (zippy primary, c1 standby)
@@ -86,26 +83,18 @@ NixOS cluster configuration using flakes. Homelab infrastructure with Nomad/Cons

 ## Migration Status

-**Phase**: 4 in progress (20/35 services migrated)
-**Current**: Migrating services from GlusterFS → NFS
-**Next**: Finish migrating remaining services, update host volumes, remove GlusterFS
-**Later**: Convert fractal to NixOS (deferred)
+**Phase 3 & 4**: COMPLETE! GlusterFS removed, all services on NFS
+**Next**: Convert fractal to NixOS (deferred)

 See `docs/MIGRATION_TODO.md` for detailed checklist.

-**IMPORTANT**: When working on migration tasks:
-1. Always update `docs/MIGRATION_TODO.md` after completing each service migration
-2. Update both the individual service checklist AND the summary counts at the bottom
-3. Pattern: `/data/compute/appdata/foo` → `/data/services/foo` (NOT `/data/services/appdata/foo`!)
-4. Migration workflow per service: stop → copy data → edit config → start → update MIGRATION_TODO.md
-
 ## Common Tasks

 **Deploy a host**: `deploy -s '.#hostname'`
 **Deploy all**: `deploy`
 **Check replication**: `ssh zippy journalctl -u replicate-services-to-c1.service -f`
 **NFS failover**: See `docs/NFS_FAILOVER.md`
-**Nomad jobs**: `services/*.hcl` - update paths: `/data/compute/appdata/foo` → `/data/services/foo` (NOT `/data/services/appdata/foo`!)
+**Nomad jobs**: `services/*.hcl` - service data stored at `/data/services/<service-name>`

 ## Troubleshooting Hints

--- a/common/cluster-member.nix
+++ b/common/cluster-member.nix
@@ -8,7 +8,10 @@
    ./unattended-encryption.nix
    ./cifs-client.nix
    ./consul.nix
-    ./glusterfs-client.nix  # Keep during migration, will be removed in Phase 3
    ./nfs-services-client.nix  # New: NFS client for /data/services
  ];
+
+  # Wait for eno1 to be routable before considering network online
+  # (hosts with different primary interfaces should override this)
+  systemd.network.wait-online.extraArgs = [ "--interface=eno1:routable" ];
 }
--- a/common/glusterfs-client.nix
+++ b/common/glusterfs-client.nix
@@ -1,13 +0,0 @@
-{ pkgs, ... }:
-{
-  environment.systemPackages = [ pkgs.glusterfs ];
-
-  fileSystems."/data/compute" = {
-    device = "192.168.1.71:/compute";
-    fsType = "glusterfs";
-    options = [
-      "backup-volfile-servers=192.168.1.72:192.168.1.73"
-      "_netdev"
-    ];
-  };
-}
--- a/common/glusterfs.nix
+++ b/common/glusterfs.nix
@@ -1,24 +0,0 @@
-{
-  pkgs,
-  config,
-  lib,
-  ...
-}:
-{
-  services.glusterfs = {
-    enable = true;
-  };
-
-  environment.persistence."/persist".directories = [ "/var/lib/glusterd" ];
-
-  # TODO: each volume needs its own port starting at 49152
-  networking.firewall.allowedTCPPorts = [
-    24007
-    24008
-    24009
-    49152
-    49153
-    49154
-    49155
-  ];
-}
--- a/common/nfs-services-client.nix
+++ b/common/nfs-services-client.nix
@@ -9,12 +9,17 @@
  # The mount is established at boot time and persists - no auto-unmount.
  # This prevents issues with Docker bind mounts seeing empty automount stubs.

+  imports = [
+    ./wait-for-dns-ready.nix
+  ];
+
  fileSystems."/data/services" = {
    device = "data-services.service.consul:/persist/services";
    fsType = "nfs";
    options = [
      "nofail"                      # Don't block boot if mount fails
      "x-systemd.mount-timeout=30s" # Timeout for mount attempts
+      "x-systemd.after=wait-for-dns-ready.service" # Wait for DNS to actually work
      "_netdev"                     # Network filesystem (wait for network)
    ];
  };
--- a/common/nfs-services-standby.nix
+++ b/common/nfs-services-standby.nix
@@ -39,15 +39,15 @@ in
      noCheck = true;
    };

-    # Cleanup old snapshots on standby (keep last 48 hours for safety)
+    # Cleanup old snapshots on standby (keep last 4 hours for HA failover)
    systemd.services.cleanup-services-standby-snapshots = {
      description = "Cleanup old btrfs snapshots in services-standby";
      path = [ pkgs.btrfs-progs pkgs.findutils ];

      script = ''
        set -euo pipefail
-        # Keep last 48 hours of snapshots (576 snapshots at 5min intervals)
-        find /persist/services-standby -maxdepth 1 -name 'services@*' -mmin +2880 -exec btrfs subvolume delete {} \; || true
+        # Keep last 4 hours of snapshots (48 snapshots at 5min intervals)
+        find /persist/services-standby -maxdepth 1 -name 'services@*' -mmin +240 -exec btrfs subvolume delete {} \; || true
      '';

      serviceConfig = {
--- a/common/wait-for-dns-ready.nix
+++ b/common/wait-for-dns-ready.nix
@@ -0,0 +1,55 @@
+{ pkgs, ... }:
+{
+  # Service to wait for DNS resolution to be actually functional
+  # This is needed because network-online.target and wait-online.service
+  # don't guarantee DNS works - they only check that interfaces are configured.
+  #
+  # Problem: NFS mounts using Consul DNS names (data-services.service.consul)
+  # fail at boot because DNS resolution isn't ready even though network is "online"
+  #
+  # Solution: Actively test DNS resolution before considering network truly ready
+
+  systemd.services.wait-for-dns-ready = {
+    description = "Wait for DNS resolution to be functional";
+    after = [
+      "systemd-networkd-wait-online.service"
+      "systemd-resolved.service"
+      "network-online.target"
+    ];
+    wants = [ "network-online.target" ];
+    wantedBy = [ "multi-user.target" ];
+
+    serviceConfig = {
+      Type = "oneshot";
+      RemainAfterExit = true;
+      ExecStart = pkgs.writeShellScript "wait-for-dns-ready" ''
+        # Test DNS resolution by attempting to resolve data-services.service.consul
+        # This ensures the full DNS path works: interface → gateway → Consul DNS
+
+        echo "Waiting for DNS resolution to be ready..."
+
+        for i in {1..30}; do
+          # Use getent which respects /etc/nsswitch.conf and systemd-resolved
+          if ${pkgs.glibc.bin}/bin/getent hosts data-services.service.consul >/dev/null 2>&1; then
+            echo "DNS ready: data-services.service.consul resolved successfully"
+            exit 0
+          fi
+
+          # Also test a public DNS name to distinguish between general DNS failure
+          # vs Consul-specific issues (helpful for debugging)
+          if ! ${pkgs.glibc.bin}/bin/getent hosts www.google.com >/dev/null 2>&1; then
+            echo "Attempt $i/30: General DNS not working yet, waiting..."
+          else
+            echo "Attempt $i/30: General DNS works but Consul DNS not ready yet, waiting..."
+          fi
+
+          sleep 1
+        done
+
+        echo "Warning: DNS not fully ready after 30 seconds"
+        echo "NFS mounts with 'nofail' option will handle this gracefully"
+        exit 0  # Don't block boot - let nofail mounts handle DNS failures
+      '';
+    };
+  };
+}
--- a/docs/MIGRATION_TODO.md
+++ b/docs/MIGRATION_TODO.md
@@ -37,17 +37,17 @@ See [CLUSTER_REVAMP.md](./CLUSTER_REVAMP.md) for detailed procedures.
 ## Phase 3: Migrate from GlusterFS to NFS
 - [x] Update all nodes to mount NFS at `/data/services`
 - [x] Deploy updated configs (NFS client on all nodes)
- [ ] Stop all Nomad jobs temporarily
- [ ] Copy data from GlusterFS to zippy NFS
-  - [ ] Copy `/data/compute/appdata/*` → `/persist/services/appdata/`
-  - [ ] Copy `/data/compute/config/*` → `/persist/services/config/`
-  - [ ] Copy `/data/sync/wordpress` → `/persist/services/appdata/wordpress`
-  - [ ] Verify data integrity
- [ ] Verify NFS mounts working on all nodes
- [ ] Stop GlusterFS volume
- [ ] Delete GlusterFS volume
- [ ] Remove GlusterFS from NixOS configs
- [ ] Remove syncthing wordpress sync configuration
+- [x] Stop all Nomad jobs temporarily
+- [x] Copy data from GlusterFS to zippy NFS
+  - [x] Copy `/data/compute/appdata/*` → `/persist/services/appdata/`
+  - [x] Copy `/data/compute/config/*` → `/persist/services/config/`
+  - [x] Copy `/data/sync/wordpress` → `/persist/services/appdata/wordpress`
+  - [x] Verify data integrity
+- [x] Verify NFS mounts working on all nodes
+- [x] Stop GlusterFS volume
+- [x] Delete GlusterFS volume
+- [x] Remove GlusterFS from NixOS configs
+- [x] Remove syncthing wordpress sync configuration (no longer used)

 ## Phase 4: Update and redeploy Nomad jobs

@@ -125,8 +125,8 @@ See [CLUSTER_REVAMP.md](./CLUSTER_REVAMP.md) for detailed procedures.
 - [ ] Verify backups include `/persist/services` data
 - [ ] Verify backups exclude replication snapshots
 - [ ] Update documentation (README.md, architecture diagrams)
- [ ] Clean up old GlusterFS data (only after everything verified!)
- [ ] Remove old glusterfs directories from all nodes
+- [x] Clean up old GlusterFS data (only after everything verified!)
+- [x] Remove old glusterfs directories from all nodes

 ## Post-Migration Checklist
 - [ ] All 5 servers in quorum (consul members)
@@ -143,8 +143,8 @@ See [CLUSTER_REVAMP.md](./CLUSTER_REVAMP.md) for detailed procedures.

 ---

-**Last updated**: 2025-10-23 22:30
-**Current phase**: Phase 4 complete! All services migrated to NFS
+**Last updated**: 2025-10-25
+**Current phase**: Phase 3 & 4 complete! GlusterFS removed, all services on NFS
 **Note**: Phase 1 (fractal NixOS conversion) deferred until after GlusterFS migration is complete

 ## Migration Summary
--- a/flake.nix
+++ b/flake.nix
@@ -66,13 +66,8 @@
      mkHost =
        system: profile: modules:
        let
-          # Auto-import profile-specific module based on profile parameter
-          profileModule =
-            if profile == "server" then ./common/server-node.nix
-            else if profile == "workstation" then ./common/workstation-node.nix
-            else if profile == "desktop" then ./common/desktop-node.nix
-            else if profile == "cloud" then ./common/cloud-node.nix
-            else null;
+          # Profile parameter is only used by home-manager for user environment
+          # NixOS system configuration is handled via explicit imports in host configs
        in
        nixpkgs.lib.nixosSystem {
          system = system;
@@ -105,7 +100,7 @@
                };
              };
            }
-          ] ++ nixpkgs.lib.optional (profileModule != null) profileModule ++ modules;
+          ] ++ modules;
          specialArgs = {
            inherit inputs self;
          };
@@ -136,9 +131,9 @@
    in
    {
      nixosConfigurations = {
-        c1 = mkHost "x86_64-linux" "server" [ ./hosts/c1 ];
-        c2 = mkHost "x86_64-linux" "server" [ ./hosts/c2 ];
-        c3 = mkHost "x86_64-linux" "server" [ ./hosts/c3 ];
+        c1 = mkHost "x86_64-linux" "minimal" [ ./hosts/c1 ];
+        c2 = mkHost "x86_64-linux" "minimal" [ ./hosts/c2 ];
+        c3 = mkHost "x86_64-linux" "minimal" [ ./hosts/c3 ];
        alo-cloud-1 = mkHost "aarch64-linux" "cloud" [ ./hosts/alo-cloud-1 ];
        zippy = mkHost "x86_64-linux" "minimal" [
          ethereum-nix.nixosModules.default
--- a/home/default.nix
+++ b/home/default.nix
@@ -29,8 +29,7 @@

    persistence."/persist/home/ppetru" = {
      directories = [
-        ".cache/nix"
-        ".cache/nix-index"
+        ".cache/"
        ".claude/"
        ".codex/"
        ".config/io.datasette.llm/"
@@ -41,7 +40,9 @@
        ".ssh"
        "projects"
      ];
-      files = [ ];
+      files = [ 
+        ".claude.json"
+      ];
      allowOther = true;
    };
  };
--- a/home/profiles/desktop.nix
+++ b/home/profiles/desktop.nix
@@ -11,7 +11,6 @@ in
 {
  packages = workstationProfile.packages ++ desktopPkgs;
  environment.persistence."/persist/home/ppetru".directories = [
-    ".cache"
    ".config/google-chrome"
  ];
 }
--- a/home/profiles/minimal.nix
+++ b/home/profiles/minimal.nix
@@ -0,0 +1,5 @@
+{ pkgs }:
+{
+  # Minimal profile: reuses server.nix for basic package list
+  packages = (import ./server.nix { inherit pkgs; }).packages;
+}
--- a/home/programs/minimal.nix
+++ b/home/programs/minimal.nix
@@ -0,0 +1,5 @@
+{ pkgs, ... }:
+{
+  # Minimal profile: reuses server.nix for basic CLI programs
+  imports = [ ./server.nix ];
+}
--- a/hosts/alo-cloud-1/default.nix
+++ b/hosts/alo-cloud-1/default.nix
@@ -2,6 +2,7 @@
 {
  imports = [
    ../../common/global
+    ../../common/cloud-node.nix  # Minimal system with Consul
    ./hardware.nix
    ./reverse-proxy.nix
  ];
--- a/hosts/c1/default.nix
+++ b/hosts/c1/default.nix
@@ -6,7 +6,6 @@
    ../../common/cluster-member.nix  # Consul + storage clients
    ../../common/nomad-worker.nix     # Nomad client (runs jobs)
    ../../common/nomad-server.nix     # Consul + Nomad server mode
-    ../../common/glusterfs.nix        # GlusterFS server (temp during migration)
    ../../common/nfs-services-standby.nix  # NFS standby for /data/services
    # To promote to NFS server (during failover):
    # 1. Follow procedure in docs/NFS_FAILOVER.md
--- a/hosts/c2/default.nix
+++ b/hosts/c2/default.nix
@@ -6,7 +6,6 @@
    ../../common/cluster-member.nix  # Consul + storage clients
    ../../common/nomad-worker.nix     # Nomad client (runs jobs)
    ../../common/nomad-server.nix     # Consul + Nomad server mode
-    ../../common/glusterfs.nix        # GlusterFS server (temp during migration)
    ./hardware.nix
  ];

--- a/hosts/c3/default.nix
+++ b/hosts/c3/default.nix
@@ -6,7 +6,6 @@
    ../../common/cluster-member.nix  # Consul + storage clients
    ../../common/nomad-worker.nix     # Nomad client (runs jobs)
    ../../common/nomad-server.nix     # Consul + Nomad server mode
-    ../../common/glusterfs.nix        # GlusterFS server (temp during migration)
    ../../common/binary-cache-server.nix
    ./hardware.nix
  ];
--- a/hosts/chilly/default.nix
+++ b/hosts/chilly/default.nix
@@ -8,6 +8,7 @@
  imports = [
    ../../common/encrypted-btrfs-layout.nix
    ../../common/global
+    ../../common/workstation-node.nix # Dev tools (deploy-rs, docker, nix-ld)
    ../../common/cluster-member.nix  # Consul + storage clients
    ../../common/cluster-tools.nix    # Nomad CLI (no service)
    ./hardware.nix
@@ -25,6 +26,8 @@

  networking.useNetworkd = true;
  systemd.network.enable = true;
+  # Wait for br0 to be routable before considering network online
+  systemd.network.wait-online.extraArgs = [ "--interface=br0:routable" ];
  # not useful and potentially a security loophole
  services.resolved.llmnr = "false";

--- a/hosts/sparky/default.nix
+++ b/hosts/sparky/default.nix
@@ -3,6 +3,7 @@
  imports = [
    ../../common/encrypted-btrfs-layout.nix
    ../../common/global
+    ../../common/desktop-node.nix     # Hyprland + GUI environment
    ../../common/cluster-member.nix  # Consul + storage clients
    ../../common/cluster-tools.nix    # Nomad CLI (no service)
    ./hardware.nix
--- a/hosts/zippy/default.nix
+++ b/hosts/zippy/default.nix
@@ -6,7 +6,6 @@
    ../../common/cluster-member.nix  # Consul + storage clients
    ../../common/nomad-worker.nix     # Nomad client (runs jobs)
    # NOTE: zippy is NOT a server - no nomad-server.nix import
-    ../../common/glusterfs.nix        # GlusterFS server (temp during migration)
 #    ../../common/ethereum.nix
    ../../common/nfs-services-server.nix  # NFS server for /data/services
    # To move NFS server role to another host:
--- a/stateful-commands.txt
+++ b/stateful-commands.txt
@@ -1,33 +1,9 @@
-glusterfs setup on c1:
-  * for h in c1 c2 c3; do ssh $h sudo mkdir /persist/glusterfs/compute; done
-  * gluster peer probe c2
-  * gluster peer probe c3
-  * gluster volume create compute replica 3 c{1,2,3}:/persist/glusterfs/compute/brick1
-  * gluster volume start compute
-  * gluster volume bitrot compute enable
-
 mysql credentials
  * Put secrets/mysql_root_password into a Nomad var named secrets/mysql.root_password

 postgres credentials
  * Put secrets/postgres_password into a Nomad var named secrets/postgresql.postgres_password

-adding a new gluster node to the compute volume, with c3 having failed:
-(instructions from https://icicimov.github.io/blog/high-availability/Replacing-GlusterFS-failed-node/)
-  * zippy: sudo mkdir /persist/glusterfs/compute -p
-  * c1: gluster peer probe 192.168.1.2 (by IP because zippy resolved to a tailscale address)
-  * c1: gluster volume replace-brick compute c3:/persist/glusterfs/compute/brick1 192.168.1.2:/persist/glusterfs/compute/brick1 commit force
-  * c1: gluster volume heal compute full
-  * c1: gluster peer detach c3
-
-same to then later replace 192.168.1.2 with 192.168.1.73
-
-replacing failed / reinstalled gluster volume (c1 in this case). all commands on c2:
-  * gluster volume remove-brick compute replica 2 c1:/persist/glusterfs/compute/brick1 force
-  * gluster peer detach c1
-  * gluster peer probe 192.168.1.71 (not c1 because switching to IPs to avoid DNS/tailscale issues)
-  * gluster volume add-brick compute replica 3 192.168.1.71:/persist/glusterfs/compute/brick1
-
 kopia repository server setup (on a non-NixOS host at the time):
  * kopia repository create filesystem --path /backup/persist
  * kopia repository connect filesystem --path=/backup/persist
Author	SHA1	Message	Date
Petru Paler	dd8fee0ecb	Reduce NFS snapshot retention time to save disk space.	2025-10-25 11:32:13 +01:00
Petru Paler	a2b54be875	Remove glusterfs references.	2025-10-25 08:51:50 +01:00
Petru Paler	ccf6154ba0	Remove glusterfs.	2025-10-25 08:51:29 +01:00
Petru Paler	bd5988dfbc	Profiles only for home manager.	2025-10-25 08:34:21 +01:00
Petru Paler	a57fc9107b	Make sure DNS is up before mounting NFS.	2025-10-24 22:49:32 +01:00
Petru Paler	a7dce7cfb9	Persist ~/.claude.json and ~/.cache everywhere	2025-10-24 22:30:16 +01:00