Reduce NFS snapshot retention time to save disk space.

Remove glusterfs references.
Remove glusterfs.
2025-10-25 11:32:13 +01:00 · 2025-10-25 08:51:50 +01:00 · 2025-10-25 08:51:29 +01:00 · 2025-10-25 08:34:21 +01:00 · 2025-10-24 22:49:32 +01:00 · 2025-10-24 22:30:16 +01:00
22 changed files with 112 additions and 114 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -2,3 +2,4 @@
 .tmp
 result
 .aider*
 .claude
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -30,10 +30,7 @@ NixOS cluster configuration using flakes. Homelab infrastructure with Nomad/Cons
 └── services/            # Nomad job specs (.hcl files)
 ```
-## Current Architecture (transitioning)
+## Current Architecture
 **OLD**: GlusterFS on c1/c2/c3 at `/data/compute` (being phased out)
 **NEW**: NFS from zippy at `/data/services` (current target)
 ### Storage Mounts
 - `/data/services` - NFS from `data-services.service.consul` (zippy primary, c1 standby)
@@ -86,26 +83,18 @@ NixOS cluster configuration using flakes. Homelab infrastructure with Nomad/Cons
 ## Migration Status
-**Phase**: 4 in progress (20/35 services migrated)
+**Phase 3 & 4**: COMPLETE! GlusterFS removed, all services on NFS
-**Current**: Migrating services from GlusterFS → NFS
+**Next**: Convert fractal to NixOS (deferred)
 **Next**: Finish migrating remaining services, update host volumes, remove GlusterFS
 **Later**: Convert fractal to NixOS (deferred)
 See `docs/MIGRATION_TODO.md` for detailed checklist.
 **IMPORTANT**: When working on migration tasks:
 1. Always update `docs/MIGRATION_TODO.md` after completing each service migration
 2. Update both the individual service checklist AND the summary counts at the bottom
 3. Pattern: `/data/compute/appdata/foo` → `/data/services/foo` (NOT `/data/services/appdata/foo`!)
 4. Migration workflow per service: stop → copy data → edit config → start → update MIGRATION_TODO.md
 ## Common Tasks
 **Deploy a host**: `deploy -s '.#hostname'`
 **Deploy all**: `deploy`
 **Check replication**: `ssh zippy journalctl -u replicate-services-to-c1.service -f`
 **NFS failover**: See `docs/NFS_FAILOVER.md`
-**Nomad jobs**: `services/*.hcl` - update paths: `/data/compute/appdata/foo` → `/data/services/foo` (NOT `/data/services/appdata/foo`!)
+**Nomad jobs**: `services/*.hcl` - service data stored at `/data/services/<service-name>`
 ## Troubleshooting Hints
--- a/common/cluster-member.nix
+++ b/common/cluster-member.nix
@@ -8,7 +8,10 @@
    ./unattended-encryption.nix
    ./cifs-client.nix
    ./consul.nix
    ./glusterfs-client.nix  # Keep during migration, will be removed in Phase 3
    ./nfs-services-client.nix  # New: NFS client for /data/services
  ];
  # Wait for eno1 to be routable before considering network online
  # (hosts with different primary interfaces should override this)
  systemd.network.wait-online.extraArgs = [ "--interface=eno1:routable" ];
 }
--- a/common/glusterfs-client.nix
+++ b/common/glusterfs-client.nix
@@ -1,13 +0,0 @@
 { pkgs, ... }:
 {
  environment.systemPackages = [ pkgs.glusterfs ];
  fileSystems."/data/compute" = {
    device = "192.168.1.71:/compute";
    fsType = "glusterfs";
    options = [
      "backup-volfile-servers=192.168.1.72:192.168.1.73"
      "_netdev"
    ];
  };
 }
--- a/common/glusterfs.nix
+++ b/common/glusterfs.nix
@@ -1,24 +0,0 @@
 {
  pkgs,
  config,
  lib,
  ...
 }:
 {
  services.glusterfs = {
    enable = true;
  };
  environment.persistence."/persist".directories = [ "/var/lib/glusterd" ];
  # TODO: each volume needs its own port starting at 49152
  networking.firewall.allowedTCPPorts = [
    24007
    24008
    24009
    49152
    49153
    49154
    49155
  ];
 }
--- a/common/nfs-services-client.nix
+++ b/common/nfs-services-client.nix
@@ -9,12 +9,17 @@
  # The mount is established at boot time and persists - no auto-unmount.
  # This prevents issues with Docker bind mounts seeing empty automount stubs.
  imports = [
    ./wait-for-dns-ready.nix
  ];
  fileSystems."/data/services" = {
    device = "data-services.service.consul:/persist/services";
    fsType = "nfs";
    options = [
      "nofail"                      # Don't block boot if mount fails
      "x-systemd.mount-timeout=30s" # Timeout for mount attempts
      "x-systemd.after=wait-for-dns-ready.service" # Wait for DNS to actually work
      "_netdev"                     # Network filesystem (wait for network)
    ];
  };
--- a/common/nfs-services-standby.nix
+++ b/common/nfs-services-standby.nix
@@ -39,15 +39,15 @@ in
      noCheck = true;
    };
-    # Cleanup old snapshots on standby (keep last 48 hours for safety)
+    # Cleanup old snapshots on standby (keep last 4 hours for HA failover)
    systemd.services.cleanup-services-standby-snapshots = {
      description = "Cleanup old btrfs snapshots in services-standby";
      path = [ pkgs.btrfs-progs pkgs.findutils ];
      script = ''
        set -euo pipefail
-        # Keep last 48 hours of snapshots (576 snapshots at 5min intervals)
+        # Keep last 4 hours of snapshots (48 snapshots at 5min intervals)
-        find /persist/services-standby -maxdepth 1 -name 'services@*' -mmin +2880 -exec btrfs subvolume delete {} \; || true
+        find /persist/services-standby -maxdepth 1 -name 'services@*' -mmin +240 -exec btrfs subvolume delete {} \; || true
      '';
      serviceConfig = {
--- a/common/wait-for-dns-ready.nix
+++ b/common/wait-for-dns-ready.nix
@@ -0,0 +1,55 @@
 { pkgs, ... }:
 {
  # Service to wait for DNS resolution to be actually functional
  # This is needed because network-online.target and wait-online.service
  # don't guarantee DNS works - they only check that interfaces are configured.
  #
  # Problem: NFS mounts using Consul DNS names (data-services.service.consul)
  # fail at boot because DNS resolution isn't ready even though network is "online"
  #
  # Solution: Actively test DNS resolution before considering network truly ready
  systemd.services.wait-for-dns-ready = {
    description = "Wait for DNS resolution to be functional";
    after = [
      "systemd-networkd-wait-online.service"
      "systemd-resolved.service"
      "network-online.target"
    ];
    wants = [ "network-online.target" ];
    wantedBy = [ "multi-user.target" ];
    serviceConfig = {
      Type = "oneshot";
      RemainAfterExit = true;
      ExecStart = pkgs.writeShellScript "wait-for-dns-ready" ''
        # Test DNS resolution by attempting to resolve data-services.service.consul
        # This ensures the full DNS path works: interface → gateway → Consul DNS
        echo "Waiting for DNS resolution to be ready..."
        for i in {1..30}; do
          # Use getent which respects /etc/nsswitch.conf and systemd-resolved
          if ${pkgs.glibc.bin}/bin/getent hosts data-services.service.consul >/dev/null 2>&1; then
            echo "DNS ready: data-services.service.consul resolved successfully"
            exit 0
          fi
          # Also test a public DNS name to distinguish between general DNS failure
          # vs Consul-specific issues (helpful for debugging)
          if ! ${pkgs.glibc.bin}/bin/getent hosts www.google.com >/dev/null 2>&1; then
            echo "Attempt $i/30: General DNS not working yet, waiting..."
          else
            echo "Attempt $i/30: General DNS works but Consul DNS not ready yet, waiting..."
          fi
          sleep 1
        done
        echo "Warning: DNS not fully ready after 30 seconds"
        echo "NFS mounts with 'nofail' option will handle this gracefully"
        exit 0  # Don't block boot - let nofail mounts handle DNS failures
      '';
    };
  };
 }
--- a/docs/MIGRATION_TODO.md
+++ b/docs/MIGRATION_TODO.md
@@ -37,17 +37,17 @@ See [CLUSTER_REVAMP.md](./CLUSTER_REVAMP.md) for detailed procedures.
 ## Phase 3: Migrate from GlusterFS to NFS
 - [x] Update all nodes to mount NFS at `/data/services`
 - [x] Deploy updated configs (NFS client on all nodes)
- [ ] Stop all Nomad jobs temporarily
+- [x] Stop all Nomad jobs temporarily
- [ ] Copy data from GlusterFS to zippy NFS
+- [x] Copy data from GlusterFS to zippy NFS
-  - [ ] Copy `/data/compute/appdata/*` → `/persist/services/appdata/`
+  - [x] Copy `/data/compute/appdata/*` → `/persist/services/appdata/`
-  - [ ] Copy `/data/compute/config/*` → `/persist/services/config/`
+  - [x] Copy `/data/compute/config/*` → `/persist/services/config/`
-  - [ ] Copy `/data/sync/wordpress` → `/persist/services/appdata/wordpress`
+  - [x] Copy `/data/sync/wordpress` → `/persist/services/appdata/wordpress`
-  - [ ] Verify data integrity
+  - [x] Verify data integrity
- [ ] Verify NFS mounts working on all nodes
+- [x] Verify NFS mounts working on all nodes
- [ ] Stop GlusterFS volume
+- [x] Stop GlusterFS volume
- [ ] Delete GlusterFS volume
+- [x] Delete GlusterFS volume
- [ ] Remove GlusterFS from NixOS configs
+- [x] Remove GlusterFS from NixOS configs
- [ ] Remove syncthing wordpress sync configuration
+- [x] Remove syncthing wordpress sync configuration (no longer used)
 ## Phase 4: Update and redeploy Nomad jobs
@@ -125,8 +125,8 @@ See [CLUSTER_REVAMP.md](./CLUSTER_REVAMP.md) for detailed procedures.
 - [ ] Verify backups include `/persist/services` data
 - [ ] Verify backups exclude replication snapshots
 - [ ] Update documentation (README.md, architecture diagrams)
- [ ] Clean up old GlusterFS data (only after everything verified!)
+- [x] Clean up old GlusterFS data (only after everything verified!)
- [ ] Remove old glusterfs directories from all nodes
+- [x] Remove old glusterfs directories from all nodes
 ## Post-Migration Checklist
 - [ ] All 5 servers in quorum (consul members)
@@ -143,8 +143,8 @@ See [CLUSTER_REVAMP.md](./CLUSTER_REVAMP.md) for detailed procedures.
 ---
-**Last updated**: 2025-10-23 22:30
+**Last updated**: 2025-10-25
-**Current phase**: Phase 4 complete! All services migrated to NFS
+**Current phase**: Phase 3 & 4 complete! GlusterFS removed, all services on NFS
 **Note**: Phase 1 (fractal NixOS conversion) deferred until after GlusterFS migration is complete
 ## Migration Summary
--- a/flake.nix
+++ b/flake.nix
@@ -66,13 +66,8 @@
      mkHost =
        system: profile: modules:
        let
-          # Auto-import profile-specific module based on profile parameter
+          # Profile parameter is only used by home-manager for user environment
-          profileModule =
+          # NixOS system configuration is handled via explicit imports in host configs
            if profile == "server" then ./common/server-node.nix
            else if profile == "workstation" then ./common/workstation-node.nix
            else if profile == "desktop" then ./common/desktop-node.nix
            else if profile == "cloud" then ./common/cloud-node.nix
            else null;
        in
        nixpkgs.lib.nixosSystem {
          system = system;
@@ -105,7 +100,7 @@
                };
              };
            }
-          ] ++ nixpkgs.lib.optional (profileModule != null) profileModule ++ modules;
+          ] ++ modules;
          specialArgs = {
            inherit inputs self;
          };
@@ -136,9 +131,9 @@
    in
    {
      nixosConfigurations = {
-        c1 = mkHost "x86_64-linux" "server" [ ./hosts/c1 ];
+        c1 = mkHost "x86_64-linux" "minimal" [ ./hosts/c1 ];
-        c2 = mkHost "x86_64-linux" "server" [ ./hosts/c2 ];
+        c2 = mkHost "x86_64-linux" "minimal" [ ./hosts/c2 ];
-        c3 = mkHost "x86_64-linux" "server" [ ./hosts/c3 ];
+        c3 = mkHost "x86_64-linux" "minimal" [ ./hosts/c3 ];
        alo-cloud-1 = mkHost "aarch64-linux" "cloud" [ ./hosts/alo-cloud-1 ];
        zippy = mkHost "x86_64-linux" "minimal" [
          ethereum-nix.nixosModules.default
--- a/home/default.nix
+++ b/home/default.nix
@@ -29,8 +29,7 @@
    persistence."/persist/home/ppetru" = {
      directories = [
-        ".cache/nix"
+        ".cache/"
        ".cache/nix-index"
        ".claude/"
        ".codex/"
        ".config/io.datasette.llm/"
@@ -41,7 +40,9 @@
        ".ssh"
        "projects"
      ];
-      files = [ ];
+      files = [ 
        ".claude.json"
      ];
      allowOther = true;
    };
  };
--- a/home/profiles/desktop.nix
+++ b/home/profiles/desktop.nix
@@ -11,7 +11,6 @@ in
 {
  packages = workstationProfile.packages ++ desktopPkgs;
  environment.persistence."/persist/home/ppetru".directories = [
    ".cache"
    ".config/google-chrome"
  ];
 }
--- a/home/profiles/minimal.nix
+++ b/home/profiles/minimal.nix
@@ -0,0 +1,5 @@
 { pkgs }:
 {
  # Minimal profile: reuses server.nix for basic package list
  packages = (import ./server.nix { inherit pkgs; }).packages;
 }
--- a/home/programs/minimal.nix
+++ b/home/programs/minimal.nix
@@ -0,0 +1,5 @@
 { pkgs, ... }:
 {
  # Minimal profile: reuses server.nix for basic CLI programs
  imports = [ ./server.nix ];
 }
--- a/hosts/alo-cloud-1/default.nix
+++ b/hosts/alo-cloud-1/default.nix
@@ -2,6 +2,7 @@
 {
  imports = [
    ../../common/global
    ../../common/cloud-node.nix  # Minimal system with Consul
    ./hardware.nix
    ./reverse-proxy.nix
  ];
--- a/hosts/c1/default.nix
+++ b/hosts/c1/default.nix
@@ -6,7 +6,6 @@
    ../../common/cluster-member.nix  # Consul + storage clients
    ../../common/nomad-worker.nix     # Nomad client (runs jobs)
    ../../common/nomad-server.nix     # Consul + Nomad server mode
    ../../common/glusterfs.nix        # GlusterFS server (temp during migration)
    ../../common/nfs-services-standby.nix  # NFS standby for /data/services
    # To promote to NFS server (during failover):
    # 1. Follow procedure in docs/NFS_FAILOVER.md
--- a/hosts/c2/default.nix
+++ b/hosts/c2/default.nix
@@ -6,7 +6,6 @@
    ../../common/cluster-member.nix  # Consul + storage clients
    ../../common/nomad-worker.nix     # Nomad client (runs jobs)
    ../../common/nomad-server.nix     # Consul + Nomad server mode
    ../../common/glusterfs.nix        # GlusterFS server (temp during migration)
    ./hardware.nix
  ];
--- a/hosts/c3/default.nix
+++ b/hosts/c3/default.nix
@@ -6,7 +6,6 @@
    ../../common/cluster-member.nix  # Consul + storage clients
    ../../common/nomad-worker.nix     # Nomad client (runs jobs)
    ../../common/nomad-server.nix     # Consul + Nomad server mode
    ../../common/glusterfs.nix        # GlusterFS server (temp during migration)
    ../../common/binary-cache-server.nix
    ./hardware.nix
  ];
--- a/hosts/chilly/default.nix
+++ b/hosts/chilly/default.nix
@@ -8,6 +8,7 @@
  imports = [
    ../../common/encrypted-btrfs-layout.nix
    ../../common/global
    ../../common/workstation-node.nix # Dev tools (deploy-rs, docker, nix-ld)
    ../../common/cluster-member.nix  # Consul + storage clients
    ../../common/cluster-tools.nix    # Nomad CLI (no service)
    ./hardware.nix
@@ -25,6 +26,8 @@
  networking.useNetworkd = true;
  systemd.network.enable = true;
  # Wait for br0 to be routable before considering network online
  systemd.network.wait-online.extraArgs = [ "--interface=br0:routable" ];
  # not useful and potentially a security loophole
  services.resolved.llmnr = "false";
--- a/hosts/sparky/default.nix
+++ b/hosts/sparky/default.nix
@@ -3,6 +3,7 @@
  imports = [
    ../../common/encrypted-btrfs-layout.nix
    ../../common/global
    ../../common/desktop-node.nix     # Hyprland + GUI environment
    ../../common/cluster-member.nix  # Consul + storage clients
    ../../common/cluster-tools.nix    # Nomad CLI (no service)
    ./hardware.nix
--- a/hosts/zippy/default.nix
+++ b/hosts/zippy/default.nix
@@ -6,7 +6,6 @@
    ../../common/cluster-member.nix  # Consul + storage clients
    ../../common/nomad-worker.nix     # Nomad client (runs jobs)
    # NOTE: zippy is NOT a server - no nomad-server.nix import
    ../../common/glusterfs.nix        # GlusterFS server (temp during migration)
 #    ../../common/ethereum.nix
    ../../common/nfs-services-server.nix  # NFS server for /data/services
    # To move NFS server role to another host:
--- a/stateful-commands.txt
+++ b/stateful-commands.txt
@@ -1,33 +1,9 @@
 glusterfs setup on c1:
  * for h in c1 c2 c3; do ssh $h sudo mkdir /persist/glusterfs/compute; done
  * gluster peer probe c2
  * gluster peer probe c3
  * gluster volume create compute replica 3 c{1,2,3}:/persist/glusterfs/compute/brick1
  * gluster volume start compute
  * gluster volume bitrot compute enable
 mysql credentials
  * Put secrets/mysql_root_password into a Nomad var named secrets/mysql.root_password
 postgres credentials
  * Put secrets/postgres_password into a Nomad var named secrets/postgresql.postgres_password
 adding a new gluster node to the compute volume, with c3 having failed:
 (instructions from https://icicimov.github.io/blog/high-availability/Replacing-GlusterFS-failed-node/)
  * zippy: sudo mkdir /persist/glusterfs/compute -p
  * c1: gluster peer probe 192.168.1.2 (by IP because zippy resolved to a tailscale address)
  * c1: gluster volume replace-brick compute c3:/persist/glusterfs/compute/brick1 192.168.1.2:/persist/glusterfs/compute/brick1 commit force
  * c1: gluster volume heal compute full
  * c1: gluster peer detach c3
 same to then later replace 192.168.1.2 with 192.168.1.73
 replacing failed / reinstalled gluster volume (c1 in this case). all commands on c2:
  * gluster volume remove-brick compute replica 2 c1:/persist/glusterfs/compute/brick1 force
  * gluster peer detach c1
  * gluster peer probe 192.168.1.71 (not c1 because switching to IPs to avoid DNS/tailscale issues)
  * gluster volume add-brick compute replica 3 192.168.1.71:/persist/glusterfs/compute/brick1
 kopia repository server setup (on a non-NixOS host at the time):
  * kopia repository create filesystem --path /backup/persist
  * kopia repository connect filesystem --path=/backup/persist
Author	SHA1	Message	Date
Petru Paler	dd8fee0ecb	Reduce NFS snapshot retention time to save disk space.	2025-10-25 11:32:13 +01:00
Petru Paler	a2b54be875	Remove glusterfs references.	2025-10-25 08:51:50 +01:00
Petru Paler	ccf6154ba0	Remove glusterfs.	2025-10-25 08:51:29 +01:00
Petru Paler	bd5988dfbc	Profiles only for home manager.	2025-10-25 08:34:21 +01:00
Petru Paler	a57fc9107b	Make sure DNS is up before mounting NFS.	2025-10-24 22:49:32 +01:00
Petru Paler	a7dce7cfb9	Persist ~/.claude.json and ~/.cache everywhere	2025-10-24 22:30:16 +01:00