Compare commits

..

6 Commits

22 changed files with 112 additions and 114 deletions

1
.gitignore vendored
View File

@@ -2,3 +2,4 @@
.tmp .tmp
result result
.aider* .aider*
.claude

View File

@@ -30,10 +30,7 @@ NixOS cluster configuration using flakes. Homelab infrastructure with Nomad/Cons
└── services/ # Nomad job specs (.hcl files) └── services/ # Nomad job specs (.hcl files)
``` ```
## Current Architecture (transitioning) ## Current Architecture
**OLD**: GlusterFS on c1/c2/c3 at `/data/compute` (being phased out)
**NEW**: NFS from zippy at `/data/services` (current target)
### Storage Mounts ### Storage Mounts
- `/data/services` - NFS from `data-services.service.consul` (zippy primary, c1 standby) - `/data/services` - NFS from `data-services.service.consul` (zippy primary, c1 standby)
@@ -86,26 +83,18 @@ NixOS cluster configuration using flakes. Homelab infrastructure with Nomad/Cons
## Migration Status ## Migration Status
**Phase**: 4 in progress (20/35 services migrated) **Phase 3 & 4**: COMPLETE! GlusterFS removed, all services on NFS
**Current**: Migrating services from GlusterFS → NFS **Next**: Convert fractal to NixOS (deferred)
**Next**: Finish migrating remaining services, update host volumes, remove GlusterFS
**Later**: Convert fractal to NixOS (deferred)
See `docs/MIGRATION_TODO.md` for detailed checklist. See `docs/MIGRATION_TODO.md` for detailed checklist.
**IMPORTANT**: When working on migration tasks:
1. Always update `docs/MIGRATION_TODO.md` after completing each service migration
2. Update both the individual service checklist AND the summary counts at the bottom
3. Pattern: `/data/compute/appdata/foo``/data/services/foo` (NOT `/data/services/appdata/foo`!)
4. Migration workflow per service: stop → copy data → edit config → start → update MIGRATION_TODO.md
## Common Tasks ## Common Tasks
**Deploy a host**: `deploy -s '.#hostname'` **Deploy a host**: `deploy -s '.#hostname'`
**Deploy all**: `deploy` **Deploy all**: `deploy`
**Check replication**: `ssh zippy journalctl -u replicate-services-to-c1.service -f` **Check replication**: `ssh zippy journalctl -u replicate-services-to-c1.service -f`
**NFS failover**: See `docs/NFS_FAILOVER.md` **NFS failover**: See `docs/NFS_FAILOVER.md`
**Nomad jobs**: `services/*.hcl` - update paths: `/data/compute/appdata/foo``/data/services/foo` (NOT `/data/services/appdata/foo`!) **Nomad jobs**: `services/*.hcl` - service data stored at `/data/services/<service-name>`
## Troubleshooting Hints ## Troubleshooting Hints

View File

@@ -8,7 +8,10 @@
./unattended-encryption.nix ./unattended-encryption.nix
./cifs-client.nix ./cifs-client.nix
./consul.nix ./consul.nix
./glusterfs-client.nix # Keep during migration, will be removed in Phase 3
./nfs-services-client.nix # New: NFS client for /data/services ./nfs-services-client.nix # New: NFS client for /data/services
]; ];
# Wait for eno1 to be routable before considering network online
# (hosts with different primary interfaces should override this)
systemd.network.wait-online.extraArgs = [ "--interface=eno1:routable" ];
} }

View File

@@ -1,13 +0,0 @@
{ pkgs, ... }:
{
environment.systemPackages = [ pkgs.glusterfs ];
fileSystems."/data/compute" = {
device = "192.168.1.71:/compute";
fsType = "glusterfs";
options = [
"backup-volfile-servers=192.168.1.72:192.168.1.73"
"_netdev"
];
};
}

View File

@@ -1,24 +0,0 @@
{
pkgs,
config,
lib,
...
}:
{
services.glusterfs = {
enable = true;
};
environment.persistence."/persist".directories = [ "/var/lib/glusterd" ];
# TODO: each volume needs its own port starting at 49152
networking.firewall.allowedTCPPorts = [
24007
24008
24009
49152
49153
49154
49155
];
}

View File

@@ -9,12 +9,17 @@
# The mount is established at boot time and persists - no auto-unmount. # The mount is established at boot time and persists - no auto-unmount.
# This prevents issues with Docker bind mounts seeing empty automount stubs. # This prevents issues with Docker bind mounts seeing empty automount stubs.
imports = [
./wait-for-dns-ready.nix
];
fileSystems."/data/services" = { fileSystems."/data/services" = {
device = "data-services.service.consul:/persist/services"; device = "data-services.service.consul:/persist/services";
fsType = "nfs"; fsType = "nfs";
options = [ options = [
"nofail" # Don't block boot if mount fails "nofail" # Don't block boot if mount fails
"x-systemd.mount-timeout=30s" # Timeout for mount attempts "x-systemd.mount-timeout=30s" # Timeout for mount attempts
"x-systemd.after=wait-for-dns-ready.service" # Wait for DNS to actually work
"_netdev" # Network filesystem (wait for network) "_netdev" # Network filesystem (wait for network)
]; ];
}; };

View File

@@ -39,15 +39,15 @@ in
noCheck = true; noCheck = true;
}; };
# Cleanup old snapshots on standby (keep last 48 hours for safety) # Cleanup old snapshots on standby (keep last 4 hours for HA failover)
systemd.services.cleanup-services-standby-snapshots = { systemd.services.cleanup-services-standby-snapshots = {
description = "Cleanup old btrfs snapshots in services-standby"; description = "Cleanup old btrfs snapshots in services-standby";
path = [ pkgs.btrfs-progs pkgs.findutils ]; path = [ pkgs.btrfs-progs pkgs.findutils ];
script = '' script = ''
set -euo pipefail set -euo pipefail
# Keep last 48 hours of snapshots (576 snapshots at 5min intervals) # Keep last 4 hours of snapshots (48 snapshots at 5min intervals)
find /persist/services-standby -maxdepth 1 -name 'services@*' -mmin +2880 -exec btrfs subvolume delete {} \; || true find /persist/services-standby -maxdepth 1 -name 'services@*' -mmin +240 -exec btrfs subvolume delete {} \; || true
''; '';
serviceConfig = { serviceConfig = {

View File

@@ -0,0 +1,55 @@
{ pkgs, ... }:
{
# Service to wait for DNS resolution to be actually functional
# This is needed because network-online.target and wait-online.service
# don't guarantee DNS works - they only check that interfaces are configured.
#
# Problem: NFS mounts using Consul DNS names (data-services.service.consul)
# fail at boot because DNS resolution isn't ready even though network is "online"
#
# Solution: Actively test DNS resolution before considering network truly ready
systemd.services.wait-for-dns-ready = {
description = "Wait for DNS resolution to be functional";
after = [
"systemd-networkd-wait-online.service"
"systemd-resolved.service"
"network-online.target"
];
wants = [ "network-online.target" ];
wantedBy = [ "multi-user.target" ];
serviceConfig = {
Type = "oneshot";
RemainAfterExit = true;
ExecStart = pkgs.writeShellScript "wait-for-dns-ready" ''
# Test DNS resolution by attempting to resolve data-services.service.consul
# This ensures the full DNS path works: interface gateway Consul DNS
echo "Waiting for DNS resolution to be ready..."
for i in {1..30}; do
# Use getent which respects /etc/nsswitch.conf and systemd-resolved
if ${pkgs.glibc.bin}/bin/getent hosts data-services.service.consul >/dev/null 2>&1; then
echo "DNS ready: data-services.service.consul resolved successfully"
exit 0
fi
# Also test a public DNS name to distinguish between general DNS failure
# vs Consul-specific issues (helpful for debugging)
if ! ${pkgs.glibc.bin}/bin/getent hosts www.google.com >/dev/null 2>&1; then
echo "Attempt $i/30: General DNS not working yet, waiting..."
else
echo "Attempt $i/30: General DNS works but Consul DNS not ready yet, waiting..."
fi
sleep 1
done
echo "Warning: DNS not fully ready after 30 seconds"
echo "NFS mounts with 'nofail' option will handle this gracefully"
exit 0 # Don't block boot - let nofail mounts handle DNS failures
'';
};
};
}

View File

@@ -37,17 +37,17 @@ See [CLUSTER_REVAMP.md](./CLUSTER_REVAMP.md) for detailed procedures.
## Phase 3: Migrate from GlusterFS to NFS ## Phase 3: Migrate from GlusterFS to NFS
- [x] Update all nodes to mount NFS at `/data/services` - [x] Update all nodes to mount NFS at `/data/services`
- [x] Deploy updated configs (NFS client on all nodes) - [x] Deploy updated configs (NFS client on all nodes)
- [ ] Stop all Nomad jobs temporarily - [x] Stop all Nomad jobs temporarily
- [ ] Copy data from GlusterFS to zippy NFS - [x] Copy data from GlusterFS to zippy NFS
- [ ] Copy `/data/compute/appdata/*``/persist/services/appdata/` - [x] Copy `/data/compute/appdata/*``/persist/services/appdata/`
- [ ] Copy `/data/compute/config/*``/persist/services/config/` - [x] Copy `/data/compute/config/*``/persist/services/config/`
- [ ] Copy `/data/sync/wordpress``/persist/services/appdata/wordpress` - [x] Copy `/data/sync/wordpress``/persist/services/appdata/wordpress`
- [ ] Verify data integrity - [x] Verify data integrity
- [ ] Verify NFS mounts working on all nodes - [x] Verify NFS mounts working on all nodes
- [ ] Stop GlusterFS volume - [x] Stop GlusterFS volume
- [ ] Delete GlusterFS volume - [x] Delete GlusterFS volume
- [ ] Remove GlusterFS from NixOS configs - [x] Remove GlusterFS from NixOS configs
- [ ] Remove syncthing wordpress sync configuration - [x] Remove syncthing wordpress sync configuration (no longer used)
## Phase 4: Update and redeploy Nomad jobs ## Phase 4: Update and redeploy Nomad jobs
@@ -125,8 +125,8 @@ See [CLUSTER_REVAMP.md](./CLUSTER_REVAMP.md) for detailed procedures.
- [ ] Verify backups include `/persist/services` data - [ ] Verify backups include `/persist/services` data
- [ ] Verify backups exclude replication snapshots - [ ] Verify backups exclude replication snapshots
- [ ] Update documentation (README.md, architecture diagrams) - [ ] Update documentation (README.md, architecture diagrams)
- [ ] Clean up old GlusterFS data (only after everything verified!) - [x] Clean up old GlusterFS data (only after everything verified!)
- [ ] Remove old glusterfs directories from all nodes - [x] Remove old glusterfs directories from all nodes
## Post-Migration Checklist ## Post-Migration Checklist
- [ ] All 5 servers in quorum (consul members) - [ ] All 5 servers in quorum (consul members)
@@ -143,8 +143,8 @@ See [CLUSTER_REVAMP.md](./CLUSTER_REVAMP.md) for detailed procedures.
--- ---
**Last updated**: 2025-10-23 22:30 **Last updated**: 2025-10-25
**Current phase**: Phase 4 complete! All services migrated to NFS **Current phase**: Phase 3 & 4 complete! GlusterFS removed, all services on NFS
**Note**: Phase 1 (fractal NixOS conversion) deferred until after GlusterFS migration is complete **Note**: Phase 1 (fractal NixOS conversion) deferred until after GlusterFS migration is complete
## Migration Summary ## Migration Summary

View File

@@ -66,13 +66,8 @@
mkHost = mkHost =
system: profile: modules: system: profile: modules:
let let
# Auto-import profile-specific module based on profile parameter # Profile parameter is only used by home-manager for user environment
profileModule = # NixOS system configuration is handled via explicit imports in host configs
if profile == "server" then ./common/server-node.nix
else if profile == "workstation" then ./common/workstation-node.nix
else if profile == "desktop" then ./common/desktop-node.nix
else if profile == "cloud" then ./common/cloud-node.nix
else null;
in in
nixpkgs.lib.nixosSystem { nixpkgs.lib.nixosSystem {
system = system; system = system;
@@ -105,7 +100,7 @@
}; };
}; };
} }
] ++ nixpkgs.lib.optional (profileModule != null) profileModule ++ modules; ] ++ modules;
specialArgs = { specialArgs = {
inherit inputs self; inherit inputs self;
}; };
@@ -136,9 +131,9 @@
in in
{ {
nixosConfigurations = { nixosConfigurations = {
c1 = mkHost "x86_64-linux" "server" [ ./hosts/c1 ]; c1 = mkHost "x86_64-linux" "minimal" [ ./hosts/c1 ];
c2 = mkHost "x86_64-linux" "server" [ ./hosts/c2 ]; c2 = mkHost "x86_64-linux" "minimal" [ ./hosts/c2 ];
c3 = mkHost "x86_64-linux" "server" [ ./hosts/c3 ]; c3 = mkHost "x86_64-linux" "minimal" [ ./hosts/c3 ];
alo-cloud-1 = mkHost "aarch64-linux" "cloud" [ ./hosts/alo-cloud-1 ]; alo-cloud-1 = mkHost "aarch64-linux" "cloud" [ ./hosts/alo-cloud-1 ];
zippy = mkHost "x86_64-linux" "minimal" [ zippy = mkHost "x86_64-linux" "minimal" [
ethereum-nix.nixosModules.default ethereum-nix.nixosModules.default

View File

@@ -29,8 +29,7 @@
persistence."/persist/home/ppetru" = { persistence."/persist/home/ppetru" = {
directories = [ directories = [
".cache/nix" ".cache/"
".cache/nix-index"
".claude/" ".claude/"
".codex/" ".codex/"
".config/io.datasette.llm/" ".config/io.datasette.llm/"
@@ -41,7 +40,9 @@
".ssh" ".ssh"
"projects" "projects"
]; ];
files = [ ]; files = [
".claude.json"
];
allowOther = true; allowOther = true;
}; };
}; };

View File

@@ -11,7 +11,6 @@ in
{ {
packages = workstationProfile.packages ++ desktopPkgs; packages = workstationProfile.packages ++ desktopPkgs;
environment.persistence."/persist/home/ppetru".directories = [ environment.persistence."/persist/home/ppetru".directories = [
".cache"
".config/google-chrome" ".config/google-chrome"
]; ];
} }

View File

@@ -0,0 +1,5 @@
{ pkgs }:
{
# Minimal profile: reuses server.nix for basic package list
packages = (import ./server.nix { inherit pkgs; }).packages;
}

View File

@@ -0,0 +1,5 @@
{ pkgs, ... }:
{
# Minimal profile: reuses server.nix for basic CLI programs
imports = [ ./server.nix ];
}

View File

@@ -2,6 +2,7 @@
{ {
imports = [ imports = [
../../common/global ../../common/global
../../common/cloud-node.nix # Minimal system with Consul
./hardware.nix ./hardware.nix
./reverse-proxy.nix ./reverse-proxy.nix
]; ];

View File

@@ -6,7 +6,6 @@
../../common/cluster-member.nix # Consul + storage clients ../../common/cluster-member.nix # Consul + storage clients
../../common/nomad-worker.nix # Nomad client (runs jobs) ../../common/nomad-worker.nix # Nomad client (runs jobs)
../../common/nomad-server.nix # Consul + Nomad server mode ../../common/nomad-server.nix # Consul + Nomad server mode
../../common/glusterfs.nix # GlusterFS server (temp during migration)
../../common/nfs-services-standby.nix # NFS standby for /data/services ../../common/nfs-services-standby.nix # NFS standby for /data/services
# To promote to NFS server (during failover): # To promote to NFS server (during failover):
# 1. Follow procedure in docs/NFS_FAILOVER.md # 1. Follow procedure in docs/NFS_FAILOVER.md

View File

@@ -6,7 +6,6 @@
../../common/cluster-member.nix # Consul + storage clients ../../common/cluster-member.nix # Consul + storage clients
../../common/nomad-worker.nix # Nomad client (runs jobs) ../../common/nomad-worker.nix # Nomad client (runs jobs)
../../common/nomad-server.nix # Consul + Nomad server mode ../../common/nomad-server.nix # Consul + Nomad server mode
../../common/glusterfs.nix # GlusterFS server (temp during migration)
./hardware.nix ./hardware.nix
]; ];

View File

@@ -6,7 +6,6 @@
../../common/cluster-member.nix # Consul + storage clients ../../common/cluster-member.nix # Consul + storage clients
../../common/nomad-worker.nix # Nomad client (runs jobs) ../../common/nomad-worker.nix # Nomad client (runs jobs)
../../common/nomad-server.nix # Consul + Nomad server mode ../../common/nomad-server.nix # Consul + Nomad server mode
../../common/glusterfs.nix # GlusterFS server (temp during migration)
../../common/binary-cache-server.nix ../../common/binary-cache-server.nix
./hardware.nix ./hardware.nix
]; ];

View File

@@ -8,6 +8,7 @@
imports = [ imports = [
../../common/encrypted-btrfs-layout.nix ../../common/encrypted-btrfs-layout.nix
../../common/global ../../common/global
../../common/workstation-node.nix # Dev tools (deploy-rs, docker, nix-ld)
../../common/cluster-member.nix # Consul + storage clients ../../common/cluster-member.nix # Consul + storage clients
../../common/cluster-tools.nix # Nomad CLI (no service) ../../common/cluster-tools.nix # Nomad CLI (no service)
./hardware.nix ./hardware.nix
@@ -25,6 +26,8 @@
networking.useNetworkd = true; networking.useNetworkd = true;
systemd.network.enable = true; systemd.network.enable = true;
# Wait for br0 to be routable before considering network online
systemd.network.wait-online.extraArgs = [ "--interface=br0:routable" ];
# not useful and potentially a security loophole # not useful and potentially a security loophole
services.resolved.llmnr = "false"; services.resolved.llmnr = "false";

View File

@@ -3,6 +3,7 @@
imports = [ imports = [
../../common/encrypted-btrfs-layout.nix ../../common/encrypted-btrfs-layout.nix
../../common/global ../../common/global
../../common/desktop-node.nix # Hyprland + GUI environment
../../common/cluster-member.nix # Consul + storage clients ../../common/cluster-member.nix # Consul + storage clients
../../common/cluster-tools.nix # Nomad CLI (no service) ../../common/cluster-tools.nix # Nomad CLI (no service)
./hardware.nix ./hardware.nix

View File

@@ -6,7 +6,6 @@
../../common/cluster-member.nix # Consul + storage clients ../../common/cluster-member.nix # Consul + storage clients
../../common/nomad-worker.nix # Nomad client (runs jobs) ../../common/nomad-worker.nix # Nomad client (runs jobs)
# NOTE: zippy is NOT a server - no nomad-server.nix import # NOTE: zippy is NOT a server - no nomad-server.nix import
../../common/glusterfs.nix # GlusterFS server (temp during migration)
# ../../common/ethereum.nix # ../../common/ethereum.nix
../../common/nfs-services-server.nix # NFS server for /data/services ../../common/nfs-services-server.nix # NFS server for /data/services
# To move NFS server role to another host: # To move NFS server role to another host:

View File

@@ -1,33 +1,9 @@
glusterfs setup on c1:
* for h in c1 c2 c3; do ssh $h sudo mkdir /persist/glusterfs/compute; done
* gluster peer probe c2
* gluster peer probe c3
* gluster volume create compute replica 3 c{1,2,3}:/persist/glusterfs/compute/brick1
* gluster volume start compute
* gluster volume bitrot compute enable
mysql credentials mysql credentials
* Put secrets/mysql_root_password into a Nomad var named secrets/mysql.root_password * Put secrets/mysql_root_password into a Nomad var named secrets/mysql.root_password
postgres credentials postgres credentials
* Put secrets/postgres_password into a Nomad var named secrets/postgresql.postgres_password * Put secrets/postgres_password into a Nomad var named secrets/postgresql.postgres_password
adding a new gluster node to the compute volume, with c3 having failed:
(instructions from https://icicimov.github.io/blog/high-availability/Replacing-GlusterFS-failed-node/)
* zippy: sudo mkdir /persist/glusterfs/compute -p
* c1: gluster peer probe 192.168.1.2 (by IP because zippy resolved to a tailscale address)
* c1: gluster volume replace-brick compute c3:/persist/glusterfs/compute/brick1 192.168.1.2:/persist/glusterfs/compute/brick1 commit force
* c1: gluster volume heal compute full
* c1: gluster peer detach c3
same to then later replace 192.168.1.2 with 192.168.1.73
replacing failed / reinstalled gluster volume (c1 in this case). all commands on c2:
* gluster volume remove-brick compute replica 2 c1:/persist/glusterfs/compute/brick1 force
* gluster peer detach c1
* gluster peer probe 192.168.1.71 (not c1 because switching to IPs to avoid DNS/tailscale issues)
* gluster volume add-brick compute replica 3 192.168.1.71:/persist/glusterfs/compute/brick1
kopia repository server setup (on a non-NixOS host at the time): kopia repository server setup (on a non-NixOS host at the time):
* kopia repository create filesystem --path /backup/persist * kopia repository create filesystem --path /backup/persist
* kopia repository connect filesystem --path=/backup/persist * kopia repository connect filesystem --path=/backup/persist