Compare commits
6 Commits
b608e110c9
...
dd8fee0ecb
| Author | SHA1 | Date | |
|---|---|---|---|
| dd8fee0ecb | |||
| a2b54be875 | |||
| ccf6154ba0 | |||
| bd5988dfbc | |||
| a57fc9107b | |||
| a7dce7cfb9 |
1
.gitignore
vendored
1
.gitignore
vendored
@@ -2,3 +2,4 @@
|
||||
.tmp
|
||||
result
|
||||
.aider*
|
||||
.claude
|
||||
|
||||
19
CLAUDE.md
19
CLAUDE.md
@@ -30,10 +30,7 @@ NixOS cluster configuration using flakes. Homelab infrastructure with Nomad/Cons
|
||||
└── services/ # Nomad job specs (.hcl files)
|
||||
```
|
||||
|
||||
## Current Architecture (transitioning)
|
||||
|
||||
**OLD**: GlusterFS on c1/c2/c3 at `/data/compute` (being phased out)
|
||||
**NEW**: NFS from zippy at `/data/services` (current target)
|
||||
## Current Architecture
|
||||
|
||||
### Storage Mounts
|
||||
- `/data/services` - NFS from `data-services.service.consul` (zippy primary, c1 standby)
|
||||
@@ -86,26 +83,18 @@ NixOS cluster configuration using flakes. Homelab infrastructure with Nomad/Cons
|
||||
|
||||
## Migration Status
|
||||
|
||||
**Phase**: 4 in progress (20/35 services migrated)
|
||||
**Current**: Migrating services from GlusterFS → NFS
|
||||
**Next**: Finish migrating remaining services, update host volumes, remove GlusterFS
|
||||
**Later**: Convert fractal to NixOS (deferred)
|
||||
**Phase 3 & 4**: COMPLETE! GlusterFS removed, all services on NFS
|
||||
**Next**: Convert fractal to NixOS (deferred)
|
||||
|
||||
See `docs/MIGRATION_TODO.md` for detailed checklist.
|
||||
|
||||
**IMPORTANT**: When working on migration tasks:
|
||||
1. Always update `docs/MIGRATION_TODO.md` after completing each service migration
|
||||
2. Update both the individual service checklist AND the summary counts at the bottom
|
||||
3. Pattern: `/data/compute/appdata/foo` → `/data/services/foo` (NOT `/data/services/appdata/foo`!)
|
||||
4. Migration workflow per service: stop → copy data → edit config → start → update MIGRATION_TODO.md
|
||||
|
||||
## Common Tasks
|
||||
|
||||
**Deploy a host**: `deploy -s '.#hostname'`
|
||||
**Deploy all**: `deploy`
|
||||
**Check replication**: `ssh zippy journalctl -u replicate-services-to-c1.service -f`
|
||||
**NFS failover**: See `docs/NFS_FAILOVER.md`
|
||||
**Nomad jobs**: `services/*.hcl` - update paths: `/data/compute/appdata/foo` → `/data/services/foo` (NOT `/data/services/appdata/foo`!)
|
||||
**Nomad jobs**: `services/*.hcl` - service data stored at `/data/services/<service-name>`
|
||||
|
||||
## Troubleshooting Hints
|
||||
|
||||
|
||||
@@ -8,7 +8,10 @@
|
||||
./unattended-encryption.nix
|
||||
./cifs-client.nix
|
||||
./consul.nix
|
||||
./glusterfs-client.nix # Keep during migration, will be removed in Phase 3
|
||||
./nfs-services-client.nix # New: NFS client for /data/services
|
||||
];
|
||||
|
||||
# Wait for eno1 to be routable before considering network online
|
||||
# (hosts with different primary interfaces should override this)
|
||||
systemd.network.wait-online.extraArgs = [ "--interface=eno1:routable" ];
|
||||
}
|
||||
|
||||
@@ -1,13 +0,0 @@
|
||||
{ pkgs, ... }:
|
||||
{
|
||||
environment.systemPackages = [ pkgs.glusterfs ];
|
||||
|
||||
fileSystems."/data/compute" = {
|
||||
device = "192.168.1.71:/compute";
|
||||
fsType = "glusterfs";
|
||||
options = [
|
||||
"backup-volfile-servers=192.168.1.72:192.168.1.73"
|
||||
"_netdev"
|
||||
];
|
||||
};
|
||||
}
|
||||
@@ -1,24 +0,0 @@
|
||||
{
|
||||
pkgs,
|
||||
config,
|
||||
lib,
|
||||
...
|
||||
}:
|
||||
{
|
||||
services.glusterfs = {
|
||||
enable = true;
|
||||
};
|
||||
|
||||
environment.persistence."/persist".directories = [ "/var/lib/glusterd" ];
|
||||
|
||||
# TODO: each volume needs its own port starting at 49152
|
||||
networking.firewall.allowedTCPPorts = [
|
||||
24007
|
||||
24008
|
||||
24009
|
||||
49152
|
||||
49153
|
||||
49154
|
||||
49155
|
||||
];
|
||||
}
|
||||
@@ -9,12 +9,17 @@
|
||||
# The mount is established at boot time and persists - no auto-unmount.
|
||||
# This prevents issues with Docker bind mounts seeing empty automount stubs.
|
||||
|
||||
imports = [
|
||||
./wait-for-dns-ready.nix
|
||||
];
|
||||
|
||||
fileSystems."/data/services" = {
|
||||
device = "data-services.service.consul:/persist/services";
|
||||
fsType = "nfs";
|
||||
options = [
|
||||
"nofail" # Don't block boot if mount fails
|
||||
"x-systemd.mount-timeout=30s" # Timeout for mount attempts
|
||||
"x-systemd.after=wait-for-dns-ready.service" # Wait for DNS to actually work
|
||||
"_netdev" # Network filesystem (wait for network)
|
||||
];
|
||||
};
|
||||
|
||||
@@ -39,15 +39,15 @@ in
|
||||
noCheck = true;
|
||||
};
|
||||
|
||||
# Cleanup old snapshots on standby (keep last 48 hours for safety)
|
||||
# Cleanup old snapshots on standby (keep last 4 hours for HA failover)
|
||||
systemd.services.cleanup-services-standby-snapshots = {
|
||||
description = "Cleanup old btrfs snapshots in services-standby";
|
||||
path = [ pkgs.btrfs-progs pkgs.findutils ];
|
||||
|
||||
script = ''
|
||||
set -euo pipefail
|
||||
# Keep last 48 hours of snapshots (576 snapshots at 5min intervals)
|
||||
find /persist/services-standby -maxdepth 1 -name 'services@*' -mmin +2880 -exec btrfs subvolume delete {} \; || true
|
||||
# Keep last 4 hours of snapshots (48 snapshots at 5min intervals)
|
||||
find /persist/services-standby -maxdepth 1 -name 'services@*' -mmin +240 -exec btrfs subvolume delete {} \; || true
|
||||
'';
|
||||
|
||||
serviceConfig = {
|
||||
|
||||
55
common/wait-for-dns-ready.nix
Normal file
55
common/wait-for-dns-ready.nix
Normal file
@@ -0,0 +1,55 @@
|
||||
{ pkgs, ... }:
|
||||
{
|
||||
# Service to wait for DNS resolution to be actually functional
|
||||
# This is needed because network-online.target and wait-online.service
|
||||
# don't guarantee DNS works - they only check that interfaces are configured.
|
||||
#
|
||||
# Problem: NFS mounts using Consul DNS names (data-services.service.consul)
|
||||
# fail at boot because DNS resolution isn't ready even though network is "online"
|
||||
#
|
||||
# Solution: Actively test DNS resolution before considering network truly ready
|
||||
|
||||
systemd.services.wait-for-dns-ready = {
|
||||
description = "Wait for DNS resolution to be functional";
|
||||
after = [
|
||||
"systemd-networkd-wait-online.service"
|
||||
"systemd-resolved.service"
|
||||
"network-online.target"
|
||||
];
|
||||
wants = [ "network-online.target" ];
|
||||
wantedBy = [ "multi-user.target" ];
|
||||
|
||||
serviceConfig = {
|
||||
Type = "oneshot";
|
||||
RemainAfterExit = true;
|
||||
ExecStart = pkgs.writeShellScript "wait-for-dns-ready" ''
|
||||
# Test DNS resolution by attempting to resolve data-services.service.consul
|
||||
# This ensures the full DNS path works: interface → gateway → Consul DNS
|
||||
|
||||
echo "Waiting for DNS resolution to be ready..."
|
||||
|
||||
for i in {1..30}; do
|
||||
# Use getent which respects /etc/nsswitch.conf and systemd-resolved
|
||||
if ${pkgs.glibc.bin}/bin/getent hosts data-services.service.consul >/dev/null 2>&1; then
|
||||
echo "DNS ready: data-services.service.consul resolved successfully"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
# Also test a public DNS name to distinguish between general DNS failure
|
||||
# vs Consul-specific issues (helpful for debugging)
|
||||
if ! ${pkgs.glibc.bin}/bin/getent hosts www.google.com >/dev/null 2>&1; then
|
||||
echo "Attempt $i/30: General DNS not working yet, waiting..."
|
||||
else
|
||||
echo "Attempt $i/30: General DNS works but Consul DNS not ready yet, waiting..."
|
||||
fi
|
||||
|
||||
sleep 1
|
||||
done
|
||||
|
||||
echo "Warning: DNS not fully ready after 30 seconds"
|
||||
echo "NFS mounts with 'nofail' option will handle this gracefully"
|
||||
exit 0 # Don't block boot - let nofail mounts handle DNS failures
|
||||
'';
|
||||
};
|
||||
};
|
||||
}
|
||||
@@ -37,17 +37,17 @@ See [CLUSTER_REVAMP.md](./CLUSTER_REVAMP.md) for detailed procedures.
|
||||
## Phase 3: Migrate from GlusterFS to NFS
|
||||
- [x] Update all nodes to mount NFS at `/data/services`
|
||||
- [x] Deploy updated configs (NFS client on all nodes)
|
||||
- [ ] Stop all Nomad jobs temporarily
|
||||
- [ ] Copy data from GlusterFS to zippy NFS
|
||||
- [ ] Copy `/data/compute/appdata/*` → `/persist/services/appdata/`
|
||||
- [ ] Copy `/data/compute/config/*` → `/persist/services/config/`
|
||||
- [ ] Copy `/data/sync/wordpress` → `/persist/services/appdata/wordpress`
|
||||
- [ ] Verify data integrity
|
||||
- [ ] Verify NFS mounts working on all nodes
|
||||
- [ ] Stop GlusterFS volume
|
||||
- [ ] Delete GlusterFS volume
|
||||
- [ ] Remove GlusterFS from NixOS configs
|
||||
- [ ] Remove syncthing wordpress sync configuration
|
||||
- [x] Stop all Nomad jobs temporarily
|
||||
- [x] Copy data from GlusterFS to zippy NFS
|
||||
- [x] Copy `/data/compute/appdata/*` → `/persist/services/appdata/`
|
||||
- [x] Copy `/data/compute/config/*` → `/persist/services/config/`
|
||||
- [x] Copy `/data/sync/wordpress` → `/persist/services/appdata/wordpress`
|
||||
- [x] Verify data integrity
|
||||
- [x] Verify NFS mounts working on all nodes
|
||||
- [x] Stop GlusterFS volume
|
||||
- [x] Delete GlusterFS volume
|
||||
- [x] Remove GlusterFS from NixOS configs
|
||||
- [x] Remove syncthing wordpress sync configuration (no longer used)
|
||||
|
||||
## Phase 4: Update and redeploy Nomad jobs
|
||||
|
||||
@@ -125,8 +125,8 @@ See [CLUSTER_REVAMP.md](./CLUSTER_REVAMP.md) for detailed procedures.
|
||||
- [ ] Verify backups include `/persist/services` data
|
||||
- [ ] Verify backups exclude replication snapshots
|
||||
- [ ] Update documentation (README.md, architecture diagrams)
|
||||
- [ ] Clean up old GlusterFS data (only after everything verified!)
|
||||
- [ ] Remove old glusterfs directories from all nodes
|
||||
- [x] Clean up old GlusterFS data (only after everything verified!)
|
||||
- [x] Remove old glusterfs directories from all nodes
|
||||
|
||||
## Post-Migration Checklist
|
||||
- [ ] All 5 servers in quorum (consul members)
|
||||
@@ -143,8 +143,8 @@ See [CLUSTER_REVAMP.md](./CLUSTER_REVAMP.md) for detailed procedures.
|
||||
|
||||
---
|
||||
|
||||
**Last updated**: 2025-10-23 22:30
|
||||
**Current phase**: Phase 4 complete! All services migrated to NFS
|
||||
**Last updated**: 2025-10-25
|
||||
**Current phase**: Phase 3 & 4 complete! GlusterFS removed, all services on NFS
|
||||
**Note**: Phase 1 (fractal NixOS conversion) deferred until after GlusterFS migration is complete
|
||||
|
||||
## Migration Summary
|
||||
|
||||
17
flake.nix
17
flake.nix
@@ -66,13 +66,8 @@
|
||||
mkHost =
|
||||
system: profile: modules:
|
||||
let
|
||||
# Auto-import profile-specific module based on profile parameter
|
||||
profileModule =
|
||||
if profile == "server" then ./common/server-node.nix
|
||||
else if profile == "workstation" then ./common/workstation-node.nix
|
||||
else if profile == "desktop" then ./common/desktop-node.nix
|
||||
else if profile == "cloud" then ./common/cloud-node.nix
|
||||
else null;
|
||||
# Profile parameter is only used by home-manager for user environment
|
||||
# NixOS system configuration is handled via explicit imports in host configs
|
||||
in
|
||||
nixpkgs.lib.nixosSystem {
|
||||
system = system;
|
||||
@@ -105,7 +100,7 @@
|
||||
};
|
||||
};
|
||||
}
|
||||
] ++ nixpkgs.lib.optional (profileModule != null) profileModule ++ modules;
|
||||
] ++ modules;
|
||||
specialArgs = {
|
||||
inherit inputs self;
|
||||
};
|
||||
@@ -136,9 +131,9 @@
|
||||
in
|
||||
{
|
||||
nixosConfigurations = {
|
||||
c1 = mkHost "x86_64-linux" "server" [ ./hosts/c1 ];
|
||||
c2 = mkHost "x86_64-linux" "server" [ ./hosts/c2 ];
|
||||
c3 = mkHost "x86_64-linux" "server" [ ./hosts/c3 ];
|
||||
c1 = mkHost "x86_64-linux" "minimal" [ ./hosts/c1 ];
|
||||
c2 = mkHost "x86_64-linux" "minimal" [ ./hosts/c2 ];
|
||||
c3 = mkHost "x86_64-linux" "minimal" [ ./hosts/c3 ];
|
||||
alo-cloud-1 = mkHost "aarch64-linux" "cloud" [ ./hosts/alo-cloud-1 ];
|
||||
zippy = mkHost "x86_64-linux" "minimal" [
|
||||
ethereum-nix.nixosModules.default
|
||||
|
||||
@@ -29,8 +29,7 @@
|
||||
|
||||
persistence."/persist/home/ppetru" = {
|
||||
directories = [
|
||||
".cache/nix"
|
||||
".cache/nix-index"
|
||||
".cache/"
|
||||
".claude/"
|
||||
".codex/"
|
||||
".config/io.datasette.llm/"
|
||||
@@ -41,7 +40,9 @@
|
||||
".ssh"
|
||||
"projects"
|
||||
];
|
||||
files = [ ];
|
||||
files = [
|
||||
".claude.json"
|
||||
];
|
||||
allowOther = true;
|
||||
};
|
||||
};
|
||||
|
||||
@@ -11,7 +11,6 @@ in
|
||||
{
|
||||
packages = workstationProfile.packages ++ desktopPkgs;
|
||||
environment.persistence."/persist/home/ppetru".directories = [
|
||||
".cache"
|
||||
".config/google-chrome"
|
||||
];
|
||||
}
|
||||
|
||||
5
home/profiles/minimal.nix
Normal file
5
home/profiles/minimal.nix
Normal file
@@ -0,0 +1,5 @@
|
||||
{ pkgs }:
|
||||
{
|
||||
# Minimal profile: reuses server.nix for basic package list
|
||||
packages = (import ./server.nix { inherit pkgs; }).packages;
|
||||
}
|
||||
5
home/programs/minimal.nix
Normal file
5
home/programs/minimal.nix
Normal file
@@ -0,0 +1,5 @@
|
||||
{ pkgs, ... }:
|
||||
{
|
||||
# Minimal profile: reuses server.nix for basic CLI programs
|
||||
imports = [ ./server.nix ];
|
||||
}
|
||||
@@ -2,6 +2,7 @@
|
||||
{
|
||||
imports = [
|
||||
../../common/global
|
||||
../../common/cloud-node.nix # Minimal system with Consul
|
||||
./hardware.nix
|
||||
./reverse-proxy.nix
|
||||
];
|
||||
|
||||
@@ -6,7 +6,6 @@
|
||||
../../common/cluster-member.nix # Consul + storage clients
|
||||
../../common/nomad-worker.nix # Nomad client (runs jobs)
|
||||
../../common/nomad-server.nix # Consul + Nomad server mode
|
||||
../../common/glusterfs.nix # GlusterFS server (temp during migration)
|
||||
../../common/nfs-services-standby.nix # NFS standby for /data/services
|
||||
# To promote to NFS server (during failover):
|
||||
# 1. Follow procedure in docs/NFS_FAILOVER.md
|
||||
|
||||
@@ -6,7 +6,6 @@
|
||||
../../common/cluster-member.nix # Consul + storage clients
|
||||
../../common/nomad-worker.nix # Nomad client (runs jobs)
|
||||
../../common/nomad-server.nix # Consul + Nomad server mode
|
||||
../../common/glusterfs.nix # GlusterFS server (temp during migration)
|
||||
./hardware.nix
|
||||
];
|
||||
|
||||
|
||||
@@ -6,7 +6,6 @@
|
||||
../../common/cluster-member.nix # Consul + storage clients
|
||||
../../common/nomad-worker.nix # Nomad client (runs jobs)
|
||||
../../common/nomad-server.nix # Consul + Nomad server mode
|
||||
../../common/glusterfs.nix # GlusterFS server (temp during migration)
|
||||
../../common/binary-cache-server.nix
|
||||
./hardware.nix
|
||||
];
|
||||
|
||||
@@ -8,6 +8,7 @@
|
||||
imports = [
|
||||
../../common/encrypted-btrfs-layout.nix
|
||||
../../common/global
|
||||
../../common/workstation-node.nix # Dev tools (deploy-rs, docker, nix-ld)
|
||||
../../common/cluster-member.nix # Consul + storage clients
|
||||
../../common/cluster-tools.nix # Nomad CLI (no service)
|
||||
./hardware.nix
|
||||
@@ -25,6 +26,8 @@
|
||||
|
||||
networking.useNetworkd = true;
|
||||
systemd.network.enable = true;
|
||||
# Wait for br0 to be routable before considering network online
|
||||
systemd.network.wait-online.extraArgs = [ "--interface=br0:routable" ];
|
||||
# not useful and potentially a security loophole
|
||||
services.resolved.llmnr = "false";
|
||||
|
||||
|
||||
@@ -3,6 +3,7 @@
|
||||
imports = [
|
||||
../../common/encrypted-btrfs-layout.nix
|
||||
../../common/global
|
||||
../../common/desktop-node.nix # Hyprland + GUI environment
|
||||
../../common/cluster-member.nix # Consul + storage clients
|
||||
../../common/cluster-tools.nix # Nomad CLI (no service)
|
||||
./hardware.nix
|
||||
|
||||
@@ -6,7 +6,6 @@
|
||||
../../common/cluster-member.nix # Consul + storage clients
|
||||
../../common/nomad-worker.nix # Nomad client (runs jobs)
|
||||
# NOTE: zippy is NOT a server - no nomad-server.nix import
|
||||
../../common/glusterfs.nix # GlusterFS server (temp during migration)
|
||||
# ../../common/ethereum.nix
|
||||
../../common/nfs-services-server.nix # NFS server for /data/services
|
||||
# To move NFS server role to another host:
|
||||
|
||||
@@ -1,33 +1,9 @@
|
||||
glusterfs setup on c1:
|
||||
* for h in c1 c2 c3; do ssh $h sudo mkdir /persist/glusterfs/compute; done
|
||||
* gluster peer probe c2
|
||||
* gluster peer probe c3
|
||||
* gluster volume create compute replica 3 c{1,2,3}:/persist/glusterfs/compute/brick1
|
||||
* gluster volume start compute
|
||||
* gluster volume bitrot compute enable
|
||||
|
||||
mysql credentials
|
||||
* Put secrets/mysql_root_password into a Nomad var named secrets/mysql.root_password
|
||||
|
||||
postgres credentials
|
||||
* Put secrets/postgres_password into a Nomad var named secrets/postgresql.postgres_password
|
||||
|
||||
adding a new gluster node to the compute volume, with c3 having failed:
|
||||
(instructions from https://icicimov.github.io/blog/high-availability/Replacing-GlusterFS-failed-node/)
|
||||
* zippy: sudo mkdir /persist/glusterfs/compute -p
|
||||
* c1: gluster peer probe 192.168.1.2 (by IP because zippy resolved to a tailscale address)
|
||||
* c1: gluster volume replace-brick compute c3:/persist/glusterfs/compute/brick1 192.168.1.2:/persist/glusterfs/compute/brick1 commit force
|
||||
* c1: gluster volume heal compute full
|
||||
* c1: gluster peer detach c3
|
||||
|
||||
same to then later replace 192.168.1.2 with 192.168.1.73
|
||||
|
||||
replacing failed / reinstalled gluster volume (c1 in this case). all commands on c2:
|
||||
* gluster volume remove-brick compute replica 2 c1:/persist/glusterfs/compute/brick1 force
|
||||
* gluster peer detach c1
|
||||
* gluster peer probe 192.168.1.71 (not c1 because switching to IPs to avoid DNS/tailscale issues)
|
||||
* gluster volume add-brick compute replica 3 192.168.1.71:/persist/glusterfs/compute/brick1
|
||||
|
||||
kopia repository server setup (on a non-NixOS host at the time):
|
||||
* kopia repository create filesystem --path /backup/persist
|
||||
* kopia repository connect filesystem --path=/backup/persist
|
||||
|
||||
Reference in New Issue
Block a user