diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..a59ac72 --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,92 @@ +# Claude Code Quick Reference + +NixOS cluster configuration using flakes. Homelab infrastructure with Nomad/Consul orchestration. + +## Project Structure + +``` +├── common/ +│ ├── global/ # Applied to all hosts (backup, sops, users, etc.) +│ ├── compute-node.nix # Nomad client + Consul agent + NFS client +│ ├── cluster-node.nix # Nomad server + Consul server (for quorum members) +│ ├── nfs-services-server.nix # NFS server + btrfs replication (zippy) +│ └── nfs-services-standby.nix # NFS standby + receive replication (c1, c2) +├── hosts/ +│ ├── c1/, c2/, c3/ # Cattle nodes (compute, quorum members) +│ ├── zippy/ # Primary storage + NFS server + stateful workloads +│ ├── fractal/ # (Proxmox, will become NixOS storage node) +│ ├── sunny/ # (Standalone ethereum node, not in cluster) +│ └── chilly/ # (Home Assistant VM, not in cluster) +├── docs/ +│ ├── CLUSTER_REVAMP.md # Master plan for architecture changes +│ ├── MIGRATION_TODO.md # Tracking checklist for migration +│ └── NFS_FAILOVER.md # NFS failover procedures +└── services/ # Nomad job specs (.hcl files) +``` + +## Current Architecture (transitioning) + +**OLD**: GlusterFS on c1/c2/c3 at `/data/compute` (being phased out) +**NEW**: NFS from zippy at `/data/services` (current target) + +### Storage Mounts +- `/data/services` - NFS from `data-services.service.consul` (zippy primary, c1 standby) +- `/data/media` - CIFS from fractal (existing, unchanged) +- `/data/shared` - CIFS from fractal (existing, unchanged) + +### Hosts +- **c1, c2, c3**: Cattle nodes, run most workloads, Nomad/Consul quorum +- **zippy**: Primary NFS server, runs databases (affinity), replicates to c1 every 5min +- **fractal**: Storage node (Proxmox/ZFS), will join quorum after GlusterFS removed +- **sunny**: Standalone ethereum staking node +- **chilly**: Home Assistant VM + +## Key Patterns + +**NFS Server/Standby**: +- Primary (zippy): imports `nfs-services-server.nix`, sets `standbys = ["c1"]` +- Standby (c1): imports `nfs-services-standby.nix`, sets `replicationKeys = [...]` +- Replication: btrfs send/receive every 5min, incremental with fallback to full + +**Backups**: +- Kopia client on all nodes → Kopia server on fractal +- Backs up `/persist` hourly via btrfs snapshot +- Excludes: `services@*` and `services-standby/services@*` (replication snapshots) + +**Secrets**: +- SOPS for secrets, files in `secrets/` +- Keys managed per-host + +## Migration Status + +**Phase**: 2 complete, ready for Phase 3 +**Current**: Migrating GlusterFS → NFS +**Next**: Copy data, update Nomad jobs, remove GlusterFS +**Later**: Convert fractal to NixOS (deferred) + +See `docs/MIGRATION_TODO.md` for detailed checklist. + +## Common Tasks + +**Deploy a host**: `deploy -s '.#hostname'` +**Deploy all**: `deploy` +**Check replication**: `ssh zippy journalctl -u replicate-services-to-c1.service -f` +**NFS failover**: See `docs/NFS_FAILOVER.md` +**Nomad jobs**: `services/*.hcl` - update paths from `/data/compute` → `/data/services` + +## Troubleshooting Hints + +- Replication errors with "empty stream": SSH key restricted to `btrfs receive`, can't run other commands +- NFS split-brain protection: nfs-server checks Consul before starting +- Btrfs snapshots: nested snapshots appear as empty dirs in parent snapshots +- Kopia: uses temporary snapshot for consistency, doesn't back up nested subvolumes + +## Important Files + +- `common/global/backup.nix` - Kopia backup configuration +- `hosts/zippy/default.nix` - NFS server config, replication targets +- `hosts/c1/default.nix` - NFS standby config, authorized replication keys +- `flake.nix` - Host definitions, nixpkgs inputs + +--- +*Auto-generated reference for Claude Code. Keep concise. Update when architecture changes.* diff --git a/docs/MIGRATION_TODO.md b/docs/MIGRATION_TODO.md new file mode 100644 index 0000000..d9f6310 --- /dev/null +++ b/docs/MIGRATION_TODO.md @@ -0,0 +1,153 @@ +# Cluster Revamp Migration TODO + +Track migration progress from GlusterFS to NFS-based architecture. +See [CLUSTER_REVAMP.md](./CLUSTER_REVAMP.md) for detailed procedures. + +## Phase 0: Preparation +- [x] Review cluster revamp plan +- [ ] Backup everything (kopia snapshots current) +- [ ] Document current state (nomad jobs, consul services) + +## Phase 1: Convert fractal to NixOS (DEFERRED - do after GlusterFS migration) +- [ ] Document fractal's current ZFS layout +- [ ] Install NixOS on fractal +- [ ] Import ZFS pools (double1, double2, double3) +- [ ] Create fractal NixOS configuration +- [ ] Configure Samba server for media/shared/homes +- [ ] Configure Kopia backup server +- [ ] Deploy and verify fractal base config +- [ ] Join fractal to cluster (5-server quorum) +- [ ] Update all cluster configs for 5-server quorum +- [ ] Verify fractal fully operational + +## Phase 2: Setup zippy storage layer +- [x] Create btrfs subvolume `/persist/services` on zippy +- [x] Configure NFS server on zippy (nfs-services-server.nix) +- [x] Configure Consul service registration for NFS +- [x] Setup btrfs replication to c1 (incremental, 5min intervals) +- [x] Fix replication script to handle SSH command restrictions +- [x] Setup standby storage on c1 (`/persist/services-standby`) +- [x] Configure c1 as standby (nfs-services-standby.nix) +- [x] Configure Kopia to exclude replication snapshots +- [x] Deploy and verify NFS server on zippy +- [x] Verify replication working to c1 +- [ ] Setup standby storage on c2 (if desired) +- [ ] Configure replication to c2 (if desired) + +## Phase 3: Migrate from GlusterFS to NFS +- [x] Update all nodes to mount NFS at `/data/services` +- [x] Deploy updated configs (NFS client on all nodes) +- [ ] Stop all Nomad jobs temporarily +- [ ] Copy data from GlusterFS to zippy NFS + - [ ] Copy `/data/compute/appdata/*` → `/persist/services/appdata/` + - [ ] Copy `/data/compute/config/*` → `/persist/services/config/` + - [ ] Copy `/data/sync/wordpress` → `/persist/services/appdata/wordpress` + - [ ] Verify data integrity +- [ ] Verify NFS mounts working on all nodes +- [ ] Stop GlusterFS volume +- [ ] Delete GlusterFS volume +- [ ] Remove GlusterFS from NixOS configs +- [ ] Remove syncthing wordpress sync configuration + +## Phase 4: Update and redeploy Nomad jobs + +### Core Infrastructure (CRITICAL) +- [x] mysql.hcl - moved to zippy, using `/data/services` +- [ ] postgres.hcl - update paths, add affinity for zippy +- [ ] redis.hcl - update paths, add affinity for zippy +- [ ] traefik.hcl - update paths (already floating) +- [ ] authentik.hcl - verify (stateless, no changes needed) + +### Monitoring Stack (HIGH) +- [ ] prometheus.hcl - update paths +- [ ] grafana.hcl - update paths +- [ ] loki.hcl - update paths +- [ ] vector.hcl - remove glusterfs log collection + +### Databases (HIGH) +- [ ] clickhouse.hcl - update paths, add affinity for zippy +- [ ] unifi.hcl - update paths (includes mongodb) + +### Web Applications (HIGH-MEDIUM) +- [ ] wordpress.hcl - update from `/data/sync/wordpress` to `/data/services/appdata/wordpress` +- [ ] gitea.hcl - update paths +- [ ] wiki.hcl - update paths, verify with exec driver +- [ ] plausible.hcl - verify (stateless) + +### Web Applications (LOW, may be deprecated) +- [ ] ghost.hcl - update paths or remove (no longer used?) +- [ ] vikunja.hcl - update paths or remove (no longer used?) +- [ ] leantime.hcl - update paths or remove (no longer used?) + +### Network Infrastructure (HIGH) +- [ ] unifi.hcl - update paths (already listed above) + +### Media Stack (MEDIUM) +- [ ] media.hcl - update paths, add constraint for fractal + - [ ] radarr, sonarr, bazarr, plex, qbittorrent + +### Utility Services (MEDIUM-LOW) +- [ ] evcc.hcl - update paths +- [ ] weewx.hcl - update paths +- [ ] code-server.hcl - update paths +- [ ] beancount.hcl - update paths +- [ ] adminer.hcl - verify (stateless) +- [ ] maps.hcl - update paths +- [ ] netbox.hcl - update paths +- [ ] farmos.hcl - update paths +- [ ] urbit.hcl - update paths +- [ ] webodm.hcl - update paths +- [ ] velutrack.hcl - verify paths +- [ ] resol-gateway.hcl - verify paths +- [ ] igsync.hcl - update paths +- [ ] jupyter.hcl - verify paths +- [ ] whoami.hcl - verify (stateless test service) +- [ ] tiddlywiki.hcl - update paths (if separate from wiki.hcl) + +### Backup Jobs (HIGH) +- [x] mysql-backup - moved to zippy, verified +- [ ] postgres-backup.hcl - verify destination +- [ ] wordpress-backup.hcl - verify destination + +### Verification +- [ ] All services healthy in Nomad +- [ ] All services registered in Consul +- [ ] Traefik routes working +- [ ] Database jobs running on zippy (verify via nomad alloc status) +- [ ] Media jobs running on fractal (verify via nomad alloc status) + +## Phase 5: Convert sunny to NixOS (OPTIONAL - can defer) +- [ ] Document current sunny setup (ethereum containers/VMs) +- [ ] Backup ethereum data +- [ ] Install NixOS on sunny +- [ ] Restore ethereum data to `/persist/ethereum` +- [ ] Create sunny container-based config (besu, lighthouse, rocketpool) +- [ ] Deploy and verify ethereum stack +- [ ] Monitor sync status and validation + +## Phase 6: Verification and cleanup +- [ ] Test NFS failover procedure (zippy → c1) +- [ ] Verify backups include `/persist/services` data +- [ ] Verify backups exclude replication snapshots +- [ ] Update documentation (README.md, architecture diagrams) +- [ ] Clean up old GlusterFS data (only after everything verified!) +- [ ] Remove old glusterfs directories from all nodes + +## Post-Migration Checklist +- [ ] All 5 servers in quorum (consul members) +- [ ] NFS mounts working on all nodes +- [ ] Btrfs replication running (check systemd timers on zippy) +- [ ] Critical services up (mysql, postgres, redis, traefik, authentik) +- [ ] Monitoring working (prometheus, grafana, loki) +- [ ] Media stack on fractal +- [ ] Database jobs on zippy +- [ ] Consul DNS working (dig @localhost -p 8600 data-services.service.consul) +- [ ] Backups running (kopia snapshots include /persist/services) +- [ ] GlusterFS removed (no processes, volumes deleted) +- [ ] Documentation updated + +--- + +**Last updated**: 2025-10-22 +**Current phase**: Phase 2 complete (zippy storage setup done), ready for Phase 3 (GlusterFS → NFS migration) +**Note**: Phase 1 (fractal NixOS conversion) deferred until after GlusterFS migration is complete