More docs.

2025-10-22 13:50:03 +01:00
parent bffc09cbd6
commit ed06f07116
2 changed files with 245 additions and 0 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -0,0 +1,92 @@
+# Claude Code Quick Reference
+
+NixOS cluster configuration using flakes. Homelab infrastructure with Nomad/Consul orchestration.
+
+## Project Structure
+
+```
+├── common/
+│   ├── global/          # Applied to all hosts (backup, sops, users, etc.)
+│   ├── compute-node.nix # Nomad client + Consul agent + NFS client
+│   ├── cluster-node.nix # Nomad server + Consul server (for quorum members)
+│   ├── nfs-services-server.nix   # NFS server + btrfs replication (zippy)
+│   └── nfs-services-standby.nix  # NFS standby + receive replication (c1, c2)
+├── hosts/
+│   ├── c1/, c2/, c3/    # Cattle nodes (compute, quorum members)
+│   ├── zippy/           # Primary storage + NFS server + stateful workloads
+│   ├── fractal/         # (Proxmox, will become NixOS storage node)
+│   ├── sunny/           # (Standalone ethereum node, not in cluster)
+│   └── chilly/          # (Home Assistant VM, not in cluster)
+├── docs/
+│   ├── CLUSTER_REVAMP.md    # Master plan for architecture changes
+│   ├── MIGRATION_TODO.md    # Tracking checklist for migration
+│   └── NFS_FAILOVER.md      # NFS failover procedures
+└── services/            # Nomad job specs (.hcl files)
+```
+
+## Current Architecture (transitioning)
+
+**OLD**: GlusterFS on c1/c2/c3 at `/data/compute` (being phased out)
+**NEW**: NFS from zippy at `/data/services` (current target)
+
+### Storage Mounts
+- `/data/services` - NFS from `data-services.service.consul` (zippy primary, c1 standby)
+- `/data/media` - CIFS from fractal (existing, unchanged)
+- `/data/shared` - CIFS from fractal (existing, unchanged)
+
+### Hosts
+- **c1, c2, c3**: Cattle nodes, run most workloads, Nomad/Consul quorum
+- **zippy**: Primary NFS server, runs databases (affinity), replicates to c1 every 5min
+- **fractal**: Storage node (Proxmox/ZFS), will join quorum after GlusterFS removed
+- **sunny**: Standalone ethereum staking node
+- **chilly**: Home Assistant VM
+
+## Key Patterns
+
+**NFS Server/Standby**:
+- Primary (zippy): imports `nfs-services-server.nix`, sets `standbys = ["c1"]`
+- Standby (c1): imports `nfs-services-standby.nix`, sets `replicationKeys = [...]`
+- Replication: btrfs send/receive every 5min, incremental with fallback to full
+
+**Backups**:
+- Kopia client on all nodes → Kopia server on fractal
+- Backs up `/persist` hourly via btrfs snapshot
+- Excludes: `services@*` and `services-standby/services@*` (replication snapshots)
+
+**Secrets**:
+- SOPS for secrets, files in `secrets/`
+- Keys managed per-host
+
+## Migration Status
+
+**Phase**: 2 complete, ready for Phase 3
+**Current**: Migrating GlusterFS → NFS
+**Next**: Copy data, update Nomad jobs, remove GlusterFS
+**Later**: Convert fractal to NixOS (deferred)
+
+See `docs/MIGRATION_TODO.md` for detailed checklist.
+
+## Common Tasks
+
+**Deploy a host**: `deploy -s '.#hostname'`
+**Deploy all**: `deploy`
+**Check replication**: `ssh zippy journalctl -u replicate-services-to-c1.service -f`
+**NFS failover**: See `docs/NFS_FAILOVER.md`
+**Nomad jobs**: `services/*.hcl` - update paths from `/data/compute` → `/data/services`
+
+## Troubleshooting Hints
+
+- Replication errors with "empty stream": SSH key restricted to `btrfs receive`, can't run other commands
+- NFS split-brain protection: nfs-server checks Consul before starting
+- Btrfs snapshots: nested snapshots appear as empty dirs in parent snapshots
+- Kopia: uses temporary snapshot for consistency, doesn't back up nested subvolumes
+
+## Important Files
+
+- `common/global/backup.nix` - Kopia backup configuration
+- `hosts/zippy/default.nix` - NFS server config, replication targets
+- `hosts/c1/default.nix` - NFS standby config, authorized replication keys
+- `flake.nix` - Host definitions, nixpkgs inputs
+
+---
+*Auto-generated reference for Claude Code. Keep concise. Update when architecture changes.*
--- a/docs/MIGRATION_TODO.md
+++ b/docs/MIGRATION_TODO.md
@@ -0,0 +1,153 @@
+# Cluster Revamp Migration TODO
+
+Track migration progress from GlusterFS to NFS-based architecture.
+See [CLUSTER_REVAMP.md](./CLUSTER_REVAMP.md) for detailed procedures.
+
+## Phase 0: Preparation
+- [x] Review cluster revamp plan
+- [ ] Backup everything (kopia snapshots current)
+- [ ] Document current state (nomad jobs, consul services)
+
+## Phase 1: Convert fractal to NixOS (DEFERRED - do after GlusterFS migration)
+- [ ] Document fractal's current ZFS layout
+- [ ] Install NixOS on fractal
+- [ ] Import ZFS pools (double1, double2, double3)
+- [ ] Create fractal NixOS configuration
+- [ ] Configure Samba server for media/shared/homes
+- [ ] Configure Kopia backup server
+- [ ] Deploy and verify fractal base config
+- [ ] Join fractal to cluster (5-server quorum)
+- [ ] Update all cluster configs for 5-server quorum
+- [ ] Verify fractal fully operational
+
+## Phase 2: Setup zippy storage layer
+- [x] Create btrfs subvolume `/persist/services` on zippy
+- [x] Configure NFS server on zippy (nfs-services-server.nix)
+- [x] Configure Consul service registration for NFS
+- [x] Setup btrfs replication to c1 (incremental, 5min intervals)
+- [x] Fix replication script to handle SSH command restrictions
+- [x] Setup standby storage on c1 (`/persist/services-standby`)
+- [x] Configure c1 as standby (nfs-services-standby.nix)
+- [x] Configure Kopia to exclude replication snapshots
+- [x] Deploy and verify NFS server on zippy
+- [x] Verify replication working to c1
+- [ ] Setup standby storage on c2 (if desired)
+- [ ] Configure replication to c2 (if desired)
+
+## Phase 3: Migrate from GlusterFS to NFS
+- [x] Update all nodes to mount NFS at `/data/services`
+- [x] Deploy updated configs (NFS client on all nodes)
+- [ ] Stop all Nomad jobs temporarily
+- [ ] Copy data from GlusterFS to zippy NFS
+  - [ ] Copy `/data/compute/appdata/*` → `/persist/services/appdata/`
+  - [ ] Copy `/data/compute/config/*` → `/persist/services/config/`
+  - [ ] Copy `/data/sync/wordpress` → `/persist/services/appdata/wordpress`
+  - [ ] Verify data integrity
+- [ ] Verify NFS mounts working on all nodes
+- [ ] Stop GlusterFS volume
+- [ ] Delete GlusterFS volume
+- [ ] Remove GlusterFS from NixOS configs
+- [ ] Remove syncthing wordpress sync configuration
+
+## Phase 4: Update and redeploy Nomad jobs
+
+### Core Infrastructure (CRITICAL)
+- [x] mysql.hcl - moved to zippy, using `/data/services`
+- [ ] postgres.hcl - update paths, add affinity for zippy
+- [ ] redis.hcl - update paths, add affinity for zippy
+- [ ] traefik.hcl - update paths (already floating)
+- [ ] authentik.hcl - verify (stateless, no changes needed)
+
+### Monitoring Stack (HIGH)
+- [ ] prometheus.hcl - update paths
+- [ ] grafana.hcl - update paths
+- [ ] loki.hcl - update paths
+- [ ] vector.hcl - remove glusterfs log collection
+
+### Databases (HIGH)
+- [ ] clickhouse.hcl - update paths, add affinity for zippy
+- [ ] unifi.hcl - update paths (includes mongodb)
+
+### Web Applications (HIGH-MEDIUM)
+- [ ] wordpress.hcl - update from `/data/sync/wordpress` to `/data/services/appdata/wordpress`
+- [ ] gitea.hcl - update paths
+- [ ] wiki.hcl - update paths, verify with exec driver
+- [ ] plausible.hcl - verify (stateless)
+
+### Web Applications (LOW, may be deprecated)
+- [ ] ghost.hcl - update paths or remove (no longer used?)
+- [ ] vikunja.hcl - update paths or remove (no longer used?)
+- [ ] leantime.hcl - update paths or remove (no longer used?)
+
+### Network Infrastructure (HIGH)
+- [ ] unifi.hcl - update paths (already listed above)
+
+### Media Stack (MEDIUM)
+- [ ] media.hcl - update paths, add constraint for fractal
+  - [ ] radarr, sonarr, bazarr, plex, qbittorrent
+
+### Utility Services (MEDIUM-LOW)
+- [ ] evcc.hcl - update paths
+- [ ] weewx.hcl - update paths
+- [ ] code-server.hcl - update paths
+- [ ] beancount.hcl - update paths
+- [ ] adminer.hcl - verify (stateless)
+- [ ] maps.hcl - update paths
+- [ ] netbox.hcl - update paths
+- [ ] farmos.hcl - update paths
+- [ ] urbit.hcl - update paths
+- [ ] webodm.hcl - update paths
+- [ ] velutrack.hcl - verify paths
+- [ ] resol-gateway.hcl - verify paths
+- [ ] igsync.hcl - update paths
+- [ ] jupyter.hcl - verify paths
+- [ ] whoami.hcl - verify (stateless test service)
+- [ ] tiddlywiki.hcl - update paths (if separate from wiki.hcl)
+
+### Backup Jobs (HIGH)
+- [x] mysql-backup - moved to zippy, verified
+- [ ] postgres-backup.hcl - verify destination
+- [ ] wordpress-backup.hcl - verify destination
+
+### Verification
+- [ ] All services healthy in Nomad
+- [ ] All services registered in Consul
+- [ ] Traefik routes working
+- [ ] Database jobs running on zippy (verify via nomad alloc status)
+- [ ] Media jobs running on fractal (verify via nomad alloc status)
+
+## Phase 5: Convert sunny to NixOS (OPTIONAL - can defer)
+- [ ] Document current sunny setup (ethereum containers/VMs)
+- [ ] Backup ethereum data
+- [ ] Install NixOS on sunny
+- [ ] Restore ethereum data to `/persist/ethereum`
+- [ ] Create sunny container-based config (besu, lighthouse, rocketpool)
+- [ ] Deploy and verify ethereum stack
+- [ ] Monitor sync status and validation
+
+## Phase 6: Verification and cleanup
+- [ ] Test NFS failover procedure (zippy → c1)
+- [ ] Verify backups include `/persist/services` data
+- [ ] Verify backups exclude replication snapshots
+- [ ] Update documentation (README.md, architecture diagrams)
+- [ ] Clean up old GlusterFS data (only after everything verified!)
+- [ ] Remove old glusterfs directories from all nodes
+
+## Post-Migration Checklist
+- [ ] All 5 servers in quorum (consul members)
+- [ ] NFS mounts working on all nodes
+- [ ] Btrfs replication running (check systemd timers on zippy)
+- [ ] Critical services up (mysql, postgres, redis, traefik, authentik)
+- [ ] Monitoring working (prometheus, grafana, loki)
+- [ ] Media stack on fractal
+- [ ] Database jobs on zippy
+- [ ] Consul DNS working (dig @localhost -p 8600 data-services.service.consul)
+- [ ] Backups running (kopia snapshots include /persist/services)
+- [ ] GlusterFS removed (no processes, volumes deleted)
+- [ ] Documentation updated
+
+---
+
+**Last updated**: 2025-10-22
+**Current phase**: Phase 2 complete (zippy storage setup done), ready for Phase 3 (GlusterFS → NFS migration)
+**Note**: Phase 1 (fractal NixOS conversion) deferred until after GlusterFS migration is complete