6.9 KiB
6.9 KiB
Cluster Revamp Migration TODO
Track migration progress from GlusterFS to NFS-based architecture. See CLUSTER_REVAMP.md for detailed procedures.
Phase 0: Preparation
- Review cluster revamp plan
- Backup everything (kopia snapshots current)
- Document current state (nomad jobs, consul services)
Phase 1: Convert fractal to NixOS (DEFERRED - do after GlusterFS migration)
- Document fractal's current ZFS layout
- Install NixOS on fractal
- Import ZFS pools (double1, double2, double3)
- Create fractal NixOS configuration
- Configure Samba server for media/shared/homes
- Configure Kopia backup server
- Deploy and verify fractal base config
- Join fractal to cluster (5-server quorum)
- Update all cluster configs for 5-server quorum
- Verify fractal fully operational
Phase 2: Setup zippy storage layer
- Create btrfs subvolume
/persist/serviceson zippy - Configure NFS server on zippy (nfs-services-server.nix)
- Configure Consul service registration for NFS
- Setup btrfs replication to c1 (incremental, 5min intervals)
- Fix replication script to handle SSH command restrictions
- Setup standby storage on c1 (
/persist/services-standby) - Configure c1 as standby (nfs-services-standby.nix)
- Configure Kopia to exclude replication snapshots
- Deploy and verify NFS server on zippy
- Verify replication working to c1
- Setup standby storage on c2 (if desired)
- Configure replication to c2 (if desired)
Phase 3: Migrate from GlusterFS to NFS
- Update all nodes to mount NFS at
/data/services - Deploy updated configs (NFS client on all nodes)
- Stop all Nomad jobs temporarily
- Copy data from GlusterFS to zippy NFS
- Copy
/data/compute/appdata/*→/persist/services/appdata/ - Copy
/data/compute/config/*→/persist/services/config/ - Copy
/data/sync/wordpress→/persist/services/appdata/wordpress - Verify data integrity
- Copy
- Verify NFS mounts working on all nodes
- Stop GlusterFS volume
- Delete GlusterFS volume
- Remove GlusterFS from NixOS configs
- Remove syncthing wordpress sync configuration (no longer used)
Phase 4: Update and redeploy Nomad jobs
Core Infrastructure (CRITICAL)
- mysql.hcl - moved to zippy, using
/data/services - postgres.hcl - migrated to
/data/services - redis.hcl - migrated to
/data/services - traefik.hcl - migrated to
/data/services - authentik.hcl - stateless, no changes needed
Monitoring Stack (HIGH)
- prometheus.hcl - migrated to
/data/services - grafana.hcl - migrated to
/data/services(2025-10-23) - loki.hcl - migrated to
/data/services - vector.hcl - removed glusterfs log collection (2025-10-23)
Databases (HIGH)
- clickhouse.hcl - migrated to
/data/services - unifi.hcl - migrated to
/data/services(includes mongodb)
Web Applications (HIGH-MEDIUM)
- wordpress.hcl - migrated to
/data/services - gitea.hcl - migrated to
/data/services(2025-10-23) - wiki.hcl - migrated to
/data/services(2025-10-23) - plausible.hcl - stateless, no changes needed
Web Applications (LOW, may be deprecated)
- vikunja.hcl - migrated to
/data/services(2025-10-23, not running)
Media Stack (MEDIUM)
- media.hcl - migrated to
/data/services
Utility Services (MEDIUM-LOW)
- evcc.hcl - migrated to
/data/services - weewx.hcl - migrated to
/data/services(2025-10-23) - code-server.hcl - migrated to
/data/services - beancount.hcl - migrated to
/data/services - adminer.hcl - stateless, no changes needed
- maps.hcl - migrated to
/data/services - netbox.hcl - migrated to
/data/services - farmos.hcl - migrated to
/data/services(2025-10-23) - urbit.hcl - migrated to
/data/services - webodm.hcl - migrated to
/data/services(2025-10-23, not running) - velutrack.hcl - migrated to
/data/services - resol-gateway.hcl - migrated to
/data/services(2025-10-23) - igsync.hcl - migrated to
/data/services(2025-10-23) - jupyter.hcl - migrated to
/data/services(2025-10-23, not running) - whoami.hcl - stateless test service, no changes needed
Backup Jobs (HIGH)
- mysql-backup - moved to zippy, verified
- postgres-backup.hcl - migrated to
/data/services
Host Volume Definitions (CRITICAL)
- common/nomad.nix - consolidated
appdataandcodevolumes into singleservicesvolume (2025-10-23)
Verification
- All services healthy in Nomad
- All services registered in Consul
- Traefik routes working
- Database jobs running on zippy (verify via nomad alloc status)
- Media jobs running on fractal (verify via nomad alloc status)
Phase 5: Convert sunny to NixOS (OPTIONAL - can defer)
- Document current sunny setup (ethereum containers/VMs)
- Backup ethereum data
- Install NixOS on sunny
- Restore ethereum data to
/persist/ethereum - Create sunny container-based config (besu, lighthouse, rocketpool)
- Deploy and verify ethereum stack
- Monitor sync status and validation
Phase 6: Verification and cleanup
- Test NFS failover procedure (zippy → c1)
- Verify backups include
/persist/servicesdata - Verify backups exclude replication snapshots
- Update documentation (README.md, architecture diagrams)
- Clean up old GlusterFS data (only after everything verified!)
- Remove old glusterfs directories from all nodes
Post-Migration Checklist
- All 5 servers in quorum (consul members)
- NFS mounts working on all nodes
- Btrfs replication running (check systemd timers on zippy)
- Critical services up (mysql, postgres, redis, traefik, authentik)
- Monitoring working (prometheus, grafana, loki)
- Media stack on fractal
- Database jobs on zippy
- Consul DNS working (dig @localhost -p 8600 data-services.service.consul)
- Backups running (kopia snapshots include /persist/services)
- GlusterFS removed (no processes, volumes deleted)
- Documentation updated
Last updated: 2025-10-25 Current phase: Phase 3 & 4 complete! GlusterFS removed, all services on NFS Note: Phase 1 (fractal NixOS conversion) deferred until after GlusterFS migration is complete
Migration Summary
All services migrated to /data/services (30 total):
mysql, mysql-backup, postgres, postgres-backup, redis, clickhouse, prometheus, grafana, loki, vector, unifi, wordpress, gitea, wiki, traefik, evcc, weewx, netbox, farmos, webodm, jupyter, vikunja, urbit, code-server, beancount, velutrack, maps, media, resol-gateway, igsync
Stateless/no changes needed (4 services): authentik, adminer, plausible, whoami
Configuration changes:
- common/nomad.nix: consolidated
appdataandcodevolumes into singleservicesvolume - vector.hcl: removed glusterfs log collection