Files
alo-cluster/docs/MIGRATION_TODO.md
2025-10-22 13:59:31 +01:00

6.1 KiB

Cluster Revamp Migration TODO

Track migration progress from GlusterFS to NFS-based architecture. See CLUSTER_REVAMP.md for detailed procedures.

Phase 0: Preparation

  • Review cluster revamp plan
  • Backup everything (kopia snapshots current)
  • Document current state (nomad jobs, consul services)

Phase 1: Convert fractal to NixOS (DEFERRED - do after GlusterFS migration)

  • Document fractal's current ZFS layout
  • Install NixOS on fractal
  • Import ZFS pools (double1, double2, double3)
  • Create fractal NixOS configuration
  • Configure Samba server for media/shared/homes
  • Configure Kopia backup server
  • Deploy and verify fractal base config
  • Join fractal to cluster (5-server quorum)
  • Update all cluster configs for 5-server quorum
  • Verify fractal fully operational

Phase 2: Setup zippy storage layer

  • Create btrfs subvolume /persist/services on zippy
  • Configure NFS server on zippy (nfs-services-server.nix)
  • Configure Consul service registration for NFS
  • Setup btrfs replication to c1 (incremental, 5min intervals)
  • Fix replication script to handle SSH command restrictions
  • Setup standby storage on c1 (/persist/services-standby)
  • Configure c1 as standby (nfs-services-standby.nix)
  • Configure Kopia to exclude replication snapshots
  • Deploy and verify NFS server on zippy
  • Verify replication working to c1
  • Setup standby storage on c2 (if desired)
  • Configure replication to c2 (if desired)

Phase 3: Migrate from GlusterFS to NFS

  • Update all nodes to mount NFS at /data/services
  • Deploy updated configs (NFS client on all nodes)
  • Stop all Nomad jobs temporarily
  • Copy data from GlusterFS to zippy NFS
    • Copy /data/compute/appdata/*/persist/services/appdata/
    • Copy /data/compute/config/*/persist/services/config/
    • Copy /data/sync/wordpress/persist/services/appdata/wordpress
    • Verify data integrity
  • Verify NFS mounts working on all nodes
  • Stop GlusterFS volume
  • Delete GlusterFS volume
  • Remove GlusterFS from NixOS configs
  • Remove syncthing wordpress sync configuration

Phase 4: Update and redeploy Nomad jobs

Core Infrastructure (CRITICAL)

  • mysql.hcl - moved to zippy, using /data/services
  • postgres.hcl - update paths, add affinity for zippy
  • redis.hcl - update paths, add affinity for zippy
  • traefik.hcl - update paths (already floating)
  • authentik.hcl - verify (stateless, no changes needed)

Monitoring Stack (HIGH)

  • prometheus.hcl - update paths
  • grafana.hcl - update paths
  • loki.hcl - update paths
  • vector.hcl - remove glusterfs log collection

Databases (HIGH)

  • clickhouse.hcl - update paths, add affinity for zippy
  • unifi.hcl - update paths (includes mongodb)

Web Applications (HIGH-MEDIUM)

  • wordpress.hcl - update from /data/sync/wordpress to /data/services/wordpress
  • gitea.hcl - update paths
  • wiki.hcl - update paths, verify with exec driver
  • plausible.hcl - verify (stateless)

Web Applications (LOW, may be deprecated)

  • ghost.hcl - update paths or remove (no longer used?)
  • vikunja.hcl - update paths or remove (no longer used?)
  • leantime.hcl - update paths or remove (no longer used?)

Network Infrastructure (HIGH)

  • unifi.hcl - update paths (already listed above)

Media Stack (MEDIUM)

  • media.hcl - update paths, add constraint for fractal
    • radarr, sonarr, bazarr, plex, qbittorrent

Utility Services (MEDIUM-LOW)

  • evcc.hcl - update paths
  • weewx.hcl - update paths
  • code-server.hcl - update paths
  • beancount.hcl - update paths
  • adminer.hcl - verify (stateless)
  • maps.hcl - update paths
  • netbox.hcl - update paths
  • farmos.hcl - update paths
  • urbit.hcl - update paths
  • webodm.hcl - update paths
  • velutrack.hcl - verify paths
  • resol-gateway.hcl - verify paths
  • igsync.hcl - update paths
  • jupyter.hcl - verify paths
  • whoami.hcl - verify (stateless test service)
  • tiddlywiki.hcl - update paths (if separate from wiki.hcl)

Backup Jobs (HIGH)

  • mysql-backup - moved to zippy, verified
  • postgres-backup.hcl - verify destination
  • wordpress-backup.hcl - verify destination

Verification

  • All services healthy in Nomad
  • All services registered in Consul
  • Traefik routes working
  • Database jobs running on zippy (verify via nomad alloc status)
  • Media jobs running on fractal (verify via nomad alloc status)

Phase 5: Convert sunny to NixOS (OPTIONAL - can defer)

  • Document current sunny setup (ethereum containers/VMs)
  • Backup ethereum data
  • Install NixOS on sunny
  • Restore ethereum data to /persist/ethereum
  • Create sunny container-based config (besu, lighthouse, rocketpool)
  • Deploy and verify ethereum stack
  • Monitor sync status and validation

Phase 6: Verification and cleanup

  • Test NFS failover procedure (zippy → c1)
  • Verify backups include /persist/services data
  • Verify backups exclude replication snapshots
  • Update documentation (README.md, architecture diagrams)
  • Clean up old GlusterFS data (only after everything verified!)
  • Remove old glusterfs directories from all nodes

Post-Migration Checklist

  • All 5 servers in quorum (consul members)
  • NFS mounts working on all nodes
  • Btrfs replication running (check systemd timers on zippy)
  • Critical services up (mysql, postgres, redis, traefik, authentik)
  • Monitoring working (prometheus, grafana, loki)
  • Media stack on fractal
  • Database jobs on zippy
  • Consul DNS working (dig @localhost -p 8600 data-services.service.consul)
  • Backups running (kopia snapshots include /persist/services)
  • GlusterFS removed (no processes, volumes deleted)
  • Documentation updated

Last updated: 2025-10-22 Current phase: Phase 2 complete (zippy storage setup done), ready for Phase 3 (GlusterFS → NFS migration) Note: Phase 1 (fractal NixOS conversion) deferred until after GlusterFS migration is complete