Files
alo-cluster/docs/MIGRATION_TODO.md

6.8 KiB

Cluster Revamp Migration TODO

Track migration progress from GlusterFS to NFS-based architecture. See CLUSTER_REVAMP.md for detailed procedures.

Phase 0: Preparation

  • Review cluster revamp plan
  • Backup everything (kopia snapshots current)
  • Document current state (nomad jobs, consul services)

Phase 1: Convert fractal to NixOS (DEFERRED - do after GlusterFS migration)

  • Document fractal's current ZFS layout
  • Install NixOS on fractal
  • Import ZFS pools (double1, double2, double3)
  • Create fractal NixOS configuration
  • Configure Samba server for media/shared/homes
  • Configure Kopia backup server
  • Deploy and verify fractal base config
  • Join fractal to cluster (5-server quorum)
  • Update all cluster configs for 5-server quorum
  • Verify fractal fully operational

Phase 2: Setup zippy storage layer

  • Create btrfs subvolume /persist/services on zippy
  • Configure NFS server on zippy (nfs-services-server.nix)
  • Configure Consul service registration for NFS
  • Setup btrfs replication to c1 (incremental, 5min intervals)
  • Fix replication script to handle SSH command restrictions
  • Setup standby storage on c1 (/persist/services-standby)
  • Configure c1 as standby (nfs-services-standby.nix)
  • Configure Kopia to exclude replication snapshots
  • Deploy and verify NFS server on zippy
  • Verify replication working to c1
  • Setup standby storage on c2 (if desired)
  • Configure replication to c2 (if desired)

Phase 3: Migrate from GlusterFS to NFS

  • Update all nodes to mount NFS at /data/services
  • Deploy updated configs (NFS client on all nodes)
  • Stop all Nomad jobs temporarily
  • Copy data from GlusterFS to zippy NFS
    • Copy /data/compute/appdata/*/persist/services/appdata/
    • Copy /data/compute/config/*/persist/services/config/
    • Copy /data/sync/wordpress/persist/services/appdata/wordpress
    • Verify data integrity
  • Verify NFS mounts working on all nodes
  • Stop GlusterFS volume
  • Delete GlusterFS volume
  • Remove GlusterFS from NixOS configs
  • Remove syncthing wordpress sync configuration

Phase 4: Update and redeploy Nomad jobs

Core Infrastructure (CRITICAL)

  • mysql.hcl - moved to zippy, using /data/services
  • postgres.hcl - migrated to /data/services
  • redis.hcl - migrated to /data/services
  • traefik.hcl - migrated to /data/services
  • authentik.hcl - stateless, no changes needed

Monitoring Stack (HIGH)

  • prometheus.hcl - migrated to /data/services
  • grafana.hcl - migrated to /data/services (2025-10-23)
  • loki.hcl - migrated to /data/services
  • vector.hcl - removed glusterfs log collection (2025-10-23)

Databases (HIGH)

  • clickhouse.hcl - migrated to /data/services
  • unifi.hcl - migrated to /data/services (includes mongodb)

Web Applications (HIGH-MEDIUM)

  • wordpress.hcl - migrated to /data/services
  • gitea.hcl - migrated to /data/services (2025-10-23)
  • wiki.hcl - migrated to /data/services (2025-10-23)
  • plausible.hcl - stateless, no changes needed

Web Applications (LOW, may be deprecated)

  • vikunja.hcl - migrated to /data/services (2025-10-23, not running)

Media Stack (MEDIUM)

  • media.hcl - migrated to /data/services

Utility Services (MEDIUM-LOW)

  • evcc.hcl - migrated to /data/services
  • weewx.hcl - migrated to /data/services (2025-10-23)
  • code-server.hcl - migrated to /data/services
  • beancount.hcl - migrated to /data/services
  • adminer.hcl - stateless, no changes needed
  • maps.hcl - migrated to /data/services
  • netbox.hcl - migrated to /data/services
  • farmos.hcl - migrated to /data/services (2025-10-23)
  • urbit.hcl - migrated to /data/services
  • webodm.hcl - migrated to /data/services (2025-10-23, not running)
  • velutrack.hcl - migrated to /data/services
  • resol-gateway.hcl - migrated to /data/services (2025-10-23)
  • igsync.hcl - migrated to /data/services (2025-10-23)
  • jupyter.hcl - migrated to /data/services (2025-10-23, not running)
  • whoami.hcl - stateless test service, no changes needed

Backup Jobs (HIGH)

  • mysql-backup - moved to zippy, verified
  • postgres-backup.hcl - migrated to /data/services

Host Volume Definitions (CRITICAL)

  • common/nomad.nix - consolidated appdata and code volumes into single services volume (2025-10-23)

Verification

  • All services healthy in Nomad
  • All services registered in Consul
  • Traefik routes working
  • Database jobs running on zippy (verify via nomad alloc status)
  • Media jobs running on fractal (verify via nomad alloc status)

Phase 5: Convert sunny to NixOS (OPTIONAL - can defer)

  • Document current sunny setup (ethereum containers/VMs)
  • Backup ethereum data
  • Install NixOS on sunny
  • Restore ethereum data to /persist/ethereum
  • Create sunny container-based config (besu, lighthouse, rocketpool)
  • Deploy and verify ethereum stack
  • Monitor sync status and validation

Phase 6: Verification and cleanup

  • Test NFS failover procedure (zippy → c1)
  • Verify backups include /persist/services data
  • Verify backups exclude replication snapshots
  • Update documentation (README.md, architecture diagrams)
  • Clean up old GlusterFS data (only after everything verified!)
  • Remove old glusterfs directories from all nodes

Post-Migration Checklist

  • All 5 servers in quorum (consul members)
  • NFS mounts working on all nodes
  • Btrfs replication running (check systemd timers on zippy)
  • Critical services up (mysql, postgres, redis, traefik, authentik)
  • Monitoring working (prometheus, grafana, loki)
  • Media stack on fractal
  • Database jobs on zippy
  • Consul DNS working (dig @localhost -p 8600 data-services.service.consul)
  • Backups running (kopia snapshots include /persist/services)
  • GlusterFS removed (no processes, volumes deleted)
  • Documentation updated

Last updated: 2025-10-23 22:30 Current phase: Phase 4 complete! All services migrated to NFS Note: Phase 1 (fractal NixOS conversion) deferred until after GlusterFS migration is complete

Migration Summary

All services migrated to /data/services (30 total): mysql, mysql-backup, postgres, postgres-backup, redis, clickhouse, prometheus, grafana, loki, vector, unifi, wordpress, gitea, wiki, traefik, evcc, weewx, netbox, farmos, webodm, jupyter, vikunja, urbit, code-server, beancount, velutrack, maps, media, resol-gateway, igsync

Stateless/no changes needed (4 services): authentik, adminer, plausible, whoami

Configuration changes:

  • common/nomad.nix: consolidated appdata and code volumes into single services volume
  • vector.hcl: removed glusterfs log collection