166 lines
7.2 KiB
Markdown
166 lines
7.2 KiB
Markdown
# Cluster Revamp Migration TODO
|
|
|
|
Track migration progress from GlusterFS to NFS-based architecture.
|
|
See [CLUSTER_REVAMP.md](./CLUSTER_REVAMP.md) for detailed procedures.
|
|
|
|
## Phase 0: Preparation
|
|
- [x] Review cluster revamp plan
|
|
- [ ] Backup everything (kopia snapshots current)
|
|
- [ ] Document current state (nomad jobs, consul services)
|
|
|
|
## Phase 1: Convert fractal to NixOS (DEFERRED - do after GlusterFS migration)
|
|
- [ ] Document fractal's current ZFS layout
|
|
- [ ] Install NixOS on fractal
|
|
- [ ] Import ZFS pools (double1, double2, double3)
|
|
- [ ] Create fractal NixOS configuration
|
|
- [ ] Configure Samba server for media/shared/homes
|
|
- [ ] Configure Kopia backup server
|
|
- [ ] Deploy and verify fractal base config
|
|
- [ ] Join fractal to cluster (5-server quorum)
|
|
- [ ] Update all cluster configs for 5-server quorum
|
|
- [ ] Verify fractal fully operational
|
|
|
|
## Phase 2: Setup zippy storage layer
|
|
- [x] Create btrfs subvolume `/persist/services` on zippy
|
|
- [x] Configure NFS server on zippy (nfs-services-server.nix)
|
|
- [x] Configure Consul service registration for NFS
|
|
- [x] Setup btrfs replication to c1 (incremental, 5min intervals)
|
|
- [x] Fix replication script to handle SSH command restrictions
|
|
- [x] Setup standby storage on c1 (`/persist/services-standby`)
|
|
- [x] Configure c1 as standby (nfs-services-standby.nix)
|
|
- [x] Configure Kopia to exclude replication snapshots
|
|
- [x] Deploy and verify NFS server on zippy
|
|
- [x] Verify replication working to c1
|
|
- [ ] Setup standby storage on c2 (if desired)
|
|
- [ ] Configure replication to c2 (if desired)
|
|
|
|
## Phase 3: Migrate from GlusterFS to NFS
|
|
- [x] Update all nodes to mount NFS at `/data/services`
|
|
- [x] Deploy updated configs (NFS client on all nodes)
|
|
- [ ] Stop all Nomad jobs temporarily
|
|
- [ ] Copy data from GlusterFS to zippy NFS
|
|
- [ ] Copy `/data/compute/appdata/*` → `/persist/services/appdata/`
|
|
- [ ] Copy `/data/compute/config/*` → `/persist/services/config/`
|
|
- [ ] Copy `/data/sync/wordpress` → `/persist/services/appdata/wordpress`
|
|
- [ ] Verify data integrity
|
|
- [ ] Verify NFS mounts working on all nodes
|
|
- [ ] Stop GlusterFS volume
|
|
- [ ] Delete GlusterFS volume
|
|
- [ ] Remove GlusterFS from NixOS configs
|
|
- [ ] Remove syncthing wordpress sync configuration
|
|
|
|
## Phase 4: Update and redeploy Nomad jobs
|
|
|
|
### Core Infrastructure (CRITICAL)
|
|
- [x] mysql.hcl - moved to zippy, using `/data/services`
|
|
- [x] postgres.hcl - migrated to `/data/services`
|
|
- [x] redis.hcl - migrated to `/data/services`
|
|
- [x] traefik.hcl - migrated to `/data/services`
|
|
- [x] authentik.hcl - stateless, no changes needed
|
|
|
|
### Monitoring Stack (HIGH)
|
|
- [x] prometheus.hcl - migrated to `/data/services`
|
|
- [x] grafana.hcl - migrated to `/data/services` (2025-10-23)
|
|
- [x] loki.hcl - migrated to `/data/services`
|
|
- [ ] vector.hcl - needs update to remove glusterfs log collection (line 26, 101-109)
|
|
|
|
### Databases (HIGH)
|
|
- [x] clickhouse.hcl - migrated to `/data/services`
|
|
- [x] unifi.hcl - migrated to `/data/services` (includes mongodb)
|
|
|
|
### Web Applications (HIGH-MEDIUM)
|
|
- [x] wordpress.hcl - migrated to `/data/services`
|
|
- [x] gitea.hcl - migrated to `/data/services` (2025-10-23)
|
|
- [ ] wiki.hcl - uses `appdata` volume (points to `/data/compute/appdata`)
|
|
- [x] plausible.hcl - stateless, no changes needed
|
|
- [ ] tiddlywiki.hcl - uses `appdata` volume (points to `/data/compute/appdata`)
|
|
|
|
### Web Applications (LOW, may be deprecated)
|
|
- [x] vikunja.hcl - migrated to `/data/services` (2025-10-23, not running)
|
|
|
|
### Media Stack (MEDIUM)
|
|
- [x] media.hcl - migrated to `/data/services`
|
|
|
|
### Utility Services (MEDIUM-LOW)
|
|
- [x] evcc.hcl - migrated to `/data/services`
|
|
- [x] weewx.hcl - migrated to `/data/services` (2025-10-23)
|
|
- [x] code-server.hcl - migrated to `/data/services`
|
|
- [x] beancount.hcl - migrated to `/data/services`
|
|
- [x] adminer.hcl - stateless, no changes needed
|
|
- [x] maps.hcl - migrated to `/data/services`
|
|
- [x] netbox.hcl - migrated to `/data/services`
|
|
- [x] farmos.hcl - migrated to `/data/services` (2025-10-23)
|
|
- [x] urbit.hcl - migrated to `/data/services`
|
|
- [x] webodm.hcl - migrated to `/data/services` (2025-10-23, not running)
|
|
- [x] velutrack.hcl - migrated to `/data/services`
|
|
- [ ] resol-gateway.hcl - uses `code` volume (points to `/data/compute/code`)
|
|
- [ ] igsync.hcl - uses `appdata` volume (points to `/data/compute/appdata`)
|
|
- [x] jupyter.hcl - migrated to `/data/services` (2025-10-23, not running)
|
|
- [x] whoami.hcl - stateless test service, no changes needed
|
|
|
|
### Backup Jobs (HIGH)
|
|
- [x] mysql-backup - moved to zippy, verified
|
|
- [x] postgres-backup.hcl - migrated to `/data/services`
|
|
|
|
### Host Volume Definitions (CRITICAL)
|
|
- [ ] common/nomad.nix - update host_volume paths from `/data/compute/{appdata,code}` to `/data/services/{appdata,code}`
|
|
|
|
### Verification
|
|
- [ ] All services healthy in Nomad
|
|
- [ ] All services registered in Consul
|
|
- [ ] Traefik routes working
|
|
- [ ] Database jobs running on zippy (verify via nomad alloc status)
|
|
- [ ] Media jobs running on fractal (verify via nomad alloc status)
|
|
|
|
## Phase 5: Convert sunny to NixOS (OPTIONAL - can defer)
|
|
- [ ] Document current sunny setup (ethereum containers/VMs)
|
|
- [ ] Backup ethereum data
|
|
- [ ] Install NixOS on sunny
|
|
- [ ] Restore ethereum data to `/persist/ethereum`
|
|
- [ ] Create sunny container-based config (besu, lighthouse, rocketpool)
|
|
- [ ] Deploy and verify ethereum stack
|
|
- [ ] Monitor sync status and validation
|
|
|
|
## Phase 6: Verification and cleanup
|
|
- [ ] Test NFS failover procedure (zippy → c1)
|
|
- [ ] Verify backups include `/persist/services` data
|
|
- [ ] Verify backups exclude replication snapshots
|
|
- [ ] Update documentation (README.md, architecture diagrams)
|
|
- [ ] Clean up old GlusterFS data (only after everything verified!)
|
|
- [ ] Remove old glusterfs directories from all nodes
|
|
|
|
## Post-Migration Checklist
|
|
- [ ] All 5 servers in quorum (consul members)
|
|
- [ ] NFS mounts working on all nodes
|
|
- [ ] Btrfs replication running (check systemd timers on zippy)
|
|
- [ ] Critical services up (mysql, postgres, redis, traefik, authentik)
|
|
- [ ] Monitoring working (prometheus, grafana, loki)
|
|
- [ ] Media stack on fractal
|
|
- [ ] Database jobs on zippy
|
|
- [ ] Consul DNS working (dig @localhost -p 8600 data-services.service.consul)
|
|
- [ ] Backups running (kopia snapshots include /persist/services)
|
|
- [ ] GlusterFS removed (no processes, volumes deleted)
|
|
- [ ] Documentation updated
|
|
|
|
---
|
|
|
|
**Last updated**: 2025-10-23 21:16
|
|
**Current phase**: Phase 4 in progress (26/35 services migrated, 4 host-volume services + config updates remaining, 4 stateless)
|
|
**Note**: Phase 1 (fractal NixOS conversion) deferred until after GlusterFS migration is complete
|
|
|
|
## Migration Summary
|
|
|
|
**Already migrated to `/data/services` (26 services):**
|
|
mysql, mysql-backup, postgres, postgres-backup, redis, clickhouse, prometheus, grafana, loki, unifi, wordpress, gitea, traefik, evcc, weewx, netbox, farmos, webodm, jupyter, vikunja, urbit, code-server, beancount, velutrack, maps, media
|
|
|
|
**Still need migration (4 services using host volumes):**
|
|
- wiki (appdata), tiddlywiki (appdata), igsync (appdata), resol-gateway (code)
|
|
- These require updating common/nomad.nix host_volume definitions first
|
|
|
|
**Stateless/no changes needed (4 services):**
|
|
authentik, adminer, plausible, whoami
|
|
|
|
**Configuration updates needed:**
|
|
- vector.hcl: remove glusterfs log collection
|
|
- common/nomad.nix: update host_volume paths
|