52 KiB
Cluster Architecture Revamp
Status: Planning complete, ready for review and refinement
Key Decisions
✅ Replication: 5-minute intervals (incremental btrfs send)
✅ WordPress: Currently syncthing → will use /data/services via NFS
✅ Media: Only media.hcl needs /data/media, constrained to fractal
✅ Unifi: Floating (no constraint needed)
✅ Sunny: Standalone, ethereum data stays local (not replicated)
✅ Quorum: 5 servers (c1, c2, c3, fractal, zippy)
✅ NFS Failover: Via Consul DNS (services.service.consul)
Table of Contents
End State Architecture
Cluster Topology
5-Server Quorum (Consul + Nomad server+client):
- c1, c2, c3: Cattle nodes - x86_64, run most stateless workloads
- fractal: Storage node - x86_64, 6x spinning drives, runs media workloads
- zippy: Stateful anchor - x86_64, runs database workloads (via affinity), primary NFS server
Standalone Nodes (not in quorum):
- sunny: x86_64, ethereum node + staking, base NixOS configs only
- chilly: x86_64, Home Assistant VM, base NixOS configs only
Quorum Math:
- 5 servers → quorum requires 3 healthy nodes
- Can tolerate 2 simultaneous failures
- Bootstrap expect: 3
Storage Architecture
Primary Storage (zippy):
/persist/services- btrfs subvolume- Contains: mysql, postgres, redis, clickhouse, mongodb, app data
- Exported via NFS to:
services.service.consul:/persist/services - Replicated via btrfs send to c1 and c2 every 5 minutes (incremental)
Standby Storage (c1, c2):
/persist/services-standby- btrfs subvolume- Receives replicated snapshots from zippy via incremental btrfs send
- Can be promoted to
/persist/servicesand exported as NFS during failover - Maximum data loss: 5 minutes (last replication interval)
Standalone Storage (sunny):
/persist/ethereum- local btrfs subvolume (or similar)- Contains: ethereum blockchain data, staking keys
- NOT replicated - too large/expensive to replicate full ethereum node
- Backed up via kopia to fractal (if feasible/needed)
Media Storage (fractal):
/data/media- existing spinning drive storage- Exported via Samba (existing)
- Mounted on c1, c2, c3 via CIFS (existing)
- Local access on fractal for media workloads
Shared Storage (fractal):
/data/shared- existing spinning drive storage- Exported via Samba (existing)
- Mounted on c1, c2, c3 via CIFS (existing)
Network Services
NFS Primary (zippy):
services.nfs.server = {
enable = true;
exports = ''
/persist/services 192.168.1.0/24(rw,sync,no_subtree_check,no_root_squash)
'';
};
services.consul.extraConfig.services = [{
name = "services";
port = 2049;
checks = [{ tcp = "localhost:2049"; interval = "30s"; }];
}];
NFS Client (all nodes):
fileSystems."/data/services" = {
device = "services.service.consul:/persist/services";
fsType = "nfs";
options = [ "x-systemd.automount" "noauto" "x-systemd.idle-timeout=60" ];
};
Samba Exports (fractal - existing):
//fractal/media→/data/media//fractal/shared→/data/shared
Nomad Job Placement Strategy
Affinity-based (prefer zippy, allow c1/c2):
- mysql, postgres, redis - stateful databases
- Run on zippy normally, can failover to c1/c2 if zippy down
Constrained (must run on fractal):
- media.hcl - radarr, sonarr, bazarr, plex, qbittorrent
- Reason: Heavy /data/media access, benefits from local storage
- prometheus.hcl - metrics database with 30d retention
- Reason: Large time-series data, spinning disks OK, saves SSD space
- loki.hcl - log aggregation with 31d retention
- Reason: Large log data, spinning disks OK
- clickhouse.hcl - analytics database for plausible
- Reason: Large time-series data, spinning disks OK
Floating (can run anywhere on c1/c2/c3/fractal/zippy):
- All other services including:
- traefik, authentik, web apps
- grafana (small data, just dashboards/config, queries prometheus for metrics)
- databases (mysql, postgres, redis)
- vector (system job, runs everywhere)
- Nomad schedules based on resources and constraints
Data Migration
Path changes needed in Nomad jobs:
/data/compute/appdata/*→/data/services/*/data/compute/config/*→/data/services/*/data/sync/wordpress→/data/services/wordpress
No changes needed:
/data/media/*- stays the same (CIFS mount from fractal, used only by media services)/data/shared/*- stays the same (CIFS mount from fractal)
Deprecated after migration:
/data/sync/wordpress- currently managed by syncthing to avoid slow GlusterFS- Will be replaced by NFS mount at
/data/services/wordpress - Syncthing configuration for this can be removed
- Final sync: copy from syncthing to
/persist/services/wordpresson zippy before cutover
- Will be replaced by NFS mount at
Migration Steps
Important path simplification note:
- All service paths use
/data/services/*directly (not/data/services/*) - Example:
/data/compute/appdata/mysql→/data/services/mysql - Simpler, cleaner, easier to manage
Phase 0: Preparation
Duration: 1-2 hours
-
Backup everything
# On all nodes, ensure kopia backups are current kopia snapshot list # Backup glusterfs data manually rsync -av /data/compute/ /backup/compute-pre-migration/ -
Document current state
# Save current nomad job list nomad job status -json > /backup/nomad-jobs-pre-migration.json # Save consul service catalog consul catalog services > /backup/consul-services-pre-migration.txt -
Review this document
- Verify all services are cataloged
- Confirm priority assignments
- Adjust as needed
Phase 1: Convert fractal to NixOS
Duration: 6-8 hours
Current state:
- Proxmox on ZFS
- System pool:
rpool(~500GB, will be wiped) - Data pools (preserved):
double1- 3.6T (homes, shared)double2- 7.2T (backup - kopia repo, PBS)double3- 17T (media, torrent)
- Services: Samba (homes, shared, media), Kopia server, PBS
- Bind mounts:
/data/{homes,shared,media,torrent}→ ZFS datasets
Goal: Fresh NixOS on rpool, preserve data pools, join cluster
Step-by-step procedure:
1. Pre-migration documentation
# On fractal, save ZFS layout
cat > /tmp/detect-zfs.sh << 'EOF'
#!/bin/bash
echo "=== ZFS Pools ==="
zpool status
echo -e "\n=== ZFS Datasets ==="
zfs list -o name,mountpoint,used,avail,mounted -r double1 double2 double3
echo -e "\n=== Bind mounts ==="
cat /etc/fstab | grep double
echo -e "\n=== Data directories ==="
ls -la /data/
echo -e "\n=== Samba users/groups ==="
getent group shared compute
getent passwd compute
EOF
chmod +x /tmp/detect-zfs.sh
ssh fractal /tmp/detect-zfs.sh > /backup/fractal-zfs-layout.txt
# Save samba config
scp fractal:/etc/samba/smb.conf /backup/fractal-smb.conf
# Save kopia certs and config
scp -r fractal:~/kopia-certs /backup/fractal-kopia-certs/
scp fractal:~/.config/kopia/repository.config /backup/fractal-kopia-repository.config
# Verify kopia backups are current
ssh fractal "kopia snapshot list --all"
2. Stop services on fractal
ssh fractal "systemctl stop smbd nmbd kopia"
# Don't stop PBS yet (in case we need to restore)
3. Install NixOS
- Boot NixOS installer USB
- IMPORTANT: Do NOT touch double1, double2, double3 during install!
- Install only on
rpool(or create new pool if needed)
# In NixOS installer
# Option A: Reuse rpool (wipe and recreate)
zpool destroy rpool
# Option B: Use different disk if available
# Then follow standard NixOS btrfs install on that disk
- Use standard encrypted btrfs layout (matching other hosts)
- Minimal install first, will add cluster configs later
4. First boot - import ZFS pools
# SSH into fresh NixOS install
# Import pools (read-only first, to be safe)
zpool import -f -o readonly=on double1
zpool import -f -o readonly=on double2
zpool import -f -o readonly=on double3
# Verify datasets
zfs list -r double1 double2 double3
# Example output should show:
# double1/homes
# double1/shared
# double2/backup
# double3/media
# double3/torrent
# If everything looks good, export and reimport read-write
zpool export double1 double2 double3
zpool import double1
zpool import double2
zpool import double3
# Set ZFS mountpoints (if needed)
# These may already be set from Proxmox
zfs set mountpoint=/double1 double1
zfs set mountpoint=/double2 double2
zfs set mountpoint=/double3 double3
5. Create fractal NixOS configuration
# hosts/fractal/default.nix
{ config, pkgs, ... }:
{
imports = [
../../common/encrypted-btrfs-layout.nix
../../common/global
../../common/cluster-node.nix # Consul + Nomad (will add in step 7)
../../common/nomad.nix # Both server and client
./hardware.nix
];
networking.hostName = "fractal";
# ZFS support
boot.supportedFilesystems = [ "zfs" ];
boot.zfs.extraPools = [ "double1" "double2" "double3" ];
# Ensure ZFS pools are imported before mounting
systemd.services.zfs-import.wantedBy = [ "multi-user.target" ];
# Bind mounts for /data (matching Proxmox setup)
fileSystems."/data/homes" = {
device = "/double1/homes";
fsType = "none";
options = [ "bind" "x-systemd.requires=zfs-mount.service" ];
};
fileSystems."/data/shared" = {
device = "/double1/shared";
fsType = "none";
options = [ "bind" "x-systemd.requires=zfs-mount.service" ];
};
fileSystems."/data/media" = {
device = "/double3/media";
fsType = "none";
options = [ "bind" "x-systemd.requires=zfs-mount.service" ];
};
fileSystems."/data/torrent" = {
device = "/double3/torrent";
fsType = "none";
options = [ "bind" "x-systemd.requires=zfs-mount.service" ];
};
fileSystems."/backup" = {
device = "/double2/backup";
fsType = "none";
options = [ "bind" "x-systemd.requires=zfs-mount.service" ];
};
# Create data directory structure
systemd.tmpfiles.rules = [
"d /data 0755 root root -"
];
# Users and groups for samba
users.groups.shared = { gid = 1001; };
users.groups.compute = { gid = 1002; };
users.users.compute = {
isSystemUser = true;
uid = 1002;
group = "compute";
};
# Ensure ppetru is in shared group
users.users.ppetru.extraGroups = [ "shared" ];
# Samba server
services.samba = {
enable = true;
openFirewall = true;
extraConfig = ''
workgroup = WORKGROUP
server string = fractal
netbios name = fractal
security = user
map to guest = bad user
'';
shares = {
homes = {
comment = "Home Directories";
browseable = "no";
path = "/data/homes/%S";
"read only" = "no";
};
shared = {
path = "/data/shared";
"read only" = "no";
browseable = "yes";
"guest ok" = "no";
"create mask" = "0775";
"directory mask" = "0775";
"force group" = "+shared";
};
media = {
path = "/data/media";
"read only" = "no";
browseable = "yes";
"guest ok" = "no";
"create mask" = "0755";
"directory mask" = "0755";
};
};
};
# Kopia backup server
systemd.services.kopia-server = {
description = "Kopia Backup Server";
wantedBy = [ "multi-user.target" ];
after = [ "network.target" "zfs-mount.service" ];
serviceConfig = {
User = "ppetru";
Group = "users";
ExecStart = ''
${pkgs.kopia}/bin/kopia server start \
--address 0.0.0.0:51515 \
--tls-cert-file /persist/kopia-certs/kopia.cert \
--tls-key-file /persist/kopia-certs/kopia.key
'';
Restart = "on-failure";
};
};
# Kopia nightly snapshot (from cron)
systemd.services.kopia-snapshot = {
description = "Kopia snapshot of homes and shared";
serviceConfig = {
Type = "oneshot";
User = "ppetru";
Group = "users";
ExecStart = ''
${pkgs.kopia}/bin/kopia --config-file=/home/ppetru/.config/kopia/repository.config \
snapshot create /data/homes /data/shared \
--log-level=warning --no-progress
'';
};
};
systemd.timers.kopia-snapshot = {
wantedBy = [ "timers.target" ];
timerConfig = {
OnCalendar = "22:47";
Persistent = true;
};
};
# Keep kopia config and certs persistent
environment.persistence."/persist" = {
directories = [
"/home/ppetru/.config/kopia"
"/home/ppetru/kopia-certs"
];
};
networking.firewall.allowedTCPPorts = [
139 445 # Samba
51515 # Kopia
];
networking.firewall.allowedUDPPorts = [
137 138 # Samba
];
}
6. Deploy initial config (without cluster)
# First, deploy without cluster-node.nix to verify storage works
# Comment out cluster-node import temporarily
deploy -s '.#fractal'
# Verify mounts
ssh fractal "df -h | grep data"
ssh fractal "ls -la /data/"
# Test samba
smbclient -L fractal -U ppetru
# Test kopia
ssh fractal "systemctl status kopia-server"
7. Join cluster (add to quorum)
# Uncomment cluster-node.nix import in fractal config
# Update all cluster configs for 5-server quorum
# (See step 3 in existing Phase 1 docs)
deploy # Deploy to all nodes
# Verify quorum
consul members
nomad server members
8. Update cluster configs for 5-server quorum
# common/consul.nix
servers = ["c1" "c2" "c3" "fractal" "zippy"];
bootstrap_expect = 3;
# common/nomad.nix
servers = ["c1" "c2" "c3" "fractal" "zippy"];
bootstrap_expect = 3;
9. Verify fractal is fully operational
# Check all services
ssh fractal "systemctl status samba kopia-server kopia-snapshot.timer"
# Verify ZFS pools
ssh fractal "zpool status"
ssh fractal "zfs list"
# Test accessing shares from another node
ssh c1 "ls /data/media /data/shared"
# Verify kopia clients can still connect
kopia repository status --server=https://fractal:51515
# Check nomad can see fractal
nomad node status | grep fractal
# Verify quorum
consul members # Should see c1, c2, c3, fractal
nomad server members # Should see 4 servers
Phase 2: Setup zippy storage layer
Duration: 2-3 hours
Goal: Prepare zippy for NFS server role, setup replication
-
Create btrfs subvolume on zippy
ssh zippy sudo btrfs subvolume create /persist/services sudo chown ppetru:users /persist/services -
Update zippy configuration
# hosts/zippy/default.nix imports = [ ../../common/encrypted-btrfs-layout.nix ../../common/global ../../common/cluster-node.nix # Adds to quorum ../../common/nomad.nix ./hardware.nix ]; # NFS server services.nfs.server = { enable = true; exports = '' /persist/services 192.168.1.0/24(rw,sync,no_subtree_check,no_root_squash) ''; }; # Consul service registration for NFS services.consul.extraConfig.services = [{ name = "services"; port = 2049; checks = [{ tcp = "localhost:2049"; interval = "30s"; }]; }]; # Btrfs replication to standbys (incremental after first full send) systemd.services.replicate-to-c1 = { description = "Replicate /persist/services to c1"; script = '' ${pkgs.btrfs-progs}/bin/btrfs subvolume snapshot -r /persist/services /persist/services@$(date +%Y%m%d-%H%M%S) LATEST=$(ls -t /persist/services@* | head -1) # Get previous snapshot for incremental send PREV=$(ls -t /persist/services@* | head -2 | tail -1) # First run: full send. Subsequent: incremental with -p (parent) if [ "$LATEST" != "$PREV" ]; then ${pkgs.btrfs-progs}/bin/btrfs send -p $PREV $LATEST | ${pkgs.openssh}/bin/ssh c1 "${pkgs.btrfs-progs}/bin/btrfs receive /persist/" else # First snapshot, full send ${pkgs.btrfs-progs}/bin/btrfs send $LATEST | ${pkgs.openssh}/bin/ssh c1 "${pkgs.btrfs-progs}/bin/btrfs receive /persist/" fi # Cleanup old snapshots (keep last 24 hours on sender) find /persist/services@* -mtime +1 -exec ${pkgs.btrfs-progs}/bin/btrfs subvolume delete {} \; ''; }; systemd.timers.replicate-to-c1 = { wantedBy = [ "timers.target" ]; timerConfig = { OnCalendar = "*:0/5"; # Every 5 minutes (incremental after first full send) Persistent = true; }; }; # Same for c2 systemd.services.replicate-to-c2 = { ... }; systemd.timers.replicate-to-c2 = { ... }; -
Setup standby storage on c1 and c2
# On c1 and c2 ssh c1 sudo btrfs subvolume create /persist/services-standby ssh c2 sudo btrfs subvolume create /persist/services-standby -
Deploy and verify
deploy -s '.#zippy' # Verify NFS export showmount -e zippy # Verify Consul registration dig @localhost -p 8600 services.service.consul -
Verify quorum is now 5 servers
consul members # Should show c1, c2, c3, fractal, zippy nomad server members
Phase 3: Migrate from GlusterFS to NFS
Duration: 3-4 hours
Goal: Move all data, update mounts, remove GlusterFS
-
Copy data from GlusterFS to zippy
# On any node with /data/compute mounted rsync -av --progress /data/compute/ zippy:/persist/services/ # Verify ssh zippy du -sh /persist/services -
Update all nodes to mount NFS
# Update common/glusterfs-client.nix → common/nfs-client.nix # OR update common/cluster-node.nix to import nfs-client instead fileSystems."/data/services" = { device = "services.service.consul:/persist/services"; fsType = "nfs"; options = [ "x-systemd.automount" "noauto" "x-systemd.idle-timeout=60" ]; }; # Remove old GlusterFS mount # fileSystems."/data/compute" = ... # DELETE -
Deploy updated configs
deploy -s '.#c1' '.#c2' '.#c3' '.#fractal' '.#zippy' -
Verify NFS mounts
for host in c1 c2 c3 fractal zippy; do ssh $host "df -h | grep services" done -
Stop all Nomad jobs temporarily
# Get list of running jobs nomad job status | grep running | awk '{print $1}' > /tmp/running-jobs.txt # Stop all (they'll be restarted with updated paths in Phase 4) cat /tmp/running-jobs.txt | xargs -I {} nomad job stop {} -
Remove GlusterFS from cluster
# On c1 (or any gluster server) gluster volume stop compute gluster volume delete compute # On all nodes for host in c1 c2 c3; do ssh $host "sudo systemctl stop glusterd; sudo systemctl disable glusterd" done -
Remove GlusterFS from NixOS configs
# common/compute-node.nix - remove ./glusterfs.nix import # Deploy again deploy
Phase 4: Update and redeploy Nomad jobs
Duration: 2-4 hours
Goal: Update all Nomad job paths, add constraints/affinities, redeploy
-
Update job specs (see Service Catalog below for details)
- Change
/data/compute→/data/services - Add constraints for media jobs → fractal
- Add affinities for database jobs → zippy
- Change
-
Deploy critical services first
# Core infrastructure nomad run services/mysql.hcl nomad run services/postgres.hcl nomad run services/redis.hcl nomad run services/traefik.hcl nomad run services/authentik.hcl # Verify nomad job status mysql consul catalog services -
Deploy high-priority services
nomad run services/prometheus.hcl nomad run services/grafana.hcl nomad run services/loki.hcl nomad run services/vector.hcl nomad run services/unifi.hcl nomad run services/gitea.hcl -
Deploy medium-priority services
# See service catalog for full list nomad run services/wordpress.hcl nomad run services/ghost.hcl nomad run services/wiki.hcl # ... etc -
Deploy low-priority services
nomad run services/media.hcl # Will run on fractal due to constraint # ... etc -
Verify all services healthy
nomad job status consul catalog services # Check traefik dashboard for health
Phase 5: Convert sunny to NixOS (Optional, can defer)
Duration: 6-10 hours (split across 2 stages)
Current state:
- Proxmox with ~1.5TB ethereum node data
- 2x LXC containers: besu (execution client), lighthouse (consensus beacon)
- 1x VM: Rocketpool smartnode (docker containers for validator, node, MEV-boost, etc.)
- Running in "hybrid mode" - managing own execution/consensus, rocketpool manages the rest
Goal: Get sunny on NixOS quickly, preserve ethereum data, defer "perfect" native setup
Stage 1: Quick NixOS Migration (containers)
Duration: 6-8 hours Goal: NixOS + containerized ethereum stack, minimal disruption
1. Pre-migration backup and documentation
# Document current setup
ssh sunny "pct list" > /backup/sunny-containers.txt
ssh sunny "qm list" > /backup/sunny-vms.txt
# Find ethereum data locations in LXC containers
ssh sunny "pct config BESU_CT_ID" > /backup/sunny-besu-config.txt
ssh sunny "pct config LIGHTHOUSE_CT_ID" > /backup/sunny-lighthouse-config.txt
# Document rocketpool VM volumes
ssh sunny "qm config ROCKETPOOL_VM_ID" > /backup/sunny-rocketpool-config.txt
# Estimate ethereum data size
ssh sunny "du -sh /path/to/besu/data"
ssh sunny "du -sh /path/to/lighthouse/data"
# Backup rocketpool config (docker-compose, wallet keys, etc.)
# This is in the VM - need to access and backup critical files
2. Extract ethereum data from containers/VM
# Stop ethereum services to get consistent state
# (This will pause validation! Plan for attestation penalties)
# Copy besu data out of LXC
ssh sunny "pct stop BESU_CT_ID"
rsync -av --progress sunny:/var/lib/lxc/BESU_CT_ID/rootfs/path/to/besu/ /backup/sunny-besu-data/
# Copy lighthouse data out of LXC
ssh sunny "pct stop LIGHTHOUSE_CT_ID"
rsync -av --progress sunny:/var/lib/lxc/LIGHTHOUSE_CT_ID/rootfs/path/to/lighthouse/ /backup/sunny-lighthouse-data/
# Copy rocketpool data out of VM
# This includes validator keys, wallet, node config
# Access VM and copy out: ~/.rocketpool/data
3. Install NixOS on sunny
- Fresh install with btrfs + impermanence
- Create large
/persist/ethereumfor 1.5TB+ data - DO NOT try to resync from network (takes weeks!)
4. Restore ethereum data to NixOS
# After NixOS install, copy data back
ssh sunny "mkdir -p /persist/ethereum/{besu,lighthouse,rocketpool}"
rsync -av --progress /backup/sunny-besu-data/ sunny:/persist/ethereum/besu/
rsync -av --progress /backup/sunny-lighthouse-data/ sunny:/persist/ethereum/lighthouse/
# Rocketpool data copied later
5. Create sunny NixOS config (container-based)
# hosts/sunny/default.nix
{ config, pkgs, ... }:
{
imports = [
../../common/encrypted-btrfs-layout.nix
../../common/global
./hardware.nix
];
networking.hostName = "sunny";
# NO cluster-node import - standalone for now
# Can add to quorum later if desired
# Container runtime
virtualisation.podman = {
enable = true;
dockerCompat = true; # Provides 'docker' command
defaultNetwork.settings.dns_enabled = true;
};
# Besu execution client (container)
virtualisation.oci-containers.containers.besu = {
image = "hyperledger/besu:latest";
volumes = [
"/persist/ethereum/besu:/var/lib/besu"
];
ports = [
"8545:8545" # HTTP RPC
"8546:8546" # WebSocket RPC
"30303:30303" # P2P
];
cmd = [
"--data-path=/var/lib/besu"
"--rpc-http-enabled=true"
"--rpc-http-host=0.0.0.0"
"--rpc-ws-enabled=true"
"--rpc-ws-host=0.0.0.0"
"--engine-rpc-enabled=true"
"--engine-host-allowlist=*"
"--engine-jwt-secret=/var/lib/besu/jwt.hex"
# Add other besu flags as needed
];
autoStart = true;
};
# Lighthouse beacon client (container)
virtualisation.oci-containers.containers.lighthouse-beacon = {
image = "sigp/lighthouse:latest";
volumes = [
"/persist/ethereum/lighthouse:/data"
"/persist/ethereum/besu/jwt.hex:/jwt.hex:ro"
];
ports = [
"5052:5052" # HTTP API
"9000:9000" # P2P
];
cmd = [
"lighthouse"
"beacon"
"--datadir=/data"
"--http"
"--http-address=0.0.0.0"
"--execution-endpoint=http://besu:8551"
"--execution-jwt=/jwt.hex"
# Add other lighthouse flags
];
dependsOn = [ "besu" ];
autoStart = true;
};
# Rocketpool stack (podman-compose for multi-container setup)
# TODO: This requires converting docker-compose to NixOS config
# For now, can run docker-compose via systemd service
systemd.services.rocketpool = {
description = "Rocketpool Smartnode Stack";
after = [ "podman.service" "lighthouse-beacon.service" ];
wantedBy = [ "multi-user.target" ];
serviceConfig = {
Type = "oneshot";
RemainAfterExit = "yes";
WorkingDirectory = "/persist/ethereum/rocketpool";
ExecStart = "${pkgs.docker-compose}/bin/docker-compose up -d";
ExecStop = "${pkgs.docker-compose}/bin/docker-compose down";
};
};
# Ensure ethereum data persists
environment.persistence."/persist" = {
directories = [
"/persist/ethereum"
];
};
# Firewall for ethereum
networking.firewall = {
allowedTCPPorts = [
30303 # Besu P2P
9000 # Lighthouse P2P
# Add rocketpool ports
];
allowedUDPPorts = [
30303 # Besu P2P
9000 # Lighthouse P2P
];
};
}
6. Setup rocketpool docker-compose on NixOS
# After NixOS is running, restore rocketpool config
ssh sunny "mkdir -p /persist/ethereum/rocketpool"
# Copy rocketpool data (wallet, keys, config)
rsync -av /backup/sunny-rocketpool-data/ sunny:/persist/ethereum/rocketpool/
# Create docker-compose.yml for rocketpool stack
# Based on rocketpool hybrid mode docs
# This runs: validator, node software, MEV-boost, prometheus, etc.
# Connects to your besu + lighthouse containers
7. Deploy and test
deploy -s '.#sunny'
# Verify containers are running
ssh sunny "podman ps"
# Check besu sync status
ssh sunny "curl -X POST -H 'Content-Type: application/json' --data '{\"jsonrpc\":\"2.0\",\"method\":\"eth_syncing\",\"params\":[],\"id\":1}' http://localhost:8545"
# Check lighthouse sync status
ssh sunny "curl http://localhost:5052/eth/v1/node/syncing"
# Monitor rocketpool
ssh sunny "cd /persist/ethereum/rocketpool && docker-compose logs -f"
8. Monitor and stabilize
- Ethereum should resume from where it left off (not resync!)
- Validation will resume once beacon is sync'd
- May have missed a few attestations during migration (minor penalty)
Stage 2: Native NixOS Services (Future)
Duration: TBD (do this later when time permits) Goal: Convert to native NixOS services using ethereum-nix
Why defer this:
- Complex (rocketpool not fully packaged for Nix)
- Current container setup works fine
- Can migrate incrementally (besu → native, then lighthouse, etc.)
- No downtime once Stage 1 is stable
When ready:
- Research ethereum-nix support for besu + lighthouse + rocketpool
- Test on separate machine first
- Migrate one service at a time with minimal downtime
- Document in separate migration plan
For now: Stage 1 gets sunny on NixOS with base configs, managed declaratively, just using containers instead of native services.
Phase 6: Verification and cleanup
Duration: 1 hour
-
Test failover procedure (see Failover Procedures below)
-
Verify backups are working
kopia snapshot list # Check that /persist/services is being backed up -
Update documentation
- Update README.md
- Document new architecture
- Update stateful-commands.txt
-
Clean up old GlusterFS data
# Only after verifying everything works! for host in c1 c2 c3; do ssh $host "sudo rm -rf /persist/glusterfs" done
Service Catalog
Legend:
- Priority: CRITICAL (must be up) / HIGH (important) / MEDIUM (nice to have) / LOW (can wait)
- Target: Where it should run (constraint or affinity)
- Data: What data it needs access to
- Changes: What needs updating in the .hcl file
Core Infrastructure
mysql
- File:
services/mysql.hcl - Priority: CRITICAL
- Current: Uses
/data/compute/appdata/mysql - Target: Affinity for zippy, allow c1/c2
- Data:
/data/services/mysql(NFS from zippy) - Changes:
- ✏️ Volume path:
/data/compute/appdata/mysql→/data/services/mysql - ✏️ Add affinity:
affinity { attribute = "${node.unique.name}" value = "zippy" weight = 100 } - ✏️ Add constraint to allow fallback:
constraint { attribute = "${node.unique.name}" operator = "regexp" value = "zippy|c1|c2" }
- ✏️ Volume path:
- Notes: Core database, needs to stay up. Consul DNS
mysql.service.consulunchanged.
postgres
- File:
services/postgres.hcl - Priority: CRITICAL
- Current: Uses
/data/compute/appdata/postgres,/data/compute/appdata/pgadmin - Target: Affinity for zippy, allow c1/c2
- Data:
/data/services/postgres,/data/services/pgadmin(NFS) - Changes:
- ✏️ Volume paths:
/data/compute/appdata/*→/data/services/* - ✏️ Add affinity and constraint (same as mysql)
- ✏️ Volume paths:
- Notes: Core database for authentik, gitea, plausible, netbox, etc.
redis
- File:
services/redis.hcl - Priority: CRITICAL
- Current: Uses
/data/compute/appdata/redis - Target: Affinity for zippy, allow c1/c2
- Data:
/data/services/redis(NFS) - Changes:
- ✏️ Volume path:
/data/compute/appdata/redis→/data/services/redis - ✏️ Add affinity and constraint (same as mysql)
- ✏️ Volume path:
- Notes: Used by authentik, wordpress. Should co-locate with databases.
traefik
- File:
services/traefik.hcl - Priority: CRITICAL
- Current: Uses
/data/compute/config/traefik - Target: Float on c1/c2/c3 (keepalived handles HA)
- Data:
/data/services/config/traefik(NFS) - Changes:
- ✏️ Volume path:
/data/compute/config/traefik→/data/services/config/traefik
- ✏️ Volume path:
- Notes: Reverse proxy, has keepalived for VIP failover. Critical for all web access.
authentik
- File:
services/authentik.hcl - Priority: CRITICAL
- Current: No persistent volumes (stateless, uses postgres/redis)
- Target: Float on c1/c2/c3
- Data: None (uses postgres.service.consul, redis.service.consul)
- Changes: None needed
- Notes: SSO for most services. Must stay up.
Monitoring Stack
prometheus
- File:
services/prometheus.hcl - Priority: HIGH
- Current: Uses
/data/compute/appdata/prometheus - Target: Float on c1/c2/c3
- Data:
/data/services/prometheus(NFS) - Changes:
- ✏️ Volume path:
/data/compute/appdata/prometheus→/data/services/prometheus
- ✏️ Volume path:
- Notes: Metrics database. Important for monitoring but not critical for services.
grafana
- File:
services/grafana.hcl - Priority: HIGH
- Current: Uses
/data/compute/appdata/grafana - Target: Float on c1/c2/c3
- Data:
/data/services/grafana(NFS) - Changes:
- ✏️ Volume path:
/data/compute/appdata/grafana→/data/services/grafana
- ✏️ Volume path:
- Notes: Monitoring UI. Depends on prometheus.
loki
- File:
services/loki.hcl - Priority: HIGH
- Current: Uses
/data/compute/appdata/loki - Target: Float on c1/c2/c3
- Data:
/data/services/loki(NFS) - Changes:
- ✏️ Volume path:
/data/compute/appdata/loki→/data/services/loki
- ✏️ Volume path:
- Notes: Log aggregation. Important for debugging.
vector
- File:
services/vector.hcl - Priority: MEDIUM
- Current: No persistent volumes, type=system (runs on all nodes)
- Target: System job (runs everywhere)
- Data: None (ephemeral logs, ships to loki)
- Changes:
- ❓ Check if glusterfs log path is still needed:
/var/log/glusterfs:/var/log/glusterfs:ro - ✏️ Remove glusterfs log collection after GlusterFS is removed
- ❓ Check if glusterfs log path is still needed:
- Notes: Log shipper. Can tolerate downtime.
Databases (Specialized)
clickhouse
- File:
services/clickhouse.hcl - Priority: HIGH
- Current: Uses
/data/compute/appdata/clickhouse - Target: Affinity for zippy (large dataset), allow c1/c2/c3
- Data:
/data/services/clickhouse(NFS) - Changes:
- ✏️ Volume path:
/data/compute/appdata/clickhouse→/data/services/clickhouse - ✏️ Add affinity for zippy (optional, but helps with performance)
- ✏️ Volume path:
- Notes: Used by plausible. Large time-series data. Important but can be recreated.
mongodb
- File:
services/unifi.hcl(embedded in unifi job) - Priority: HIGH
- Current: Uses
/data/compute/appdata/unifi/mongodb - Target: Float on c1/c2/c3 (with unifi)
- Data:
/data/services/unifi/mongodb(NFS) - Changes: See unifi below
- Notes: Only used by unifi. Should stay with unifi controller.
Web Applications
wordpress
- File:
services/wordpress.hcl - Priority: HIGH
- Current: Uses
/data/sync/wordpress(syncthing-managed to avoid slow GlusterFS) - Target: Float on c1/c2/c3
- Data:
/data/services/wordpress(NFS from zippy) - Changes:
- ✏️ Volume path:
/data/sync/wordpress→/data/services/wordpress - 📋 Before cutover: Copy data from syncthing to zippy:
rsync -av /data/sync/wordpress/ zippy:/persist/services/appdata/wordpress/ - 📋 After migration: Remove syncthing configuration for wordpress sync
- ✏️ Volume path:
- Notes: Production website. Important but can tolerate brief downtime during migration.
ghost
- File:
services/ghost.hcl - Priority: no longer used, should wipe
- Current: Uses
/data/compute/appdata/ghost - Target: Float on c1/c2/c3
- Data:
/data/services/ghost(NFS) - Changes:
- ✏️ Volume path:
/data/compute/appdata/ghost→/data/services/ghost
- ✏️ Volume path:
- Notes: Blog platform (alo.land). Can tolerate downtime.
gitea
- File:
services/gitea.hcl - Priority: HIGH
- Current: Uses
/data/compute/appdata/gitea/data,/data/compute/appdata/gitea/config - Target: Float on c1/c2/c3
- Data:
/data/services/gitea/*(NFS) - Changes:
- ✏️ Volume paths:
/data/compute/appdata/gitea/*→/data/services/gitea/*
- ✏️ Volume paths:
- Notes: Git server. Contains code repositories. Important.
wiki (tiddlywiki)
- File:
services/wiki.hcl - Priority: HIGH
- Current: Uses
/data/compute/appdata/wikivia host volume mount - Target: Float on c1/c2/c3
- Data:
/data/services/wiki(NFS) - Changes:
- ✏️ Volume mount path in
volume_mountblocks - ⚠️ Uses
execdriver with host volumes - verify NFS mount works with this
- ✏️ Volume mount path in
- Notes: Multiple tiddlywiki instances. Personal wikis. Can tolerate downtime.
code-server
- File:
services/code-server.hcl - Priority: LOW
- Current: Uses
/data/compute/appdata/code - Target: Float on c1/c2/c3
- Data:
/data/services/code(NFS) - Changes:
- ✏️ Volume path:
/data/compute/appdata/code→/data/services/code
- ✏️ Volume path:
- Notes: Web IDE. Low priority, for development only.
beancount (fava)
- File:
services/beancount.hcl - Priority: MEDIUM
- Current: Uses
/data/compute/appdata/beancount - Target: Float on c1/c2/c3
- Data:
/data/services/beancount(NFS) - Changes:
- ✏️ Volume path:
/data/compute/appdata/beancount→/data/services/beancount
- ✏️ Volume path:
- Notes: Finance tracking. Low priority.
adminer
- File:
services/adminer.hcl - Priority: LOW
- Current: Stateless
- Target: Float on c1/c2/c3
- Data: None
- Changes: None needed
- Notes: Database admin UI. Only needed for maintenance.
plausible
- File:
services/plausible.hcl - Priority: HIGH
- Current: Stateless (uses postgres and clickhouse)
- Target: Float on c1/c2/c3
- Data: None (uses postgres.service.consul, clickhouse.service.consul)
- Changes: None needed
- Notes: Website analytics. Nice to have but not critical.
evcc
- File:
services/evcc.hcl - Priority: HIGH
- Current: Uses
/data/compute/appdata/evcc/evcc.yaml,/data/compute/appdata/evcc/evcc - Target: Float on c1/c2/c3
- Data:
/data/services/evcc/*(NFS) - Changes:
- ✏️ Volume paths:
/data/compute/appdata/evcc/*→/data/services/evcc/*
- ✏️ Volume paths:
- Notes: EV charging controller. Important for daily use.
vikunja
- File:
services/vikunja.hcl(assumed to exist based on README) - Priority: no longer used, should delete
- Current: Likely uses
/data/compute/appdata/vikunja - Target: Float on c1/c2/c3
- Data:
/data/services/vikunja(NFS) - Changes:
- ✏️ Volume paths: Update to
/data/services/vikunja
- ✏️ Volume paths: Update to
- Notes: Task management. Low priority.
leantime
- File:
services/leantime.hcl - Priority: no longer used, should delete
- Current: Likely uses
/data/compute/appdata/leantime - Target: Float on c1/c2/c3
- Data:
/data/services/leantime(NFS) - Changes:
- ✏️ Volume paths: Update to
/data/services/leantime
- ✏️ Volume paths: Update to
- Notes: Project management. Low priority.
Network Infrastructure
unifi
- File:
services/unifi.hcl - Priority: HIGH
- Current: Uses
/data/compute/appdata/unifi/data,/data/compute/appdata/unifi/mongodb - Target: Float on c1/c2/c3/fractal/zippy
- Data:
/data/services/unifi/*(NFS) - Changes:
- ✏️ Volume paths:
/data/compute/appdata/unifi/*→/data/services/unifi/*
- ✏️ Volume paths:
- Notes: UniFi network controller. Critical for network management. Has keepalived VIP for stable inform address. Floating is fine.
Media Stack
media (radarr, sonarr, bazarr, plex, qbittorrent)
- File:
services/media.hcl - Priority: MEDIUM
- Current: Uses
/data/compute/appdata/radarr,/data/compute/appdata/sonarr, etc. and/data/media - Target: MUST run on fractal (local /data/media access)
- Data:
/data/services/radarr(NFS) - config data/data/media(local CIFS mount on fractal, local disk on fractal)
- Changes:
- ✏️ Volume paths:
/data/compute/appdata/*→/data/services/* - ✏️ Add constraint:
constraint { attribute = "${node.unique.name}" value = "fractal" }
- ✏️ Volume paths:
- Notes: Heavy I/O to /data/media. Must run on fractal for performance. Has keepalived VIP.
Utility Services
weewx
- File:
services/weewx.hcl - Priority: HIGH
- Current: Likely uses
/data/compute/appdata/weewx - Target: Float on c1/c2/c3
- Data:
/data/services/weewx(NFS) - Changes:
- ✏️ Volume paths: Update to
/data/services/weewx
- ✏️ Volume paths: Update to
- Notes: Weather station. Low priority.
maps
- File:
services/maps.hcl - Priority: MEDIUM
- Current: Likely uses
/data/compute/appdata/maps - Target: Float on c1/c2/c3 (or fractal if large tile data)
- Data:
/data/services/maps(NFS) or/data/media/mapsif large - Changes:
- ✏️ Volume paths: Check data size, may want to move to /data/media
- Notes: Map tiles. Low priority.
netbox
- File:
services/netbox.hcl - Priority: LOW
- Current: Likely uses
/data/compute/appdata/netbox - Target: Float on c1/c2/c3
- Data:
/data/services/netbox(NFS) - Changes:
- ✏️ Volume paths: Update to
/data/services/netbox
- ✏️ Volume paths: Update to
- Notes: IPAM/DCIM. Low priority, for documentation.
farmos
- File:
services/farmos.hcl - Priority: LOW
- Current: Likely uses
/data/compute/appdata/farmos - Target: Float on c1/c2/c3
- Data:
/data/services/farmos(NFS) - Changes:
- ✏️ Volume paths: Update to
/data/services/farmos
- ✏️ Volume paths: Update to
- Notes: Farm management. Low priority.
urbit
- File:
services/urbit.hcl - Priority: LOW
- Current: Likely uses
/data/compute/appdata/urbit - Target: Float on c1/c2/c3
- Data:
/data/services/urbit(NFS) - Changes:
- ✏️ Volume paths: Update to
/data/services/urbit
- ✏️ Volume paths: Update to
- Notes: Urbit node. Experimental, low priority.
webodm
- File:
services/webodm.hcl - Priority: LOW
- Current: Likely uses
/data/compute/appdata/webodm - Target: Float on c1/c2/c3 (or fractal if processing large imagery from /data/media)
- Data:
/data/services/webodm(NFS) - Changes:
- ✏️ Volume paths: Update to
/data/services/webodm - 🤔 May benefit from running on fractal if it processes files from /data/media
- ✏️ Volume paths: Update to
- Notes: Drone imagery processing. Low priority.
velutrack
- File:
services/velutrack.hcl - Priority: LOW
- Current: Likely minimal state
- Target: Float on c1/c2/c3
- Data: Minimal
- Changes: Verify if any volume paths need updating
- Notes: Vehicle tracking. Low priority.
resol-gateway
- File:
services/resol-gateway.hcl - Priority: HIGH
- Current: Likely minimal state
- Target: Float on c1/c2/c3
- Data: Minimal
- Changes: Verify if any volume paths need updating
- Notes: Solar thermal controller. Low priority.
igsync
- File:
services/igsync.hcl - Priority: MEDIUM
- Current: Likely uses
/data/compute/appdata/igsyncor/data/media - Target: Float on c1/c2/c3 (or fractal if storing to /data/media)
- Data: Check if it writes to
/data/mediaor/data/services - Changes:
- ✏️ Volume paths: Verify and update
- Notes: Instagram sync. Low priority.
jupyter
- File:
services/jupyter.hcl - Priority: LOW
- Current: Stateless or minimal state
- Target: Float on c1/c2/c3
- Data: Minimal
- Changes: Verify if any volume paths need updating
- Notes: Notebook server. Low priority, for experimentation.
whoami
- File:
services/whoami.hcl - Priority: LOW
- Current: Stateless
- Target: Float on c1/c2/c3
- Data: None
- Changes: None needed
- Notes: Test service. Can be stopped during migration.
tiddlywiki (if separate from wiki.hcl)
- File:
services/tiddlywiki.hcl - Priority: MEDIUM
- Current: Likely same as wiki.hcl
- Target: Float on c1/c2/c3
- Data:
/data/services/tiddlywiki(NFS) - Changes: Same as wiki.hcl
- Notes: May be duplicate of wiki.hcl.
Backup Jobs
mysql-backup
- File:
services/mysql-backup.hcl - Priority: HIGH
- Current: Likely writes to
/data/computeor/data/shared - Target: Float on c1/c2/c3
- Data: Should write to
/data/shared(backed up to fractal) - Changes:
- ✏️ Verify backup destination, should be
/data/shared/backups/mysql
- ✏️ Verify backup destination, should be
- Notes: Important for disaster recovery. Should run regularly.
postgres-backup
- File:
services/postgres-backup.hcl - Priority: HIGH
- Current: Likely writes to
/data/computeor/data/shared - Target: Float on c1/c2/c3
- Data: Should write to
/data/shared(backed up to fractal) - Changes:
- ✏️ Verify backup destination, should be
/data/shared/backups/postgres
- ✏️ Verify backup destination, should be
- Notes: Important for disaster recovery. Should run regularly.
wordpress-backup
- File:
services/wordpress-backup.hcl - Priority: MEDIUM
- Current: Likely writes to
/data/computeor/data/shared - Target: Float on c1/c2/c3
- Data: Should write to
/data/shared(backed up to fractal) - Changes:
- ✏️ Verify backup destination
- Notes: Periodic backup job.
Failover Procedures
NFS Server Failover (zippy → c1 or c2)
When to use: zippy is down and not coming back soon
Prerequisites:
- c1 and c2 have been receiving btrfs snapshots from zippy
- Last successful replication was < 1 hour ago (verify timestamps)
Procedure:
-
Choose standby node (c1 or c2)
# Check replication freshness ssh c1 "ls -lt /persist/services-standby@* | head -5" ssh c2 "ls -lt /persist/services-standby@* | head -5" # Choose the one with most recent snapshot # For this example, we'll use c1 -
On standby node (c1), promote standby to primary
ssh c1 # Stop NFS client mount (if running) sudo systemctl stop data-services.mount # Find latest snapshot LATEST=$(ls -t /persist/services-standby@* | head -1) # Create writable subvolume from snapshot sudo btrfs subvolume snapshot $LATEST /persist/services # Verify ls -la /persist/services -
Deploy c1-nfs-server configuration
# From your workstation deploy -s '.#c1-nfs-server' # This activates: # - NFS server on c1 # - Consul service registration for "services" # - Firewall rules -
On c1, verify NFS is running
ssh c1 sudo systemctl status nfs-server showmount -e localhost dig @localhost -p 8600 services.service.consul # Should show c1's IP -
On other nodes, remount NFS
# Nodes should auto-remount via Consul DNS, but you can force it: for host in c2 c3 fractal zippy; do ssh $host "sudo systemctl restart data-services.mount" done -
Verify Nomad jobs are healthy
nomad job status mysql nomad job status postgres # Check all critical services -
Update monitoring/alerts
- Note in documentation that c1 is now primary NFS server
- Set up alert to remember to fail back to zippy when it's repaired
Recovery Time Objective (RTO): ~10-15 minutes
Recovery Point Objective (RPO): Last snapshot interval (5 minutes max)
Failing Back to zippy
When to use: zippy is repaired and ready to resume primary role
Procedure:
-
Sync data from c1 back to zippy
# On c1 (current primary) sudo btrfs subvolume snapshot -r /persist/services /persist/services@failback-$(date +%Y%m%d-%H%M%S) FAILBACK=$(ls -t /persist/services@failback-* | head -1) sudo btrfs send $FAILBACK | ssh zippy "sudo btrfs receive /persist/" # On zippy, make it writable ssh zippy "sudo btrfs subvolume snapshot /persist/$(basename $FAILBACK) /persist/services" -
Deploy zippy back to NFS server role
deploy -s '.#zippy' # Consul will register services.service.consul → zippy again -
Demote c1 back to standby
deploy -s '.#c1' # This removes NFS server, restores NFS client mount -
Verify all nodes are mounting from zippy
dig @c1 -p 8600 services.service.consul # Should show zippy's IP for host in c1 c2 c3 fractal; do ssh $host "df -h | grep services" done
Database Job Failover (automatic via Nomad)
When to use: zippy is down, database jobs need to run elsewhere
What happens automatically:
- Nomad detects zippy is unhealthy
- Jobs with constraint
zippy|c1|c2are rescheduled to c1 or c2 - Jobs start on new node, accessing
/data/services(now via NFS from promoted standby)
Manual intervention needed:
- None if NFS failover completed successfully
- If jobs are stuck:
nomad job stop mysql && nomad job run services/mysql.hcl
What to check:
nomad job status mysql
nomad job status postgres
nomad job status redis
# Verify they're running on c1 or c2, not zippy
nomad alloc status <alloc-id>
Complete Cluster Failure (lose quorum)
Scenario: 3 or more servers go down, quorum lost
Prevention: This is why we have 5 servers (need 3 for quorum)
Recovery:
- Bring up at least 3 servers (any 3 from c1, c2, c3, fractal, zippy)
- If that's not possible, bootstrap new cluster:
# On one surviving server, force bootstrap consul force-leave <failed-node> nomad operator raft list-peers nomad operator raft remove-peer <failed-peer> - Restore from backups (worst case)
Post-Migration Verification Checklist
- All 5 servers in quorum:
consul membersshows c1, c2, c3, fractal, zippy - NFS mounts working:
df -h | grep serviceson all nodes - Btrfs replication running: Check systemd timers on zippy
- Critical services up: mysql, postgres, redis, traefik, authentik
- Monitoring working: Prometheus, Grafana, Loki accessible
- Media stack on fractal:
nomad alloc statusshows media job on fractal - Database jobs on zippy:
nomad alloc statusshows mysql/postgres on zippy - Consul DNS working:
dig @localhost -p 8600 services.service.consul - Backups running: Kopia snapshots include
/persist/services - GlusterFS removed: No glusterfs processes, volumes deleted
- Documentation updated: README.md, architecture diagrams
Rollback Plan
If migration fails catastrophically:
-
Stop all new Nomad jobs
nomad job stop -purge <new-jobs> -
Restore GlusterFS mounts
# On all nodes, re-enable GlusterFS client deploy # With old configs -
Restart old Nomad jobs
# With old paths pointing to /data/compute nomad run services/*.hcl # Old versions from git -
Restore data if needed
rsync -av /backup/compute-pre-migration/ /data/compute/
Important: Keep GlusterFS running until Phase 4 is complete and verified!
Questions Answered
-
✅ Where is
/data/sync/wordpressmounted from?- Answer: Syncthing-managed to avoid slow GlusterFS
- Action: Migrate to
/data/services/wordpress, remove syncthing config
-
✅ Which services use
/data/mediadirectly?- Answer: Only media.hcl (radarr, sonarr, plex, qbittorrent)
- Action: Constrain media.hcl to fractal, everything else uses CIFS mount
-
✅ Do we want unifi on fractal or floating?
- Answer: Floating is fine
- Action: No constraint needed
-
✅ What's the plan for sunny's existing data?
- Answer: Ethereum data stays local, not replicated (too expensive)
- Action: Either backup/restore or resync from network during NixOS conversion
Questions Still to Answer
-
Backup retention for btrfs snapshots?
- Current plan: Keep 24 hours of snapshots on zippy
- Is this enough? Or do we want more for safety?
- This should be fine -- snapshots are just for hot recovery. More/older backups are kept via kopia on fractal.
-
c1-nfs-server vs c1 config - same host, different configs?
- Recommendation: Use same hostname, different flake output
c1= normal config with NFS clientc1-nfs-server= variant with NFS server enabled- Both in flake.nix, deploy appropriate one based on role
- Answer: recommendation makes sense.
-
Should we verify webodm, igsync, maps don't need /data/media access?
- neither of them needs /data/media
- maps needs /data/shared
Timeline Estimate
Total duration: 12-20 hours (can be split across multiple sessions)
- Phase 0 (Prep): 1-2 hours
- Phase 1 (fractal): 4-6 hours
- Phase 2 (zippy storage): 2-3 hours
- Phase 3 (GlusterFS → NFS): 3-4 hours
- Phase 4 (Nomad jobs): 2-4 hours
- Phase 5 (sunny): 2-3 hours (optional, can be done later)
- Phase 6 (Cleanup): 1 hour
Suggested schedule:
- Day 1: Phases 0-1 (fractal conversion, establish quorum)
- Day 2: Phases 2-3 (zippy storage, data migration)
- Day 3: Phase 4 (Nomad job updates and deployment)
- Day 4: Phases 5-6 (sunny + cleanup) or take a break and do later
Maintenance windows needed:
- Phase 3: ~1 hour downtime (all services stopped during data migration)
- Phase 4: Rolling (services come back up as redeployed)