ppetru/alo-cluster

Fork 0

Files

Petru Paler f414ac0146 Fix path names.

2025-10-22 13:59:31 +01:00

52 KiB

Raw Blame History

Cluster Architecture Revamp

Status: Planning complete, ready for review and refinement

Key Decisions

✅ Replication: 5-minute intervals (incremental btrfs send) ✅ WordPress: Currently syncthing → will use /data/services via NFS ✅ Media: Only media.hcl needs /data/media, constrained to fractal ✅ Unifi: Floating (no constraint needed) ✅ Sunny: Standalone, ethereum data stays local (not replicated) ✅ Quorum: 5 servers (c1, c2, c3, fractal, zippy) ✅ NFS Failover: Via Consul DNS (services.service.consul)

End State Architecture
Migration Steps
Service Catalog
Failover Procedures

End State Architecture

Cluster Topology

5-Server Quorum (Consul + Nomad server+client):

c1, c2, c3: Cattle nodes - x86_64, run most stateless workloads
fractal: Storage node - x86_64, 6x spinning drives, runs media workloads
zippy: Stateful anchor - x86_64, runs database workloads (via affinity), primary NFS server

Standalone Nodes (not in quorum):

sunny: x86_64, ethereum node + staking, base NixOS configs only
chilly: x86_64, Home Assistant VM, base NixOS configs only

Quorum Math:

5 servers → quorum requires 3 healthy nodes
Can tolerate 2 simultaneous failures
Bootstrap expect: 3

Storage Architecture

Primary Storage (zippy):

/persist/services - btrfs subvolume
- Contains: mysql, postgres, redis, clickhouse, mongodb, app data
- Exported via NFS to: services.service.consul:/persist/services
- Replicated via btrfs send to c1 and c2 every 5 minutes (incremental)

Standby Storage (c1, c2):

/persist/services-standby - btrfs subvolume
- Receives replicated snapshots from zippy via incremental btrfs send
- Can be promoted to /persist/services and exported as NFS during failover
- Maximum data loss: 5 minutes (last replication interval)

Standalone Storage (sunny):

/persist/ethereum - local btrfs subvolume (or similar)
- Contains: ethereum blockchain data, staking keys
- NOT replicated - too large/expensive to replicate full ethereum node
- Backed up via kopia to fractal (if feasible/needed)

Media Storage (fractal):

/data/media - existing spinning drive storage
- Exported via Samba (existing)
- Mounted on c1, c2, c3 via CIFS (existing)
- Local access on fractal for media workloads

Shared Storage (fractal):

/data/shared - existing spinning drive storage
- Exported via Samba (existing)
- Mounted on c1, c2, c3 via CIFS (existing)

Network Services

NFS Primary (zippy):

services.nfs.server = {
  enable = true;
  exports = ''
    /persist/services 192.168.1.0/24(rw,sync,no_subtree_check,no_root_squash)
  '';
};

services.consul.extraConfig.services = [{
  name = "services";
  port = 2049;
  checks = [{ tcp = "localhost:2049"; interval = "30s"; }];
}];

NFS Client (all nodes):

fileSystems."/data/services" = {
  device = "services.service.consul:/persist/services";
  fsType = "nfs";
  options = [ "x-systemd.automount" "noauto" "x-systemd.idle-timeout=60" ];
};

Samba Exports (fractal - existing):

//fractal/media → /data/media
//fractal/shared → /data/shared

Nomad Job Placement Strategy

Affinity-based (prefer zippy, allow c1/c2):

mysql, postgres, redis - stateful databases
Run on zippy normally, can failover to c1/c2 if zippy down

Constrained (must run on fractal):

media.hcl - radarr, sonarr, bazarr, plex, qbittorrent
- Reason: Heavy /data/media access, benefits from local storage
prometheus.hcl - metrics database with 30d retention
- Reason: Large time-series data, spinning disks OK, saves SSD space
loki.hcl - log aggregation with 31d retention
- Reason: Large log data, spinning disks OK
clickhouse.hcl - analytics database for plausible
- Reason: Large time-series data, spinning disks OK

Floating (can run anywhere on c1/c2/c3/fractal/zippy):

All other services including:
- traefik, authentik, web apps
- grafana (small data, just dashboards/config, queries prometheus for metrics)
- databases (mysql, postgres, redis)
- vector (system job, runs everywhere)
Nomad schedules based on resources and constraints

Data Migration

Path changes needed in Nomad jobs:

/data/compute/appdata/* → /data/services/*
/data/compute/config/* → /data/services/*
/data/sync/wordpress → /data/services/wordpress

No changes needed:

/data/media/* - stays the same (CIFS mount from fractal, used only by media services)
/data/shared/* - stays the same (CIFS mount from fractal)

Deprecated after migration:

/data/sync/wordpress - currently managed by syncthing to avoid slow GlusterFS
- Will be replaced by NFS mount at /data/services/wordpress
- Syncthing configuration for this can be removed
- Final sync: copy from syncthing to /persist/services/wordpress on zippy before cutover

Migration Steps

Important path simplification note:

All service paths use /data/services/* directly (not /data/services/*)
Example: /data/compute/appdata/mysql → /data/services/mysql
Simpler, cleaner, easier to manage

Phase 0: Preparation

Duration: 1-2 hours

Backup everything

# On all nodes, ensure kopia backups are current
kopia snapshot list

# Backup glusterfs data manually
rsync -av /data/compute/ /backup/compute-pre-migration/

Document current state

# Save current nomad job list
nomad job status -json > /backup/nomad-jobs-pre-migration.json

# Save consul service catalog
consul catalog services > /backup/consul-services-pre-migration.txt

Review this document
- Verify all services are cataloged
- Confirm priority assignments
- Adjust as needed

Phase 1: Convert fractal to NixOS

Duration: 6-8 hours

Current state:

Proxmox on ZFS
System pool: rpool (~500GB, will be wiped)
Data pools (preserved):
- double1 - 3.6T (homes, shared)
- double2 - 7.2T (backup - kopia repo, PBS)
- double3 - 17T (media, torrent)
Services: Samba (homes, shared, media), Kopia server, PBS
Bind mounts: /data/{homes,shared,media,torrent} → ZFS datasets

Goal: Fresh NixOS on rpool, preserve data pools, join cluster

Step-by-step procedure:

1. Pre-migration documentation

# On fractal, save ZFS layout
cat > /tmp/detect-zfs.sh << 'EOF'
#!/bin/bash
echo "=== ZFS Pools ==="
zpool status

echo -e "\n=== ZFS Datasets ==="
zfs list -o name,mountpoint,used,avail,mounted -r double1 double2 double3

echo -e "\n=== Bind mounts ==="
cat /etc/fstab | grep double

echo -e "\n=== Data directories ==="
ls -la /data/

echo -e "\n=== Samba users/groups ==="
getent group shared compute
getent passwd compute
EOF
chmod +x /tmp/detect-zfs.sh
ssh fractal /tmp/detect-zfs.sh > /backup/fractal-zfs-layout.txt

# Save samba config
scp fractal:/etc/samba/smb.conf /backup/fractal-smb.conf

# Save kopia certs and config
scp -r fractal:~/kopia-certs /backup/fractal-kopia-certs/
scp fractal:~/.config/kopia/repository.config /backup/fractal-kopia-repository.config

# Verify kopia backups are current
ssh fractal "kopia snapshot list --all"

2. Stop services on fractal

ssh fractal "systemctl stop smbd nmbd kopia"
# Don't stop PBS yet (in case we need to restore)

3. Install NixOS

Boot NixOS installer USB
IMPORTANT: Do NOT touch double1, double2, double3 during install!
Install only on rpool (or create new pool if needed)

# In NixOS installer
# Option A: Reuse rpool (wipe and recreate)
zpool destroy rpool

# Option B: Use different disk if available
# Then follow standard NixOS btrfs install on that disk

Use standard encrypted btrfs layout (matching other hosts)
Minimal install first, will add cluster configs later

4. First boot - import ZFS pools

# SSH into fresh NixOS install

# Import pools (read-only first, to be safe)
zpool import -f -o readonly=on double1
zpool import -f -o readonly=on double2
zpool import -f -o readonly=on double3

# Verify datasets
zfs list -r double1 double2 double3

# Example output should show:
# double1/homes
# double1/shared
# double2/backup
# double3/media
# double3/torrent

# If everything looks good, export and reimport read-write
zpool export double1 double2 double3
zpool import double1
zpool import double2
zpool import double3

# Set ZFS mountpoints (if needed)
# These may already be set from Proxmox
zfs set mountpoint=/double1 double1
zfs set mountpoint=/double2 double2
zfs set mountpoint=/double3 double3

5. Create fractal NixOS configuration

# hosts/fractal/default.nix
{ config, pkgs, ... }:
{
  imports = [
    ../../common/encrypted-btrfs-layout.nix
    ../../common/global
    ../../common/cluster-node.nix  # Consul + Nomad (will add in step 7)
    ../../common/nomad.nix  # Both server and client
    ./hardware.nix
  ];

  networking.hostName = "fractal";

  # ZFS support
  boot.supportedFilesystems = [ "zfs" ];
  boot.zfs.extraPools = [ "double1" "double2" "double3" ];

  # Ensure ZFS pools are imported before mounting
  systemd.services.zfs-import.wantedBy = [ "multi-user.target" ];

  # Bind mounts for /data (matching Proxmox setup)
  fileSystems."/data/homes" = {
    device = "/double1/homes";
    fsType = "none";
    options = [ "bind" "x-systemd.requires=zfs-mount.service" ];
  };

  fileSystems."/data/shared" = {
    device = "/double1/shared";
    fsType = "none";
    options = [ "bind" "x-systemd.requires=zfs-mount.service" ];
  };

  fileSystems."/data/media" = {
    device = "/double3/media";
    fsType = "none";
    options = [ "bind" "x-systemd.requires=zfs-mount.service" ];
  };

  fileSystems."/data/torrent" = {
    device = "/double3/torrent";
    fsType = "none";
    options = [ "bind" "x-systemd.requires=zfs-mount.service" ];
  };

  fileSystems."/backup" = {
    device = "/double2/backup";
    fsType = "none";
    options = [ "bind" "x-systemd.requires=zfs-mount.service" ];
  };

  # Create data directory structure
  systemd.tmpfiles.rules = [
    "d /data 0755 root root -"
  ];

  # Users and groups for samba
  users.groups.shared = { gid = 1001; };
  users.groups.compute = { gid = 1002; };
  users.users.compute = {
    isSystemUser = true;
    uid = 1002;
    group = "compute";
  };

  # Ensure ppetru is in shared group
  users.users.ppetru.extraGroups = [ "shared" ];

  # Samba server
  services.samba = {
    enable = true;
    openFirewall = true;

    extraConfig = ''
      workgroup = WORKGROUP
      server string = fractal
      netbios name = fractal
      security = user
      map to guest = bad user
    '';

    shares = {
      homes = {
        comment = "Home Directories";
        browseable = "no";
        path = "/data/homes/%S";
        "read only" = "no";
      };

      shared = {
        path = "/data/shared";
        "read only" = "no";
        browseable = "yes";
        "guest ok" = "no";
        "create mask" = "0775";
        "directory mask" = "0775";
        "force group" = "+shared";
      };

      media = {
        path = "/data/media";
        "read only" = "no";
        browseable = "yes";
        "guest ok" = "no";
        "create mask" = "0755";
        "directory mask" = "0755";
      };
    };
  };

  # Kopia backup server
  systemd.services.kopia-server = {
    description = "Kopia Backup Server";
    wantedBy = [ "multi-user.target" ];
    after = [ "network.target" "zfs-mount.service" ];

    serviceConfig = {
      User = "ppetru";
      Group = "users";
      ExecStart = ''
        ${pkgs.kopia}/bin/kopia server start \
          --address 0.0.0.0:51515 \
          --tls-cert-file /persist/kopia-certs/kopia.cert \
          --tls-key-file /persist/kopia-certs/kopia.key
      '';
      Restart = "on-failure";
    };
  };

  # Kopia nightly snapshot (from cron)
  systemd.services.kopia-snapshot = {
    description = "Kopia snapshot of homes and shared";
    serviceConfig = {
      Type = "oneshot";
      User = "ppetru";
      Group = "users";
      ExecStart = ''
        ${pkgs.kopia}/bin/kopia --config-file=/home/ppetru/.config/kopia/repository.config \
          snapshot create /data/homes /data/shared \
          --log-level=warning --no-progress
      '';
    };
  };

  systemd.timers.kopia-snapshot = {
    wantedBy = [ "timers.target" ];
    timerConfig = {
      OnCalendar = "22:47";
      Persistent = true;
    };
  };

  # Keep kopia config and certs persistent
  environment.persistence."/persist" = {
    directories = [
      "/home/ppetru/.config/kopia"
      "/home/ppetru/kopia-certs"
    ];
  };

  networking.firewall.allowedTCPPorts = [
    139 445  # Samba
    51515    # Kopia
  ];
  networking.firewall.allowedUDPPorts = [
    137 138  # Samba
  ];
}

6. Deploy initial config (without cluster)

# First, deploy without cluster-node.nix to verify storage works
# Comment out cluster-node import temporarily

deploy -s '.#fractal'

# Verify mounts
ssh fractal "df -h | grep data"
ssh fractal "ls -la /data/"

# Test samba
smbclient -L fractal -U ppetru

# Test kopia
ssh fractal "systemctl status kopia-server"

7. Join cluster (add to quorum)

# Uncomment cluster-node.nix import in fractal config
# Update all cluster configs for 5-server quorum
# (See step 3 in existing Phase 1 docs)

deploy  # Deploy to all nodes

# Verify quorum
consul members
nomad server members

8. Update cluster configs for 5-server quorum

# common/consul.nix
servers = ["c1" "c2" "c3" "fractal" "zippy"];
bootstrap_expect = 3;

# common/nomad.nix
servers = ["c1" "c2" "c3" "fractal" "zippy"];
bootstrap_expect = 3;

9. Verify fractal is fully operational

# Check all services
ssh fractal "systemctl status samba kopia-server kopia-snapshot.timer"

# Verify ZFS pools
ssh fractal "zpool status"
ssh fractal "zfs list"

# Test accessing shares from another node
ssh c1 "ls /data/media /data/shared"

# Verify kopia clients can still connect
kopia repository status --server=https://fractal:51515

# Check nomad can see fractal
nomad node status | grep fractal

# Verify quorum
consul members  # Should see c1, c2, c3, fractal
nomad server members  # Should see 4 servers

Phase 2: Setup zippy storage layer

Duration: 2-3 hours

Goal: Prepare zippy for NFS server role, setup replication

Create btrfs subvolume on zippy

ssh zippy
sudo btrfs subvolume create /persist/services
sudo chown ppetru:users /persist/services

Update zippy configuration

# hosts/zippy/default.nix
imports = [
  ../../common/encrypted-btrfs-layout.nix
  ../../common/global
  ../../common/cluster-node.nix  # Adds to quorum
  ../../common/nomad.nix
  ./hardware.nix
];

# NFS server
services.nfs.server = {
  enable = true;
  exports = ''
    /persist/services 192.168.1.0/24(rw,sync,no_subtree_check,no_root_squash)
  '';
};

# Consul service registration for NFS
services.consul.extraConfig.services = [{
  name = "services";
  port = 2049;
  checks = [{ tcp = "localhost:2049"; interval = "30s"; }];
}];

# Btrfs replication to standbys (incremental after first full send)
systemd.services.replicate-to-c1 = {
  description = "Replicate /persist/services to c1";
  script = ''
    ${pkgs.btrfs-progs}/bin/btrfs subvolume snapshot -r /persist/services /persist/services@$(date +%Y%m%d-%H%M%S)
    LATEST=$(ls -t /persist/services@* | head -1)

    # Get previous snapshot for incremental send
    PREV=$(ls -t /persist/services@* | head -2 | tail -1)

    # First run: full send. Subsequent: incremental with -p (parent)
    if [ "$LATEST" != "$PREV" ]; then
      ${pkgs.btrfs-progs}/bin/btrfs send -p $PREV $LATEST | ${pkgs.openssh}/bin/ssh c1 "${pkgs.btrfs-progs}/bin/btrfs receive /persist/"
    else
      # First snapshot, full send
      ${pkgs.btrfs-progs}/bin/btrfs send $LATEST | ${pkgs.openssh}/bin/ssh c1 "${pkgs.btrfs-progs}/bin/btrfs receive /persist/"
    fi

    # Cleanup old snapshots (keep last 24 hours on sender)
    find /persist/services@* -mtime +1 -exec ${pkgs.btrfs-progs}/bin/btrfs subvolume delete {} \;
  '';
};

systemd.timers.replicate-to-c1 = {
  wantedBy = [ "timers.target" ];
  timerConfig = {
    OnCalendar = "*:0/5";  # Every 5 minutes (incremental after first full send)
    Persistent = true;
  };
};

# Same for c2
systemd.services.replicate-to-c2 = { ... };
systemd.timers.replicate-to-c2 = { ... };

Setup standby storage on c1 and c2

# On c1 and c2
ssh c1 sudo btrfs subvolume create /persist/services-standby
ssh c2 sudo btrfs subvolume create /persist/services-standby

Deploy and verify

deploy -s '.#zippy'

# Verify NFS export
showmount -e zippy

# Verify Consul registration
dig @localhost -p 8600 services.service.consul

Verify quorum is now 5 servers

consul members  # Should show c1, c2, c3, fractal, zippy
nomad server members

Phase 3: Migrate from GlusterFS to NFS

Duration: 3-4 hours

Goal: Move all data, update mounts, remove GlusterFS

Copy data from GlusterFS to zippy

# On any node with /data/compute mounted
rsync -av --progress /data/compute/ zippy:/persist/services/

# Verify
ssh zippy du -sh /persist/services

Update all nodes to mount NFS

# Update common/glusterfs-client.nix → common/nfs-client.nix
# OR update common/cluster-node.nix to import nfs-client instead

fileSystems."/data/services" = {
  device = "services.service.consul:/persist/services";
  fsType = "nfs";
  options = [ "x-systemd.automount" "noauto" "x-systemd.idle-timeout=60" ];
};

# Remove old GlusterFS mount
# fileSystems."/data/compute" = ...  # DELETE

Deploy updated configs

deploy -s '.#c1' '.#c2' '.#c3' '.#fractal' '.#zippy'

Verify NFS mounts

for host in c1 c2 c3 fractal zippy; do
  ssh $host "df -h | grep services"
done

Stop all Nomad jobs temporarily

# Get list of running jobs
nomad job status | grep running | awk '{print $1}' > /tmp/running-jobs.txt

# Stop all (they'll be restarted with updated paths in Phase 4)
cat /tmp/running-jobs.txt | xargs -I {} nomad job stop {}

Remove GlusterFS from cluster

# On c1 (or any gluster server)
gluster volume stop compute
gluster volume delete compute

# On all nodes
for host in c1 c2 c3; do
  ssh $host "sudo systemctl stop glusterd; sudo systemctl disable glusterd"
done

Remove GlusterFS from NixOS configs

# common/compute-node.nix - remove ./glusterfs.nix import
# Deploy again
deploy

Phase 4: Update and redeploy Nomad jobs

Duration: 2-4 hours

Goal: Update all Nomad job paths, add constraints/affinities, redeploy

Update job specs (see Service Catalog below for details)
- Change /data/compute → /data/services
- Add constraints for media jobs → fractal
- Add affinities for database jobs → zippy

Deploy critical services first

# Core infrastructure
nomad run services/mysql.hcl
nomad run services/postgres.hcl
nomad run services/redis.hcl
nomad run services/traefik.hcl
nomad run services/authentik.hcl

# Verify
nomad job status mysql
consul catalog services

Deploy high-priority services

nomad run services/prometheus.hcl
nomad run services/grafana.hcl
nomad run services/loki.hcl
nomad run services/vector.hcl

nomad run services/unifi.hcl
nomad run services/gitea.hcl

Deploy medium-priority services

# See service catalog for full list
nomad run services/wordpress.hcl
nomad run services/ghost.hcl
nomad run services/wiki.hcl
# ... etc

Deploy low-priority services

nomad run services/media.hcl  # Will run on fractal due to constraint
# ... etc

Verify all services healthy

nomad job status
consul catalog services
# Check traefik dashboard for health

Phase 5: Convert sunny to NixOS (Optional, can defer)

Duration: 6-10 hours (split across 2 stages)

Current state:

Proxmox with ~1.5TB ethereum node data
2x LXC containers: besu (execution client), lighthouse (consensus beacon)
1x VM: Rocketpool smartnode (docker containers for validator, node, MEV-boost, etc.)
Running in "hybrid mode" - managing own execution/consensus, rocketpool manages the rest

Goal: Get sunny on NixOS quickly, preserve ethereum data, defer "perfect" native setup

Stage 1: Quick NixOS Migration (containers)

Duration: 6-8 hours Goal: NixOS + containerized ethereum stack, minimal disruption

1. Pre-migration backup and documentation

# Document current setup
ssh sunny "pct list" > /backup/sunny-containers.txt
ssh sunny "qm list" > /backup/sunny-vms.txt

# Find ethereum data locations in LXC containers
ssh sunny "pct config BESU_CT_ID" > /backup/sunny-besu-config.txt
ssh sunny "pct config LIGHTHOUSE_CT_ID" > /backup/sunny-lighthouse-config.txt

# Document rocketpool VM volumes
ssh sunny "qm config ROCKETPOOL_VM_ID" > /backup/sunny-rocketpool-config.txt

# Estimate ethereum data size
ssh sunny "du -sh /path/to/besu/data"
ssh sunny "du -sh /path/to/lighthouse/data"

# Backup rocketpool config (docker-compose, wallet keys, etc.)
# This is in the VM - need to access and backup critical files

2. Extract ethereum data from containers/VM

# Stop ethereum services to get consistent state
# (This will pause validation! Plan for attestation penalties)

# Copy besu data out of LXC
ssh sunny "pct stop BESU_CT_ID"
rsync -av --progress sunny:/var/lib/lxc/BESU_CT_ID/rootfs/path/to/besu/ /backup/sunny-besu-data/

# Copy lighthouse data out of LXC
ssh sunny "pct stop LIGHTHOUSE_CT_ID"
rsync -av --progress sunny:/var/lib/lxc/LIGHTHOUSE_CT_ID/rootfs/path/to/lighthouse/ /backup/sunny-lighthouse-data/

# Copy rocketpool data out of VM
# This includes validator keys, wallet, node config
# Access VM and copy out: ~/.rocketpool/data

3. Install NixOS on sunny

Fresh install with btrfs + impermanence
Create large /persist/ethereum for 1.5TB+ data
DO NOT try to resync from network (takes weeks!)

4. Restore ethereum data to NixOS

# After NixOS install, copy data back
ssh sunny "mkdir -p /persist/ethereum/{besu,lighthouse,rocketpool}"

rsync -av --progress /backup/sunny-besu-data/ sunny:/persist/ethereum/besu/
rsync -av --progress /backup/sunny-lighthouse-data/ sunny:/persist/ethereum/lighthouse/
# Rocketpool data copied later

5. Create sunny NixOS config (container-based)

# hosts/sunny/default.nix
{ config, pkgs, ... }:
{
  imports = [
    ../../common/encrypted-btrfs-layout.nix
    ../../common/global
    ./hardware.nix
  ];

  networking.hostName = "sunny";

  # NO cluster-node import - standalone for now
  # Can add to quorum later if desired

  # Container runtime
  virtualisation.podman = {
    enable = true;
    dockerCompat = true;  # Provides 'docker' command
    defaultNetwork.settings.dns_enabled = true;
  };

  # Besu execution client (container)
  virtualisation.oci-containers.containers.besu = {
    image = "hyperledger/besu:latest";
    volumes = [
      "/persist/ethereum/besu:/var/lib/besu"
    ];
    ports = [
      "8545:8545"   # HTTP RPC
      "8546:8546"   # WebSocket RPC
      "30303:30303" # P2P
    ];
    cmd = [
      "--data-path=/var/lib/besu"
      "--rpc-http-enabled=true"
      "--rpc-http-host=0.0.0.0"
      "--rpc-ws-enabled=true"
      "--rpc-ws-host=0.0.0.0"
      "--engine-rpc-enabled=true"
      "--engine-host-allowlist=*"
      "--engine-jwt-secret=/var/lib/besu/jwt.hex"
      # Add other besu flags as needed
    ];
    autoStart = true;
  };

  # Lighthouse beacon client (container)
  virtualisation.oci-containers.containers.lighthouse-beacon = {
    image = "sigp/lighthouse:latest";
    volumes = [
      "/persist/ethereum/lighthouse:/data"
      "/persist/ethereum/besu/jwt.hex:/jwt.hex:ro"
    ];
    ports = [
      "5052:5052"   # HTTP API
      "9000:9000"   # P2P
    ];
    cmd = [
      "lighthouse"
      "beacon"
      "--datadir=/data"
      "--http"
      "--http-address=0.0.0.0"
      "--execution-endpoint=http://besu:8551"
      "--execution-jwt=/jwt.hex"
      # Add other lighthouse flags
    ];
    dependsOn = [ "besu" ];
    autoStart = true;
  };

  # Rocketpool stack (podman-compose for multi-container setup)
  # TODO: This requires converting docker-compose to NixOS config
  # For now, can run docker-compose via systemd service
  systemd.services.rocketpool = {
    description = "Rocketpool Smartnode Stack";
    after = [ "podman.service" "lighthouse-beacon.service" ];
    wantedBy = [ "multi-user.target" ];

    serviceConfig = {
      Type = "oneshot";
      RemainAfterExit = "yes";
      WorkingDirectory = "/persist/ethereum/rocketpool";
      ExecStart = "${pkgs.docker-compose}/bin/docker-compose up -d";
      ExecStop = "${pkgs.docker-compose}/bin/docker-compose down";
    };
  };

  # Ensure ethereum data persists
  environment.persistence."/persist" = {
    directories = [
      "/persist/ethereum"
    ];
  };

  # Firewall for ethereum
  networking.firewall = {
    allowedTCPPorts = [
      30303  # Besu P2P
      9000   # Lighthouse P2P
      # Add rocketpool ports
    ];
    allowedUDPPorts = [
      30303  # Besu P2P
      9000   # Lighthouse P2P
    ];
  };
}

6. Setup rocketpool docker-compose on NixOS

# After NixOS is running, restore rocketpool config
ssh sunny "mkdir -p /persist/ethereum/rocketpool"

# Copy rocketpool data (wallet, keys, config)
rsync -av /backup/sunny-rocketpool-data/ sunny:/persist/ethereum/rocketpool/

# Create docker-compose.yml for rocketpool stack
# Based on rocketpool hybrid mode docs
# This runs: validator, node software, MEV-boost, prometheus, etc.
# Connects to your besu + lighthouse containers

7. Deploy and test

deploy -s '.#sunny'

# Verify containers are running
ssh sunny "podman ps"

# Check besu sync status
ssh sunny "curl -X POST -H 'Content-Type: application/json' --data '{\"jsonrpc\":\"2.0\",\"method\":\"eth_syncing\",\"params\":[],\"id\":1}' http://localhost:8545"

# Check lighthouse sync status
ssh sunny "curl http://localhost:5052/eth/v1/node/syncing"

# Monitor rocketpool
ssh sunny "cd /persist/ethereum/rocketpool && docker-compose logs -f"

8. Monitor and stabilize

Ethereum should resume from where it left off (not resync!)
Validation will resume once beacon is sync'd
May have missed a few attestations during migration (minor penalty)

Stage 2: Native NixOS Services (Future)

Duration: TBD (do this later when time permits) Goal: Convert to native NixOS services using ethereum-nix

Why defer this:

Complex (rocketpool not fully packaged for Nix)
Current container setup works fine
Can migrate incrementally (besu → native, then lighthouse, etc.)
No downtime once Stage 1 is stable

When ready:

Research ethereum-nix support for besu + lighthouse + rocketpool
Test on separate machine first
Migrate one service at a time with minimal downtime
Document in separate migration plan

For now: Stage 1 gets sunny on NixOS with base configs, managed declaratively, just using containers instead of native services.

Phase 6: Verification and cleanup

Duration: 1 hour

Test failover procedure (see Failover Procedures below)

Verify backups are working

kopia snapshot list
# Check that /persist/services is being backed up

Update documentation
- Update README.md
- Document new architecture
- Update stateful-commands.txt

Clean up old GlusterFS data

# Only after verifying everything works!
for host in c1 c2 c3; do
  ssh $host "sudo rm -rf /persist/glusterfs"
done

Service Catalog

Legend:

Priority: CRITICAL (must be up) / HIGH (important) / MEDIUM (nice to have) / LOW (can wait)
Target: Where it should run (constraint or affinity)
Data: What data it needs access to
Changes: What needs updating in the .hcl file

Core Infrastructure

mysql

File: services/mysql.hcl
Priority: CRITICAL
Current: Uses /data/compute/appdata/mysql
Target: Affinity for zippy, allow c1/c2
Data: /data/services/mysql (NFS from zippy)

Changes:

✏️ Volume path: /data/compute/appdata/mysql → /data/services/mysql

✏️ Add affinity:

affinity {
  attribute = "${node.unique.name}"
  value     = "zippy"
  weight    = 100
}

✏️ Add constraint to allow fallback:

constraint {
  attribute = "${node.unique.name}"
  operator  = "regexp"
  value     = "zippy|c1|c2"
}

Notes: Core database, needs to stay up. Consul DNS mysql.service.consul unchanged.

postgres

File: services/postgres.hcl
Priority: CRITICAL
Current: Uses /data/compute/appdata/postgres, /data/compute/appdata/pgadmin
Target: Affinity for zippy, allow c1/c2
Data: /data/services/postgres, /data/services/pgadmin (NFS)
Changes:
- ✏️ Volume paths: /data/compute/appdata/* → /data/services/*
- ✏️ Add affinity and constraint (same as mysql)
Notes: Core database for authentik, gitea, plausible, netbox, etc.

redis

File: services/redis.hcl
Priority: CRITICAL
Current: Uses /data/compute/appdata/redis
Target: Affinity for zippy, allow c1/c2
Data: /data/services/redis (NFS)
Changes:
- ✏️ Volume path: /data/compute/appdata/redis → /data/services/redis
- ✏️ Add affinity and constraint (same as mysql)
Notes: Used by authentik, wordpress. Should co-locate with databases.

traefik

File: services/traefik.hcl
Priority: CRITICAL
Current: Uses /data/compute/config/traefik
Target: Float on c1/c2/c3 (keepalived handles HA)
Data: /data/services/config/traefik (NFS)
Changes:
- ✏️ Volume path: /data/compute/config/traefik → /data/services/config/traefik
Notes: Reverse proxy, has keepalived for VIP failover. Critical for all web access.

authentik

File: services/authentik.hcl
Priority: CRITICAL
Current: No persistent volumes (stateless, uses postgres/redis)
Target: Float on c1/c2/c3
Data: None (uses postgres.service.consul, redis.service.consul)
Changes: None needed
Notes: SSO for most services. Must stay up.

Monitoring Stack

prometheus

File: services/prometheus.hcl
Priority: HIGH
Current: Uses /data/compute/appdata/prometheus
Target: Float on c1/c2/c3
Data: /data/services/prometheus (NFS)
Changes:
- ✏️ Volume path: /data/compute/appdata/prometheus → /data/services/prometheus
Notes: Metrics database. Important for monitoring but not critical for services.

grafana

File: services/grafana.hcl
Priority: HIGH
Current: Uses /data/compute/appdata/grafana
Target: Float on c1/c2/c3
Data: /data/services/grafana (NFS)
Changes:
- ✏️ Volume path: /data/compute/appdata/grafana → /data/services/grafana
Notes: Monitoring UI. Depends on prometheus.

loki

File: services/loki.hcl
Priority: HIGH
Current: Uses /data/compute/appdata/loki
Target: Float on c1/c2/c3
Data: /data/services/loki (NFS)
Changes:
- ✏️ Volume path: /data/compute/appdata/loki → /data/services/loki
Notes: Log aggregation. Important for debugging.

vector

File: services/vector.hcl
Priority: MEDIUM
Current: No persistent volumes, type=system (runs on all nodes)
Target: System job (runs everywhere)
Data: None (ephemeral logs, ships to loki)
Changes:
- ❓ Check if glusterfs log path is still needed: /var/log/glusterfs:/var/log/glusterfs:ro
- ✏️ Remove glusterfs log collection after GlusterFS is removed
Notes: Log shipper. Can tolerate downtime.

Databases (Specialized)

clickhouse

File: services/clickhouse.hcl
Priority: HIGH
Current: Uses /data/compute/appdata/clickhouse
Target: Affinity for zippy (large dataset), allow c1/c2/c3
Data: /data/services/clickhouse (NFS)
Changes:
- ✏️ Volume path: /data/compute/appdata/clickhouse → /data/services/clickhouse
- ✏️ Add affinity for zippy (optional, but helps with performance)
Notes: Used by plausible. Large time-series data. Important but can be recreated.

mongodb

File: services/unifi.hcl (embedded in unifi job)
Priority: HIGH
Current: Uses /data/compute/appdata/unifi/mongodb
Target: Float on c1/c2/c3 (with unifi)
Data: /data/services/unifi/mongodb (NFS)
Changes: See unifi below
Notes: Only used by unifi. Should stay with unifi controller.

Web Applications

wordpress

File: services/wordpress.hcl
Priority: HIGH
Current: Uses /data/sync/wordpress (syncthing-managed to avoid slow GlusterFS)
Target: Float on c1/c2/c3
Data: /data/services/wordpress (NFS from zippy)
Changes:
- ✏️ Volume path: /data/sync/wordpress → /data/services/wordpress
- 📋 Before cutover: Copy data from syncthing to zippy: rsync -av /data/sync/wordpress/ zippy:/persist/services/appdata/wordpress/
- 📋 After migration: Remove syncthing configuration for wordpress sync
Notes: Production website. Important but can tolerate brief downtime during migration.

ghost

File: services/ghost.hcl
Priority: no longer used, should wipe
Current: Uses /data/compute/appdata/ghost
Target: Float on c1/c2/c3
Data: /data/services/ghost (NFS)
Changes:
- ✏️ Volume path: /data/compute/appdata/ghost → /data/services/ghost
Notes: Blog platform (alo.land). Can tolerate downtime.

gitea

File: services/gitea.hcl
Priority: HIGH
Current: Uses /data/compute/appdata/gitea/data, /data/compute/appdata/gitea/config
Target: Float on c1/c2/c3
Data: /data/services/gitea/* (NFS)
Changes:
- ✏️ Volume paths: /data/compute/appdata/gitea/* → /data/services/gitea/*
Notes: Git server. Contains code repositories. Important.

wiki (tiddlywiki)

File: services/wiki.hcl
Priority: HIGH
Current: Uses /data/compute/appdata/wiki via host volume mount
Target: Float on c1/c2/c3
Data: /data/services/wiki (NFS)
Changes:
- ✏️ Volume mount path in volume_mount blocks
- ⚠️ Uses exec driver with host volumes - verify NFS mount works with this
Notes: Multiple tiddlywiki instances. Personal wikis. Can tolerate downtime.

code-server

File: services/code-server.hcl
Priority: LOW
Current: Uses /data/compute/appdata/code
Target: Float on c1/c2/c3
Data: /data/services/code (NFS)
Changes:
- ✏️ Volume path: /data/compute/appdata/code → /data/services/code
Notes: Web IDE. Low priority, for development only.

beancount (fava)

File: services/beancount.hcl
Priority: MEDIUM
Current: Uses /data/compute/appdata/beancount
Target: Float on c1/c2/c3
Data: /data/services/beancount (NFS)
Changes:
- ✏️ Volume path: /data/compute/appdata/beancount → /data/services/beancount
Notes: Finance tracking. Low priority.

adminer

File: services/adminer.hcl
Priority: LOW
Current: Stateless
Target: Float on c1/c2/c3
Data: None
Changes: None needed
Notes: Database admin UI. Only needed for maintenance.

plausible

File: services/plausible.hcl
Priority: HIGH
Current: Stateless (uses postgres and clickhouse)
Target: Float on c1/c2/c3
Data: None (uses postgres.service.consul, clickhouse.service.consul)
Changes: None needed
Notes: Website analytics. Nice to have but not critical.

evcc

File: services/evcc.hcl
Priority: HIGH
Current: Uses /data/compute/appdata/evcc/evcc.yaml, /data/compute/appdata/evcc/evcc
Target: Float on c1/c2/c3
Data: /data/services/evcc/* (NFS)
Changes:
- ✏️ Volume paths: /data/compute/appdata/evcc/* → /data/services/evcc/*
Notes: EV charging controller. Important for daily use.

vikunja

File: services/vikunja.hcl (assumed to exist based on README)
Priority: no longer used, should delete
Current: Likely uses /data/compute/appdata/vikunja
Target: Float on c1/c2/c3
Data: /data/services/vikunja (NFS)
Changes:
- ✏️ Volume paths: Update to /data/services/vikunja
Notes: Task management. Low priority.

leantime

File: services/leantime.hcl
Priority: no longer used, should delete
Current: Likely uses /data/compute/appdata/leantime
Target: Float on c1/c2/c3
Data: /data/services/leantime (NFS)
Changes:
- ✏️ Volume paths: Update to /data/services/leantime
Notes: Project management. Low priority.

Network Infrastructure

unifi

File: services/unifi.hcl
Priority: HIGH
Current: Uses /data/compute/appdata/unifi/data, /data/compute/appdata/unifi/mongodb
Target: Float on c1/c2/c3/fractal/zippy
Data: /data/services/unifi/* (NFS)
Changes:
- ✏️ Volume paths: /data/compute/appdata/unifi/* → /data/services/unifi/*
Notes: UniFi network controller. Critical for network management. Has keepalived VIP for stable inform address. Floating is fine.

Media Stack

media (radarr, sonarr, bazarr, plex, qbittorrent)

File: services/media.hcl
Priority: MEDIUM
Current: Uses /data/compute/appdata/radarr, /data/compute/appdata/sonarr, etc. and /data/media
Target: MUST run on fractal (local /data/media access)
Data:
- /data/services/radarr (NFS) - config data
- /data/media (local CIFS mount on fractal, local disk on fractal)

Changes:

✏️ Volume paths: /data/compute/appdata/* → /data/services/*

✏️ Add constraint:

constraint {
  attribute = "${node.unique.name}"
  value     = "fractal"
}

Notes: Heavy I/O to /data/media. Must run on fractal for performance. Has keepalived VIP.

Utility Services

weewx

File: services/weewx.hcl
Priority: HIGH
Current: Likely uses /data/compute/appdata/weewx
Target: Float on c1/c2/c3
Data: /data/services/weewx (NFS)
Changes:
- ✏️ Volume paths: Update to /data/services/weewx
Notes: Weather station. Low priority.

maps

File: services/maps.hcl
Priority: MEDIUM
Current: Likely uses /data/compute/appdata/maps
Target: Float on c1/c2/c3 (or fractal if large tile data)
Data: /data/services/maps (NFS) or /data/media/maps if large
Changes:
- ✏️ Volume paths: Check data size, may want to move to /data/media
Notes: Map tiles. Low priority.

netbox

File: services/netbox.hcl
Priority: LOW
Current: Likely uses /data/compute/appdata/netbox
Target: Float on c1/c2/c3
Data: /data/services/netbox (NFS)
Changes:
- ✏️ Volume paths: Update to /data/services/netbox
Notes: IPAM/DCIM. Low priority, for documentation.

farmos

File: services/farmos.hcl
Priority: LOW
Current: Likely uses /data/compute/appdata/farmos
Target: Float on c1/c2/c3
Data: /data/services/farmos (NFS)
Changes:
- ✏️ Volume paths: Update to /data/services/farmos
Notes: Farm management. Low priority.

urbit

File: services/urbit.hcl
Priority: LOW
Current: Likely uses /data/compute/appdata/urbit
Target: Float on c1/c2/c3
Data: /data/services/urbit (NFS)
Changes:
- ✏️ Volume paths: Update to /data/services/urbit
Notes: Urbit node. Experimental, low priority.

webodm

File: services/webodm.hcl
Priority: LOW
Current: Likely uses /data/compute/appdata/webodm
Target: Float on c1/c2/c3 (or fractal if processing large imagery from /data/media)
Data: /data/services/webodm (NFS)
Changes:
- ✏️ Volume paths: Update to /data/services/webodm
- 🤔 May benefit from running on fractal if it processes files from /data/media
Notes: Drone imagery processing. Low priority.

velutrack

File: services/velutrack.hcl
Priority: LOW
Current: Likely minimal state
Target: Float on c1/c2/c3
Data: Minimal
Changes: Verify if any volume paths need updating
Notes: Vehicle tracking. Low priority.

resol-gateway

File: services/resol-gateway.hcl
Priority: HIGH
Current: Likely minimal state
Target: Float on c1/c2/c3
Data: Minimal
Changes: Verify if any volume paths need updating
Notes: Solar thermal controller. Low priority.

igsync

File: services/igsync.hcl
Priority: MEDIUM
Current: Likely uses /data/compute/appdata/igsync or /data/media
Target: Float on c1/c2/c3 (or fractal if storing to /data/media)
Data: Check if it writes to /data/media or /data/services
Changes:
- ✏️ Volume paths: Verify and update
Notes: Instagram sync. Low priority.

jupyter

File: services/jupyter.hcl
Priority: LOW
Current: Stateless or minimal state
Target: Float on c1/c2/c3
Data: Minimal
Changes: Verify if any volume paths need updating
Notes: Notebook server. Low priority, for experimentation.

whoami

File: services/whoami.hcl
Priority: LOW
Current: Stateless
Target: Float on c1/c2/c3
Data: None
Changes: None needed
Notes: Test service. Can be stopped during migration.

tiddlywiki (if separate from wiki.hcl)

File: services/tiddlywiki.hcl
Priority: MEDIUM
Current: Likely same as wiki.hcl
Target: Float on c1/c2/c3
Data: /data/services/tiddlywiki (NFS)
Changes: Same as wiki.hcl
Notes: May be duplicate of wiki.hcl.

Backup Jobs

mysql-backup

File: services/mysql-backup.hcl
Priority: HIGH
Current: Likely writes to /data/compute or /data/shared
Target: Float on c1/c2/c3
Data: Should write to /data/shared (backed up to fractal)
Changes:
- ✏️ Verify backup destination, should be /data/shared/backups/mysql
Notes: Important for disaster recovery. Should run regularly.

postgres-backup

File: services/postgres-backup.hcl
Priority: HIGH
Current: Likely writes to /data/compute or /data/shared
Target: Float on c1/c2/c3
Data: Should write to /data/shared (backed up to fractal)
Changes:
- ✏️ Verify backup destination, should be /data/shared/backups/postgres
Notes: Important for disaster recovery. Should run regularly.

wordpress-backup

File: services/wordpress-backup.hcl
Priority: MEDIUM
Current: Likely writes to /data/compute or /data/shared
Target: Float on c1/c2/c3
Data: Should write to /data/shared (backed up to fractal)
Changes:
- ✏️ Verify backup destination
Notes: Periodic backup job.

Failover Procedures

NFS Server Failover (zippy → c1 or c2)

When to use: zippy is down and not coming back soon

Prerequisites:

c1 and c2 have been receiving btrfs snapshots from zippy
Last successful replication was < 1 hour ago (verify timestamps)

Procedure:

Choose standby node (c1 or c2)

# Check replication freshness
ssh c1 "ls -lt /persist/services-standby@* | head -5"
ssh c2 "ls -lt /persist/services-standby@* | head -5"

# Choose the one with most recent snapshot
# For this example, we'll use c1

On standby node (c1), promote standby to primary

ssh c1

# Stop NFS client mount (if running)
sudo systemctl stop data-services.mount

# Find latest snapshot
LATEST=$(ls -t /persist/services-standby@* | head -1)

# Create writable subvolume from snapshot
sudo btrfs subvolume snapshot $LATEST /persist/services

# Verify
ls -la /persist/services

Deploy c1-nfs-server configuration

# From your workstation
deploy -s '.#c1-nfs-server'

# This activates:
# - NFS server on c1
# - Consul service registration for "services"
# - Firewall rules

On c1, verify NFS is running

ssh c1
sudo systemctl status nfs-server
showmount -e localhost
dig @localhost -p 8600 services.service.consul  # Should show c1's IP

On other nodes, remount NFS

# Nodes should auto-remount via Consul DNS, but you can force it:
for host in c2 c3 fractal zippy; do
  ssh $host "sudo systemctl restart data-services.mount"
done

Verify Nomad jobs are healthy

nomad job status mysql
nomad job status postgres
# Check all critical services

Update monitoring/alerts
- Note in documentation that c1 is now primary NFS server
- Set up alert to remember to fail back to zippy when it's repaired

Recovery Time Objective (RTO): ~10-15 minutes

Recovery Point Objective (RPO): Last snapshot interval (5 minutes max)

Failing Back to zippy

When to use: zippy is repaired and ready to resume primary role

Procedure:

Sync data from c1 back to zippy

# On c1 (current primary)
sudo btrfs subvolume snapshot -r /persist/services /persist/services@failback-$(date +%Y%m%d-%H%M%S)
FAILBACK=$(ls -t /persist/services@failback-* | head -1)
sudo btrfs send $FAILBACK | ssh zippy "sudo btrfs receive /persist/"

# On zippy, make it writable
ssh zippy "sudo btrfs subvolume snapshot /persist/$(basename $FAILBACK) /persist/services"

Deploy zippy back to NFS server role

deploy -s '.#zippy'
# Consul will register services.service.consul → zippy again

Demote c1 back to standby

deploy -s '.#c1'
# This removes NFS server, restores NFS client mount

Verify all nodes are mounting from zippy

dig @c1 -p 8600 services.service.consul  # Should show zippy's IP

for host in c1 c2 c3 fractal; do
  ssh $host "df -h | grep services"
done

Database Job Failover (automatic via Nomad)

When to use: zippy is down, database jobs need to run elsewhere

What happens automatically:

Nomad detects zippy is unhealthy
Jobs with constraint zippy|c1|c2 are rescheduled to c1 or c2
Jobs start on new node, accessing /data/services (now via NFS from promoted standby)

Manual intervention needed:

None if NFS failover completed successfully
If jobs are stuck: nomad job stop mysql && nomad job run services/mysql.hcl

What to check:

nomad job status mysql
nomad job status postgres
nomad job status redis

# Verify they're running on c1 or c2, not zippy
nomad alloc status <alloc-id>

Complete Cluster Failure (lose quorum)

Scenario: 3 or more servers go down, quorum lost

Prevention: This is why we have 5 servers (need 3 for quorum)

Recovery:

Bring up at least 3 servers (any 3 from c1, c2, c3, fractal, zippy)

If that's not possible, bootstrap new cluster:

# On one surviving server, force bootstrap
consul force-leave <failed-node>
nomad operator raft list-peers
nomad operator raft remove-peer <failed-peer>

Restore from backups (worst case)

Post-Migration Verification Checklist

All 5 servers in quorum: consul members shows c1, c2, c3, fractal, zippy
NFS mounts working: df -h | grep services on all nodes
Btrfs replication running: Check systemd timers on zippy
Critical services up: mysql, postgres, redis, traefik, authentik
Monitoring working: Prometheus, Grafana, Loki accessible
Media stack on fractal: nomad alloc status shows media job on fractal
Database jobs on zippy: nomad alloc status shows mysql/postgres on zippy
Consul DNS working: dig @localhost -p 8600 services.service.consul
Backups running: Kopia snapshots include /persist/services
GlusterFS removed: No glusterfs processes, volumes deleted
Documentation updated: README.md, architecture diagrams

Rollback Plan

If migration fails catastrophically:

Stop all new Nomad jobs
```
nomad job stop -purge <new-jobs>
```

Restore GlusterFS mounts

# On all nodes, re-enable GlusterFS client
deploy  # With old configs

Restart old Nomad jobs

# With old paths pointing to /data/compute
nomad run services/*.hcl  # Old versions from git

Restore data if needed

rsync -av /backup/compute-pre-migration/ /data/compute/

Important: Keep GlusterFS running until Phase 4 is complete and verified!

Questions Answered

✅ Where is /data/sync/wordpress mounted from?
- Answer: Syncthing-managed to avoid slow GlusterFS
- Action: Migrate to /data/services/wordpress, remove syncthing config
✅ Which services use /data/media directly?
- Answer: Only media.hcl (radarr, sonarr, plex, qbittorrent)
- Action: Constrain media.hcl to fractal, everything else uses CIFS mount
✅ Do we want unifi on fractal or floating?
- Answer: Floating is fine
- Action: No constraint needed
✅ What's the plan for sunny's existing data?
- Answer: Ethereum data stays local, not replicated (too expensive)
- Action: Either backup/restore or resync from network during NixOS conversion

Questions Still to Answer

Backup retention for btrfs snapshots?
- Current plan: Keep 24 hours of snapshots on zippy
- Is this enough? Or do we want more for safety?
- This should be fine -- snapshots are just for hot recovery. More/older backups are kept via kopia on fractal.
c1-nfs-server vs c1 config - same host, different configs?
- Recommendation: Use same hostname, different flake output
- c1 = normal config with NFS client
- c1-nfs-server = variant with NFS server enabled
- Both in flake.nix, deploy appropriate one based on role
- Answer: recommendation makes sense.
Should we verify webodm, igsync, maps don't need /data/media access?
- neither of them needs /data/media
- maps needs /data/shared

Timeline Estimate

Total duration: 12-20 hours (can be split across multiple sessions)

Phase 0 (Prep): 1-2 hours
Phase 1 (fractal): 4-6 hours
Phase 2 (zippy storage): 2-3 hours
Phase 3 (GlusterFS → NFS): 3-4 hours
Phase 4 (Nomad jobs): 2-4 hours
Phase 5 (sunny): 2-3 hours (optional, can be done later)
Phase 6 (Cleanup): 1 hour

Suggested schedule:

Day 1: Phases 0-1 (fractal conversion, establish quorum)
Day 2: Phases 2-3 (zippy storage, data migration)
Day 3: Phase 4 (Nomad job updates and deployment)
Day 4: Phases 5-6 (sunny + cleanup) or take a break and do later

Maintenance windows needed:

Phase 3: ~1 hour downtime (all services stopped during data migration)
Phase 4: Rolling (services come back up as redeployed)

52 KiB Raw Blame History

Cluster Architecture Revamp

Key Decisions

Table of Contents

End State Architecture

Cluster Topology

Storage Architecture

Network Services

Nomad Job Placement Strategy

Data Migration

Migration Steps

Phase 0: Preparation

Phase 1: Convert fractal to NixOS

Step-by-step procedure:

Phase 2: Setup zippy storage layer

Phase 3: Migrate from GlusterFS to NFS

Phase 4: Update and redeploy Nomad jobs

Phase 5: Convert sunny to NixOS (Optional, can defer)

Stage 1: Quick NixOS Migration (containers)

Stage 2: Native NixOS Services (Future)

Phase 6: Verification and cleanup

Service Catalog

Core Infrastructure

mysql

postgres

redis

traefik

authentik

Monitoring Stack

prometheus

grafana

loki

vector

Databases (Specialized)

clickhouse

mongodb

Web Applications

wordpress

ghost

gitea

wiki (tiddlywiki)

code-server

beancount (fava)

adminer

plausible

evcc

vikunja

leantime

Network Infrastructure

unifi

Media Stack

media (radarr, sonarr, bazarr, plex, qbittorrent)

Utility Services

weewx

maps

netbox

farmos

urbit

webodm

velutrack

resol-gateway

igsync

jupyter

whoami

tiddlywiki (if separate from wiki.hcl)

Backup Jobs

mysql-backup

postgres-backup

wordpress-backup

Failover Procedures

NFS Server Failover (zippy → c1 or c2)

Failing Back to zippy

Database Job Failover (automatic via Nomad)

Complete Cluster Failure (lose quorum)

Post-Migration Verification Checklist

Rollback Plan

Questions Answered

Questions Still to Answer

Timeline Estimate

52 KiB

Raw Blame History