Files
alo-cluster/CLAUDE.md

4.3 KiB

Claude Code Quick Reference

NixOS cluster configuration using flakes. Homelab infrastructure with Nomad/Consul orchestration.

Project Structure

├── common/
│   ├── global/          # Applied to all hosts (backup, sops, users, etc.)
│   ├── compute-node.nix # Nomad client + Consul agent + NFS client
│   ├── cluster-node.nix # Nomad server + Consul server (for quorum members)
│   ├── nfs-services-server.nix   # NFS server + btrfs replication (zippy)
│   └── nfs-services-standby.nix  # NFS standby + receive replication (c1, c2)
├── hosts/
│   ├── c1/, c2/, c3/    # Cattle nodes (compute, quorum members)
│   ├── zippy/           # Primary storage + NFS server + stateful workloads
│   ├── fractal/         # (Proxmox, will become NixOS storage node)
│   ├── sunny/           # (Standalone ethereum node, not in cluster)
│   └── chilly/          # (Home Assistant VM, not in cluster)
├── docs/
│   ├── CLUSTER_REVAMP.md    # Master plan for architecture changes
│   ├── MIGRATION_TODO.md    # Tracking checklist for migration
│   └── NFS_FAILOVER.md      # NFS failover procedures
└── services/            # Nomad job specs (.hcl files)

Current Architecture (transitioning)

OLD: GlusterFS on c1/c2/c3 at /data/compute (being phased out) NEW: NFS from zippy at /data/services (current target)

Storage Mounts

  • /data/services - NFS from data-services.service.consul (zippy primary, c1 standby)
  • /data/media - CIFS from fractal (existing, unchanged)
  • /data/shared - CIFS from fractal (existing, unchanged)

Hosts

  • c1, c2, c3: Cattle nodes, run most workloads, Nomad/Consul quorum
  • zippy: Primary NFS server, runs databases (affinity), replicates to c1 every 5min
  • fractal: Storage node (Proxmox/ZFS), will join quorum after GlusterFS removed
  • sunny: Standalone ethereum staking node
  • chilly: Home Assistant VM

Key Patterns

NFS Server/Standby:

  • Primary (zippy): imports nfs-services-server.nix, sets standbys = ["c1"]
  • Standby (c1): imports nfs-services-standby.nix, sets replicationKeys = [...]
  • Replication: btrfs send/receive every 5min, incremental with fallback to full

Backups:

  • Kopia client on all nodes → Kopia server on fractal
  • Backs up /persist hourly via btrfs snapshot
  • Excludes: services@* and services-standby/services@* (replication snapshots)

Secrets:

  • SOPS for secrets, files in secrets/
  • Keys managed per-host

Migration Status

Phase: 4 in progress (20/35 services migrated) Current: Migrating services from GlusterFS → NFS Next: Finish migrating remaining services, update host volumes, remove GlusterFS Later: Convert fractal to NixOS (deferred)

See docs/MIGRATION_TODO.md for detailed checklist.

IMPORTANT: When working on migration tasks:

  1. Always update docs/MIGRATION_TODO.md after completing each service migration
  2. Update both the individual service checklist AND the summary counts at the bottom
  3. Pattern: /data/compute/appdata/foo/data/services/foo (NOT /data/services/appdata/foo!)
  4. Migration workflow per service: stop → copy data → edit config → start → update MIGRATION_TODO.md

Common Tasks

Deploy a host: deploy -s '.#hostname' Deploy all: deploy Check replication: ssh zippy journalctl -u replicate-services-to-c1.service -f NFS failover: See docs/NFS_FAILOVER.md Nomad jobs: services/*.hcl - update paths: /data/compute/appdata/foo/data/services/foo (NOT /data/services/appdata/foo!)

Troubleshooting Hints

  • Replication errors with "empty stream": SSH key restricted to btrfs receive, can't run other commands
  • NFS split-brain protection: nfs-server checks Consul before starting
  • Btrfs snapshots: nested snapshots appear as empty dirs in parent snapshots
  • Kopia: uses temporary snapshot for consistency, doesn't back up nested subvolumes

Important Files

  • common/global/backup.nix - Kopia backup configuration
  • hosts/zippy/default.nix - NFS server config, replication targets
  • hosts/c1/default.nix - NFS standby config, authorized replication keys
  • flake.nix - Host definitions, nixpkgs inputs

Auto-generated reference for Claude Code. Keep concise. Update when architecture changes.