Files
alo-cluster/CLAUDE.md
2025-10-25 08:51:29 +01:00

115 lines
5.2 KiB
Markdown

# Claude Code Quick Reference
NixOS cluster configuration using flakes. Homelab infrastructure with Nomad/Consul orchestration.
## Project Structure
```
├── common/
│ ├── global/ # Applied to all hosts (backup, sops, users, etc.)
│ ├── minimal-node.nix # Base (ssh, user, boot, impermanence)
│ ├── cluster-member.nix # Consul + storage clients (NFS/CIFS/GlusterFS)
│ ├── nomad-worker.nix # Nomad client (runs jobs) + Docker + NFS deps
│ ├── nomad-server.nix # Enables Consul + Nomad server mode
│ ├── cluster-tools.nix # Just CLI tools (nomad, wander, damon)
│ ├── workstation-node.nix # Dev tools (wget, deploy-rs, docker, nix-ld)
│ ├── desktop-node.nix # Hyprland + GUI environment
│ ├── nfs-services-server.nix # NFS server + btrfs replication (zippy)
│ └── nfs-services-standby.nix # NFS standby + receive replication (c1)
├── hosts/
│ ├── c1/, c2/, c3/ # Cattle nodes (quorum + workers)
│ ├── zippy/ # Primary storage + NFS server + worker (not quorum)
│ ├── chilly/ # Home Assistant VM + cluster member (Consul only)
│ ├── sparky/ # Desktop + cluster member (Consul only)
│ ├── fractal/ # (Proxmox, will become NixOS storage node)
│ └── sunny/ # (Standalone ethereum node, not in cluster)
├── docs/
│ ├── CLUSTER_REVAMP.md # Master plan for architecture changes
│ ├── MIGRATION_TODO.md # Tracking checklist for migration
│ └── NFS_FAILOVER.md # NFS failover procedures
└── services/ # Nomad job specs (.hcl files)
```
## Current Architecture
### Storage Mounts
- `/data/services` - NFS from `data-services.service.consul` (zippy primary, c1 standby)
- `/data/media` - CIFS from fractal (existing, unchanged)
- `/data/shared` - CIFS from fractal (existing, unchanged)
### Hosts
- **c1, c2, c3**: Cattle nodes, run most workloads, Nomad/Consul quorum members
- **zippy**: Primary NFS server, runs workloads (affinity), NOT quorum, replicates to c1 every 5min
- **chilly**: Home Assistant VM, cluster member (Consul agent + CLI tools), no workloads
- **sparky**: Desktop/laptop, cluster member (Consul agent + CLI tools), no workloads
- **fractal**: Storage node (Proxmox/ZFS), will join quorum after GlusterFS removed
- **sunny**: Standalone ethereum staking node (not in cluster)
## Config Architecture
**Modular role-based configs** (compose as needed):
- `minimal-node.nix` - Base for all systems (SSH, user, boot, impermanence)
- `cluster-member.nix` - Consul agent + shared storage mounts (no Nomad)
- `nomad-worker.nix` - Nomad client to run jobs (requires cluster-member)
- `nomad-server.nix` - Enables Consul + Nomad server mode (for quorum members)
- `cluster-tools.nix` - Just CLI tools (no services)
**Machine type configs** (via flake profile):
- `workstation-node.nix` - Dev tools (deploy-rs, docker, nix-ld, emulation)
- `desktop-node.nix` - Extends workstation + Hyprland/GUI
**Host composition examples**:
- c1/c2/c3: `cluster-member + nomad-worker + nomad-server` (quorum + runs jobs)
- zippy: `cluster-member + nomad-worker` (runs jobs, not quorum)
- chilly/sparky: `cluster-member + cluster-tools` (Consul + CLI only)
**Key insight**: Profiles (workstation/desktop) no longer imply cluster membership. Hosts explicitly declare roles via imports.
## Key Patterns
**NFS Server/Standby**:
- Primary (zippy): imports `nfs-services-server.nix`, sets `standbys = ["c1"]`
- Standby (c1): imports `nfs-services-standby.nix`, sets `replicationKeys = [...]`
- Replication: btrfs send/receive every 5min, incremental with fallback to full
**Backups**:
- Kopia client on all nodes → Kopia server on fractal
- Backs up `/persist` hourly via btrfs snapshot
- Excludes: `services@*` and `services-standby/services@*` (replication snapshots)
**Secrets**:
- SOPS for secrets, files in `secrets/`
- Keys managed per-host
## Migration Status
**Phase 3 & 4**: COMPLETE! GlusterFS removed, all services on NFS
**Next**: Convert fractal to NixOS (deferred)
See `docs/MIGRATION_TODO.md` for detailed checklist.
## Common Tasks
**Deploy a host**: `deploy -s '.#hostname'`
**Deploy all**: `deploy`
**Check replication**: `ssh zippy journalctl -u replicate-services-to-c1.service -f`
**NFS failover**: See `docs/NFS_FAILOVER.md`
**Nomad jobs**: `services/*.hcl` - service data stored at `/data/services/<service-name>`
## Troubleshooting Hints
- Replication errors with "empty stream": SSH key restricted to `btrfs receive`, can't run other commands
- NFS split-brain protection: nfs-server checks Consul before starting
- Btrfs snapshots: nested snapshots appear as empty dirs in parent snapshots
- Kopia: uses temporary snapshot for consistency, doesn't back up nested subvolumes
## Important Files
- `common/global/backup.nix` - Kopia backup configuration
- `hosts/zippy/default.nix` - NFS server config, replication targets
- `hosts/c1/default.nix` - NFS standby config, authorized replication keys
- `flake.nix` - Host definitions, nixpkgs inputs
---
*Auto-generated reference for Claude Code. Keep concise. Update when architecture changes.*