# NFS Services Failover Procedures This document describes how to fail over the `/data/services` NFS server between hosts and how to fail back. ## Architecture Overview - **Primary NFS Server**: Typically `zippy` - Exports `/persist/services` via NFS - Has local bind mount: `/data/services` → `/persist/services` (same path as clients) - Registers `data-services.service.consul` in Consul - Sets Nomad node meta: `storage_role = "primary"` - Replicates snapshots to standbys every 5 minutes via btrfs send - **Safety check**: Refuses to start if another NFS server is already active in Consul - **Standby**: Typically `c1` - Receives snapshots at `/persist/services-standby/services@` - Can be promoted to NFS server during failover - No special Nomad node meta (not primary) - **Clients**: All cluster nodes (c1, c2, c3, zippy) - Mount `/data/services` from `data-services.service.consul:/persist/services` - Automatically connect to whoever is registered in Consul ### Nomad Job Constraints Jobs that need to run on the primary storage node should use: ```hcl constraint { attribute = "${meta.storage_role}" value = "primary" } ``` This is useful for: - Database jobs (mysql, postgres, redis) that benefit from local storage - Jobs that need guaranteed fast disk I/O During failover, the `storage_role = "primary"` meta attribute moves to the new NFS server, and Nomad automatically reschedules constrained jobs to the new primary. ## Prerequisites - Standby has been receiving snapshots (check: `ls /persist/services-standby/services@*`) - Last successful replication was recent (< 5-10 minutes) --- ## Failover: Promoting Standby to Primary **Scenario**: `zippy` is down and you need to promote `c1` to be the NFS server. ### Step 1: Choose Latest Snapshot On the standby (c1): ```bash ssh c1 sudo ls -lt /persist/services-standby/services@* | head -5 ``` Find the most recent snapshot. Note the timestamp to estimate data loss (typically < 5 minutes). ### Step 2: Promote Snapshot to Read-Write Subvolume On c1: ```bash # Find the latest snapshot LATEST=$(sudo ls -t /persist/services-standby/services@* | head -1) # Create writable subvolume from snapshot sudo btrfs subvolume snapshot "$LATEST" /persist/services # Verify ls -la /persist/services ``` ### Step 3: Update NixOS Configuration Edit your configuration to swap the NFS server role: **In `hosts/c1/default.nix`**: ```nix imports = [ # ... existing imports ... # ../../common/nfs-services-standby.nix # REMOVE THIS ../../common/nfs-services-server.nix # ADD THIS ]; # Add standbys if desired (optional - can leave empty during emergency) nfsServicesServer.standbys = []; # Or ["c2"] to add a new standby ``` **Optional: Prepare zippy config for when it comes back**: In `hosts/zippy/default.nix` (can do this later too): ```nix imports = [ # ... existing imports ... # ../../common/nfs-services-server.nix # REMOVE THIS ../../common/nfs-services-standby.nix # ADD THIS ]; # Add the replication key from c1 (get it from c1:/persist/root/.ssh/btrfs-replication.pub) nfsServicesStandby.replicationKeys = [ "ssh-ed25519 AAAA... root@c1-replication" ]; ``` ### Step 4: Deploy Configuration ```bash # From your workstation deploy -s '.#c1' # If zippy is still down, updating its config will fail, but that's okay # You can update it later when it comes back ``` ### Step 5: Verify NFS Server is Running On c1: ```bash sudo systemctl status nfs-server sudo showmount -e localhost dig @localhost -p 8600 data-services.service.consul # Should show c1's IP ``` ### Step 6: Verify Clients Can Access From any node: ```bash df -h | grep services ls /data/services ``` The mount should automatically reconnect via Consul DNS. ### Step 7: Check Nomad Jobs ```bash nomad job status mysql nomad job status postgres # Verify critical services are healthy # Jobs constrained to ${meta.storage_role} = "primary" will automatically # reschedule to c1 once it's deployed with the NFS server module ``` **Recovery Time Objective (RTO)**: ~10-15 minutes **Recovery Point Objective (RPO)**: Last replication interval (5 minutes max) **Note**: Jobs with the `storage_role = "primary"` constraint will automatically move to c1 because it now has that node meta attribute. No job spec changes needed! --- ## What Happens When zippy Comes Back? **IMPORTANT**: If zippy reboots while still configured as NFS server, it will **refuse to start** the NFS service because it detects c1 is already active in Consul. You'll see this error in `journalctl -u nfs-server`: ``` ERROR: Another NFS server is already active at 192.168.1.X This host (192.168.1.2) is configured as NFS server but should be standby. To fix: 1. If this is intentional (failback), first demote the other server 2. Update this host's config to use nfs-services-standby.nix instead 3. Sync data from active server before promoting this host ``` This is a **safety feature** to prevent split-brain and data corruption. ### Options when zippy comes back: **Option A: Keep c1 as primary** (zippy becomes standby) 1. Update zippy's config to use `nfs-services-standby.nix` 2. Deploy to zippy 3. c1 will start replicating to zippy **Option B: Fail back to zippy as primary** Follow the "Failing Back to Original Primary" procedure below. --- ## Failing Back to Original Primary **Scenario**: `zippy` is repaired and you want to move the NFS server role back from `c1` to `zippy`. ### Step 1: Sync Latest Data from c1 to zippy On c1 (current primary): ```bash # Create readonly snapshot of current state sudo btrfs subvolume snapshot -r /persist/services /persist/services@failback-$(date +%Y%m%d-%H%M%S) # Find the snapshot FAILBACK=$(sudo ls -t /persist/services@failback-* | head -1) # Send to zippy (use root SSH key if available, or generate temporary key) sudo btrfs send "$FAILBACK" | ssh root@zippy "btrfs receive /persist/" ``` On zippy: ```bash # Verify snapshot arrived ls -la /persist/services@failback-* # Create writable subvolume from the snapshot FAILBACK=$(ls -t /persist/services@failback-* | head -1) sudo btrfs subvolume snapshot "$FAILBACK" /persist/services # Verify ls -la /persist/services ``` ### Step 2: Update NixOS Configuration Swap the roles back: **In `hosts/zippy/default.nix`**: ```nix imports = [ # ... existing imports ... # ../../common/nfs-services-standby.nix # REMOVE THIS ../../common/nfs-services-server.nix # ADD THIS ]; nfsServicesServer.standbys = ["c1"]; ``` **In `hosts/c1/default.nix`**: ```nix imports = [ # ... existing imports ... # ../../common/nfs-services-server.nix # REMOVE THIS ../../common/nfs-services-standby.nix # ADD THIS ]; nfsServicesStandby.replicationKeys = [ "ssh-ed25519 AAAA... root@zippy-replication" # Get from zippy:/persist/root/.ssh/btrfs-replication.pub ]; ``` ### Step 3: Deploy Configurations ```bash # IMPORTANT: Deploy c1 FIRST to demote it deploy -s '.#c1' # Wait for c1 to stop NFS server ssh c1 sudo systemctl status nfs-server # Should be inactive # Then deploy zippy to promote it deploy -s '.#zippy' ``` The order matters! If you deploy zippy first, it will see c1 is still active and refuse to start. ### Step 4: Verify Failback Check Consul DNS points to zippy: ```bash dig @c1 -p 8600 data-services.service.consul # Should show zippy's IP ``` Check clients are mounting from zippy: ```bash for host in c1 c2 c3; do ssh $host "df -h | grep services" done ``` ### Step 5: Clean Up Temporary Snapshots On c1: ```bash # Remove the failback snapshot and the promoted subvolume sudo btrfs subvolume delete /persist/services@failback-* sudo btrfs subvolume delete /persist/services ``` --- ## Adding a New Standby **Scenario**: You want to add `c2` as an additional standby. ### Step 1: Create Standby Subvolume on c2 ```bash ssh c2 sudo btrfs subvolume create /persist/services-standby ``` ### Step 2: Update c2 Configuration **In `hosts/c2/default.nix`**: ```nix imports = [ # ... existing imports ... ../../common/nfs-services-standby.nix ]; nfsServicesStandby.replicationKeys = [ "ssh-ed25519 AAAA... root@zippy-replication" # Get from current NFS server ]; ``` ### Step 3: Update NFS Server Configuration On the current NFS server (e.g., zippy), update the standbys list: **In `hosts/zippy/default.nix`**: ```nix nfsServicesServer.standbys = ["c1" "c2"]; # Added c2 ``` ### Step 4: Deploy ```bash deploy -s '.#c2' deploy -s '.#zippy' ``` The next replication cycle (within 5 minutes) will do a full send to c2, then switch to incremental. --- ## Troubleshooting ### Replication Failed Check the replication service logs: ```bash # On NFS server sudo journalctl -u replicate-services-to-c1 -f ``` Common issues: - SSH key not found → Run key generation step (see stateful-commands.txt) - Permission denied → Check authorized_keys on standby - Snapshot already exists → Old snapshot with same timestamp, wait for next cycle ### Clients Can't Mount Check Consul: ```bash dig @localhost -p 8600 data-services.service.consul consul catalog services | grep data-services ``` If Consul isn't resolving: - NFS server might not have registered → Check `sudo systemctl status nfs-server` - Consul agent might be down → Check `sudo systemctl status consul` ### Mount is Stale Force remount: ```bash sudo systemctl restart data-services.mount ``` Or unmount and let automount handle it: ```bash sudo umount /data/services ls /data/services # Triggers automount ``` ### Split-Brain Prevention: NFS Server Won't Start If you see: ``` ERROR: Another NFS server is already active at 192.168.1.X ``` This is **intentional** - the safety check is working! You have two options: 1. **Keep the other server as primary**: Update this host's config to be a standby instead 2. **Fail back to this host**: First demote the other server, sync data, then deploy both hosts in correct order --- ## Monitoring ### Check Replication Status On NFS server: ```bash # List recent snapshots ls -lt /persist/services@* | head # Check last replication run sudo systemctl status replicate-services-to-c1 # Check replication logs sudo journalctl -u replicate-services-to-c1 --since "1 hour ago" ``` On standby: ```bash # List received snapshots ls -lt /persist/services-standby/services@* | head # Check how old the latest snapshot is stat /persist/services-standby/services@* | grep Modify | head -1 ``` ### Verify NFS Exports ```bash sudo showmount -e localhost ``` Should show: ``` /persist/services 192.168.1.0/24 ``` ### Check Consul Registration ```bash consul catalog services | grep data-services dig @localhost -p 8600 data-services.service.consul ```