NFS server and client setup.

2025-10-22 13:06:21 +01:00
parent 1262e03e21
commit 967ff34a51
9 changed files with 739 additions and 2 deletions
--- a/docs/NFS_FAILOVER.md
+++ b/docs/NFS_FAILOVER.md
@@ -0,0 +1,438 @@
+# NFS Services Failover Procedures
+
+This document describes how to fail over the `/data/services` NFS server between hosts and how to fail back.
+
+## Architecture Overview
+
+- **Primary NFS Server**: Typically `zippy`
+  - Exports `/persist/services` via NFS
+  - Has local bind mount: `/data/services` → `/persist/services` (same path as clients)
+  - Registers `data-services.service.consul` in Consul
+  - Sets Nomad node meta: `storage_role = "primary"`
+  - Replicates snapshots to standbys every 5 minutes via btrfs send
+  - **Safety check**: Refuses to start if another NFS server is already active in Consul
+
+- **Standby**: Typically `c1`
+  - Receives snapshots at `/persist/services-standby/services@<timestamp>`
+  - Can be promoted to NFS server during failover
+  - No special Nomad node meta (not primary)
+
+- **Clients**: All cluster nodes (c1, c2, c3, zippy)
+  - Mount `/data/services` from `data-services.service.consul:/persist/services`
+  - Automatically connect to whoever is registered in Consul
+
+### Nomad Job Constraints
+
+Jobs that need to run on the primary storage node should use:
+
+```hcl
+constraint {
+  attribute = "${meta.storage_role}"
+  value     = "primary"
+}
+```
+
+This is useful for:
+- Database jobs (mysql, postgres, redis) that benefit from local storage
+- Jobs that need guaranteed fast disk I/O
+
+During failover, the `storage_role = "primary"` meta attribute moves to the new NFS server, and Nomad automatically reschedules constrained jobs to the new primary.
+
+## Prerequisites
+
+- Standby has been receiving snapshots (check: `ls /persist/services-standby/services@*`)
+- Last successful replication was recent (< 5-10 minutes)
+
+---
+
+## Failover: Promoting Standby to Primary
+
+**Scenario**: `zippy` is down and you need to promote `c1` to be the NFS server.
+
+### Step 1: Choose Latest Snapshot
+
+On the standby (c1):
+
+```bash
+ssh c1
+sudo ls -lt /persist/services-standby/services@* | head -5
+```
+
+Find the most recent snapshot. Note the timestamp to estimate data loss (typically < 5 minutes).
+
+### Step 2: Promote Snapshot to Read-Write Subvolume
+
+On c1:
+
+```bash
+# Find the latest snapshot
+LATEST=$(sudo ls -t /persist/services-standby/services@* | head -1)
+
+# Create writable subvolume from snapshot
+sudo btrfs subvolume snapshot "$LATEST" /persist/services
+
+# Verify
+ls -la /persist/services
+```
+
+### Step 3: Update NixOS Configuration
+
+Edit your configuration to swap the NFS server role:
+
+**In `hosts/c1/default.nix`**:
+```nix
+imports = [
+  # ... existing imports ...
+  # ../../common/nfs-services-standby.nix  # REMOVE THIS
+  ../../common/nfs-services-server.nix     # ADD THIS
+];
+
+# Add standbys if desired (optional - can leave empty during emergency)
+nfsServicesServer.standbys = [];  # Or ["c2"] to add a new standby
+```
+
+**Optional: Prepare zippy config for when it comes back**:
+
+In `hosts/zippy/default.nix` (can do this later too):
+```nix
+imports = [
+  # ... existing imports ...
+  # ../../common/nfs-services-server.nix   # REMOVE THIS
+  ../../common/nfs-services-standby.nix    # ADD THIS
+];
+
+# Add the replication key from c1 (get it from c1:/persist/root/.ssh/btrfs-replication.pub)
+nfsServicesStandby.replicationKeys = [
+  "ssh-ed25519 AAAA... root@c1-replication"
+];
+```
+
+### Step 4: Deploy Configuration
+
+```bash
+# From your workstation
+deploy -s '.#c1'
+
+# If zippy is still down, updating its config will fail, but that's okay
+# You can update it later when it comes back
+```
+
+### Step 5: Verify NFS Server is Running
+
+On c1:
+
+```bash
+sudo systemctl status nfs-server
+sudo showmount -e localhost
+dig @localhost -p 8600 data-services.service.consul  # Should show c1's IP
+```
+
+### Step 6: Verify Clients Can Access
+
+From any node:
+
+```bash
+df -h | grep services
+ls /data/services
+```
+
+The mount should automatically reconnect via Consul DNS.
+
+### Step 7: Check Nomad Jobs
+
+```bash
+nomad job status mysql
+nomad job status postgres
+# Verify critical services are healthy
+
+# Jobs constrained to ${meta.storage_role} = "primary" will automatically
+# reschedule to c1 once it's deployed with the NFS server module
+```
+
+**Recovery Time Objective (RTO)**: ~10-15 minutes
+**Recovery Point Objective (RPO)**: Last replication interval (5 minutes max)
+
+**Note**: Jobs with the `storage_role = "primary"` constraint will automatically move to c1 because it now has that node meta attribute. No job spec changes needed!
+
+---
+
+## What Happens When zippy Comes Back?
+
+**IMPORTANT**: If zippy reboots while still configured as NFS server, it will **refuse to start** the NFS service because it detects c1 is already active in Consul.
+
+You'll see this error in `journalctl -u nfs-server`:
+
+```
+ERROR: Another NFS server is already active at 192.168.1.X
+This host (192.168.1.2) is configured as NFS server but should be standby.
+To fix:
+  1. If this is intentional (failback), first demote the other server
+  2. Update this host's config to use nfs-services-standby.nix instead
+  3. Sync data from active server before promoting this host
+```
+
+This is a **safety feature** to prevent split-brain and data corruption.
+
+### Options when zippy comes back:
+
+**Option A: Keep c1 as primary** (zippy becomes standby)
+1. Update zippy's config to use `nfs-services-standby.nix`
+2. Deploy to zippy
+3. c1 will start replicating to zippy
+
+**Option B: Fail back to zippy as primary**
+Follow the "Failing Back to Original Primary" procedure below.
+
+---
+
+## Failing Back to Original Primary
+
+**Scenario**: `zippy` is repaired and you want to move the NFS server role back from `c1` to `zippy`.
+
+### Step 1: Sync Latest Data from c1 to zippy
+
+On c1 (current primary):
+
+```bash
+# Create readonly snapshot of current state
+sudo btrfs subvolume snapshot -r /persist/services /persist/services@failback-$(date +%Y%m%d-%H%M%S)
+
+# Find the snapshot
+FAILBACK=$(sudo ls -t /persist/services@failback-* | head -1)
+
+# Send to zippy (use root SSH key if available, or generate temporary key)
+sudo btrfs send "$FAILBACK" | ssh root@zippy "btrfs receive /persist/"
+```
+
+On zippy:
+
+```bash
+# Verify snapshot arrived
+ls -la /persist/services@failback-*
+
+# Create writable subvolume from the snapshot
+FAILBACK=$(ls -t /persist/services@failback-* | head -1)
+sudo btrfs subvolume snapshot "$FAILBACK" /persist/services
+
+# Verify
+ls -la /persist/services
+```
+
+### Step 2: Update NixOS Configuration
+
+Swap the roles back:
+
+**In `hosts/zippy/default.nix`**:
+```nix
+imports = [
+  # ... existing imports ...
+  # ../../common/nfs-services-standby.nix  # REMOVE THIS
+  ../../common/nfs-services-server.nix     # ADD THIS
+];
+
+nfsServicesServer.standbys = ["c1"];
+```
+
+**In `hosts/c1/default.nix`**:
+```nix
+imports = [
+  # ... existing imports ...
+  # ../../common/nfs-services-server.nix   # REMOVE THIS
+  ../../common/nfs-services-standby.nix    # ADD THIS
+];
+
+nfsServicesStandby.replicationKeys = [
+  "ssh-ed25519 AAAA... root@zippy-replication"  # Get from zippy:/persist/root/.ssh/btrfs-replication.pub
+];
+```
+
+### Step 3: Deploy Configurations
+
+```bash
+# IMPORTANT: Deploy c1 FIRST to demote it
+deploy -s '.#c1'
+
+# Wait for c1 to stop NFS server
+ssh c1 sudo systemctl status nfs-server  # Should be inactive
+
+# Then deploy zippy to promote it
+deploy -s '.#zippy'
+```
+
+The order matters! If you deploy zippy first, it will see c1 is still active and refuse to start.
+
+### Step 4: Verify Failback
+
+Check Consul DNS points to zippy:
+
+```bash
+dig @c1 -p 8600 data-services.service.consul  # Should show zippy's IP
+```
+
+Check clients are mounting from zippy:
+
+```bash
+for host in c1 c2 c3; do
+  ssh $host "df -h | grep services"
+done
+```
+
+### Step 5: Clean Up Temporary Snapshots
+
+On c1:
+
+```bash
+# Remove the failback snapshot and the promoted subvolume
+sudo btrfs subvolume delete /persist/services@failback-*
+sudo btrfs subvolume delete /persist/services
+```
+
+---
+
+## Adding a New Standby
+
+**Scenario**: You want to add `c2` as an additional standby.
+
+### Step 1: Create Standby Subvolume on c2
+
+```bash
+ssh c2
+sudo btrfs subvolume create /persist/services-standby
+```
+
+### Step 2: Update c2 Configuration
+
+**In `hosts/c2/default.nix`**:
+```nix
+imports = [
+  # ... existing imports ...
+  ../../common/nfs-services-standby.nix
+];
+
+nfsServicesStandby.replicationKeys = [
+  "ssh-ed25519 AAAA... root@zippy-replication"  # Get from current NFS server
+];
+```
+
+### Step 3: Update NFS Server Configuration
+
+On the current NFS server (e.g., zippy), update the standbys list:
+
+**In `hosts/zippy/default.nix`**:
+```nix
+nfsServicesServer.standbys = ["c1" "c2"];  # Added c2
+```
+
+### Step 4: Deploy
+
+```bash
+deploy -s '.#c2'
+deploy -s '.#zippy'
+```
+
+The next replication cycle (within 5 minutes) will do a full send to c2, then switch to incremental.
+
+---
+
+## Troubleshooting
+
+### Replication Failed
+
+Check the replication service logs:
+
+```bash
+# On NFS server
+sudo journalctl -u replicate-services-to-c1 -f
+```
+
+Common issues:
+- SSH key not found → Run key generation step (see stateful-commands.txt)
+- Permission denied → Check authorized_keys on standby
+- Snapshot already exists → Old snapshot with same timestamp, wait for next cycle
+
+### Clients Can't Mount
+
+Check Consul:
+
+```bash
+dig @localhost -p 8600 data-services.service.consul
+consul catalog services | grep data-services
+```
+
+If Consul isn't resolving:
+- NFS server might not have registered → Check `sudo systemctl status nfs-server`
+- Consul agent might be down → Check `sudo systemctl status consul`
+
+### Mount is Stale
+
+Force remount:
+
+```bash
+sudo systemctl restart data-services.mount
+```
+
+Or unmount and let automount handle it:
+
+```bash
+sudo umount /data/services
+ls /data/services  # Triggers automount
+```
+
+### Split-Brain Prevention: NFS Server Won't Start
+
+If you see:
+```
+ERROR: Another NFS server is already active at 192.168.1.X
+```
+
+This is **intentional** - the safety check is working! You have two options:
+
+1. **Keep the other server as primary**: Update this host's config to be a standby instead
+2. **Fail back to this host**: First demote the other server, sync data, then deploy both hosts in correct order
+
+---
+
+## Monitoring
+
+### Check Replication Status
+
+On NFS server:
+
+```bash
+# List recent snapshots
+ls -lt /persist/services@* | head
+
+# Check last replication run
+sudo systemctl status replicate-services-to-c1
+
+# Check replication logs
+sudo journalctl -u replicate-services-to-c1 --since "1 hour ago"
+```
+
+On standby:
+
+```bash
+# List received snapshots
+ls -lt /persist/services-standby/services@* | head
+
+# Check how old the latest snapshot is
+stat /persist/services-standby/services@* | grep Modify | head -1
+```
+
+### Verify NFS Exports
+
+```bash
+sudo showmount -e localhost
+```
+
+Should show:
+```
+/persist/services 192.168.1.0/24
+```
+
+### Check Consul Registration
+
+```bash
+consul catalog services | grep data-services
+dig @localhost -p 8600 data-services.service.consul
+```