Files
alo-cluster/docs/NFS_FAILOVER.md

10 KiB

NFS Services Failover Procedures

This document describes how to fail over the /data/services NFS server between hosts and how to fail back.

Architecture Overview

  • Primary NFS Server: Typically zippy

    • Exports /persist/services via NFS
    • Has local bind mount: /data/services/persist/services (same path as clients)
    • Registers data-services.service.consul in Consul
    • Sets Nomad node meta: storage_role = "primary"
    • Replicates snapshots to standbys every 5 minutes via btrfs send
    • Safety check: Refuses to start if another NFS server is already active in Consul
  • Standby: Typically c1

    • Receives snapshots at /persist/services-standby/services@<timestamp>
    • Can be promoted to NFS server during failover
    • No special Nomad node meta (not primary)
  • Clients: All cluster nodes (c1, c2, c3, zippy)

    • Mount /data/services from data-services.service.consul:/persist/services
    • Automatically connect to whoever is registered in Consul

Nomad Job Constraints

Jobs that need to run on the primary storage node should use:

constraint {
  attribute = "${meta.storage_role}"
  value     = "primary"
}

This is useful for:

  • Database jobs (mysql, postgres, redis) that benefit from local storage
  • Jobs that need guaranteed fast disk I/O

During failover, the storage_role = "primary" meta attribute moves to the new NFS server, and Nomad automatically reschedules constrained jobs to the new primary.

Prerequisites

  • Standby has been receiving snapshots (check: ls /persist/services-standby/services@*)
  • Last successful replication was recent (< 5-10 minutes)

Failover: Promoting Standby to Primary

Scenario: zippy is down and you need to promote c1 to be the NFS server.

Step 1: Choose Latest Snapshot

On the standby (c1):

ssh c1
sudo ls -lt /persist/services-standby/services@* | head -5

Find the most recent snapshot. Note the timestamp to estimate data loss (typically < 5 minutes).

Step 2: Promote Snapshot to Read-Write Subvolume

On c1:

# Find the latest snapshot
LATEST=$(sudo ls -t /persist/services-standby/services@* | head -1)

# Create writable subvolume from snapshot
sudo btrfs subvolume snapshot "$LATEST" /persist/services

# Verify
ls -la /persist/services

Step 3: Update NixOS Configuration

Edit your configuration to swap the NFS server role:

In hosts/c1/default.nix:

imports = [
  # ... existing imports ...
  # ../../common/nfs-services-standby.nix  # REMOVE THIS
  ../../common/nfs-services-server.nix     # ADD THIS
];

# Add standbys if desired (optional - can leave empty during emergency)
nfsServicesServer.standbys = [];  # Or ["c2"] to add a new standby

Optional: Prepare zippy config for when it comes back:

In hosts/zippy/default.nix (can do this later too):

imports = [
  # ... existing imports ...
  # ../../common/nfs-services-server.nix   # REMOVE THIS
  ../../common/nfs-services-standby.nix    # ADD THIS
];

# Add the replication key from c1 (get it from c1:/persist/root/.ssh/btrfs-replication.pub)
nfsServicesStandby.replicationKeys = [
  "ssh-ed25519 AAAA... root@c1-replication"
];

Step 4: Deploy Configuration

# From your workstation
deploy -s '.#c1'

# If zippy is still down, updating its config will fail, but that's okay
# You can update it later when it comes back

Step 5: Verify NFS Server is Running

On c1:

sudo systemctl status nfs-server
sudo showmount -e localhost
dig @localhost -p 8600 data-services.service.consul  # Should show c1's IP

Step 6: Verify Clients Can Access

From any node:

df -h | grep services
ls /data/services

The mount should automatically reconnect via Consul DNS.

Step 7: Check Nomad Jobs

nomad job status mysql
nomad job status postgres
# Verify critical services are healthy

# Jobs constrained to ${meta.storage_role} = "primary" will automatically
# reschedule to c1 once it's deployed with the NFS server module

Recovery Time Objective (RTO): ~10-15 minutes Recovery Point Objective (RPO): Last replication interval (5 minutes max)

Note: Jobs with the storage_role = "primary" constraint will automatically move to c1 because it now has that node meta attribute. No job spec changes needed!


What Happens When zippy Comes Back?

IMPORTANT: If zippy reboots while still configured as NFS server, it will refuse to start the NFS service because it detects c1 is already active in Consul.

You'll see this error in journalctl -u nfs-server:

ERROR: Another NFS server is already active at 192.168.1.X
This host (192.168.1.2) is configured as NFS server but should be standby.
To fix:
  1. If this is intentional (failback), first demote the other server
  2. Update this host's config to use nfs-services-standby.nix instead
  3. Sync data from active server before promoting this host

This is a safety feature to prevent split-brain and data corruption.

Options when zippy comes back:

Option A: Keep c1 as primary (zippy becomes standby)

  1. Update zippy's config to use nfs-services-standby.nix
  2. Deploy to zippy
  3. c1 will start replicating to zippy

Option B: Fail back to zippy as primary Follow the "Failing Back to Original Primary" procedure below.


Failing Back to Original Primary

Scenario: zippy is repaired and you want to move the NFS server role back from c1 to zippy.

Step 1: Sync Latest Data from c1 to zippy

On c1 (current primary):

# Create readonly snapshot of current state
sudo btrfs subvolume snapshot -r /persist/services /persist/services@failback-$(date +%Y%m%d-%H%M%S)

# Find the snapshot
FAILBACK=$(sudo ls -t /persist/services@failback-* | head -1)

# Send to zippy (use root SSH key if available, or generate temporary key)
sudo btrfs send "$FAILBACK" | ssh root@zippy "btrfs receive /persist/"

On zippy:

# Verify snapshot arrived
ls -la /persist/services@failback-*

# Create writable subvolume from the snapshot
FAILBACK=$(ls -t /persist/services@failback-* | head -1)
sudo btrfs subvolume snapshot "$FAILBACK" /persist/services

# Verify
ls -la /persist/services

Step 2: Update NixOS Configuration

Swap the roles back:

In hosts/zippy/default.nix:

imports = [
  # ... existing imports ...
  # ../../common/nfs-services-standby.nix  # REMOVE THIS
  ../../common/nfs-services-server.nix     # ADD THIS
];

nfsServicesServer.standbys = ["c1"];

In hosts/c1/default.nix:

imports = [
  # ... existing imports ...
  # ../../common/nfs-services-server.nix   # REMOVE THIS
  ../../common/nfs-services-standby.nix    # ADD THIS
];

nfsServicesStandby.replicationKeys = [
  "ssh-ed25519 AAAA... root@zippy-replication"  # Get from zippy:/persist/root/.ssh/btrfs-replication.pub
];

Step 3: Deploy Configurations

# IMPORTANT: Deploy c1 FIRST to demote it
deploy -s '.#c1'

# Wait for c1 to stop NFS server
ssh c1 sudo systemctl status nfs-server  # Should be inactive

# Then deploy zippy to promote it
deploy -s '.#zippy'

The order matters! If you deploy zippy first, it will see c1 is still active and refuse to start.

Step 4: Verify Failback

Check Consul DNS points to zippy:

dig @c1 -p 8600 data-services.service.consul  # Should show zippy's IP

Check clients are mounting from zippy:

for host in c1 c2 c3; do
  ssh $host "df -h | grep services"
done

Step 5: Clean Up Temporary Snapshots

On c1:

# Remove the failback snapshot and the promoted subvolume
sudo btrfs subvolume delete /persist/services@failback-*
sudo btrfs subvolume delete /persist/services

Adding a New Standby

Scenario: You want to add c2 as an additional standby.

Step 1: Create Standby Subvolume on c2

ssh c2
sudo btrfs subvolume create /persist/services-standby

Step 2: Update c2 Configuration

In hosts/c2/default.nix:

imports = [
  # ... existing imports ...
  ../../common/nfs-services-standby.nix
];

nfsServicesStandby.replicationKeys = [
  "ssh-ed25519 AAAA... root@zippy-replication"  # Get from current NFS server
];

Step 3: Update NFS Server Configuration

On the current NFS server (e.g., zippy), update the standbys list:

In hosts/zippy/default.nix:

nfsServicesServer.standbys = ["c1" "c2"];  # Added c2

Step 4: Deploy

deploy -s '.#c2'
deploy -s '.#zippy'

The next replication cycle (within 5 minutes) will do a full send to c2, then switch to incremental.


Troubleshooting

Replication Failed

Check the replication service logs:

# On NFS server
sudo journalctl -u replicate-services-to-c1 -f

Common issues:

  • SSH key not found → Run key generation step (see stateful-commands.txt)
  • Permission denied → Check authorized_keys on standby
  • Snapshot already exists → Old snapshot with same timestamp, wait for next cycle

Clients Can't Mount

Check Consul:

dig @localhost -p 8600 data-services.service.consul
consul catalog services | grep data-services

If Consul isn't resolving:

  • NFS server might not have registered → Check sudo systemctl status nfs-server
  • Consul agent might be down → Check sudo systemctl status consul

Mount is Stale

Force remount:

sudo systemctl restart data-services.mount

Or unmount and let automount handle it:

sudo umount /data/services
ls /data/services  # Triggers automount

Split-Brain Prevention: NFS Server Won't Start

If you see:

ERROR: Another NFS server is already active at 192.168.1.X

This is intentional - the safety check is working! You have two options:

  1. Keep the other server as primary: Update this host's config to be a standby instead
  2. Fail back to this host: First demote the other server, sync data, then deploy both hosts in correct order

Monitoring

Check Replication Status

On NFS server:

# List recent snapshots
ls -lt /persist/services@* | head

# Check last replication run
sudo systemctl status replicate-services-to-c1

# Check replication logs
sudo journalctl -u replicate-services-to-c1 --since "1 hour ago"

On standby:

# List received snapshots
ls -lt /persist/services-standby/services@* | head

# Check how old the latest snapshot is
stat /persist/services-standby/services@* | grep Modify | head -1

Verify NFS Exports

sudo showmount -e localhost

Should show:

/persist/services 192.168.1.0/24

Check Consul Registration

consul catalog services | grep data-services
dig @localhost -p 8600 data-services.service.consul