alo-cluster/docs/NFS_FAILOVER.md

# NFS Services Failover Procedures

This document describes how to fail over the `/data/services` NFS server between hosts and how to fail back.

## Architecture Overview

- **Primary NFS Server**: Typically `zippy`
  - Exports `/persist/services` via NFS
  - Has local bind mount: `/data/services` → `/persist/services` (same path as clients)
  - Registers `data-services.service.consul` in Consul
  - Sets Nomad node meta: `storage_role = "primary"`
  - Replicates snapshots to standbys every 5 minutes via btrfs send
  - **Safety check**: Refuses to start if another NFS server is already active in Consul

- **Standby**: Typically `c1`
  - Receives snapshots at `/persist/services-standby/services@<timestamp>`
  - Can be promoted to NFS server during failover
  - No special Nomad node meta (not primary)

- **Clients**: All cluster nodes (c1, c2, c3, zippy)
  - Mount `/data/services` from `data-services.service.consul:/persist/services`
  - Automatically connect to whoever is registered in Consul

### Nomad Job Constraints

Jobs that need to run on the primary storage node should use:

```hcl
constraint {
  attribute = "${meta.storage_role}"
  value     = "primary"
}
```

This is useful for:
- Database jobs (mysql, postgres, redis) that benefit from local storage
- Jobs that need guaranteed fast disk I/O

During failover, the `storage_role = "primary"` meta attribute moves to the new NFS server, and Nomad automatically reschedules constrained jobs to the new primary.

## Prerequisites

- Standby has been receiving snapshots (check: `ls /persist/services-standby/services@*`)
- Last successful replication was recent (< 5-10 minutes)

---

## Failover: Promoting Standby to Primary

**Scenario**: `zippy` is down and you need to promote `c1` to be the NFS server.

### Step 1: Choose Latest Snapshot

On the standby (c1):

```bash
ssh c1
sudo ls -lt /persist/services-standby/services@* | head -5
```

Find the most recent snapshot. Note the timestamp to estimate data loss (typically < 5 minutes).

### Step 2: Promote Snapshot to Read-Write Subvolume

On c1:

```bash
# Find the latest snapshot
LATEST=$(sudo ls -t /persist/services-standby/services@* | head -1)

# Create writable subvolume from snapshot
sudo btrfs subvolume snapshot "$LATEST" /persist/services

# Verify
ls -la /persist/services
```

### Step 3: Update NixOS Configuration

Edit your configuration to swap the NFS server role:

**In `hosts/c1/default.nix`**:
```nix
imports = [
  # ... existing imports ...
  # ../../common/nfs-services-standby.nix  # REMOVE THIS
  ../../common/nfs-services-server.nix     # ADD THIS
];

# Add standbys if desired (optional - can leave empty during emergency)
nfsServicesServer.standbys = [];  # Or ["c2"] to add a new standby
```

**Optional: Prepare zippy config for when it comes back**:

In `hosts/zippy/default.nix` (can do this later too):
```nix
imports = [
  # ... existing imports ...
  # ../../common/nfs-services-server.nix   # REMOVE THIS
  ../../common/nfs-services-standby.nix    # ADD THIS
];

# Add the replication key from c1 (get it from c1:/persist/root/.ssh/btrfs-replication.pub)
nfsServicesStandby.replicationKeys = [
  "ssh-ed25519 AAAA... root@c1-replication"
];
```

### Step 4: Deploy Configuration

```bash
# From your workstation
deploy -s '.#c1'

# If zippy is still down, updating its config will fail, but that's okay
# You can update it later when it comes back
```

### Step 5: Verify NFS Server is Running

On c1:

```bash
sudo systemctl status nfs-server
sudo showmount -e localhost
dig @localhost -p 8600 data-services.service.consul  # Should show c1's IP
```

### Step 6: Verify Clients Can Access

From any node:

```bash
df -h | grep services
ls /data/services
```

The mount should automatically reconnect via Consul DNS.

### Step 7: Check Nomad Jobs

```bash
nomad job status mysql
nomad job status postgres
# Verify critical services are healthy

# Jobs constrained to ${meta.storage_role} = "primary" will automatically
# reschedule to c1 once it's deployed with the NFS server module
```

**Recovery Time Objective (RTO)**: ~10-15 minutes
**Recovery Point Objective (RPO)**: Last replication interval (5 minutes max)

**Note**: Jobs with the `storage_role = "primary"` constraint will automatically move to c1 because it now has that node meta attribute. No job spec changes needed!

---

## What Happens When zippy Comes Back?

**IMPORTANT**: If zippy reboots while still configured as NFS server, it will **refuse to start** the NFS service because it detects c1 is already active in Consul.

You'll see this error in `journalctl -u nfs-server`:

```
ERROR: Another NFS server is already active at 192.168.1.X
This host (192.168.1.2) is configured as NFS server but should be standby.
To fix:
  1. If this is intentional (failback), first demote the other server
  2. Update this host's config to use nfs-services-standby.nix instead
  3. Sync data from active server before promoting this host
```

This is a **safety feature** to prevent split-brain and data corruption.

### Options when zippy comes back:

**Option A: Keep c1 as primary** (zippy becomes standby)
1. Update zippy's config to use `nfs-services-standby.nix`
2. Deploy to zippy
3. c1 will start replicating to zippy

**Option B: Fail back to zippy as primary**
Follow the "Failing Back to Original Primary" procedure below.

---

## Failing Back to Original Primary

**Scenario**: `zippy` is repaired and you want to move the NFS server role back from `c1` to `zippy`.

### Step 1: Sync Latest Data from c1 to zippy

On c1 (current primary):

```bash
# Create readonly snapshot of current state
sudo btrfs subvolume snapshot -r /persist/services /persist/services@failback-$(date +%Y%m%d-%H%M%S)

# Find the snapshot
FAILBACK=$(sudo ls -t /persist/services@failback-* | head -1)

# Send to zippy (use root SSH key if available, or generate temporary key)
sudo btrfs send "$FAILBACK" | ssh root@zippy "btrfs receive /persist/"
```

On zippy:

```bash
# Verify snapshot arrived
ls -la /persist/services@failback-*

# Create writable subvolume from the snapshot
FAILBACK=$(ls -t /persist/services@failback-* | head -1)
sudo btrfs subvolume snapshot "$FAILBACK" /persist/services

# Verify
ls -la /persist/services
```

### Step 2: Update NixOS Configuration

Swap the roles back:

**In `hosts/zippy/default.nix`**:
```nix
imports = [
  # ... existing imports ...
  # ../../common/nfs-services-standby.nix  # REMOVE THIS
  ../../common/nfs-services-server.nix     # ADD THIS
];

nfsServicesServer.standbys = ["c1"];
```

**In `hosts/c1/default.nix`**:
```nix
imports = [
  # ... existing imports ...
  # ../../common/nfs-services-server.nix   # REMOVE THIS
  ../../common/nfs-services-standby.nix    # ADD THIS
];

nfsServicesStandby.replicationKeys = [
  "ssh-ed25519 AAAA... root@zippy-replication"  # Get from zippy:/persist/root/.ssh/btrfs-replication.pub
];
```

### Step 3: Deploy Configurations

```bash
# IMPORTANT: Deploy c1 FIRST to demote it
deploy -s '.#c1'

# Wait for c1 to stop NFS server
ssh c1 sudo systemctl status nfs-server  # Should be inactive

# Then deploy zippy to promote it
deploy -s '.#zippy'
```

The order matters! If you deploy zippy first, it will see c1 is still active and refuse to start.

### Step 4: Verify Failback

Check Consul DNS points to zippy:

```bash
dig @c1 -p 8600 data-services.service.consul  # Should show zippy's IP
```

Check clients are mounting from zippy:

```bash
for host in c1 c2 c3; do
  ssh $host "df -h | grep services"
done
```

### Step 5: Clean Up Temporary Snapshots

On c1:

```bash
# Remove the failback snapshot and the promoted subvolume
sudo btrfs subvolume delete /persist/services@failback-*
sudo btrfs subvolume delete /persist/services
```

---

## Adding a New Standby

**Scenario**: You want to add `c2` as an additional standby.

### Step 1: Create Standby Subvolume on c2

```bash
ssh c2
sudo btrfs subvolume create /persist/services-standby
```

### Step 2: Update c2 Configuration

**In `hosts/c2/default.nix`**:
```nix
imports = [
  # ... existing imports ...
  ../../common/nfs-services-standby.nix
];

nfsServicesStandby.replicationKeys = [
  "ssh-ed25519 AAAA... root@zippy-replication"  # Get from current NFS server
];
```

### Step 3: Update NFS Server Configuration

On the current NFS server (e.g., zippy), update the standbys list:

**In `hosts/zippy/default.nix`**:
```nix
nfsServicesServer.standbys = ["c1" "c2"];  # Added c2
```

### Step 4: Deploy

```bash
deploy -s '.#c2'
deploy -s '.#zippy'
```

The next replication cycle (within 5 minutes) will do a full send to c2, then switch to incremental.

---

## Troubleshooting

### Replication Failed

Check the replication service logs:

```bash
# On NFS server
sudo journalctl -u replicate-services-to-c1 -f
```

Common issues:
- SSH key not found → Run key generation step (see stateful-commands.txt)
- Permission denied → Check authorized_keys on standby
- Snapshot already exists → Old snapshot with same timestamp, wait for next cycle

### Clients Can't Mount

Check Consul:

```bash
dig @localhost -p 8600 data-services.service.consul
consul catalog services | grep data-services
```

If Consul isn't resolving:
- NFS server might not have registered → Check `sudo systemctl status nfs-server`
- Consul agent might be down → Check `sudo systemctl status consul`

### Mount is Stale

Force remount:

```bash
sudo systemctl restart data-services.mount
```

Or unmount and let automount handle it:

```bash
sudo umount /data/services
ls /data/services  # Triggers automount
```

### Split-Brain Prevention: NFS Server Won't Start

If you see:
```
ERROR: Another NFS server is already active at 192.168.1.X
```

This is **intentional** - the safety check is working! You have two options:

1. **Keep the other server as primary**: Update this host's config to be a standby instead
2. **Fail back to this host**: First demote the other server, sync data, then deploy both hosts in correct order

---

## Monitoring

### Check Replication Status

On NFS server:

```bash
# List recent snapshots
ls -lt /persist/services@* | head

# Check last replication run
sudo systemctl status replicate-services-to-c1

# Check replication logs
sudo journalctl -u replicate-services-to-c1 --since "1 hour ago"
```

On standby:

```bash
# List received snapshots
ls -lt /persist/services-standby/services@* | head

# Check how old the latest snapshot is
stat /persist/services-standby/services@* | grep Modify | head -1
```

### Verify NFS Exports

```bash
sudo showmount -e localhost
```

Should show:
```
/persist/services 192.168.1.0/24
```

### Check Consul Registration

```bash
consul catalog services | grep data-services
dig @localhost -p 8600 data-services.service.consul
```