alo-cluster/docs/HOMELAB_AGENT.md

# ABOUTME: Vision and design document for an AI agent that manages the homelab cluster.
# ABOUTME: Covers emergent capabilities, technical approach, and implementation strategy.

# Homelab Agent: Vision and Design

## The Core Idea

Not automation. Not "LLM-powered autocomplete for infrastructure." Emergent capabilities.

The same shift Claude Code brought to programming: you describe outcomes, it handles implementation. You become a "product manager" for your infrastructure instead of an "infrastructure engineer."

The cluster stops being infrastructure you manage and becomes an environment that responds to intent.

## What Makes This Different From Automation

**Automation**: "If disk > 90%, delete old logs"

**Emergent**: "Disk is 95% full. What's using space? ...Postgres WAL. Can I safely checkpoint? Last backup was 2h ago, load is low, yes. Running checkpoint... down to 60%. I should note that WAL retention might need tuning."

The difference:
- Novel problem-solving (not pattern matching)
- Contextual safety reasoning
- Adaptation to the specific situation
- Learning for the future

## Examples of Genuinely New Capabilities

### 1. Intent-Driven Infrastructure

> "I want to run Synapse for Matrix"

Agent figures out: Nomad job spec, storage location, Traefik routing, TLS, Consul registration, backup config. Creates it, deploys it, validates it.

You don't need to know Nomad job format or Traefik labels. You describe the outcome.

### 2. Proactive Evolution (The Best One)

The agent doesn't wait for problems or instructions:

- "Synapse 1.98 has a security fix. I've tested it in a local build, no config changes needed. Deploy?"
- "Your NFS server has been primary for 47 days. Want me to test failover to make sure it still works?"
- "I noticed arr services all have the same resource limits but Sonarr consistently uses more. Adjusted."
- "There's a new NixOS module for Traefik that simplifies your current setup. Here's the diff."

Not monitoring. Stewardship.

### 3. The Cluster Has Opinions

> You: "I want to add Plex"
>
> Agent: "You already have Jellyfin, which does the same thing. If you want Plex specifically for its mobile app, I can set it up to share Jellyfin's media library. Or if you want to switch entirely, I can migrate watch history. What's the actual goal?"

Not a command executor. A collaborator that understands your system.

### 4. "Bring This Into the Cluster"

You're running something in Docker on a random VM:

> "Bring this into the cluster"

Agent: connects, inspects, figures out dependencies, writes Nomad job, sets up storage, migrates data, routes traffic, validates, decommissions old instance.

You didn't need to know how.

### 5. Cross-Cutting Changes

> "Add authentication to all public-facing services"

Agent identifies which services are public, understands the auth setup (Pocket ID + traefik-oidc-auth), modifies each service's config, tests that auth works.

Single coherent change across everything, without knowing every service yourself.

### 6. Emergent Debugging

Not runbooks. Actual reasoning:

> "The blog is slow"

Agent checks service health (fine), node resources (fine), network latency (fine), database queries (ah, slow query), traces to missing index, adds index, validates performance improved.

Solved a problem nobody wrote a runbook for.

### 7. Architecture Exploration

> "What if we added a third Nomad server for better quorum?"

Agent reasons about current topology, generates the config, identifies what would change, shows blast radius. Thinking partner for infrastructure decisions.

## Why Nix Makes This Possible

Traditional infrastructure: state is scattered and implicit. Nix: everything is declared.

- **Full system understanding** - agent can read the flake and understand EVERYTHING
- **Safe experimentation** - build without deploying, rollback trivially
- **Reproducibility** - "what was the state 3 days ago?" can be rebuilt exactly
- **Composition** - agent can generate valid configs that compose correctly
- **The ecosystem** - 80k+ packages, thousands of modules the agent can navigate

> "I want a VPN that works with my phone"

Agent knows Nix, finds WireGuard module, configures it, generates QR codes, opens firewall. You didn't learn WireGuard.

## The Validation Pattern

Just like code has linting and tests, infrastructure actions need validation:

| Phase | Code | Infrastructure |
|-------|------|----------------|
| Static | Lint, typecheck | Config parses, secrets exist, no port conflicts |
| Pre-flight | — | Cluster healthy, dependencies up, quorum intact |
| Post-action | Unit tests | Service started, health checks pass, metrics flowing |
| Invariants | CI | NFS mounted, Consul quorum, replication current |

The agent can take actions confidently because it validates outcomes.

## The Reality Check

Some of this works today. Some would fail spectacularly. Some would fail silently and idiotically. Just like Claude Code for coding.

Therefore:
- Tight loop with the human operator
- Assume the human is competent and knowledgeable
- Agent amplifies expertise, doesn't replace it
- Escalate when uncertain

## Technical Approach

### Runtime: Claude Code (Not Agent SDK)

Two options were considered:

| Tool | Pro/Max Subscription | API Billing |
|------|---------------------|-------------|
| Claude Code CLI | Yes | Yes |
| Claude Agent SDK | No | Required |

Claude Code can use existing Max subscription. Agent SDK requires separate API billing.

For v1, use Claude Code as the runtime:

```bash
claude --print "prompt" \
  --allowedTools "Bash,Read,Edit" \
  --permission-mode acceptEdits
```

Graduate to Agent SDK later if limitations are hit.

### Trigger Architecture

On-demand Claude Code sessions, triggered by:
- **Timer** - periodic health/sanity check
- **Alert** - alertmanager webhook
- **Event** - systemd OnFailure, consul watch
- **Manual** - invoke with a goal

Each trigger provides context and a goal. Claude Code does the rest.

### Structure

```
agent/
├── triggers/
│   ├── scheduled-check       # systemd timer
│   ├── on-alert              # webhook handler
│   └── on-failure            # systemd OnFailure target
├── gather-context.sh         # snapshot of cluster state
└── goals/
    ├── health-check.md       # verify health, fix if safe
    ├── incident.md           # investigate alert, fix or escalate
    └── proactive.md          # look for improvements
```

### Example: Scheduled Health Check

```bash
#!/usr/bin/env bash
CONTEXT=$(./gather-context.sh)
GOAL=$(cat goals/health-check.md)

claude --print "
## Context
$CONTEXT

## Goal
$GOAL

## Constraints
- You can read any file in this repo
- You can run nomad/consul/systemctl commands
- You can edit Nix/HCL files and run deploy
- Before destructive actions, validate with nix build or nomad plan
- If uncertain about safety, output a summary and stop
"
```

### Context Gathering

```bash
#!/usr/bin/env bash
echo "=== Nomad Jobs ==="
nomad job status

echo "=== Consul Members ==="
consul members

echo "=== Failed Systemd Units ==="
systemctl --failed

echo "=== Recent Errors (last hour) ==="
journalctl --since "1 hour ago" -p err --no-pager | tail -100
```

## Edge Cases and the Nix Promise

The NixOS promise mostly works, but sometimes doesn't:
- Mount option changes that require reboot
- Transition states where switch fails even if end state is correct
- Partial application where switch "succeeds" but change didn't take effect

This is where the agent adds value: it can detect when a change needs special handling, apply the appropriate strategy, and verify the change actually took effect.

## Capturing Knowledge

Document edge cases as they're discovered:

```markdown
## CIFS/NFS mount option changes
Switch may fail or succeed without effect. Strategy:
1. Try normal deploy
2. If mount options don't match after, reboot required
3. If deploy fails with mount busy, local switch + reboot
```

The agent reads this, uses it as context, but can also reason about novel situations.

## Path to CI/CD

Eventually: push to main triggers deploy via agent.

```
push to main
     |
build all configs (mechanical)
     |
agent: "what changed? is this safe to auto-deploy?"
     |
├─ clean change -> deploy, validate, done
├─ needs reboot -> deploy, schedule reboot, validate after
├─ risky change -> notify for manual approval
└─ failed -> diagnose, retry with different strategy, or escalate
     |
post-deploy verification
     |
notification
```

The agent is the intelligence layer on top of mechanical CI/CD.

## Research: What Others Are Doing (January 2026)

### Existing Projects & Approaches

**n8n + Ollama Stack**
The most common pattern is n8n (workflow orchestration) + Ollama (local LLM). Webhooks from
monitoring (Netdata/Prometheus) trigger AI-assisted diagnosis. Philosophy from one practitioner:
"train an employee, not a bot" — build trust, gradually grant autonomy.

Sources:
- [Virtualization Howto: Self-Healing Home Lab](https://www.virtualizationhowto.com/2025/10/how-i-built-a-self-healing-home-lab-that-fixes-itself/)
- [addROM: AI Agent for Homelab with n8n](https://addrom.com/unleashing-the-power-of-an-ai-agent-for-homelab-management-with-n8n/)

**Local Infrastructure Agent (Kelcode)**
Architecture: user question → tool router → query processor → LLM response. Connects to
Kubernetes, Prometheus, Harbor Registry.

Key insight: "The AI's output definition must be perfectly synchronized with the software
it's trying to use." Their K8s tool failed because the prompt generated kubectl commands
while the code expected structured data objects.

Uses phi4-mini via Ollama for routing decisions after testing multiple models.

Source: [Kelcode: Building a Homelab Agentic Ecosystem](https://kelcode.co.uk/building-a-homelab-agentic-ecosystem-part1/)

**nixai**
AI assistant specifically for NixOS. Searches NixOS Wiki, Nixpkgs Manual, nix.dev, Home Manager
docs. Diagnoses issues from piped logs/errors. Privacy-first: defaults to local Ollama.

Limited scope — helper tool, not autonomous agent. But shows NixOS-specific tooling is possible.

Source: [NixOS Discourse: Introducing nixai](https://discourse.nixos.org/t/introducing-nixai-your-ai-powered-nixos-companion/65168)

**AI-Friendly Infrastructure (The Merino Wolf)**
Key insight: make infrastructure "AI-friendly" through structured documentation. CLAUDE.md
provides comprehensive context — "structured knowledge transfer."

Lessons:
- "Context investment pays dividends" — comprehensive documentation is the most valuable asset
- Layered infrastructure design mirrors how both humans and AI think
- Rule-based guidance enforces safety practices automatically

Source: [The Merino Wolf: AI-Powered Homelab](https://themerinowolf.com/posts/ai-powered-homelab/)

**Claude Code Infrastructure Patterns**
Solves "skills don't activate automatically" problem using hooks (UserPromptSubmit, PostToolUse)
+ skill-rules.json for auto-activation.

500-line rule with progressive disclosure: main file for high-level guidance, resource files
for deep dives. Claude loads materials incrementally as needed.

Persistence pattern across context resets using three-file structures (plan, context, tasks).

Born from 6 months managing TypeScript microservices (50k+ lines).

Source: [diet103/claude-code-infrastructure-showcase](https://github.com/diet103/claude-code-infrastructure-showcase)

### Patterns That Work

- Local LLMs (Ollama) + workflow orchestration (n8n) is the popular stack
- Start with read-only/diagnostic agents, gradually add write access
- Pre-approved command lists for safety (e.g., 50 validated bash commands max)
- Structured documentation as foundation — AI is only as good as its context
- Multi-step tool use: agent plans, then executes steps, observing results

### What's Missing in the Space

- Nobody's doing true "emergent capabilities" yet — mostly tool routing
- Most projects are Kubernetes/Docker focused, not NixOS
- Few examples of proactive stewardship (our example #2)
- Limited examples of agents that understand the whole system coherently

### Community Skepticism

From Reddit discussions: doubts exist about using LLM agents in production. Although LLMs can
automate specific tasks, they frequently need human involvement for intricate decision-making.

This validates our approach: tight loop with a competent human, not autonomous operation.

### The Gap We'd Fill

- NixOS-native agent leveraging declarative config as source of truth
- True emergence — not just tool routing, but reasoning about novel situations
- Proactive evolution, not just reactive troubleshooting
- Tight human loop with a competent operator

## Next Steps

1. Build trigger infrastructure (systemd timer, basic webhook handler)
2. Write context gathering scripts
3. Define goal prompts for common scenarios
4. Test with scheduled health checks
5. Iterate based on what works and what doesn't
6. Document edge cases as they're discovered
7. Gradually expand scope as confidence grows