diff --git a/docs/HOMELAB_AGENT.md b/docs/HOMELAB_AGENT.md new file mode 100644 index 0000000..651087e --- /dev/null +++ b/docs/HOMELAB_AGENT.md @@ -0,0 +1,354 @@ +# ABOUTME: Vision and design document for an AI agent that manages the homelab cluster. +# ABOUTME: Covers emergent capabilities, technical approach, and implementation strategy. + +# Homelab Agent: Vision and Design + +## The Core Idea + +Not automation. Not "LLM-powered autocomplete for infrastructure." Emergent capabilities. + +The same shift Claude Code brought to programming: you describe outcomes, it handles implementation. You become a "product manager" for your infrastructure instead of an "infrastructure engineer." + +The cluster stops being infrastructure you manage and becomes an environment that responds to intent. + +## What Makes This Different From Automation + +**Automation**: "If disk > 90%, delete old logs" + +**Emergent**: "Disk is 95% full. What's using space? ...Postgres WAL. Can I safely checkpoint? Last backup was 2h ago, load is low, yes. Running checkpoint... down to 60%. I should note that WAL retention might need tuning." + +The difference: +- Novel problem-solving (not pattern matching) +- Contextual safety reasoning +- Adaptation to the specific situation +- Learning for the future + +## Examples of Genuinely New Capabilities + +### 1. Intent-Driven Infrastructure + +> "I want to run Synapse for Matrix" + +Agent figures out: Nomad job spec, storage location, Traefik routing, TLS, Consul registration, backup config. Creates it, deploys it, validates it. + +You don't need to know Nomad job format or Traefik labels. You describe the outcome. + +### 2. Proactive Evolution (The Best One) + +The agent doesn't wait for problems or instructions: + +- "Synapse 1.98 has a security fix. I've tested it in a local build, no config changes needed. Deploy?" +- "Your NFS server has been primary for 47 days. Want me to test failover to make sure it still works?" +- "I noticed arr services all have the same resource limits but Sonarr consistently uses more. Adjusted." +- "There's a new NixOS module for Traefik that simplifies your current setup. Here's the diff." + +Not monitoring. Stewardship. + +### 3. The Cluster Has Opinions + +> You: "I want to add Plex" +> +> Agent: "You already have Jellyfin, which does the same thing. If you want Plex specifically for its mobile app, I can set it up to share Jellyfin's media library. Or if you want to switch entirely, I can migrate watch history. What's the actual goal?" + +Not a command executor. A collaborator that understands your system. + +### 4. "Bring This Into the Cluster" + +You're running something in Docker on a random VM: + +> "Bring this into the cluster" + +Agent: connects, inspects, figures out dependencies, writes Nomad job, sets up storage, migrates data, routes traffic, validates, decommissions old instance. + +You didn't need to know how. + +### 5. Cross-Cutting Changes + +> "Add authentication to all public-facing services" + +Agent identifies which services are public, understands the auth setup (Pocket ID + traefik-oidc-auth), modifies each service's config, tests that auth works. + +Single coherent change across everything, without knowing every service yourself. + +### 6. Emergent Debugging + +Not runbooks. Actual reasoning: + +> "The blog is slow" + +Agent checks service health (fine), node resources (fine), network latency (fine), database queries (ah, slow query), traces to missing index, adds index, validates performance improved. + +Solved a problem nobody wrote a runbook for. + +### 7. Architecture Exploration + +> "What if we added a third Nomad server for better quorum?" + +Agent reasons about current topology, generates the config, identifies what would change, shows blast radius. Thinking partner for infrastructure decisions. + +## Why Nix Makes This Possible + +Traditional infrastructure: state is scattered and implicit. Nix: everything is declared. + +- **Full system understanding** - agent can read the flake and understand EVERYTHING +- **Safe experimentation** - build without deploying, rollback trivially +- **Reproducibility** - "what was the state 3 days ago?" can be rebuilt exactly +- **Composition** - agent can generate valid configs that compose correctly +- **The ecosystem** - 80k+ packages, thousands of modules the agent can navigate + +> "I want a VPN that works with my phone" + +Agent knows Nix, finds WireGuard module, configures it, generates QR codes, opens firewall. You didn't learn WireGuard. + +## The Validation Pattern + +Just like code has linting and tests, infrastructure actions need validation: + +| Phase | Code | Infrastructure | +|-------|------|----------------| +| Static | Lint, typecheck | Config parses, secrets exist, no port conflicts | +| Pre-flight | — | Cluster healthy, dependencies up, quorum intact | +| Post-action | Unit tests | Service started, health checks pass, metrics flowing | +| Invariants | CI | NFS mounted, Consul quorum, replication current | + +The agent can take actions confidently because it validates outcomes. + +## The Reality Check + +Some of this works today. Some would fail spectacularly. Some would fail silently and idiotically. Just like Claude Code for coding. + +Therefore: +- Tight loop with the human operator +- Assume the human is competent and knowledgeable +- Agent amplifies expertise, doesn't replace it +- Escalate when uncertain + +## Technical Approach + +### Runtime: Claude Code (Not Agent SDK) + +Two options were considered: + +| Tool | Pro/Max Subscription | API Billing | +|------|---------------------|-------------| +| Claude Code CLI | Yes | Yes | +| Claude Agent SDK | No | Required | + +Claude Code can use existing Max subscription. Agent SDK requires separate API billing. + +For v1, use Claude Code as the runtime: + +```bash +claude --print "prompt" \ + --allowedTools "Bash,Read,Edit" \ + --permission-mode acceptEdits +``` + +Graduate to Agent SDK later if limitations are hit. + +### Trigger Architecture + +On-demand Claude Code sessions, triggered by: +- **Timer** - periodic health/sanity check +- **Alert** - alertmanager webhook +- **Event** - systemd OnFailure, consul watch +- **Manual** - invoke with a goal + +Each trigger provides context and a goal. Claude Code does the rest. + +### Structure + +``` +agent/ +├── triggers/ +│ ├── scheduled-check # systemd timer +│ ├── on-alert # webhook handler +│ └── on-failure # systemd OnFailure target +├── gather-context.sh # snapshot of cluster state +└── goals/ + ├── health-check.md # verify health, fix if safe + ├── incident.md # investigate alert, fix or escalate + └── proactive.md # look for improvements +``` + +### Example: Scheduled Health Check + +```bash +#!/usr/bin/env bash +CONTEXT=$(./gather-context.sh) +GOAL=$(cat goals/health-check.md) + +claude --print " +## Context +$CONTEXT + +## Goal +$GOAL + +## Constraints +- You can read any file in this repo +- You can run nomad/consul/systemctl commands +- You can edit Nix/HCL files and run deploy +- Before destructive actions, validate with nix build or nomad plan +- If uncertain about safety, output a summary and stop +" +``` + +### Context Gathering + +```bash +#!/usr/bin/env bash +echo "=== Nomad Jobs ===" +nomad job status + +echo "=== Consul Members ===" +consul members + +echo "=== Failed Systemd Units ===" +systemctl --failed + +echo "=== Recent Errors (last hour) ===" +journalctl --since "1 hour ago" -p err --no-pager | tail -100 +``` + +## Edge Cases and the Nix Promise + +The NixOS promise mostly works, but sometimes doesn't: +- Mount option changes that require reboot +- Transition states where switch fails even if end state is correct +- Partial application where switch "succeeds" but change didn't take effect + +This is where the agent adds value: it can detect when a change needs special handling, apply the appropriate strategy, and verify the change actually took effect. + +## Capturing Knowledge + +Document edge cases as they're discovered: + +```markdown +## CIFS/NFS mount option changes +Switch may fail or succeed without effect. Strategy: +1. Try normal deploy +2. If mount options don't match after, reboot required +3. If deploy fails with mount busy, local switch + reboot +``` + +The agent reads this, uses it as context, but can also reason about novel situations. + +## Path to CI/CD + +Eventually: push to main triggers deploy via agent. + +``` +push to main + | +build all configs (mechanical) + | +agent: "what changed? is this safe to auto-deploy?" + | +├─ clean change -> deploy, validate, done +├─ needs reboot -> deploy, schedule reboot, validate after +├─ risky change -> notify for manual approval +└─ failed -> diagnose, retry with different strategy, or escalate + | +post-deploy verification + | +notification +``` + +The agent is the intelligence layer on top of mechanical CI/CD. + +## Research: What Others Are Doing (January 2026) + +### Existing Projects & Approaches + +**n8n + Ollama Stack** +The most common pattern is n8n (workflow orchestration) + Ollama (local LLM). Webhooks from +monitoring (Netdata/Prometheus) trigger AI-assisted diagnosis. Philosophy from one practitioner: +"train an employee, not a bot" — build trust, gradually grant autonomy. + +Sources: +- [Virtualization Howto: Self-Healing Home Lab](https://www.virtualizationhowto.com/2025/10/how-i-built-a-self-healing-home-lab-that-fixes-itself/) +- [addROM: AI Agent for Homelab with n8n](https://addrom.com/unleashing-the-power-of-an-ai-agent-for-homelab-management-with-n8n/) + +**Local Infrastructure Agent (Kelcode)** +Architecture: user question → tool router → query processor → LLM response. Connects to +Kubernetes, Prometheus, Harbor Registry. + +Key insight: "The AI's output definition must be perfectly synchronized with the software +it's trying to use." Their K8s tool failed because the prompt generated kubectl commands +while the code expected structured data objects. + +Uses phi4-mini via Ollama for routing decisions after testing multiple models. + +Source: [Kelcode: Building a Homelab Agentic Ecosystem](https://kelcode.co.uk/building-a-homelab-agentic-ecosystem-part1/) + +**nixai** +AI assistant specifically for NixOS. Searches NixOS Wiki, Nixpkgs Manual, nix.dev, Home Manager +docs. Diagnoses issues from piped logs/errors. Privacy-first: defaults to local Ollama. + +Limited scope — helper tool, not autonomous agent. But shows NixOS-specific tooling is possible. + +Source: [NixOS Discourse: Introducing nixai](https://discourse.nixos.org/t/introducing-nixai-your-ai-powered-nixos-companion/65168) + +**AI-Friendly Infrastructure (The Merino Wolf)** +Key insight: make infrastructure "AI-friendly" through structured documentation. CLAUDE.md +provides comprehensive context — "structured knowledge transfer." + +Lessons: +- "Context investment pays dividends" — comprehensive documentation is the most valuable asset +- Layered infrastructure design mirrors how both humans and AI think +- Rule-based guidance enforces safety practices automatically + +Source: [The Merino Wolf: AI-Powered Homelab](https://themerinowolf.com/posts/ai-powered-homelab/) + +**Claude Code Infrastructure Patterns** +Solves "skills don't activate automatically" problem using hooks (UserPromptSubmit, PostToolUse) ++ skill-rules.json for auto-activation. + +500-line rule with progressive disclosure: main file for high-level guidance, resource files +for deep dives. Claude loads materials incrementally as needed. + +Persistence pattern across context resets using three-file structures (plan, context, tasks). + +Born from 6 months managing TypeScript microservices (50k+ lines). + +Source: [diet103/claude-code-infrastructure-showcase](https://github.com/diet103/claude-code-infrastructure-showcase) + +### Patterns That Work + +- Local LLMs (Ollama) + workflow orchestration (n8n) is the popular stack +- Start with read-only/diagnostic agents, gradually add write access +- Pre-approved command lists for safety (e.g., 50 validated bash commands max) +- Structured documentation as foundation — AI is only as good as its context +- Multi-step tool use: agent plans, then executes steps, observing results + +### What's Missing in the Space + +- Nobody's doing true "emergent capabilities" yet — mostly tool routing +- Most projects are Kubernetes/Docker focused, not NixOS +- Few examples of proactive stewardship (our example #2) +- Limited examples of agents that understand the whole system coherently + +### Community Skepticism + +From Reddit discussions: doubts exist about using LLM agents in production. Although LLMs can +automate specific tasks, they frequently need human involvement for intricate decision-making. + +This validates our approach: tight loop with a competent human, not autonomous operation. + +### The Gap We'd Fill + +- NixOS-native agent leveraging declarative config as source of truth +- True emergence — not just tool routing, but reasoning about novel situations +- Proactive evolution, not just reactive troubleshooting +- Tight human loop with a competent operator + +## Next Steps + +1. Build trigger infrastructure (systemd timer, basic webhook handler) +2. Write context gathering scripts +3. Define goal prompts for common scenarios +4. Test with scheduled health checks +5. Iterate based on what works and what doesn't +6. Document edge cases as they're discovered +7. Gradually expand scope as confidence grows