355 lines
13 KiB
Markdown
355 lines
13 KiB
Markdown
# ABOUTME: Vision and design document for an AI agent that manages the homelab cluster.
|
|
# ABOUTME: Covers emergent capabilities, technical approach, and implementation strategy.
|
|
|
|
# Homelab Agent: Vision and Design
|
|
|
|
## The Core Idea
|
|
|
|
Not automation. Not "LLM-powered autocomplete for infrastructure." Emergent capabilities.
|
|
|
|
The same shift Claude Code brought to programming: you describe outcomes, it handles implementation. You become a "product manager" for your infrastructure instead of an "infrastructure engineer."
|
|
|
|
The cluster stops being infrastructure you manage and becomes an environment that responds to intent.
|
|
|
|
## What Makes This Different From Automation
|
|
|
|
**Automation**: "If disk > 90%, delete old logs"
|
|
|
|
**Emergent**: "Disk is 95% full. What's using space? ...Postgres WAL. Can I safely checkpoint? Last backup was 2h ago, load is low, yes. Running checkpoint... down to 60%. I should note that WAL retention might need tuning."
|
|
|
|
The difference:
|
|
- Novel problem-solving (not pattern matching)
|
|
- Contextual safety reasoning
|
|
- Adaptation to the specific situation
|
|
- Learning for the future
|
|
|
|
## Examples of Genuinely New Capabilities
|
|
|
|
### 1. Intent-Driven Infrastructure
|
|
|
|
> "I want to run Synapse for Matrix"
|
|
|
|
Agent figures out: Nomad job spec, storage location, Traefik routing, TLS, Consul registration, backup config. Creates it, deploys it, validates it.
|
|
|
|
You don't need to know Nomad job format or Traefik labels. You describe the outcome.
|
|
|
|
### 2. Proactive Evolution (The Best One)
|
|
|
|
The agent doesn't wait for problems or instructions:
|
|
|
|
- "Synapse 1.98 has a security fix. I've tested it in a local build, no config changes needed. Deploy?"
|
|
- "Your NFS server has been primary for 47 days. Want me to test failover to make sure it still works?"
|
|
- "I noticed arr services all have the same resource limits but Sonarr consistently uses more. Adjusted."
|
|
- "There's a new NixOS module for Traefik that simplifies your current setup. Here's the diff."
|
|
|
|
Not monitoring. Stewardship.
|
|
|
|
### 3. The Cluster Has Opinions
|
|
|
|
> You: "I want to add Plex"
|
|
>
|
|
> Agent: "You already have Jellyfin, which does the same thing. If you want Plex specifically for its mobile app, I can set it up to share Jellyfin's media library. Or if you want to switch entirely, I can migrate watch history. What's the actual goal?"
|
|
|
|
Not a command executor. A collaborator that understands your system.
|
|
|
|
### 4. "Bring This Into the Cluster"
|
|
|
|
You're running something in Docker on a random VM:
|
|
|
|
> "Bring this into the cluster"
|
|
|
|
Agent: connects, inspects, figures out dependencies, writes Nomad job, sets up storage, migrates data, routes traffic, validates, decommissions old instance.
|
|
|
|
You didn't need to know how.
|
|
|
|
### 5. Cross-Cutting Changes
|
|
|
|
> "Add authentication to all public-facing services"
|
|
|
|
Agent identifies which services are public, understands the auth setup (Pocket ID + traefik-oidc-auth), modifies each service's config, tests that auth works.
|
|
|
|
Single coherent change across everything, without knowing every service yourself.
|
|
|
|
### 6. Emergent Debugging
|
|
|
|
Not runbooks. Actual reasoning:
|
|
|
|
> "The blog is slow"
|
|
|
|
Agent checks service health (fine), node resources (fine), network latency (fine), database queries (ah, slow query), traces to missing index, adds index, validates performance improved.
|
|
|
|
Solved a problem nobody wrote a runbook for.
|
|
|
|
### 7. Architecture Exploration
|
|
|
|
> "What if we added a third Nomad server for better quorum?"
|
|
|
|
Agent reasons about current topology, generates the config, identifies what would change, shows blast radius. Thinking partner for infrastructure decisions.
|
|
|
|
## Why Nix Makes This Possible
|
|
|
|
Traditional infrastructure: state is scattered and implicit. Nix: everything is declared.
|
|
|
|
- **Full system understanding** - agent can read the flake and understand EVERYTHING
|
|
- **Safe experimentation** - build without deploying, rollback trivially
|
|
- **Reproducibility** - "what was the state 3 days ago?" can be rebuilt exactly
|
|
- **Composition** - agent can generate valid configs that compose correctly
|
|
- **The ecosystem** - 80k+ packages, thousands of modules the agent can navigate
|
|
|
|
> "I want a VPN that works with my phone"
|
|
|
|
Agent knows Nix, finds WireGuard module, configures it, generates QR codes, opens firewall. You didn't learn WireGuard.
|
|
|
|
## The Validation Pattern
|
|
|
|
Just like code has linting and tests, infrastructure actions need validation:
|
|
|
|
| Phase | Code | Infrastructure |
|
|
|-------|------|----------------|
|
|
| Static | Lint, typecheck | Config parses, secrets exist, no port conflicts |
|
|
| Pre-flight | — | Cluster healthy, dependencies up, quorum intact |
|
|
| Post-action | Unit tests | Service started, health checks pass, metrics flowing |
|
|
| Invariants | CI | NFS mounted, Consul quorum, replication current |
|
|
|
|
The agent can take actions confidently because it validates outcomes.
|
|
|
|
## The Reality Check
|
|
|
|
Some of this works today. Some would fail spectacularly. Some would fail silently and idiotically. Just like Claude Code for coding.
|
|
|
|
Therefore:
|
|
- Tight loop with the human operator
|
|
- Assume the human is competent and knowledgeable
|
|
- Agent amplifies expertise, doesn't replace it
|
|
- Escalate when uncertain
|
|
|
|
## Technical Approach
|
|
|
|
### Runtime: Claude Code (Not Agent SDK)
|
|
|
|
Two options were considered:
|
|
|
|
| Tool | Pro/Max Subscription | API Billing |
|
|
|------|---------------------|-------------|
|
|
| Claude Code CLI | Yes | Yes |
|
|
| Claude Agent SDK | No | Required |
|
|
|
|
Claude Code can use existing Max subscription. Agent SDK requires separate API billing.
|
|
|
|
For v1, use Claude Code as the runtime:
|
|
|
|
```bash
|
|
claude --print "prompt" \
|
|
--allowedTools "Bash,Read,Edit" \
|
|
--permission-mode acceptEdits
|
|
```
|
|
|
|
Graduate to Agent SDK later if limitations are hit.
|
|
|
|
### Trigger Architecture
|
|
|
|
On-demand Claude Code sessions, triggered by:
|
|
- **Timer** - periodic health/sanity check
|
|
- **Alert** - alertmanager webhook
|
|
- **Event** - systemd OnFailure, consul watch
|
|
- **Manual** - invoke with a goal
|
|
|
|
Each trigger provides context and a goal. Claude Code does the rest.
|
|
|
|
### Structure
|
|
|
|
```
|
|
agent/
|
|
├── triggers/
|
|
│ ├── scheduled-check # systemd timer
|
|
│ ├── on-alert # webhook handler
|
|
│ └── on-failure # systemd OnFailure target
|
|
├── gather-context.sh # snapshot of cluster state
|
|
└── goals/
|
|
├── health-check.md # verify health, fix if safe
|
|
├── incident.md # investigate alert, fix or escalate
|
|
└── proactive.md # look for improvements
|
|
```
|
|
|
|
### Example: Scheduled Health Check
|
|
|
|
```bash
|
|
#!/usr/bin/env bash
|
|
CONTEXT=$(./gather-context.sh)
|
|
GOAL=$(cat goals/health-check.md)
|
|
|
|
claude --print "
|
|
## Context
|
|
$CONTEXT
|
|
|
|
## Goal
|
|
$GOAL
|
|
|
|
## Constraints
|
|
- You can read any file in this repo
|
|
- You can run nomad/consul/systemctl commands
|
|
- You can edit Nix/HCL files and run deploy
|
|
- Before destructive actions, validate with nix build or nomad plan
|
|
- If uncertain about safety, output a summary and stop
|
|
"
|
|
```
|
|
|
|
### Context Gathering
|
|
|
|
```bash
|
|
#!/usr/bin/env bash
|
|
echo "=== Nomad Jobs ==="
|
|
nomad job status
|
|
|
|
echo "=== Consul Members ==="
|
|
consul members
|
|
|
|
echo "=== Failed Systemd Units ==="
|
|
systemctl --failed
|
|
|
|
echo "=== Recent Errors (last hour) ==="
|
|
journalctl --since "1 hour ago" -p err --no-pager | tail -100
|
|
```
|
|
|
|
## Edge Cases and the Nix Promise
|
|
|
|
The NixOS promise mostly works, but sometimes doesn't:
|
|
- Mount option changes that require reboot
|
|
- Transition states where switch fails even if end state is correct
|
|
- Partial application where switch "succeeds" but change didn't take effect
|
|
|
|
This is where the agent adds value: it can detect when a change needs special handling, apply the appropriate strategy, and verify the change actually took effect.
|
|
|
|
## Capturing Knowledge
|
|
|
|
Document edge cases as they're discovered:
|
|
|
|
```markdown
|
|
## CIFS/NFS mount option changes
|
|
Switch may fail or succeed without effect. Strategy:
|
|
1. Try normal deploy
|
|
2. If mount options don't match after, reboot required
|
|
3. If deploy fails with mount busy, local switch + reboot
|
|
```
|
|
|
|
The agent reads this, uses it as context, but can also reason about novel situations.
|
|
|
|
## Path to CI/CD
|
|
|
|
Eventually: push to main triggers deploy via agent.
|
|
|
|
```
|
|
push to main
|
|
|
|
|
build all configs (mechanical)
|
|
|
|
|
agent: "what changed? is this safe to auto-deploy?"
|
|
|
|
|
├─ clean change -> deploy, validate, done
|
|
├─ needs reboot -> deploy, schedule reboot, validate after
|
|
├─ risky change -> notify for manual approval
|
|
└─ failed -> diagnose, retry with different strategy, or escalate
|
|
|
|
|
post-deploy verification
|
|
|
|
|
notification
|
|
```
|
|
|
|
The agent is the intelligence layer on top of mechanical CI/CD.
|
|
|
|
## Research: What Others Are Doing (January 2026)
|
|
|
|
### Existing Projects & Approaches
|
|
|
|
**n8n + Ollama Stack**
|
|
The most common pattern is n8n (workflow orchestration) + Ollama (local LLM). Webhooks from
|
|
monitoring (Netdata/Prometheus) trigger AI-assisted diagnosis. Philosophy from one practitioner:
|
|
"train an employee, not a bot" — build trust, gradually grant autonomy.
|
|
|
|
Sources:
|
|
- [Virtualization Howto: Self-Healing Home Lab](https://www.virtualizationhowto.com/2025/10/how-i-built-a-self-healing-home-lab-that-fixes-itself/)
|
|
- [addROM: AI Agent for Homelab with n8n](https://addrom.com/unleashing-the-power-of-an-ai-agent-for-homelab-management-with-n8n/)
|
|
|
|
**Local Infrastructure Agent (Kelcode)**
|
|
Architecture: user question → tool router → query processor → LLM response. Connects to
|
|
Kubernetes, Prometheus, Harbor Registry.
|
|
|
|
Key insight: "The AI's output definition must be perfectly synchronized with the software
|
|
it's trying to use." Their K8s tool failed because the prompt generated kubectl commands
|
|
while the code expected structured data objects.
|
|
|
|
Uses phi4-mini via Ollama for routing decisions after testing multiple models.
|
|
|
|
Source: [Kelcode: Building a Homelab Agentic Ecosystem](https://kelcode.co.uk/building-a-homelab-agentic-ecosystem-part1/)
|
|
|
|
**nixai**
|
|
AI assistant specifically for NixOS. Searches NixOS Wiki, Nixpkgs Manual, nix.dev, Home Manager
|
|
docs. Diagnoses issues from piped logs/errors. Privacy-first: defaults to local Ollama.
|
|
|
|
Limited scope — helper tool, not autonomous agent. But shows NixOS-specific tooling is possible.
|
|
|
|
Source: [NixOS Discourse: Introducing nixai](https://discourse.nixos.org/t/introducing-nixai-your-ai-powered-nixos-companion/65168)
|
|
|
|
**AI-Friendly Infrastructure (The Merino Wolf)**
|
|
Key insight: make infrastructure "AI-friendly" through structured documentation. CLAUDE.md
|
|
provides comprehensive context — "structured knowledge transfer."
|
|
|
|
Lessons:
|
|
- "Context investment pays dividends" — comprehensive documentation is the most valuable asset
|
|
- Layered infrastructure design mirrors how both humans and AI think
|
|
- Rule-based guidance enforces safety practices automatically
|
|
|
|
Source: [The Merino Wolf: AI-Powered Homelab](https://themerinowolf.com/posts/ai-powered-homelab/)
|
|
|
|
**Claude Code Infrastructure Patterns**
|
|
Solves "skills don't activate automatically" problem using hooks (UserPromptSubmit, PostToolUse)
|
|
+ skill-rules.json for auto-activation.
|
|
|
|
500-line rule with progressive disclosure: main file for high-level guidance, resource files
|
|
for deep dives. Claude loads materials incrementally as needed.
|
|
|
|
Persistence pattern across context resets using three-file structures (plan, context, tasks).
|
|
|
|
Born from 6 months managing TypeScript microservices (50k+ lines).
|
|
|
|
Source: [diet103/claude-code-infrastructure-showcase](https://github.com/diet103/claude-code-infrastructure-showcase)
|
|
|
|
### Patterns That Work
|
|
|
|
- Local LLMs (Ollama) + workflow orchestration (n8n) is the popular stack
|
|
- Start with read-only/diagnostic agents, gradually add write access
|
|
- Pre-approved command lists for safety (e.g., 50 validated bash commands max)
|
|
- Structured documentation as foundation — AI is only as good as its context
|
|
- Multi-step tool use: agent plans, then executes steps, observing results
|
|
|
|
### What's Missing in the Space
|
|
|
|
- Nobody's doing true "emergent capabilities" yet — mostly tool routing
|
|
- Most projects are Kubernetes/Docker focused, not NixOS
|
|
- Few examples of proactive stewardship (our example #2)
|
|
- Limited examples of agents that understand the whole system coherently
|
|
|
|
### Community Skepticism
|
|
|
|
From Reddit discussions: doubts exist about using LLM agents in production. Although LLMs can
|
|
automate specific tasks, they frequently need human involvement for intricate decision-making.
|
|
|
|
This validates our approach: tight loop with a competent human, not autonomous operation.
|
|
|
|
### The Gap We'd Fill
|
|
|
|
- NixOS-native agent leveraging declarative config as source of truth
|
|
- True emergence — not just tool routing, but reasoning about novel situations
|
|
- Proactive evolution, not just reactive troubleshooting
|
|
- Tight human loop with a competent operator
|
|
|
|
## Next Steps
|
|
|
|
1. Build trigger infrastructure (systemd timer, basic webhook handler)
|
|
2. Write context gathering scripts
|
|
3. Define goal prompts for common scenarios
|
|
4. Test with scheduled health checks
|
|
5. Iterate based on what works and what doesn't
|
|
6. Document edge cases as they're discovered
|
|
7. Gradually expand scope as confidence grows
|