AI ideas.
This commit is contained in:
354
docs/HOMELAB_AGENT.md
Normal file
354
docs/HOMELAB_AGENT.md
Normal file
@@ -0,0 +1,354 @@
|
||||
# ABOUTME: Vision and design document for an AI agent that manages the homelab cluster.
|
||||
# ABOUTME: Covers emergent capabilities, technical approach, and implementation strategy.
|
||||
|
||||
# Homelab Agent: Vision and Design
|
||||
|
||||
## The Core Idea
|
||||
|
||||
Not automation. Not "LLM-powered autocomplete for infrastructure." Emergent capabilities.
|
||||
|
||||
The same shift Claude Code brought to programming: you describe outcomes, it handles implementation. You become a "product manager" for your infrastructure instead of an "infrastructure engineer."
|
||||
|
||||
The cluster stops being infrastructure you manage and becomes an environment that responds to intent.
|
||||
|
||||
## What Makes This Different From Automation
|
||||
|
||||
**Automation**: "If disk > 90%, delete old logs"
|
||||
|
||||
**Emergent**: "Disk is 95% full. What's using space? ...Postgres WAL. Can I safely checkpoint? Last backup was 2h ago, load is low, yes. Running checkpoint... down to 60%. I should note that WAL retention might need tuning."
|
||||
|
||||
The difference:
|
||||
- Novel problem-solving (not pattern matching)
|
||||
- Contextual safety reasoning
|
||||
- Adaptation to the specific situation
|
||||
- Learning for the future
|
||||
|
||||
## Examples of Genuinely New Capabilities
|
||||
|
||||
### 1. Intent-Driven Infrastructure
|
||||
|
||||
> "I want to run Synapse for Matrix"
|
||||
|
||||
Agent figures out: Nomad job spec, storage location, Traefik routing, TLS, Consul registration, backup config. Creates it, deploys it, validates it.
|
||||
|
||||
You don't need to know Nomad job format or Traefik labels. You describe the outcome.
|
||||
|
||||
### 2. Proactive Evolution (The Best One)
|
||||
|
||||
The agent doesn't wait for problems or instructions:
|
||||
|
||||
- "Synapse 1.98 has a security fix. I've tested it in a local build, no config changes needed. Deploy?"
|
||||
- "Your NFS server has been primary for 47 days. Want me to test failover to make sure it still works?"
|
||||
- "I noticed arr services all have the same resource limits but Sonarr consistently uses more. Adjusted."
|
||||
- "There's a new NixOS module for Traefik that simplifies your current setup. Here's the diff."
|
||||
|
||||
Not monitoring. Stewardship.
|
||||
|
||||
### 3. The Cluster Has Opinions
|
||||
|
||||
> You: "I want to add Plex"
|
||||
>
|
||||
> Agent: "You already have Jellyfin, which does the same thing. If you want Plex specifically for its mobile app, I can set it up to share Jellyfin's media library. Or if you want to switch entirely, I can migrate watch history. What's the actual goal?"
|
||||
|
||||
Not a command executor. A collaborator that understands your system.
|
||||
|
||||
### 4. "Bring This Into the Cluster"
|
||||
|
||||
You're running something in Docker on a random VM:
|
||||
|
||||
> "Bring this into the cluster"
|
||||
|
||||
Agent: connects, inspects, figures out dependencies, writes Nomad job, sets up storage, migrates data, routes traffic, validates, decommissions old instance.
|
||||
|
||||
You didn't need to know how.
|
||||
|
||||
### 5. Cross-Cutting Changes
|
||||
|
||||
> "Add authentication to all public-facing services"
|
||||
|
||||
Agent identifies which services are public, understands the auth setup (Pocket ID + traefik-oidc-auth), modifies each service's config, tests that auth works.
|
||||
|
||||
Single coherent change across everything, without knowing every service yourself.
|
||||
|
||||
### 6. Emergent Debugging
|
||||
|
||||
Not runbooks. Actual reasoning:
|
||||
|
||||
> "The blog is slow"
|
||||
|
||||
Agent checks service health (fine), node resources (fine), network latency (fine), database queries (ah, slow query), traces to missing index, adds index, validates performance improved.
|
||||
|
||||
Solved a problem nobody wrote a runbook for.
|
||||
|
||||
### 7. Architecture Exploration
|
||||
|
||||
> "What if we added a third Nomad server for better quorum?"
|
||||
|
||||
Agent reasons about current topology, generates the config, identifies what would change, shows blast radius. Thinking partner for infrastructure decisions.
|
||||
|
||||
## Why Nix Makes This Possible
|
||||
|
||||
Traditional infrastructure: state is scattered and implicit. Nix: everything is declared.
|
||||
|
||||
- **Full system understanding** - agent can read the flake and understand EVERYTHING
|
||||
- **Safe experimentation** - build without deploying, rollback trivially
|
||||
- **Reproducibility** - "what was the state 3 days ago?" can be rebuilt exactly
|
||||
- **Composition** - agent can generate valid configs that compose correctly
|
||||
- **The ecosystem** - 80k+ packages, thousands of modules the agent can navigate
|
||||
|
||||
> "I want a VPN that works with my phone"
|
||||
|
||||
Agent knows Nix, finds WireGuard module, configures it, generates QR codes, opens firewall. You didn't learn WireGuard.
|
||||
|
||||
## The Validation Pattern
|
||||
|
||||
Just like code has linting and tests, infrastructure actions need validation:
|
||||
|
||||
| Phase | Code | Infrastructure |
|
||||
|-------|------|----------------|
|
||||
| Static | Lint, typecheck | Config parses, secrets exist, no port conflicts |
|
||||
| Pre-flight | — | Cluster healthy, dependencies up, quorum intact |
|
||||
| Post-action | Unit tests | Service started, health checks pass, metrics flowing |
|
||||
| Invariants | CI | NFS mounted, Consul quorum, replication current |
|
||||
|
||||
The agent can take actions confidently because it validates outcomes.
|
||||
|
||||
## The Reality Check
|
||||
|
||||
Some of this works today. Some would fail spectacularly. Some would fail silently and idiotically. Just like Claude Code for coding.
|
||||
|
||||
Therefore:
|
||||
- Tight loop with the human operator
|
||||
- Assume the human is competent and knowledgeable
|
||||
- Agent amplifies expertise, doesn't replace it
|
||||
- Escalate when uncertain
|
||||
|
||||
## Technical Approach
|
||||
|
||||
### Runtime: Claude Code (Not Agent SDK)
|
||||
|
||||
Two options were considered:
|
||||
|
||||
| Tool | Pro/Max Subscription | API Billing |
|
||||
|------|---------------------|-------------|
|
||||
| Claude Code CLI | Yes | Yes |
|
||||
| Claude Agent SDK | No | Required |
|
||||
|
||||
Claude Code can use existing Max subscription. Agent SDK requires separate API billing.
|
||||
|
||||
For v1, use Claude Code as the runtime:
|
||||
|
||||
```bash
|
||||
claude --print "prompt" \
|
||||
--allowedTools "Bash,Read,Edit" \
|
||||
--permission-mode acceptEdits
|
||||
```
|
||||
|
||||
Graduate to Agent SDK later if limitations are hit.
|
||||
|
||||
### Trigger Architecture
|
||||
|
||||
On-demand Claude Code sessions, triggered by:
|
||||
- **Timer** - periodic health/sanity check
|
||||
- **Alert** - alertmanager webhook
|
||||
- **Event** - systemd OnFailure, consul watch
|
||||
- **Manual** - invoke with a goal
|
||||
|
||||
Each trigger provides context and a goal. Claude Code does the rest.
|
||||
|
||||
### Structure
|
||||
|
||||
```
|
||||
agent/
|
||||
├── triggers/
|
||||
│ ├── scheduled-check # systemd timer
|
||||
│ ├── on-alert # webhook handler
|
||||
│ └── on-failure # systemd OnFailure target
|
||||
├── gather-context.sh # snapshot of cluster state
|
||||
└── goals/
|
||||
├── health-check.md # verify health, fix if safe
|
||||
├── incident.md # investigate alert, fix or escalate
|
||||
└── proactive.md # look for improvements
|
||||
```
|
||||
|
||||
### Example: Scheduled Health Check
|
||||
|
||||
```bash
|
||||
#!/usr/bin/env bash
|
||||
CONTEXT=$(./gather-context.sh)
|
||||
GOAL=$(cat goals/health-check.md)
|
||||
|
||||
claude --print "
|
||||
## Context
|
||||
$CONTEXT
|
||||
|
||||
## Goal
|
||||
$GOAL
|
||||
|
||||
## Constraints
|
||||
- You can read any file in this repo
|
||||
- You can run nomad/consul/systemctl commands
|
||||
- You can edit Nix/HCL files and run deploy
|
||||
- Before destructive actions, validate with nix build or nomad plan
|
||||
- If uncertain about safety, output a summary and stop
|
||||
"
|
||||
```
|
||||
|
||||
### Context Gathering
|
||||
|
||||
```bash
|
||||
#!/usr/bin/env bash
|
||||
echo "=== Nomad Jobs ==="
|
||||
nomad job status
|
||||
|
||||
echo "=== Consul Members ==="
|
||||
consul members
|
||||
|
||||
echo "=== Failed Systemd Units ==="
|
||||
systemctl --failed
|
||||
|
||||
echo "=== Recent Errors (last hour) ==="
|
||||
journalctl --since "1 hour ago" -p err --no-pager | tail -100
|
||||
```
|
||||
|
||||
## Edge Cases and the Nix Promise
|
||||
|
||||
The NixOS promise mostly works, but sometimes doesn't:
|
||||
- Mount option changes that require reboot
|
||||
- Transition states where switch fails even if end state is correct
|
||||
- Partial application where switch "succeeds" but change didn't take effect
|
||||
|
||||
This is where the agent adds value: it can detect when a change needs special handling, apply the appropriate strategy, and verify the change actually took effect.
|
||||
|
||||
## Capturing Knowledge
|
||||
|
||||
Document edge cases as they're discovered:
|
||||
|
||||
```markdown
|
||||
## CIFS/NFS mount option changes
|
||||
Switch may fail or succeed without effect. Strategy:
|
||||
1. Try normal deploy
|
||||
2. If mount options don't match after, reboot required
|
||||
3. If deploy fails with mount busy, local switch + reboot
|
||||
```
|
||||
|
||||
The agent reads this, uses it as context, but can also reason about novel situations.
|
||||
|
||||
## Path to CI/CD
|
||||
|
||||
Eventually: push to main triggers deploy via agent.
|
||||
|
||||
```
|
||||
push to main
|
||||
|
|
||||
build all configs (mechanical)
|
||||
|
|
||||
agent: "what changed? is this safe to auto-deploy?"
|
||||
|
|
||||
├─ clean change -> deploy, validate, done
|
||||
├─ needs reboot -> deploy, schedule reboot, validate after
|
||||
├─ risky change -> notify for manual approval
|
||||
└─ failed -> diagnose, retry with different strategy, or escalate
|
||||
|
|
||||
post-deploy verification
|
||||
|
|
||||
notification
|
||||
```
|
||||
|
||||
The agent is the intelligence layer on top of mechanical CI/CD.
|
||||
|
||||
## Research: What Others Are Doing (January 2026)
|
||||
|
||||
### Existing Projects & Approaches
|
||||
|
||||
**n8n + Ollama Stack**
|
||||
The most common pattern is n8n (workflow orchestration) + Ollama (local LLM). Webhooks from
|
||||
monitoring (Netdata/Prometheus) trigger AI-assisted diagnosis. Philosophy from one practitioner:
|
||||
"train an employee, not a bot" — build trust, gradually grant autonomy.
|
||||
|
||||
Sources:
|
||||
- [Virtualization Howto: Self-Healing Home Lab](https://www.virtualizationhowto.com/2025/10/how-i-built-a-self-healing-home-lab-that-fixes-itself/)
|
||||
- [addROM: AI Agent for Homelab with n8n](https://addrom.com/unleashing-the-power-of-an-ai-agent-for-homelab-management-with-n8n/)
|
||||
|
||||
**Local Infrastructure Agent (Kelcode)**
|
||||
Architecture: user question → tool router → query processor → LLM response. Connects to
|
||||
Kubernetes, Prometheus, Harbor Registry.
|
||||
|
||||
Key insight: "The AI's output definition must be perfectly synchronized with the software
|
||||
it's trying to use." Their K8s tool failed because the prompt generated kubectl commands
|
||||
while the code expected structured data objects.
|
||||
|
||||
Uses phi4-mini via Ollama for routing decisions after testing multiple models.
|
||||
|
||||
Source: [Kelcode: Building a Homelab Agentic Ecosystem](https://kelcode.co.uk/building-a-homelab-agentic-ecosystem-part1/)
|
||||
|
||||
**nixai**
|
||||
AI assistant specifically for NixOS. Searches NixOS Wiki, Nixpkgs Manual, nix.dev, Home Manager
|
||||
docs. Diagnoses issues from piped logs/errors. Privacy-first: defaults to local Ollama.
|
||||
|
||||
Limited scope — helper tool, not autonomous agent. But shows NixOS-specific tooling is possible.
|
||||
|
||||
Source: [NixOS Discourse: Introducing nixai](https://discourse.nixos.org/t/introducing-nixai-your-ai-powered-nixos-companion/65168)
|
||||
|
||||
**AI-Friendly Infrastructure (The Merino Wolf)**
|
||||
Key insight: make infrastructure "AI-friendly" through structured documentation. CLAUDE.md
|
||||
provides comprehensive context — "structured knowledge transfer."
|
||||
|
||||
Lessons:
|
||||
- "Context investment pays dividends" — comprehensive documentation is the most valuable asset
|
||||
- Layered infrastructure design mirrors how both humans and AI think
|
||||
- Rule-based guidance enforces safety practices automatically
|
||||
|
||||
Source: [The Merino Wolf: AI-Powered Homelab](https://themerinowolf.com/posts/ai-powered-homelab/)
|
||||
|
||||
**Claude Code Infrastructure Patterns**
|
||||
Solves "skills don't activate automatically" problem using hooks (UserPromptSubmit, PostToolUse)
|
||||
+ skill-rules.json for auto-activation.
|
||||
|
||||
500-line rule with progressive disclosure: main file for high-level guidance, resource files
|
||||
for deep dives. Claude loads materials incrementally as needed.
|
||||
|
||||
Persistence pattern across context resets using three-file structures (plan, context, tasks).
|
||||
|
||||
Born from 6 months managing TypeScript microservices (50k+ lines).
|
||||
|
||||
Source: [diet103/claude-code-infrastructure-showcase](https://github.com/diet103/claude-code-infrastructure-showcase)
|
||||
|
||||
### Patterns That Work
|
||||
|
||||
- Local LLMs (Ollama) + workflow orchestration (n8n) is the popular stack
|
||||
- Start with read-only/diagnostic agents, gradually add write access
|
||||
- Pre-approved command lists for safety (e.g., 50 validated bash commands max)
|
||||
- Structured documentation as foundation — AI is only as good as its context
|
||||
- Multi-step tool use: agent plans, then executes steps, observing results
|
||||
|
||||
### What's Missing in the Space
|
||||
|
||||
- Nobody's doing true "emergent capabilities" yet — mostly tool routing
|
||||
- Most projects are Kubernetes/Docker focused, not NixOS
|
||||
- Few examples of proactive stewardship (our example #2)
|
||||
- Limited examples of agents that understand the whole system coherently
|
||||
|
||||
### Community Skepticism
|
||||
|
||||
From Reddit discussions: doubts exist about using LLM agents in production. Although LLMs can
|
||||
automate specific tasks, they frequently need human involvement for intricate decision-making.
|
||||
|
||||
This validates our approach: tight loop with a competent human, not autonomous operation.
|
||||
|
||||
### The Gap We'd Fill
|
||||
|
||||
- NixOS-native agent leveraging declarative config as source of truth
|
||||
- True emergence — not just tool routing, but reasoning about novel situations
|
||||
- Proactive evolution, not just reactive troubleshooting
|
||||
- Tight human loop with a competent operator
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. Build trigger infrastructure (systemd timer, basic webhook handler)
|
||||
2. Write context gathering scripts
|
||||
3. Define goal prompts for common scenarios
|
||||
4. Test with scheduled health checks
|
||||
5. Iterate based on what works and what doesn't
|
||||
6. Document edge cases as they're discovered
|
||||
7. Gradually expand scope as confidence grows
|
||||
Reference in New Issue
Block a user