AI ideas.

2026-01-03 10:38:47 +00:00
parent d71408b567
commit 3b8cd7b742
1 changed files with 354 additions and 0 deletions
--- a/docs/HOMELAB_AGENT.md
+++ b/docs/HOMELAB_AGENT.md
@@ -0,0 +1,354 @@
+# ABOUTME: Vision and design document for an AI agent that manages the homelab cluster.
+# ABOUTME: Covers emergent capabilities, technical approach, and implementation strategy.
+
+# Homelab Agent: Vision and Design
+
+## The Core Idea
+
+Not automation. Not "LLM-powered autocomplete for infrastructure." Emergent capabilities.
+
+The same shift Claude Code brought to programming: you describe outcomes, it handles implementation. You become a "product manager" for your infrastructure instead of an "infrastructure engineer."
+
+The cluster stops being infrastructure you manage and becomes an environment that responds to intent.
+
+## What Makes This Different From Automation
+
+**Automation**: "If disk > 90%, delete old logs"
+
+**Emergent**: "Disk is 95% full. What's using space? ...Postgres WAL. Can I safely checkpoint? Last backup was 2h ago, load is low, yes. Running checkpoint... down to 60%. I should note that WAL retention might need tuning."
+
+The difference:
+- Novel problem-solving (not pattern matching)
+- Contextual safety reasoning
+- Adaptation to the specific situation
+- Learning for the future
+
+## Examples of Genuinely New Capabilities
+
+### 1. Intent-Driven Infrastructure
+
+> "I want to run Synapse for Matrix"
+
+Agent figures out: Nomad job spec, storage location, Traefik routing, TLS, Consul registration, backup config. Creates it, deploys it, validates it.
+
+You don't need to know Nomad job format or Traefik labels. You describe the outcome.
+
+### 2. Proactive Evolution (The Best One)
+
+The agent doesn't wait for problems or instructions:
+
+- "Synapse 1.98 has a security fix. I've tested it in a local build, no config changes needed. Deploy?"
+- "Your NFS server has been primary for 47 days. Want me to test failover to make sure it still works?"
+- "I noticed arr services all have the same resource limits but Sonarr consistently uses more. Adjusted."
+- "There's a new NixOS module for Traefik that simplifies your current setup. Here's the diff."
+
+Not monitoring. Stewardship.
+
+### 3. The Cluster Has Opinions
+
+> You: "I want to add Plex"
+>
+> Agent: "You already have Jellyfin, which does the same thing. If you want Plex specifically for its mobile app, I can set it up to share Jellyfin's media library. Or if you want to switch entirely, I can migrate watch history. What's the actual goal?"
+
+Not a command executor. A collaborator that understands your system.
+
+### 4. "Bring This Into the Cluster"
+
+You're running something in Docker on a random VM:
+
+> "Bring this into the cluster"
+
+Agent: connects, inspects, figures out dependencies, writes Nomad job, sets up storage, migrates data, routes traffic, validates, decommissions old instance.
+
+You didn't need to know how.
+
+### 5. Cross-Cutting Changes
+
+> "Add authentication to all public-facing services"
+
+Agent identifies which services are public, understands the auth setup (Pocket ID + traefik-oidc-auth), modifies each service's config, tests that auth works.
+
+Single coherent change across everything, without knowing every service yourself.
+
+### 6. Emergent Debugging
+
+Not runbooks. Actual reasoning:
+
+> "The blog is slow"
+
+Agent checks service health (fine), node resources (fine), network latency (fine), database queries (ah, slow query), traces to missing index, adds index, validates performance improved.
+
+Solved a problem nobody wrote a runbook for.
+
+### 7. Architecture Exploration
+
+> "What if we added a third Nomad server for better quorum?"
+
+Agent reasons about current topology, generates the config, identifies what would change, shows blast radius. Thinking partner for infrastructure decisions.
+
+## Why Nix Makes This Possible
+
+Traditional infrastructure: state is scattered and implicit. Nix: everything is declared.
+
+- **Full system understanding** - agent can read the flake and understand EVERYTHING
+- **Safe experimentation** - build without deploying, rollback trivially
+- **Reproducibility** - "what was the state 3 days ago?" can be rebuilt exactly
+- **Composition** - agent can generate valid configs that compose correctly
+- **The ecosystem** - 80k+ packages, thousands of modules the agent can navigate
+
+> "I want a VPN that works with my phone"
+
+Agent knows Nix, finds WireGuard module, configures it, generates QR codes, opens firewall. You didn't learn WireGuard.
+
+## The Validation Pattern
+
+Just like code has linting and tests, infrastructure actions need validation:
+
+| Phase | Code | Infrastructure |
+|-------|------|----------------|
+| Static | Lint, typecheck | Config parses, secrets exist, no port conflicts |
+| Pre-flight | — | Cluster healthy, dependencies up, quorum intact |
+| Post-action | Unit tests | Service started, health checks pass, metrics flowing |
+| Invariants | CI | NFS mounted, Consul quorum, replication current |
+
+The agent can take actions confidently because it validates outcomes.
+
+## The Reality Check
+
+Some of this works today. Some would fail spectacularly. Some would fail silently and idiotically. Just like Claude Code for coding.
+
+Therefore:
+- Tight loop with the human operator
+- Assume the human is competent and knowledgeable
+- Agent amplifies expertise, doesn't replace it
+- Escalate when uncertain
+
+## Technical Approach
+
+### Runtime: Claude Code (Not Agent SDK)
+
+Two options were considered:
+
+| Tool | Pro/Max Subscription | API Billing |
+|------|---------------------|-------------|
+| Claude Code CLI | Yes | Yes |
+| Claude Agent SDK | No | Required |
+
+Claude Code can use existing Max subscription. Agent SDK requires separate API billing.
+
+For v1, use Claude Code as the runtime:
+
+```bash
+claude --print "prompt" \
+  --allowedTools "Bash,Read,Edit" \
+  --permission-mode acceptEdits
+```
+
+Graduate to Agent SDK later if limitations are hit.
+
+### Trigger Architecture
+
+On-demand Claude Code sessions, triggered by:
+- **Timer** - periodic health/sanity check
+- **Alert** - alertmanager webhook
+- **Event** - systemd OnFailure, consul watch
+- **Manual** - invoke with a goal
+
+Each trigger provides context and a goal. Claude Code does the rest.
+
+### Structure
+
+```
+agent/
+├── triggers/
+│   ├── scheduled-check       # systemd timer
+│   ├── on-alert              # webhook handler
+│   └── on-failure            # systemd OnFailure target
+├── gather-context.sh         # snapshot of cluster state
+└── goals/
+    ├── health-check.md       # verify health, fix if safe
+    ├── incident.md           # investigate alert, fix or escalate
+    └── proactive.md          # look for improvements
+```
+
+### Example: Scheduled Health Check
+
+```bash
+#!/usr/bin/env bash
+CONTEXT=$(./gather-context.sh)
+GOAL=$(cat goals/health-check.md)
+
+claude --print "
+## Context
+$CONTEXT
+
+## Goal
+$GOAL
+
+## Constraints
+- You can read any file in this repo
+- You can run nomad/consul/systemctl commands
+- You can edit Nix/HCL files and run deploy
+- Before destructive actions, validate with nix build or nomad plan
+- If uncertain about safety, output a summary and stop
+"
+```
+
+### Context Gathering
+
+```bash
+#!/usr/bin/env bash
+echo "=== Nomad Jobs ==="
+nomad job status
+
+echo "=== Consul Members ==="
+consul members
+
+echo "=== Failed Systemd Units ==="
+systemctl --failed
+
+echo "=== Recent Errors (last hour) ==="
+journalctl --since "1 hour ago" -p err --no-pager | tail -100
+```
+
+## Edge Cases and the Nix Promise
+
+The NixOS promise mostly works, but sometimes doesn't:
+- Mount option changes that require reboot
+- Transition states where switch fails even if end state is correct
+- Partial application where switch "succeeds" but change didn't take effect
+
+This is where the agent adds value: it can detect when a change needs special handling, apply the appropriate strategy, and verify the change actually took effect.
+
+## Capturing Knowledge
+
+Document edge cases as they're discovered:
+
+```markdown
+## CIFS/NFS mount option changes
+Switch may fail or succeed without effect. Strategy:
+1. Try normal deploy
+2. If mount options don't match after, reboot required
+3. If deploy fails with mount busy, local switch + reboot
+```
+
+The agent reads this, uses it as context, but can also reason about novel situations.
+
+## Path to CI/CD
+
+Eventually: push to main triggers deploy via agent.
+
+```
+push to main
+     |
+build all configs (mechanical)
+     |
+agent: "what changed? is this safe to auto-deploy?"
+     |
+├─ clean change -> deploy, validate, done
+├─ needs reboot -> deploy, schedule reboot, validate after
+├─ risky change -> notify for manual approval
+└─ failed -> diagnose, retry with different strategy, or escalate
+     |
+post-deploy verification
+     |
+notification
+```
+
+The agent is the intelligence layer on top of mechanical CI/CD.
+
+## Research: What Others Are Doing (January 2026)
+
+### Existing Projects & Approaches
+
+**n8n + Ollama Stack**
+The most common pattern is n8n (workflow orchestration) + Ollama (local LLM). Webhooks from
+monitoring (Netdata/Prometheus) trigger AI-assisted diagnosis. Philosophy from one practitioner:
+"train an employee, not a bot" — build trust, gradually grant autonomy.
+
+Sources:
+- [Virtualization Howto: Self-Healing Home Lab](https://www.virtualizationhowto.com/2025/10/how-i-built-a-self-healing-home-lab-that-fixes-itself/)
+- [addROM: AI Agent for Homelab with n8n](https://addrom.com/unleashing-the-power-of-an-ai-agent-for-homelab-management-with-n8n/)
+
+**Local Infrastructure Agent (Kelcode)**
+Architecture: user question → tool router → query processor → LLM response. Connects to
+Kubernetes, Prometheus, Harbor Registry.
+
+Key insight: "The AI's output definition must be perfectly synchronized with the software
+it's trying to use." Their K8s tool failed because the prompt generated kubectl commands
+while the code expected structured data objects.
+
+Uses phi4-mini via Ollama for routing decisions after testing multiple models.
+
+Source: [Kelcode: Building a Homelab Agentic Ecosystem](https://kelcode.co.uk/building-a-homelab-agentic-ecosystem-part1/)
+
+**nixai**
+AI assistant specifically for NixOS. Searches NixOS Wiki, Nixpkgs Manual, nix.dev, Home Manager
+docs. Diagnoses issues from piped logs/errors. Privacy-first: defaults to local Ollama.
+
+Limited scope — helper tool, not autonomous agent. But shows NixOS-specific tooling is possible.
+
+Source: [NixOS Discourse: Introducing nixai](https://discourse.nixos.org/t/introducing-nixai-your-ai-powered-nixos-companion/65168)
+
+**AI-Friendly Infrastructure (The Merino Wolf)**
+Key insight: make infrastructure "AI-friendly" through structured documentation. CLAUDE.md
+provides comprehensive context — "structured knowledge transfer."
+
+Lessons:
+- "Context investment pays dividends" — comprehensive documentation is the most valuable asset
+- Layered infrastructure design mirrors how both humans and AI think
+- Rule-based guidance enforces safety practices automatically
+
+Source: [The Merino Wolf: AI-Powered Homelab](https://themerinowolf.com/posts/ai-powered-homelab/)
+
+**Claude Code Infrastructure Patterns**
+Solves "skills don't activate automatically" problem using hooks (UserPromptSubmit, PostToolUse)
+ skill-rules.json for auto-activation.
+
+500-line rule with progressive disclosure: main file for high-level guidance, resource files
+for deep dives. Claude loads materials incrementally as needed.
+
+Persistence pattern across context resets using three-file structures (plan, context, tasks).
+
+Born from 6 months managing TypeScript microservices (50k+ lines).
+
+Source: [diet103/claude-code-infrastructure-showcase](https://github.com/diet103/claude-code-infrastructure-showcase)
+
+### Patterns That Work
+
+- Local LLMs (Ollama) + workflow orchestration (n8n) is the popular stack
+- Start with read-only/diagnostic agents, gradually add write access
+- Pre-approved command lists for safety (e.g., 50 validated bash commands max)
+- Structured documentation as foundation — AI is only as good as its context
+- Multi-step tool use: agent plans, then executes steps, observing results
+
+### What's Missing in the Space
+
+- Nobody's doing true "emergent capabilities" yet — mostly tool routing
+- Most projects are Kubernetes/Docker focused, not NixOS
+- Few examples of proactive stewardship (our example #2)
+- Limited examples of agents that understand the whole system coherently
+
+### Community Skepticism
+
+From Reddit discussions: doubts exist about using LLM agents in production. Although LLMs can
+automate specific tasks, they frequently need human involvement for intricate decision-making.
+
+This validates our approach: tight loop with a competent human, not autonomous operation.
+
+### The Gap We'd Fill
+
+- NixOS-native agent leveraging declarative config as source of truth
+- True emergence — not just tool routing, but reasoning about novel situations
+- Proactive evolution, not just reactive troubleshooting
+- Tight human loop with a competent operator
+
+## Next Steps
+
+1. Build trigger infrastructure (systemd timer, basic webhook handler)
+2. Write context gathering scripts
+3. Define goal prompts for common scenarios
+4. Test with scheduled health checks
+5. Iterate based on what works and what doesn't
+6. Document edge cases as they're discovered
+7. Gradually expand scope as confidence grows