Martin Dobberstein
AI-Assisted Coding Community · March 2026
That's it. That was the entire brief.
What came back: 7 Architectural Decision Records, a full implementation plan, and a working system — in one afternoon.
What happens when you treat an AI like a capable junior developer?
The task: Build AI-powered infrastructure monitoring for a 4-node Kubernetes cluster.
Time budget: One Saturday afternoon. ~45 minutes of my active attention.
The AI wrote 7 Architectural Decision Records before writing any code. I reviewed every one.
| # | Decision | AI Proposed | I Pushed Back | Final |
|---|---|---|---|---|
| 1 | Stack choice | PLG (Prometheus+Loki+Grafana) | ✅ Agreed — lighter than ELK | PLG |
| 2 | Where to run | Same cluster + extra node | ✅ Accepted risk for homelab | Same cluster |
| 3 | Deployment | Helm | ❌ "Too much magic" | Kustomize |
| 4 | Alerting | AI only | ❌ "What if AI is down?" | Hybrid |
| 5 | Sessions | Main session | ❌ "Don't pollute my chat" | Isolated |
| 6 | State tracking | ConfigMap | ❌ "Keep it simple" | memory/*.json |
| 7 | Access | SSH tunnel | ❌ "Breaks AI access" | Public + TLS + auth |
5 out of 7 decisions were changed through review. That's not overhead — that's the value.
Me: "I want monitoring. But I don't want to maintain alert rules."
Otto: "What if AI interprets the metrics instead of static rules?"
Me: "But what about obvious stuff — node down, disk full? That shouldn't depend on AI being available."
Works even if AI is offline
Context humans don't have time for
Otto: "For Grafana access: NodePort + SSH tunnel is most secure."
Me: "Wait — how would the AI layer query Prometheus then?"
Otto: "...good point. It would need a tunnel for every query."
Me: "TLS + 32-character random password. That's ~192 bits of entropy. Good enough."
The lesson: The AI optimized for security. I optimized for the actual use case.
This is why human review matters — even when the AI's suggestion sounds reasonable.
flowchart TB
subgraph Internet[" "]
User["👤 Browser"]
AI["🤖 OpenClaw AI"]
end
subgraph Ingress["☁️ Ingress — TLS + Basic Auth"]
GrafanaURL["grafana.gutsch.it"]
PromURL["prometheus.gutsch.it"]
end
subgraph Cluster["⎈ Kubernetes Cluster"]
subgraph Monitoring["Monitoring Stack"]
Prometheus[("📊 Prometheus")]
Loki[("📜 Loki")]
Grafana["📈 Grafana"]
AlertManager["🚨 AlertManager"]
end
subgraph Collection["Collection Layer — DaemonSets"]
Promtail["Promtail"]
NodeExp["node-exporter"]
KSM["kube-state-metrics"]
end
subgraph Apps["Applications"]
CB["circleback-webhook\n/metrics"]
end
end
User --> GrafanaURL --> Grafana
AI --> PromURL --> Prometheus
Prometheus -.->|scrape| NodeExp
Prometheus -.->|scrape| KSM
Prometheus -.->|scrape| CB
Prometheus -->|rules| AlertManager
Promtail -->|push| Loki
Grafana -->|query| Prometheus
Grafana -->|query| Loki
| Prometheus | Metrics collection, 20GB PVC |
| Loki | Log aggregation, 20GB PVC |
| Grafana | Dashboards, 1GB PVC |
| AlertManager | Critical alerts |
| node-exporter | DaemonSet × 4 nodes |
| Promtail | DaemonSet × 4 nodes |
| kube-state-metrics | K8s object metrics |
| Hourly check | Haiku (~$0.01/day) |
| Weekly review | Sonnet (~$0.10/week) |
| Escalation | Threshold → analysis |
| Delivery | Telegram alerts |
10+ components wired together. ~45 min of my time. ~€6/month ongoing.
Me: "Will Haiku be clever enough to know what's 'interesting'?"
Otto: "Honest answer: probably not. Let's not ask it to judge — just check explicit thresholds."
Use the right model for the job. Cheap for routine, smart for analysis.
| Item | Cost |
|---|---|
| This session (~135k tokens, Opus) | ~$3-5 |
| Hourly monitoring (Haiku) | ~$0.01/day |
| Weekly analysis (Sonnet) | ~$0.10/week |
| Infrastructure (CX22 node) | ~€6/month |
Total: Less than a coffee for the AI work. €6.50/month ongoing.
Compare to: Datadog, New Relic, or your own time writing YAML.
First automated check ran at 19:00, one hour after setup:
| Metric | Result | Threshold | Status |
|---|---|---|---|
| Nodes | 4/4 up | any down | ✅ |
| Memory | 27-49% | > 70% | ✅ |
| P95 latency | 9.5ms | > 500ms | ✅ |
| Pod restarts | 0 | > 0 | ✅ |
System monitoring itself. No human in the loop.
That was the build. But the real question is:
Let's see what happens when things go wrong. Live.
Let's kill a worker node. For real — via Hetzner API:
The AI doesn't just detect — it explains what happened, assesses severity, and recommends action.
🚨 Infrastructure Alert (10:39 AM)
Node Down: worker-node-2 (10.1.0.5) is unreachable
Severity: Medium — cluster degraded but functional. 2 remaining workers handling load.
Other metrics:
✅ Memory: 38-53% across remaining nodes
✅ P95 response time: 9.5ms (excellent)
✅ No error rate spikes
Recommended action: Check worker-node-2 via Hetzner console. If unresponsive, may need restart. Non-urgent if workloads are healthy.
Compare this to a PagerDuty alert: FIRING: NodeDown worker-node-2 [critical]
"Is this spike normal?" 🤷
"Worker-node-3 went offline at 11:23. Kubernetes rescheduled 4 pods. Memory on remaining nodes jumped to 68%. Disk is fine. No data loss — Prometheus PVC was retained. I'd bring the node back soon to avoid pressure."
Context + interpretation + recommendation
Cost: ~$3/month for AI tokens (Haiku hourly + Sonnet weekly)
Value: An engineer who reads every metric, every hour, and never gets bored
This isn't replacing monitoring tools — it's adding an interpretation layer on top of what you already have.
This isn't about AI replacing you. It's about amplifying your Saturday afternoon.
The tools are ready. The patterns work. Go build something.
Session log: ~/clawd/projects/infra-monitoring/SESSION-LOG.md
Live system: https://grafana.gutsch.it
This deck: https://openclaw-presentation.gutsch.it/observability
Built during a Saturday afternoon of chores
with Claude (via OpenClaw)
Martin Dobberstein
@gutschilla