AI-Powered Infrastructure Monitoring

A Saturday Afternoon Conversation

Martin Dobberstein
AI-Assisted Coding Community · March 2026

It Started With This

I want monitoring for my Kubernetes cluster. No Prometheus alerting rules. No AlertManager config. I don't want to maintain that. Instead: the AI interprets the metrics and alerts me conversationally — in plain language, via Telegram. Build it on my private Hetzner infra.

That's it. That was the entire brief.

What came back: 7 Architectural Decision Records, a full implementation plan, and a working system — in one afternoon.

The Process

What happens when you treat an AI like a capable junior developer?

Give it a real problem, not a toy example
Let it propose solutions — then challenge its assumptions
Make it write ADRs — then review them like you would a human's
Trust but verify

The task: Build AI-powered infrastructure monitoring for a 4-node Kubernetes cluster.
Time budget: One Saturday afternoon. ~45 minutes of my active attention.

7 Decisions — All Reviewed

The AI wrote 7 Architectural Decision Records before writing any code. I reviewed every one.

#	Decision	AI Proposed	I Pushed Back	Final
1	Stack choice	PLG (Prometheus+Loki+Grafana)	✅ Agreed — lighter than ELK	PLG
2	Where to run	Same cluster + extra node	✅ Accepted risk for homelab	Same cluster
3	Deployment	Helm	❌ "Too much magic"	Kustomize
4	Alerting	AI only	❌ "What if AI is down?"	Hybrid
5	Sessions	Main session	❌ "Don't pollute my chat"	Isolated
6	State tracking	ConfigMap	❌ "Keep it simple"	*memory/.json**
7	Access	SSH tunnel	❌ "Breaks AI access"	Public + TLS + auth

5 out of 7 decisions were changed through review. That's not overhead — that's the value.

ADR #4: Hybrid Alerting

The most important decision

Me: "I want monitoring. But I don't want to maintain alert rules."

Otto: "What if AI interprets the metrics instead of static rules?"

Me: "But what about obvious stuff — node down, disk full? That shouldn't depend on AI being available."

🚨 AlertManager (traditional)

Node down
Disk > 90%
Memory > 95%
CrashLoopBackOff > 5 min
Cert expiring < 3 days

Works even if AI is offline

🤖 AI Layer (interpretation)

"Disk grew 15% in 24h — unusual"
"Error rate up 3× but still low absolute"
"CPU spike correlates with deploy 2h ago"
Weekly trend analysis
Capacity planning

Context humans don't have time for

ADR #7: Access — When "Secure" Breaks the Use Case

Otto: "For Grafana access: NodePort + SSH tunnel is most secure."

Me: "Wait — how would the AI layer query Prometheus then?"

Otto: "...good point. It would need a tunnel for every query."

Me: "TLS + 32-character random password. That's ~192 bits of entropy. Good enough."

The lesson: The AI optimized for security. I optimized for the actual use case.
This is why human review matters — even when the AI's suggestion sounds reasonable.

Security layers: TLS (Let's Encrypt) → nginx basic auth (192-bit) → Grafana auth AI access: curl -u admin:$PASS https://prometheus.gutsch.it/api/v1/query?query=up

The Architecture

    flowchart TB
        subgraph Internet[" "]
            User["👤 Browser"]
            AI["🤖 OpenClaw AI"]
        end
        
        subgraph Ingress["☁️ Ingress — TLS + Basic Auth"]
            GrafanaURL["grafana.gutsch.it"]
            PromURL["prometheus.gutsch.it"]
        end
        
        subgraph Cluster["⎈ Kubernetes Cluster"]
            subgraph Monitoring["Monitoring Stack"]
                Prometheus[("📊 Prometheus")]
                Loki[("📜 Loki")]
                Grafana["📈 Grafana"]
                AlertManager["🚨 AlertManager"]
            end
            
            subgraph Collection["Collection Layer — DaemonSets"]
                Promtail["Promtail"]
                NodeExp["node-exporter"]
                KSM["kube-state-metrics"]
            end
            
            subgraph Apps["Applications"]
                CB["circleback-webhook\n/metrics"]
            end
        end
        
        User --> GrafanaURL --> Grafana
        AI --> PromURL --> Prometheus
        
        Prometheus -.->|scrape| NodeExp
        Prometheus -.->|scrape| KSM
        Prometheus -.->|scrape| CB
        Prometheus -->|rules| AlertManager
        
        Promtail -->|push| Loki
        Grafana -->|query| Prometheus
        Grafana -->|query| Loki

What Actually Got Built

The Stack (~3 GB RAM total)

Prometheus	Metrics collection, 20GB PVC
Loki	Log aggregation, 20GB PVC
Grafana	Dashboards, 1GB PVC
AlertManager	Critical alerts
node-exporter	DaemonSet × 4 nodes
Promtail	DaemonSet × 4 nodes
kube-state-metrics	K8s object metrics

The AI Layer

Hourly check	Haiku (~$0.01/day)
Weekly review	Sonnet (~$0.10/week)
Escalation	Threshold → analysis
Delivery	Telegram alerts

Debugging the AI Solved

Talos version mismatch (75 min)
Missing StorageClass
PodSecurity policy blocks
Scrape target discovery

10+ components wired together. ~45 min of my time. ~€6/month ongoing.

The Two-Model Insight

Me: "Will Haiku be clever enough to know what's 'interesting'?"

Otto: "Honest answer: probably not. Let's not ask it to judge — just check explicit thresholds."

┌─────────────────────────────────────┐ │ ☁️ Haiku (cheap, hourly) │ │ │ │ "disk > 70%? memory > 70%?" │ │ │ │ │ yes? ├────────────────────────────▶ 🧠 Sonnet │ │ │ "Analyze & alert" │ no? └──▶ 😴 silent │ │ │ │ ▼ └─────────────────────────────────────┘ 📱 Telegram

Use the right model for the job. Cheap for routine, smart for analysis.

What It Cost

Item	Cost
This session (~135k tokens, Opus)	~$3-5
Hourly monitoring (Haiku)	~$0.01/day
Weekly analysis (Sonnet)	~$0.10/week
Infrastructure (CX22 node)	~€6/month

Total: Less than a coffee for the AI work. €6.50/month ongoing.

Compare to: Datadog, New Relic, or your own time writing YAML.

It Actually Works

First automated check ran at 19:00, one hour after setup:

Metric	Result	Threshold	Status
Nodes	4/4 up	any down	✅
Memory	27-49%	> 70%	✅
P95 latency	9.5ms	> 500ms	✅
Pod restarts	0	> 0	✅

System monitoring itself. No human in the loop.

Fast Forward: Two Weeks Later

That was the build. But the real question is:

Does it actually help?

The monitoring has been running for 2 weeks
Daily checks at 10:00 AM — AI queries Prometheus, reports status
Weekly trend analysis every Sunday
Total cost so far: ~€0.30 in AI tokens

Let's see what happens when things go wrong. Live.

🔴 Live Demo

The Setup

What's running:

4-node Kubernetes cluster (Hetzner)
Real websites: abi-2000.de, woerterkant.de
PLG monitoring stack
AI check every 2 minutes (demo mode)

What you're seeing:

📱 Left: Chat with Otto (TUI)
💬 Right: Telegram — automated alerts
AI reports arrive in plain English
No dashboards to interpret

┌──────────────────────┐ ┌──────────────────────┐ │ Terminal (TUI) │ │ Telegram │ │ │ │ │ │ "Otto, how's the │ │ 🤖 All clear: │ │ cluster?" │ │ 4 nodes up, mem │ │ │ │ 27-49%, latency │ │ → Live conversation │ │ 9.5ms, 0 restarts │ │ │ │ │ └──────────────────────┘ └──────────────────────┘

🔴 Demo: Node Goes Down

Let's kill a worker node. For real — via Hetzner API:

$ hcloud server poweroff worker-node-2 Server worker-node-2 powered off

What the AI sees (next 2-min check):

💀 Node down — node-exporter on 10.1.0.5 stops responding
📉 Kubernetes detects NotReady, reschedules pods
📊 Memory pressure rises on remaining nodes
🤖 AI detects, analyzes severity, and sends alert to Telegram

The AI doesn't just detect — it explains what happened, assesses severity, and recommends action.

🔴 The Alert

What arrives on Telegram

🚨 Infrastructure Alert (10:39 AM)

Node Down: worker-node-2 (10.1.0.5) is unreachable

Severity: Medium — cluster degraded but functional. 2 remaining workers handling load.

Other metrics:
✅ Memory: 38-53% across remaining nodes
✅ P95 response time: 9.5ms (excellent)
✅ No error rate spikes

Recommended action: Check worker-node-2 via Hetzner console. If unresponsive, may need restart. Non-urgent if workloads are healthy.

Compare this to a PagerDuty alert: FIRING: NodeDown worker-node-2 [critical]

Dashboard vs. AI Summary

Which one do you want at 3 AM?

📊 Grafana Dashboard

14 panels of graphs
Need to know which metrics matter
Correlate spikes across panels
Interpret the numbers yourself
Context? Check 3 other dashboards

"Is this spike normal?" 🤷

🤖 AI Summary

"Worker-node-3 went offline at 11:23. Kubernetes rescheduled 4 pods. Memory on remaining nodes jumped to 68%. Disk is fine. No data loss — Prometheus PVC was retained. I'd bring the node back soon to avoid pressure."

Context + interpretation + recommendation

What This Could Mean at Work

The pattern scales:

On-call rotation — AI triages alerts before paging a human
Incident response — "What changed in the last hour?" answered in seconds
Capacity planning — weekly trend reports instead of quarterly reviews
New team members — AI explains what metrics mean, no tribal knowledge needed

Cost: ~$3/month for AI tokens (Haiku hourly + Sonnet weekly)
Value: An engineer who reads every metric, every hour, and never gets bored

This isn't replacing monitoring tools — it's adding an interpretation layer on top of what you already have.

Try This Yourself

What made it work:

Real problem, real constraints — budget, preferences, existing infra
Review and challenge — don't just accept; push back
Use the pauses — AI thinking time is your time
Trust the boring parts — YAML, configs, boilerplate
Own the decisions — architecture is still your job

This isn't about AI replacing you. It's about amplifying your Saturday afternoon.

What's Next

For this project:

Tune thresholds based on real alerts
Add more applications to monitoring
Log analysis — AI reads Loki, not just Prometheus

For you to explore:

OpenClaw — the framework used here (open source)
PromEx — Elixir metrics made easy
Kustomize — explicit k8s deployments
The ADR pattern — document decisions, not just code

The tools are ready. The patterns work. Go build something.

Questions?

Session log: ~/clawd/projects/infra-monitoring/SESSION-LOG.md
Live system: https://grafana.gutsch.it
This deck: https://openclaw-presentation.gutsch.it/observability

Thank You

Built during a Saturday afternoon of chores
with Claude (via OpenClaw)

Martin Dobberstein
@gutschilla