AI-Powered Infrastructure Monitoring

A Saturday Afternoon Conversation

Martin Dobberstein
AI-Assisted Coding Community · March 2026

It Started With This

I want monitoring for my Kubernetes cluster. No Prometheus alerting rules. No AlertManager config. I don't want to maintain that. Instead: the AI interprets the metrics and alerts me conversationally — in plain language, via Telegram. Build it on my private Hetzner infra.

That's it. That was the entire brief.

What came back: 7 Architectural Decision Records, a full implementation plan, and a working system — in one afternoon.

The Process

What happens when you treat an AI like a capable junior developer?

The task: Build AI-powered infrastructure monitoring for a 4-node Kubernetes cluster.
Time budget: One Saturday afternoon. ~45 minutes of my active attention.

7 Decisions — All Reviewed

The AI wrote 7 Architectural Decision Records before writing any code. I reviewed every one.

#DecisionAI ProposedI Pushed BackFinal
1Stack choicePLG (Prometheus+Loki+Grafana)✅ Agreed — lighter than ELKPLG
2Where to runSame cluster + extra node✅ Accepted risk for homelabSame cluster
3DeploymentHelm❌ "Too much magic"Kustomize
4AlertingAI only❌ "What if AI is down?"Hybrid
5SessionsMain session❌ "Don't pollute my chat"Isolated
6State trackingConfigMap❌ "Keep it simple"memory/*.json
7AccessSSH tunnel❌ "Breaks AI access"Public + TLS + auth

5 out of 7 decisions were changed through review. That's not overhead — that's the value.

ADR #4: Hybrid Alerting

The most important decision

Me: "I want monitoring. But I don't want to maintain alert rules."

Otto: "What if AI interprets the metrics instead of static rules?"

Me: "But what about obvious stuff — node down, disk full? That shouldn't depend on AI being available."

🚨 AlertManager (traditional)

  • Node down
  • Disk > 90%
  • Memory > 95%
  • CrashLoopBackOff > 5 min
  • Cert expiring < 3 days

Works even if AI is offline

🤖 AI Layer (interpretation)

  • "Disk grew 15% in 24h — unusual"
  • "Error rate up 3× but still low absolute"
  • "CPU spike correlates with deploy 2h ago"
  • Weekly trend analysis
  • Capacity planning

Context humans don't have time for

ADR #7: Access — When "Secure" Breaks the Use Case

Otto: "For Grafana access: NodePort + SSH tunnel is most secure."

Me: "Wait — how would the AI layer query Prometheus then?"

Otto: "...good point. It would need a tunnel for every query."

Me: "TLS + 32-character random password. That's ~192 bits of entropy. Good enough."

The lesson: The AI optimized for security. I optimized for the actual use case.
This is why human review matters — even when the AI's suggestion sounds reasonable.

Security layers: TLS (Let's Encrypt) → nginx basic auth (192-bit) → Grafana auth AI access: curl -u admin:$PASS https://prometheus.gutsch.it/api/v1/query?query=up

The Architecture

    flowchart TB
        subgraph Internet[" "]
            User["👤 Browser"]
            AI["🤖 OpenClaw AI"]
        end
        
        subgraph Ingress["☁️ Ingress — TLS + Basic Auth"]
            GrafanaURL["grafana.gutsch.it"]
            PromURL["prometheus.gutsch.it"]
        end
        
        subgraph Cluster["⎈ Kubernetes Cluster"]
            subgraph Monitoring["Monitoring Stack"]
                Prometheus[("📊 Prometheus")]
                Loki[("📜 Loki")]
                Grafana["📈 Grafana"]
                AlertManager["🚨 AlertManager"]
            end
            
            subgraph Collection["Collection Layer — DaemonSets"]
                Promtail["Promtail"]
                NodeExp["node-exporter"]
                KSM["kube-state-metrics"]
            end
            
            subgraph Apps["Applications"]
                CB["circleback-webhook\n/metrics"]
            end
        end
        
        User --> GrafanaURL --> Grafana
        AI --> PromURL --> Prometheus
        
        Prometheus -.->|scrape| NodeExp
        Prometheus -.->|scrape| KSM
        Prometheus -.->|scrape| CB
        Prometheus -->|rules| AlertManager
        
        Promtail -->|push| Loki
        Grafana -->|query| Prometheus
        Grafana -->|query| Loki
    

What Actually Got Built

The Stack (~3 GB RAM total)

PrometheusMetrics collection, 20GB PVC
LokiLog aggregation, 20GB PVC
GrafanaDashboards, 1GB PVC
AlertManagerCritical alerts
node-exporterDaemonSet × 4 nodes
PromtailDaemonSet × 4 nodes
kube-state-metricsK8s object metrics

The AI Layer

Hourly checkHaiku (~$0.01/day)
Weekly reviewSonnet (~$0.10/week)
EscalationThreshold → analysis
DeliveryTelegram alerts

Debugging the AI Solved

  • Talos version mismatch (75 min)
  • Missing StorageClass
  • PodSecurity policy blocks
  • Scrape target discovery

10+ components wired together. ~45 min of my time. ~€6/month ongoing.

The Two-Model Insight

Me: "Will Haiku be clever enough to know what's 'interesting'?"

Otto: "Honest answer: probably not. Let's not ask it to judge — just check explicit thresholds."

┌─────────────────────────────────────┐ │ ☁️ Haiku (cheap, hourly) │ │ │ │ "disk > 70%? memory > 70%?" │ │ │ │ │ yes? ├────────────────────────────▶ 🧠 Sonnet │ │ │ "Analyze & alert" │ no? └──▶ 😴 silent │ │ │ │ ▼ └─────────────────────────────────────┘ 📱 Telegram

Use the right model for the job. Cheap for routine, smart for analysis.

What It Cost

ItemCost
This session (~135k tokens, Opus)~$3-5
Hourly monitoring (Haiku)~$0.01/day
Weekly analysis (Sonnet)~$0.10/week
Infrastructure (CX22 node)~€6/month

Total: Less than a coffee for the AI work. €6.50/month ongoing.

Compare to: Datadog, New Relic, or your own time writing YAML.

It Actually Works

First automated check ran at 19:00, one hour after setup:

MetricResultThresholdStatus
Nodes4/4 upany down
Memory27-49%> 70%
P95 latency9.5ms> 500ms
Pod restarts0> 0

System monitoring itself. No human in the loop.

Fast Forward: Two Weeks Later

That was the build. But the real question is:

Does it actually help?

Let's see what happens when things go wrong. Live.

🔴 Live Demo

The Setup

What's running:

  • 4-node Kubernetes cluster (Hetzner)
  • Real websites: abi-2000.de, woerterkant.de
  • PLG monitoring stack
  • AI check every 2 minutes (demo mode)

What you're seeing:

  • 📱 Left: Chat with Otto (TUI)
  • 💬 Right: Telegram — automated alerts
  • AI reports arrive in plain English
  • No dashboards to interpret
┌──────────────────────┐ ┌──────────────────────┐ │ Terminal (TUI) │ │ Telegram │ │ │ │ │ │ "Otto, how's the │ │ 🤖 All clear: │ │ cluster?" │ │ 4 nodes up, mem │ │ │ │ 27-49%, latency │ │ → Live conversation │ │ 9.5ms, 0 restarts │ │ │ │ │ └──────────────────────┘ └──────────────────────┘

🔴 Demo: Node Goes Down

Let's kill a worker node. For real — via Hetzner API:

$ hcloud server poweroff worker-node-2 Server worker-node-2 powered off

What the AI sees (next 2-min check):

The AI doesn't just detect — it explains what happened, assesses severity, and recommends action.

🔴 The Alert

What arrives on Telegram

🚨 Infrastructure Alert (10:39 AM)

Node Down: worker-node-2 (10.1.0.5) is unreachable

Severity: Medium — cluster degraded but functional. 2 remaining workers handling load.

Other metrics:
✅ Memory: 38-53% across remaining nodes
✅ P95 response time: 9.5ms (excellent)
✅ No error rate spikes

Recommended action: Check worker-node-2 via Hetzner console. If unresponsive, may need restart. Non-urgent if workloads are healthy.

Compare this to a PagerDuty alert: FIRING: NodeDown worker-node-2 [critical]

Dashboard vs. AI Summary

Which one do you want at 3 AM?

📊 Grafana Dashboard

  • 14 panels of graphs
  • Need to know which metrics matter
  • Correlate spikes across panels
  • Interpret the numbers yourself
  • Context? Check 3 other dashboards

"Is this spike normal?" 🤷

🤖 AI Summary

"Worker-node-3 went offline at 11:23. Kubernetes rescheduled 4 pods. Memory on remaining nodes jumped to 68%. Disk is fine. No data loss — Prometheus PVC was retained. I'd bring the node back soon to avoid pressure."

Context + interpretation + recommendation

What This Could Mean at Work

The pattern scales:

Cost: ~$3/month for AI tokens (Haiku hourly + Sonnet weekly)
Value: An engineer who reads every metric, every hour, and never gets bored

This isn't replacing monitoring tools — it's adding an interpretation layer on top of what you already have.

Try This Yourself

What made it work:

  1. Real problem, real constraints — budget, preferences, existing infra
  2. Review and challenge — don't just accept; push back
  3. Use the pauses — AI thinking time is your time
  4. Trust the boring parts — YAML, configs, boilerplate
  5. Own the decisions — architecture is still your job

This isn't about AI replacing you. It's about amplifying your Saturday afternoon.

What's Next

For this project:

For you to explore:

The tools are ready. The patterns work. Go build something.

Questions?

Session log: ~/clawd/projects/infra-monitoring/SESSION-LOG.md
Live system: https://grafana.gutsch.it
This deck: https://openclaw-presentation.gutsch.it/observability

Thank You

Built during a Saturday afternoon of chores
with Claude (via OpenClaw)

Martin Dobberstein
@gutschilla