nXsi
HomeProductsBlogGuidesMCP ServersServicesAbout
HomeProductsBlogGuidesMCP ServersServicesAbout
nXsi

Practical guides, automation tools, and self-hosted products for developers and homelabbers.

Content

  • Blog
  • Products
  • Guides
  • MCP Servers

Resources

  • About
  • Services
  • Support
  • Privacy Policy

Newsletter

Weekly AI architecture insights. No spam.

© 2026 nXsi Intelligence. All rights reserved.
  1. Home
  2. Blog
  3. Homelab Monitoring in 2026: Prometheus v…
GuideintermediateFebruary 21, 2026·10 min read·15 min read hands-on

HomelabMonitoringin2026:PrometheusvsUptimeKumavsn8n+AI

A real comparison of three approaches to monitoring 10-30 Docker containers on 1-3 servers. What each tool actually costs, what it misses, and when to pick it.

homelabmonitoringprometheusuptime-kuman8ndockerdevops
Share
XLinkedIn
Table of Contents

Running 15 Docker containers on a single box is fine until it isn't. You don't find out something broke because you were watching dashboards — you find out because something else stopped working, or a Discord message bounced, or you tried to load a page and got a timeout. By then whatever failed has been down for hours.

I wanted to know before that. Not "the server is unreachable" level alerting. I wanted to know that my n8n container was restarting every 20 minutes, that inodes were at 78% and trending up, that memory pressure had been climbing for three days before anything actually crashed.

There are real tools for this. I tried two of them before building the third.


Prometheus + Grafana + Node Exporter + Alertmanager

This is the infrastructure world's answer to monitoring. It is also correct. If you work in production DevOps you've used some version of this stack and it genuinely is the right answer at scale.

The pitch is solid: Prometheus scrapes metrics from exporters on a configurable interval, stores them as time-series data with per-second granularity, and provides a query language (PromQL) for slicing and aggregating everything. Grafana sits on top for dashboards. Node Exporter runs on each machine and exposes system-level metrics — CPU, memory, disk, network, filesystem stats, temperature on supported hardware. Alertmanager handles the routing, grouping, deduplication, and delivery of alerts.

The community around this stack is massive. There are exporters for virtually everything — databases, container runtimes, JVM apps, network hardware, cloud providers. If you want to monitor it, there is probably an exporter already written. Grafana has thousands of community dashboards; you can import a Node Exporter dashboard in about 30 seconds and immediately have something beautiful showing you the last 24 hours of CPU and memory per host.

Kubernetes is where this stack genuinely shines. The Prometheus Operator, service monitors, kube-state-metrics — the integration is deep and the tooling is mature. If you're running k8s at any scale, Prometheus is not optional.

For a homelab with two servers and 15 containers, I found the calculus different.

The minimum setup is four additional containers: Prometheus, Grafana, Node Exporter (one per monitored host), and Alertmanager. Prometheus alone needs ~500MB of RAM with default retention settings; more if you push retention out to 30 days or start adding many targets. That is not nothing on a machine that is already running your actual workloads.

Configuration is real work. prometheus.yml needs scrape targets. Alerting rules live in separate YAML files with a specific syntax. Grafana needs a datasource configured, then dashboards either built by hand or imported and then modified to match your specific setup. Alertmanager has its own configuration file with routes, receivers, group timing, inhibit rules. None of it is hard, exactly, but it is a lot of moving parts to maintain — and every update to any of the four containers is a chance for something to drift.

I ran this stack for six months. The dashboards were genuinely beautiful. I am not being sarcastic — there is something satisfying about a well-configured Grafana dashboard showing you 30 days of memory pressure across two hosts. I checked them about once a week.

The alerts were where it fell apart for me. The first alert I configured was a CPU threshold — anything above 85% for 5 minutes. My backup cron job runs at 2AM and pegs a core for 12 minutes. So I got paged at 2AM, realized it was the backup, and suppressed the alert with a time-based silence in Alertmanager. Then I forgot to remove the silence. Then the alert never fired again when the CPU actually spiked during a real incident two months later.

(This is a very common alerting anti-pattern and there are correct ways to handle it — recording rules, inhibition, better alert semantics. I know. The point is that getting alerting right in Alertmanager for a single-person homelab requires real tuning effort, and the feedback loop for validating that your alerts actually fire when they should is long and painful.)

The other alert failure mode is the opposite: too noisy. Container restarts are interesting when they're unexpected and noise when they're expected. Distinguishing those cases in PromQL requires knowing which containers restart normally and which don't, encoding that as rules, and keeping those rules updated as your stack changes.

I am not saying Prometheus alerting is bad. For a team with dedicated on-call, runbooks, and time to tune alerting rules, it is the right tool. For one person checking a homelab: the overhead of maintaining the alerting layer started to exceed the value it provided.

When Prometheus is the right choice: You're running more than three servers. You need per-second metric granularity for performance debugging. You want historical dashboards spanning months. You're running Kubernetes. You have time to tune alert rules and maintain the stack.


Uptime Kuma

Single container. About 50MB of RAM. Beautiful UI. Dead simple to set up.

Uptime Kuma monitors endpoints — HTTP(S), TCP, DNS, ping, Docker containers via the Docker socket, and a few other protocols. You add a monitor, configure a check interval (as low as 20 seconds), wire up notification channels (Discord, Slack, Telegram, PagerDuty, email, ~90 others), and you're done in 10 minutes.

The status page feature is genuinely useful for public-facing services. If you host anything that other people depend on, a Kuma status page gives them a single place to check when things are slow.

For Docker monitoring specifically, Kuma connects to the socket and can tell you which containers are running, stopped, or restarting. That alone is worth having — it catches the silent restart loops that don't obviously surface anywhere else.

What it doesn't do: system metrics. There is no CPU, memory, disk, inode, or temperature data. Kuma tells you "the service is responding" or "the service is not responding." It does not tell you that memory has been climbing for 72 hours and the service is one OOM kill away from going down.

No trend analysis either. Each check is independent. You get up/down status, response time, and cert expiry. That is the designed scope, and within that scope it executes well.

There is no AI interpretation layer. "Your Nginx container is down" is the alert. Diagnosing why is entirely on you.

When Uptime Kuma is the right choice: You care primarily about endpoint reachability. You want a status page. You need external monitoring (Kuma can monitor URLs from outside your network if exposed). Fast setup, minimal overhead, zero maintenance. Excellent tool for what it does.


n8n + SSH + AI

This is what I built. Two SSH commands, a Code node that parses the output, an LLM that compares current state against seven days of stored history, and a Discord message with a summary and any fix commands worth knowing.

The data collection is more thorough than you'd expect from two SSH commands. One ssh command runs a compound shell script that collects real CPU utilization % (via /proc/stat delta sampling with a 1-second sleep — actual core usage, not just load average), load average, all mounted filesystems via df -P (with tmpfs/udev filtered out), network I/O from /proc/net/dev (RX/TX bytes and error counts), top 5 processes by CPU, zombie process count, failed systemd services, established network connections, memory, swap, inode usage per filesystem, and temperature. A second command runs docker stats --no-stream and docker ps for per-container CPU, memory, restart count, and health status — plus docker system df for ecosystem health: total image size, volume size, build cache, reclaimable space, dangling image count, and unused volume count. Over 30 metrics from two SSH connections, in about two seconds per host.

The Code node normalizes that into a structured JSON object with 16 columns — one record per run, stored as a row in Google Sheets. That's the history. Seven days of daily snapshots per host.

Then the LLM gets the current snapshot plus the seven-day history. Not the raw data — a compact summary: "disk usage on /var/lib/docker went from 61% seven days ago to 74% today." The LLM is configured as a senior DevOps engineer persona and returns structured JSON: a status, headline, severity-tagged findings (each with a specific CLI fix command), trend summary, and a top recommendation. Every finding is actionable — no vague advice, just commands you can copy-paste.

The output is useful in a way alert thresholds aren't. Instead of "disk usage exceeded 80%," you get something like: "At the current rate of growth, /var/lib/docker will hit 90% in approximately 8 days. Run docker system prune -af --volumes to reclaim 12GB immediately." That is a different class of information.

The Discord delivery is a 4-embed dashboard, not a single message. Embed 1 is the status header with health score, CPU, memory, disk, swap, and container counts as inline fields. Embed 2 shows severity-tagged findings (HIGH / MED / LOW) with fix commands in code blocks. Embed 3 covers the Docker ecosystem, network stats, trends, and top recommendation. Embed 4 is a footer with timestamp, hostname, execution duration, and API cost. It reads like a real monitoring dashboard, not a wall of text.

There are also 5-minute critical checks — a separate execution path in the same workflow that checks thresholds with no AI and fires immediately to Discord if something is actively wrong.

No agents installed on the monitored server. SSH access is all it needs. This matters if you're monitoring machines you don't fully control, or if you want to keep the monitored host clean.

Running cost: under $1/month. The LLM calls are small (maybe 2,000 tokens per daily digest across all hosts), and that's assuming you're using OpenAI. If you run Ollama locally, the external API cost drops to zero.

The trade-offs are real. No per-second granularity — daily snapshots plus 5-minute health checks. No dashboards. If you want to see a graph of memory usage over the past week, this doesn't do that. The LLM describes trends; it doesn't render them visually.

There's also an external API dependency baked in. If OpenAI has an outage, the daily digest either fails or falls back to raw data without interpretation. I handle this with a simple try/catch in the Code node — if the LLM call fails, the digest still sends, just without trend analysis. Raw data is better than nothing.

Server discovery is manual. There's no dynamic target discovery like Prometheus service discovery. Each monitored host gets an explicit entry in the workflow. For 1-3 machines, this is fine. For 20 machines, it starts to feel like maintenance work.

I want to be clear about one thing: building this took about half a day. Most of that was writing and testing the shell parsing logic, not the n8n configuration. If you're already running n8n, adding this workflow is genuinely lightweight.

When this approach makes sense: 1-3 servers, you want daily intelligence and trend detection without running a separate observability stack, you're already using n8n, you're comfortable with SSH access for collection.


The Actual Comparison

FeaturePrometheus StackUptime Kuman8n + AI
Setup time2-4 hours10 minutes~30 minutes
RAM overhead~500MB-1GB~50MB0 (runs in existing n8n)
New containers410
System metricsYes (via exporters)NoYes (via SSH)
Docker metricsYes (cAdvisor)Status onlyStatus + per-container stats + ecosystem health (disk waste, dangling images, reclaimable space)
Trend analysisManual (PromQL + Grafana)NoAI-powered, automatic
Alert qualityHighly configurableUp/Down onlySeverity-tagged findings with CLI fix commands
DashboardsYes (Grafana)Status pageNo
Per-second granularityYesConfigurable intervalsNo
External API dependencyNoNoYes (LLM API)
Agent install requiredYes (Node Exporter per host)No (Docker socket)No (SSH only)
Running costFreeFree~$0.50-1/month
Ongoing maintenanceMedium-highLowLow
Best for3+ servers, K8s, productionEndpoint monitoring1-3 servers, daily digest

My Actual Setup

I run all three.

Prometheus + Grafana covers my production infrastructure — a separate set of machines where I actually need historical dashboards, per-second debugging capability, and properly tuned alerts. The maintenance overhead is worth it there because the stakes are higher and I have real incidents that require real investigation.

Uptime Kuma monitors external endpoints: public URLs, API health checks, SSL certificate expiry, and a few TCP ports I care about. It is the first thing I check when something feels slow. It is also what I'd point people to when they want to know if a service is up.

The n8n workflow is my morning briefing. I wake up, I check Discord, I see a digest that tells me whether anything is trending in a bad direction. Not "something is down right now" — that's what Kuma handles — but "here's what changed over the last 24 hours and here's what's worth watching." It has caught two slow memory leaks and one log volume explosion before any threshold alert would have triggered.

Treating these as competitors misses the point. They're measuring different things on different timescales with different output formats. The question isn't which one is best — it's which one fits what you're actually trying to know.


The n8n workflow is free on the Creator Hub. Link coming soon.

On this page

Get weekly AI architecture insights

Patterns, lessons, and tools from building a production multi-agent system. Delivered weekly.

Series: Homelab Health DashboardPart 3 of 3
← Previous

How to Monitor Your Homelab Server with n8n and AI (Step-by-Step Setup)

Series: Homelab Health DashboardPart 3 of 3
← Previous

How to Monitor Your Homelab Server with n8n and AI (Step-by-Step Setup)

Related Product

Homelab Health Dashboard

32-node n8n workflow that monitors your homelab via SSH, analyzes metrics with AI, and delivers a 4-embed health dashboard to Discord every morning. Includes 5-minute critical alerts.

Get Free Download

Read Next

Build Log11 min

I Built a Homelab Health Dashboard with n8n and AI — SSH Metrics, Docker Stats, and a Daily Digest That Actually Tells You Something

Build log of creating a 32-node n8n workflow that SSHes into your homelab, collects 30+ system and Docker metrics, analyzes them with GPT-4o-mini, and delivers a 4-embed health dashboard to Discord — plus 5-minute critical alerts.

Tutorial11 min

How to Monitor Your Homelab Server with n8n and AI (Step-by-Step Setup)

Step-by-step guide to setting up a 32-node n8n workflow that SSHes into your homelab, collects 30+ metrics, analyzes them with GPT-4o-mini, and delivers a 4-embed health dashboard to Discord every morning.

Tutorial21 min

Deploy a Complete Homelab Monitoring Stack with Docker Compose: Grafana, Prometheus, Loki, and 23 Alert Rules

Step-by-step tutorial for deploying a 9-service monitoring stack on any Linux server. Prometheus for metrics, Loki for logs, Grafana for dashboards, Alertmanager for notifications, plus Proxmox and Uptime Kuma. One docker compose up and you have 7 dashboards and 23 pre-configured alert rules.