Running 15 Docker containers on a single box is fine until it isn't. You don't find out something broke because you were watching dashboards — you find out because something else stopped working, or a Discord message bounced, or you tried to load a page and got a timeout. By then whatever failed has been down for hours.
I wanted to know before that. Not "the server is unreachable" level alerting. I wanted to know that my n8n container was restarting every 20 minutes, that inodes were at 78% and trending up, that memory pressure had been climbing for three days before anything actually crashed.
There are real tools for this. I tried two of them before building the third.
Prometheus + Grafana + Node Exporter + Alertmanager
This is the infrastructure world's answer to monitoring. It is also correct. If you work in production DevOps you've used some version of this stack and it genuinely is the right answer at scale.
The pitch is solid: Prometheus scrapes metrics from exporters on a configurable interval, stores them as time-series data with per-second granularity, and provides a query language (PromQL) for slicing and aggregating everything. Grafana sits on top for dashboards. Node Exporter runs on each machine and exposes system-level metrics — CPU, memory, disk, network, filesystem stats, temperature on supported hardware. Alertmanager handles the routing, grouping, deduplication, and delivery of alerts.
The community around this stack is massive. There are exporters for virtually everything — databases, container runtimes, JVM apps, network hardware, cloud providers. If you want to monitor it, there is probably an exporter already written. Grafana has thousands of community dashboards; you can import a Node Exporter dashboard in about 30 seconds and immediately have something beautiful showing you the last 24 hours of CPU and memory per host.
Kubernetes is where this stack genuinely shines. The Prometheus Operator, service monitors, kube-state-metrics — the integration is deep and the tooling is mature. If you're running k8s at any scale, Prometheus is not optional.
For a homelab with two servers and 15 containers, I found the calculus different.
The minimum setup is four additional containers: Prometheus, Grafana, Node Exporter (one per monitored host), and Alertmanager. Prometheus alone needs ~500MB of RAM with default retention settings; more if you push retention out to 30 days or start adding many targets. That is not nothing on a machine that is already running your actual workloads.
Configuration is real work. prometheus.yml needs scrape targets. Alerting rules live in separate YAML files with a specific syntax. Grafana needs a datasource configured, then dashboards either built by hand or imported and then modified to match your specific setup. Alertmanager has its own configuration file with routes, receivers, group timing, inhibit rules. None of it is hard, exactly, but it is a lot of moving parts to maintain — and every update to any of the four containers is a chance for something to drift.
I ran this stack for six months. The dashboards were genuinely beautiful. I am not being sarcastic — there is something satisfying about a well-configured Grafana dashboard showing you 30 days of memory pressure across two hosts. I checked them about once a week.
The alerts were where it fell apart for me. The first alert I configured was a CPU threshold — anything above 85% for 5 minutes. My backup cron job runs at 2AM and pegs a core for 12 minutes. So I got paged at 2AM, realized it was the backup, and suppressed the alert with a time-based silence in Alertmanager. Then I forgot to remove the silence. Then the alert never fired again when the CPU actually spiked during a real incident two months later.
(This is a very common alerting anti-pattern and there are correct ways to handle it — recording rules, inhibition, better alert semantics. I know. The point is that getting alerting right in Alertmanager for a single-person homelab requires real tuning effort, and the feedback loop for validating that your alerts actually fire when they should is long and painful.)
The other alert failure mode is the opposite: too noisy. Container restarts are interesting when they're unexpected and noise when they're expected. Distinguishing those cases in PromQL requires knowing which containers restart normally and which don't, encoding that as rules, and keeping those rules updated as your stack changes.
I am not saying Prometheus alerting is bad. For a team with dedicated on-call, runbooks, and time to tune alerting rules, it is the right tool. For one person checking a homelab: the overhead of maintaining the alerting layer started to exceed the value it provided.
When Prometheus is the right choice: You're running more than three servers. You need per-second metric granularity for performance debugging. You want historical dashboards spanning months. You're running Kubernetes. You have time to tune alert rules and maintain the stack.
Uptime Kuma
Single container. About 50MB of RAM. Beautiful UI. Dead simple to set up.
Uptime Kuma monitors endpoints — HTTP(S), TCP, DNS, ping, Docker containers via the Docker socket, and a few other protocols. You add a monitor, configure a check interval (as low as 20 seconds), wire up notification channels (Discord, Slack, Telegram, PagerDuty, email, ~90 others), and you're done in 10 minutes.
The status page feature is genuinely useful for public-facing services. If you host anything that other people depend on, a Kuma status page gives them a single place to check when things are slow.
For Docker monitoring specifically, Kuma connects to the socket and can tell you which containers are running, stopped, or restarting. That alone is worth having — it catches the silent restart loops that don't obviously surface anywhere else.
What it doesn't do: system metrics. There is no CPU, memory, disk, inode, or temperature data. Kuma tells you "the service is responding" or "the service is not responding." It does not tell you that memory has been climbing for 72 hours and the service is one OOM kill away from going down.
No trend analysis either. Each check is independent. You get up/down status, response time, and cert expiry. That is the designed scope, and within that scope it executes well.
There is no AI interpretation layer. "Your Nginx container is down" is the alert. Diagnosing why is entirely on you.
When Uptime Kuma is the right choice: You care primarily about endpoint reachability. You want a status page. You need external monitoring (Kuma can monitor URLs from outside your network if exposed). Fast setup, minimal overhead, zero maintenance. Excellent tool for what it does.
n8n + SSH + AI
This is what I built. Two SSH commands, a Code node that parses the output, an LLM that compares current state against seven days of stored history, and a Discord message with a summary and any fix commands worth knowing.
The data collection is more thorough than you'd expect from two SSH commands. One ssh command runs a compound shell script that collects real CPU utilization % (via /proc/stat delta sampling with a 1-second sleep — actual core usage, not just load average), load average, all mounted filesystems via df -P (with tmpfs/udev filtered out), network I/O from /proc/net/dev (RX/TX bytes and error counts), top 5 processes by CPU, zombie process count, failed systemd services, established network connections, memory, swap, inode usage per filesystem, and temperature. A second command runs docker stats --no-stream and docker ps for per-container CPU, memory, restart count, and health status — plus docker system df for ecosystem health: total image size, volume size, build cache, reclaimable space, dangling image count, and unused volume count. Over 30 metrics from two SSH connections, in about two seconds per host.
The Code node normalizes that into a structured JSON object with 16 columns — one record per run, stored as a row in Google Sheets. That's the history. Seven days of daily snapshots per host.
Then the LLM gets the current snapshot plus the seven-day history. Not the raw data — a compact summary: "disk usage on /var/lib/docker went from 61% seven days ago to 74% today." The LLM is configured as a senior DevOps engineer persona and returns structured JSON: a status, headline, severity-tagged findings (each with a specific CLI fix command), trend summary, and a top recommendation. Every finding is actionable — no vague advice, just commands you can copy-paste.
The output is useful in a way alert thresholds aren't. Instead of "disk usage exceeded 80%," you get something like: "At the current rate of growth, /var/lib/docker will hit 90% in approximately 8 days. Run docker system prune -af --volumes to reclaim 12GB immediately." That is a different class of information.
The Discord delivery is a 4-embed dashboard, not a single message. Embed 1 is the status header with health score, CPU, memory, disk, swap, and container counts as inline fields. Embed 2 shows severity-tagged findings (HIGH / MED / LOW) with fix commands in code blocks. Embed 3 covers the Docker ecosystem, network stats, trends, and top recommendation. Embed 4 is a footer with timestamp, hostname, execution duration, and API cost. It reads like a real monitoring dashboard, not a wall of text.
There are also 5-minute critical checks — a separate execution path in the same workflow that checks thresholds with no AI and fires immediately to Discord if something is actively wrong.
No agents installed on the monitored server. SSH access is all it needs. This matters if you're monitoring machines you don't fully control, or if you want to keep the monitored host clean.
Running cost: under $1/month. The LLM calls are small (maybe 2,000 tokens per daily digest across all hosts), and that's assuming you're using OpenAI. If you run Ollama locally, the external API cost drops to zero.
The trade-offs are real. No per-second granularity — daily snapshots plus 5-minute health checks. No dashboards. If you want to see a graph of memory usage over the past week, this doesn't do that. The LLM describes trends; it doesn't render them visually.
There's also an external API dependency baked in. If OpenAI has an outage, the daily digest either fails or falls back to raw data without interpretation. I handle this with a simple try/catch in the Code node — if the LLM call fails, the digest still sends, just without trend analysis. Raw data is better than nothing.
Server discovery is manual. There's no dynamic target discovery like Prometheus service discovery. Each monitored host gets an explicit entry in the workflow. For 1-3 machines, this is fine. For 20 machines, it starts to feel like maintenance work.
I want to be clear about one thing: building this took about half a day. Most of that was writing and testing the shell parsing logic, not the n8n configuration. If you're already running n8n, adding this workflow is genuinely lightweight.
When this approach makes sense: 1-3 servers, you want daily intelligence and trend detection without running a separate observability stack, you're already using n8n, you're comfortable with SSH access for collection.
The Actual Comparison
| Feature | Prometheus Stack | Uptime Kuma | n8n + AI |
|---|---|---|---|
| Setup time | 2-4 hours | 10 minutes | ~30 minutes |
| RAM overhead | ~500MB-1GB | ~50MB | 0 (runs in existing n8n) |
| New containers | 4 | 1 | 0 |
| System metrics | Yes (via exporters) | No | Yes (via SSH) |
| Docker metrics | Yes (cAdvisor) | Status only | Status + per-container stats + ecosystem health (disk waste, dangling images, reclaimable space) |
| Trend analysis | Manual (PromQL + Grafana) | No | AI-powered, automatic |
| Alert quality | Highly configurable | Up/Down only | Severity-tagged findings with CLI fix commands |
| Dashboards | Yes (Grafana) | Status page | No |
| Per-second granularity | Yes | Configurable intervals | No |
| External API dependency | No | No | Yes (LLM API) |
| Agent install required | Yes (Node Exporter per host) | No (Docker socket) | No (SSH only) |
| Running cost | Free | Free | ~$0.50-1/month |
| Ongoing maintenance | Medium-high | Low | Low |
| Best for | 3+ servers, K8s, production | Endpoint monitoring | 1-3 servers, daily digest |
My Actual Setup
I run all three.
Prometheus + Grafana covers my production infrastructure — a separate set of machines where I actually need historical dashboards, per-second debugging capability, and properly tuned alerts. The maintenance overhead is worth it there because the stakes are higher and I have real incidents that require real investigation.
Uptime Kuma monitors external endpoints: public URLs, API health checks, SSL certificate expiry, and a few TCP ports I care about. It is the first thing I check when something feels slow. It is also what I'd point people to when they want to know if a service is up.
The n8n workflow is my morning briefing. I wake up, I check Discord, I see a digest that tells me whether anything is trending in a bad direction. Not "something is down right now" — that's what Kuma handles — but "here's what changed over the last 24 hours and here's what's worth watching." It has caught two slow memory leaks and one log volume explosion before any threshold alert would have triggered.
Treating these as competitors misses the point. They're measuring different things on different timescales with different output formats. The question isn't which one is best — it's which one fits what you're actually trying to know.
The n8n workflow is free on the Creator Hub. Link coming soon.