I Built a 9-Service Homelab Monitoring Stack and Shipped It as a Product — Here's the Full Build Log
My homelab had no monitoring. None. Seventeen running containers, a Proxmox cluster with VMs and LXCs, 33% of a 96 GB disk in use, and I could not tell you at what time of day memory peaked, whether any container had restarted in the last week, or how quickly my disk was filling. I found out about problems by noticing something felt slow, SSHing in, running htop, and squinting at it like a doctor from the 1800s checking for fever by touching a forehead.
The standard advice is "just run Grafana and Prometheus." Great. Except setting that up from scratch means wrangling Prometheus scrape configs, writing alerting rules, building Grafana dashboards panel by panel, figuring out how Loki and Promtail talk to each other, dealing with retention settings, and all the other configuration that nobody tells you takes longer than the actual monitoring decision. I searched for complete, ready-to-deploy monitoring stacks. Plenty of blog posts with docker-compose snippets. Almost none shipped dashboards. Zero shipped alert rules. None included log aggregation.
So I built one. This is the full, chronological story of that build — every phase, every decision, every error, and the one debugging session that made me question the entire container monitoring ecosystem.
The Starting Point
The host machine: Ubuntu 24.04 on 4 cores, 7.8 GB RAM, 62 GB free disk. Docker 29.2.1 with Compose v5.0.2. Already running an API, dashboard, database, Redis, n8n, and Caddy on this box. Ports 22, 80, 443, 3000, 4000, 5432, 5678, 6379 all occupied.
My target: a self-contained monitoring stack that runs alongside existing services without conflicts, costs nothing to operate (no cloud APIs, no SaaS subscriptions), and is ready to use in under 15 minutes from a cold start.
Nine services in one compose file:
- Grafana (dashboards)
- Prometheus (metrics collection)
- Loki (log aggregation)
- Promtail (log shipping)
- Node Exporter (host system metrics)
- cAdvisor (container metrics)
- Alertmanager (alert routing)
- PVE Exporter (Proxmox metrics)
- Uptime Kuma (uptime monitoring)
Total RAM budget: about 800 MB to 1.2 GB. Zero ongoing cost.
Phase 1: The Core Three — Prometheus, Grafana, Node Exporter
I wrote all the config files upfront — every YAML, every provisioning file, every alert rule — even for services I would not start until later phases. The docker-compose.yml references all 9 services, so having valid config files ready prevents Compose from complaining about missing bind mounts.
Every configurable value goes through .env. Grafana port, admin credentials, Prometheus retention, Loki burst size, PVE host IP, Discord webhook URL. The compose file is full of ${VAR:-default} syntax — I wrote about these patterns and eight others separately. The goal: users should never need to edit YAML. Copy .env.example, fill in their values, run docker compose up -d.
services:
grafana:
image: grafana/grafana:11.5.2
ports:
- "${GRAFANA_PORT:-3000}:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD:-admin}
- GF_DASHBOARDS_DEFAULT_HOME_DASHBOARD_PATH=/var/lib/grafana/dashboards/system-overview.json
All image versions pinned. Not :latest. Grafana 11.5.2, Prometheus v3.2.1, Node Exporter v1.9.0. If someone deploys this six months from now, they get the exact versions I tested.
Started just three services for phase 1: Prometheus, Node Exporter, Grafana.
docker compose up -d prometheus node-exporter grafana
Checked Prometheus targets: prometheus UP, node-exporter UP, grafana UP. Three green dots. Three auto-provisioned datasources in Grafana (Prometheus, Loki, Alertmanager — all configured, Loki and Alertmanager just not reachable yet).
No errors. Quick phase.
(I have gotten suspicious of phases that go this smoothly. The universe tends to balance the ledger in the next phase.)
Phase 2: cAdvisor and the Docker 29 Containerd Nightmare
This was the worst stretch of the entire build.
Started cAdvisor v0.51.0. It pulled fast — only 21 MB. Container came up in privileged mode with the standard mount list: /, /var/run, /sys, /var/lib/docker, /dev/disk, /dev/kmsg.
Prometheus immediately showed cAdvisor as UP. Good sign.
Then I looked at the container logs.
Seventy-two error lines. Every single one saying the same thing:
failed to identify the read-write layer ID for container "abc123..."
Over and over. One line per container on the host. Thirteen containers, most throwing this error multiple times during the housekeeping cycle.
What Was Actually Happening
cAdvisor discovers Docker containers by reading the overlay filesystem metadata at /var/lib/docker/image/overlayfs/layerdb/mounts/<container-id>/mount-id. This is how it maps container IDs to their filesystem layers so it can report per-container disk usage.
Docker 29.2.1 on this host uses the containerd snapshotter storage driver (io.containerd.snapshotter.v1). With this driver, the /var/lib/docker/image/ directory does not exist. At all.
I listed /var/lib/docker/:
buildkit containers engine-id network plugins rootfs runtimes swarm tmp volumes
No image/ directory. The entire filesystem layer database that cAdvisor depends on is gone. Not moved. Not renamed. Just absent.
Five Attempts, All Failed
Attempt 1: --docker_only=true flag. Theory: force cAdvisor to use only the Docker API, skip filesystem discovery. Reality: it still tried the layer lookup internally. Same error.
Attempt 2: --containerd=/run/containerd/containerd.sock. Theory: tell cAdvisor to talk to containerd directly for layer info. Reality: the socket was accessible, cAdvisor connected, but the Docker container handler still ran its own layer lookup independently. Same error.
Attempt 3: Upgraded to cAdvisor v0.52.1 (latest at the time). Same. Exact. Error. The containerd snapshotter support simply is not in cAdvisor's codebase yet.
Attempt 4: Created dummy mount-id files manually at the expected path. I grabbed a container ID, created the directory structure, wrote a mount-id file pointing to the correct overlay path. cAdvisor detected that one container's filesystem. But there were thirteen containers, the IDs rotate on recreation, and this is obviously insane to maintain.
Attempt 5: Removed --docker_only, let cAdvisor fall back to pure cgroup discovery without Docker container mapping. It found the systemd cgroups fine, but Docker containers showed up as anonymous cgroup paths instead of named containers. The metrics existed but were useless without the name and image labels that make container monitoring actually useful.
The Decision
I sat back and thought about who would actually use this product kit.
The majority of homelab users run Docker 24 through 27. Docker 28 and 29 with the containerd snapshotter are bleeding-edge. On Docker 24-27, cAdvisor works perfectly out of the box — full per-container metrics with name, image, and compose project labels. The bug is upstream in cAdvisor, tracked in their GitHub issues, and will presumably be fixed when containerd snapshotters become the default everywhere.
Ship cAdvisor as-is. Document the Docker 28+ limitation honestly in the troubleshooting guide. On standard Docker installs, users get full container monitoring. On Docker 29 with containerd snapshotters, they get cgroup-level metrics without container labels. Not ideal. Not broken.
I also added flags to reduce cAdvisor's resource footprint:
command:
- "--housekeeping_interval=30s"
- "--disable_metrics=advtcp,cpu_topology,cpuset,hugetlb,memory_numa,process,referenced_memory,resctrl,tcp,udp"
Drops memory usage from about 128 MB to 80 MB. Cuts metric cardinality by roughly 40%. You do not need TCP connection stats from cAdvisor when Node Exporter already provides them.
I could have spent another hour trying to write a custom exporter, or replaced cAdvisor with Docker daemon metrics (which do not expose per-container data), or just removed container monitoring entirely. None of those options serve the user better than shipping the standard tool with an honest explanation of its current limitation.
Phase 3: Loki and Promtail — The First-Boot Flood
Loki 3.4.2 pulled at 34 MB. Promtail 3.4.2 at 62 MB. Loki takes about 30 seconds to become ready — there is an ingester warmup period where it logs "waiting for 15s after being ready" before accepting writes. Important detail for the product docs: tell users to wait, it is not stuck.
Promtail depends on Loki's health check, so it started automatically once Loki was ready.
Then the errors started.
Rate Limiting on First Boot
Promtail found every Docker container log file on the host and tried to ingest all of them at once. Eight thousand plus lines per batch. Loki responded with two types of errors:
429 — Ingestion rate limit exceeded. Loki defaults to 4 MB/s ingestion rate. Promtail was sending 1 MB batches faster than that. On a host with existing containers that have been running for days, the initial log backfill easily exceeds 4 MB/s.
400 — Timestamp too old. Docker container logs older than 7 days (Loki's retention period) get rejected. Not an error per se — Loki correctly refusing stale data. But the log output looks alarming if you do not know what you are looking at.
Both are transient. Self-resolving. Promtail catches up to the current log position within about 30 seconds, after which the steady-state ingestion rate is well under 1 MB/s.
I bumped the Loki rate limits anyway:
ingestion_rate_mb: 10
ingestion_burst_size_mb: 20
Users with large existing Docker deployments will hit the same first-boot burst. Better to handle it gracefully than to ship a product where the first thing users see is a wall of 429 errors.
Ubuntu 24.04 Has No Syslog
This one caught me off guard. My Promtail config included a scrape target for /var/log/syslog and /var/log/auth.log. Standard Linux log files. Present on every Debian and Ubuntu system since forever.
Except Ubuntu 24.04 uses journald exclusively. There is no rsyslog. /var/log/syslog does not exist. Neither does /var/log/auth.log.
(I should have known this. I have been running 24.04 on this machine for months. It just never came up because I had not needed to read syslog directly until now.)
Added journald as a scrape source in Promtail:
- job_name: journal
journal:
max_age: 12h
labels:
job: systemd-journal
relabel_configs:
- source_labels: ['__journal__systemd_unit']
target_label: 'unit'
Added the required volume mounts to the compose file:
volumes:
- /run/log/journal:/run/log/journal:ro
- /etc/machine-id:/etc/machine-id:ro
The max_age: 12h is intentional. Without it, Promtail reads the entire journal history on first boot, which on a system that has been running for months could be gigabytes. Twelve hours of history is enough for the initial ingestion to be useful without causing another rate-limit flood.
The final Promtail config supports both file-based logs (Debian, older Ubuntu, CentOS) and journald (Ubuntu 24.04+, Fedora, Arch). Missing files get silently skipped. Works everywhere.
Verification: Loki labels showed Docker container logs from 13 containers plus journal entries from docker.service, systemd-networkd.service, and init.scope. Everything flowing.
Phase 4: 23 Alert Rules
Three rule files. Twenty-three rules total.
host-alerts.yml — 12 rules for the underlying system. CPU at 80% (warning) and 95% (critical). Memory same thresholds. Disk at 80/90%. Swap at 50% (on a homelab box, any significant swap usage means something is wrong). System load above 2x CPU cores for 10 minutes. Network traffic above 100 MB/s for 10 minutes. Clock skew above 50ms.
The one I am most proud of: DiskWillFillIn24Hours.
- alert: DiskWillFillIn24Hours
expr: |
predict_linear(
node_filesystem_avail_bytes{
fstype!~"tmpfs|overlay|squashfs"
}[6h], 24 * 3600
) < 0
for: 30m
predict_linear takes 6 hours of disk usage data and extrapolates 24 hours into the future. If the projected available space goes below zero, the alert fires. This catches the scenario where disk usage is only at 60% right now but growing at a rate that will fill it by tomorrow morning. A static 85% threshold on a 4 TB drive would not fire until you have already lost most of your buffer.
I also set the for: 30m duration to 30 minutes instead of the 5 minutes used on other rules. Disk fill predictions are noisy during spikes — a large Docker build temporarily eating disk space would trigger a false positive on a shorter window.
container-alerts.yml — 6 rules. Container down (not seen in 5 minutes), high CPU (80%), high memory (80% of limit), OOM kill (immediate, no for: delay), excessive restarts (more than 2 in 15 minutes), cAdvisor itself down.
proxmox-alerts.yml — 5 rules. Separate file. Users without Proxmox can delete it instead of commenting out individual rules.
Restarted Prometheus. All 23 rules loaded. Two alerts immediately entered pending state: HighSwapUsage (the host was at 51% swap — 7.8 GB RAM on a box running everything is tight) and DiskWillFillIn24Hours (the prediction needed more than a few minutes of data to stabilize, so it was firing speculatively). Both expected and correct.
Tested the full pipeline: fired a manual test alert via the Alertmanager v2 API. It appeared in Alertmanager with correct labels, correct severity routing. The pipeline works: Prometheus evaluates rules, sends to Alertmanager, Alertmanager routes to receivers.
Phase 5: Proxmox Integration and the URL Encoding Surprise
The PVE Exporter container itself was straightforward. Pull the image, pass the Proxmox host IP and credentials via environment variables, done. It connected to the Proxmox API at 10.10.10.101:8006 and got a 401 Unauthorized because the monitoring user had not been created on the Proxmox side yet. Expected.
The surprise was in Prometheus.
I had written prometheus.yml with ${PVE_HOST} in the target URL, assuming Docker Compose would substitute it. Prometheus config files do not support Docker Compose ${VAR} substitution. They are read directly by the Prometheus binary inside the container. The literal string ${PVE_HOST} got URL-encoded to $%7bpve_host and sent as an HTTP request to nowhere.
This is one of those things that seems obvious in retrospect. Prometheus is not Compose. Compose substitutes variables in its own YAML file. Mount a config file into a container, and it is read raw.
The fix: a setup.sh script that reads .env and runs sed to replace a PVE_TARGET placeholder in prometheus.yml with the actual IP. Simple, idempotent, runs once during initial setup.
PVE_HOST=$(grep PVE_HOST .env | cut -d '=' -f2)
sed -i "s/PVE_TARGET/$PVE_HOST/g" prometheus/prometheus.yml
I also wrote a 5 KB PROXMOX-SETUP.md guide for creating the monitoring user and API token on Proxmox, since that is a multi-step process involving user creation, role assignment, and token generation with privilege separation disabled. The full Proxmox monitoring setup — exporter, scrape config, dashboard, alerts — is covered in a separate guide.
Phase 6: Uptime Kuma
Pulled the image (190 MB, the largest in the stack by far — ironic for the simplest service). Started on port 3001. Health check passed.
Uptime Kuma is standalone. No Prometheus integration, no Grafana dashboard needed. It has its own web UI where you configure HTTP/TCP/DNS/ping monitors with built-in notification channels. I included it because it fills a gap the Prometheus-based stack does not cover: external uptime checks with status pages. Blackbox Exporter is the Prometheus-native alternative, but its UI story is "build more Grafana dashboards." Uptime Kuma is just better for that specific job.
All 9 services running.
Phase 7: Seven Dashboards, Sixty-Eight Panels
This was the densest phase. Seven JSON dashboard files, 68 total panels. I covered the full deployment steps in the companion tutorial, so here I will focus on design decisions.
System Overview (12 panels) — CPU usage gauge, memory gauge, disk gauge, uptime stat. Then per-mode CPU area chart (user/system/iowait/steal stacked), load averages with a CPU core reference line, memory breakdown (used/buffers/cached/available stacked), swap usage, disk space per mountpoint, disk I/O, network traffic, network errors. This is the "glance at it once a day" dashboard.
Docker Containers (11 panels) — Container count, total CPU and memory stats at the top. Per-container CPU stacked area, CPU throttling percentage, a memory table with name/usage/limit/percentage and color thresholds, network I/O per container, block I/O per container. All queries filter on name=~".+" so only named containers appear (excludes temporary build containers).
Disk Health (9 panels) — predict_linear gets a visual home here. Three projection lines: 24 hours (solid), 48 hours (dashed), 7 days (dotted). You can see at a glance when each mountpoint will fill. Default time range is 6 hours instead of the typical 24 hours — shorter windows make the prediction lines more visually distinct.
Logs Explorer (7 panels, Loki datasource) — Log volume stacked by container, error count via regex (error|fatal|panic), Docker container log panel, log level pie chart, error rate per container, journal log panel, journal error panel. The two log panels are native Grafana log visualizations with syntax highlighting and expandable detail views.
I assigned stable UIDs to each provisioned datasource (prometheus, loki, alertmanager) and referenced those UIDs in every dashboard panel. This matters more than you would think — if you leave the UID empty and Grafana falls back to the default datasource, your Loki log queries get routed to Prometheus. Prometheus sees the LogQL pipe operator (|~) and returns "invalid character '|'." Cryptic error, obvious cause once you know to look.
Restarted Grafana. All 7 dashboards auto-provisioned. System Overview set as the home dashboard. Everything rendered with live data from the running stack.
Phase 8: Polish and Packaging
The product needs to work for someone who is not me.
Four documentation files totaling 37 KB: README with architecture diagram and quick start, TROUBLESHOOTING covering every issue from the build (the cAdvisor section gets the most real estate), CUSTOMIZATION for extending the stack with remote hosts and custom alert rules, and PROXMOX-SETUP for the multi-step API token creation process.
Two scripts: deploy-node-exporter.sh for installing Node Exporter on remote machines as a systemd service (multi-arch, security-hardened), and backup.sh for timestamped config + Grafana DB archives with 5-backup rotation.
Removed the local .env (never ship real credentials), restored the PVE_TARGET placeholder in prometheus.yml, made scripts executable, ran bash -n syntax validation.
Twenty-six files. All checked.
Post-Deploy Testing — The Parts That Look Right Until You Check
I deployed the full stack to a separate machine for a clean test. Everything came up. Dashboards loaded. No errors in the compose logs. I almost called it done.
Then I actually looked at the Proxmox dashboard.
PVE Exporter Labels Are Not What the Docs Suggest
The Proxmox Cluster dashboard showed two things: node status (online) and total guest count (13). Every other panel — node CPU, node memory, guest CPU, guest memory, storage — was empty. No data.
My PromQL queries all filtered on {object="nodes"} or {object=~"qemu|lxc"}. I had written them based on older PVE Exporter documentation and community dashboard examples. Reasonable assumption. Wrong assumption.
I queried Prometheus directly for the actual label structure:
curl -s 'http://localhost:9090/api/v1/query?query=pve_cpu_usage_ratio' | jq '.data.result[0].metric'
No object label. The PVE Exporter uses an id label with hierarchical paths: node/NXS-AURORA, qemu/1002, lxc/2104, storage/NXS-AURORA/aurora-lvm. Every panel that filtered on object matched nothing.
Fixed all 10 panels to use id regex selectors:
# Old (matches nothing):
pve_cpu_usage_ratio{object="nodes"} * 100
# New (works):
pve_cpu_usage_ratio{id=~"node/.*"} * 100 * on(id) group_left(name) pve_node_info
The group_left(name) join pulls the friendly hostname from pve_node_info so the legend shows "NXS-AURORA" instead of "node/NXS-AURORA." Same pattern for guests — join with pve_guest_info to get VM names instead of VMID numbers.
Another discovery: pve_storage_usage_bytes and pve_storage_size_bytes do not exist. The actual metrics are pve_disk_usage_bytes and pve_disk_size_bytes, which cover both guest disks and storage pools. You filter to storage pools with {id=~"storage/.*"}.
Promtail's Docker Job Label Isn't Automatic
The Logs Explorer dashboard had a different failure mode. Every panel that used {job="docker"} showed "invalid character '|'." Confusing error for a LogQL query.
Two bugs stacked on top of each other.
First: Promtail's docker_sd_configs does not automatically set a job label from the job_name field. Unlike static_configs where labels propagate from the config, Docker service discovery requires an explicit relabel rule. Without it, Docker container logs arrive in Loki with no job label at all.
# This alone does NOT set job="docker":
- job_name: docker
docker_sd_configs:
- host: unix:///var/run/docker.sock
# You need an explicit relabel:
relabel_configs:
- replacement: "docker"
target_label: "job"
Second: the "invalid character '|'" error was not actually about Loki rejecting the query. It was Grafana routing the query to Prometheus instead of Loki. The empty datasource UID in the dashboard JSON caused Grafana to fall back to the default datasource, which was Prometheus. Prometheus received a LogQL query containing the |~ pipe operator and rightfully complained about the | character.
The Deploy Script Assumes Too Much
The Node Exporter deploy script (deploy-node-exporter.sh) failed on first run. Port 9100 was already in use — by the monitoring stack's own Docker Node Exporter container.
This is actually the common case. The machine running the monitoring stack already has Node Exporter as a Docker container. The deploy script is meant for remote machines. But nothing in the script said that, and there was no port conflict detection.
Added three things: a --port flag for custom ports, ss-based port conflict detection that identifies the process, and specific messaging when the conflict is a Docker container ("you don't need this script on the monitoring host — the Docker container already collects host metrics").
The Final Tally
| Metric | Value |
|---|---|
| Services | 9 |
| Dashboards | 7 |
| Dashboard panels | 68 |
| Alert rules | 23 |
| Config files | 19 |
| Documentation pages | 4 (37 KB total) |
| Scripts | 2 |
| Errors hit | 8 (cAdvisor layer bug, Loki rate limit, Loki timestamp rejection, missing syslog, Prometheus env var, PVE Exporter label mismatch, Grafana datasource UID routing, Promtail missing job label) |
| Monthly cost | $0 |
| RAM overhead | ~800 MB - 1.2 GB |
| Disk usage (Prometheus 15d + Loki 7d) | ~3-8 GB |
The cAdvisor debugging ate a disproportionate chunk of the build for a problem that turned out to be unsolvable without patching cAdvisor's source code. Post-deploy testing took longer than any individual build phase — and caught three bugs that would have shipped to every user. Prometheus and Grafana just work when you give them correct config files. The problem is that "correct" often means "tested against real data," not "looks right in a text editor." Alertmanager is the most reliable piece of software in the entire stack (I have never seen it fail in any meaningful way across any project).
What I Would Do Differently
Test cAdvisor on Docker 24-27 first, then 28+. I happened to build this on a Docker 29 host, which led me straight into the containerd snapshotter bug. If I had started on Docker 25, I would have had a clean cAdvisor experience and then discovered the limitation later during compatibility testing rather than during the initial build. The debugging would have been calmer and more systematic instead of "why is this broken on my first attempt."
Start with the alert rules, not the dashboards. Dashboards are visual confirmation. Alerts are operational value. I should have written the 23 rules in phase 2 and built dashboards last. The rules would have caught real issues earlier in the build process (the swap alert did catch high swap usage, which was useful to know about).
Include a health check dashboard. A meta-dashboard showing the health of the monitoring stack itself — is Prometheus scraping all targets? Is Loki accepting writes? I check Prometheus targets manually. A dashboard for that would close the loop.
Deploy to a separate machine before calling it done. Three of the eight bugs were invisible on my dev machine. The PVE Exporter label mismatch only showed up when real Proxmox data flowed through dashboards I had only tested with promtool syntax validation. The deploy script port conflict only surfaced when running on a machine that already had the monitoring stack. Testing on the same machine you built on is not testing — it is confirmation.
What surprised me most: how much of the work is not the services themselves. Prometheus, Grafana, Loki — they install in seconds and run fine. The value is in the configuration around them. The 23 alert rules with sane thresholds. The 68 dashboard panels with correct PromQL. The first-boot rate limit tuning. The Ubuntu 24.04 journald support. The predict_linear disk fill projection. The .env-driven compose file that does not require YAML editing. That is the product. The Docker images are free. The configuration is what costs time.
The complete Homelab Monitoring Stack is available as a free download — just enter your email. You get the 9-service Docker Compose file, all 7 Grafana dashboards (68 panels), 23 alert rules, 4 documentation guides, 2 deployment scripts, and a .env-driven setup that takes about 15 minutes from download to live dashboards. Zero ongoing cost — fully self-hosted, no API keys, no subscriptions.