Deploy a Complete Homelab Monitoring Stack with Docker Compose: Grafana, Prometheus, Loki, and 23 Alert Rules
Already downloaded the Homelab Monitoring Stack kit? All the files below are pre-configured in your download. This tutorial walks through building from scratch so you understand what each piece does and can customize it. Follow along to learn the stack, or use it as a reference when you need to modify something.
Most homelab monitoring setups start with Grafana and Prometheus, then slowly bolt on pieces over the next six months. Loki for logs. Alertmanager for notifications. cAdvisor for container metrics. Each one needs its own config, its own data source, its own debugging session.
I built this stack as a single Docker Compose deployment. Nine services, seven dashboards, twenty-three alert rules, and the entire thing comes up with one command. The build log covers the full story of how I designed it and the 8 errors I hit along the way. This tutorial walks through deploying it phase by phase so you understand what each piece does and can troubleshoot it yourself.
The full architecture:
+-----------+
| Grafana | (Visualization, 7 dashboards)
+-----+-----+
|
+---------------+---------------+
| | |
+-----+-----+ +----+----+ +------+------+
| Prometheus | | Loki | | Alertmanager|
+-----+-----+ +----+----+ +------+------+
| | |
+-----+-----+ +----+----+ Notifications
| Scrapers | | Promtail| (Discord/Slack/Email)
+-----+-----+ +---------+
|
+-------+-------+----------+
| | |
Node Exporter cAdvisor PVE Exporter
(host metrics) (Docker) (Proxmox VMs)
+---------------+
| Uptime Kuma | (Standalone, HTTP/TCP/DNS checks)
+---------------+
Prerequisites
Hardware:
- 2 GB RAM minimum (the full stack uses 800 MB to 1.2 GB)
- 10 GB free disk space (Prometheus defaults to 10 GB retention, Loki adds 1-3 GB)
- Any x86_64 Linux server -- bare metal, VM, or LXC container
Software:
- Docker 24+ with Compose v2. Check with
docker compose version. If you seedocker-compose(with the hyphen), you have v1 which will not work with this compose file. - Git (optional, for cloning the config repo)
Tested on: Ubuntu 22.04, Ubuntu 24.04, Debian 12. Should work on Fedora, Arch, and any distro with Docker and systemd. The Promtail config handles both syslog and journald, so it adapts to whatever your distro uses.
If you are running Docker 28 or newer, read the cAdvisor note in Phase 2. There is a known upstream issue with the containerd snapshotter that affects container-level metrics.
Phase 1: Core Stack -- Prometheus, Grafana, Node Exporter
Create a project directory and set up the config files. I will show you every file you need to create, in order.
Directory Structure
mkdir -p homelab-monitoring/{prometheus/alerts,loki,promtail,alertmanager,grafana/provisioning/datasources,grafana/provisioning/dashboards,grafana/dashboards}
cd homelab-monitoring
The Docker Compose File
This is the backbone. Nine services, all with health checks, restart policies, and memory limits:
services:
grafana:
image: grafana/grafana:11.5.2
container_name: monitoring-grafana
ports:
- "${GRAFANA_PORT:-3000}:3000"
environment:
- GF_SECURITY_ADMIN_USER=${GRAFANA_ADMIN_USER:-admin}
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD:-admin}
- GF_AUTH_ANONYMOUS_ENABLED=${GRAFANA_ANONYMOUS_ENABLED:-false}
- GF_AUTH_ANONYMOUS_ORG_ROLE=Viewer
- GF_DASHBOARDS_DEFAULT_HOME_DASHBOARD_PATH=/var/lib/grafana/dashboards/system-overview.json
volumes:
- grafana-data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning:ro
- ./grafana/dashboards:/var/lib/grafana/dashboards:ro
networks:
- monitoring
restart: unless-stopped
healthcheck:
test: ["CMD", "wget", "--spider", "-q", "http://localhost:3000/api/health"]
interval: 30s
timeout: 5s
retries: 3
deploy:
resources:
limits:
memory: 256M
prometheus:
image: prom/prometheus:v3.2.1
container_name: monitoring-prometheus
ports:
- "${PROMETHEUS_PORT:-9090}:9090"
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--storage.tsdb.retention.time=${PROMETHEUS_RETENTION:-15d}"
- "--storage.tsdb.retention.size=${PROMETHEUS_RETENTION_SIZE:-10GB}"
- "--web.enable-lifecycle"
- "--web.enable-admin-api"
volumes:
- prometheus-data:/prometheus
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./prometheus/alerts:/etc/prometheus/alerts:ro
networks:
- monitoring
restart: unless-stopped
healthcheck:
test: ["CMD", "wget", "--spider", "-q", "http://localhost:9090/-/healthy"]
interval: 30s
timeout: 5s
retries: 3
deploy:
resources:
limits:
memory: 512M
node-exporter:
image: prom/node-exporter:v1.9.0
container_name: monitoring-node-exporter
ports:
- "${NODE_EXPORTER_PORT:-9100}:9100"
command:
- "--path.procfs=/host/proc"
- "--path.sysfs=/host/sys"
- "--path.rootfs=/host"
- "--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/host:ro,rslave
pid: host
networks:
- monitoring
restart: unless-stopped
deploy:
resources:
limits:
memory: 64M
I am showing only three services here. The full compose file has all nine, but you don't need to deploy them all at once. The remaining services (Loki, Promtail, cAdvisor, Alertmanager, PVE Exporter, Uptime Kuma) follow the same pattern and get added in later phases.
Every service has:
- Pinned image versions. No
:latesttags. When you deploy this in six months, you get the same tested versions. - Health checks. Compose can gate dependent services on health status. Promtail waits for Loki to be healthy before starting.
- Memory limits. The whole stack fits in about 1.2 GB. Without limits, Prometheus alone can eat 2 GB on a busy host.
- Named volumes. Persistent data survives container recreation.
I wrote about 9 Docker Compose patterns from this stack in a separate article — covers env defaults, health checks, memory limits, restart policies, and the other small decisions that make compose files portable.
A volumes and networks block at the bottom ties it all together:
volumes:
grafana-data:
prometheus-data:
loki-data:
alertmanager-data:
uptime-kuma-data:
promtail-positions:
networks:
monitoring:
name: monitoring
driver: bridge
Environment Configuration
Create .env from the example:
cp .env.example .env
nano .env
The .env.example has every configurable value with comments:
# --- Grafana ---
GRAFANA_PORT=3000
GRAFANA_ADMIN_USER=admin
GRAFANA_ADMIN_PASSWORD=CHANGE_ME_grafana # Change this
GRAFANA_ANONYMOUS_ENABLED=false
# --- Prometheus ---
PROMETHEUS_PORT=9090
PROMETHEUS_RETENTION=15d
PROMETHEUS_RETENTION_SIZE=10GB
# --- Loki ---
LOKI_PORT=3100
LOKI_RETENTION=168h # 7 days
At minimum, change GRAFANA_ADMIN_PASSWORD. Everything else has sane defaults. If port 3000 is taken on your host (common if you run other services), change GRAFANA_PORT to something else. I used 3030 on my box because another service already had 3000.
Prometheus Configuration
Create prometheus/prometheus.yml:
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_timeout: 10s
rule_files:
- /etc/prometheus/alerts/*.yml
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
- job_name: "node-exporter"
static_configs:
- targets: ["node-exporter:9100"]
labels:
host: "monitoring-host"
- job_name: "cadvisor"
static_configs:
- targets: ["cadvisor:8080"]
- job_name: "alertmanager"
static_configs:
- targets: ["alertmanager:9093"]
- job_name: "loki"
static_configs:
- targets: ["loki:3100"]
- job_name: "grafana"
static_configs:
- targets: ["grafana:3000"]
All targets use Docker service names (node-exporter:9100, not localhost:9100). Docker's internal DNS resolves these automatically. This is portable -- the config works on any machine without IP changes.
Grafana Datasource Provisioning
Create grafana/provisioning/datasources/datasources.yml:
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
uid: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: false
jsonData:
timeInterval: "15s"
httpMethod: POST
- name: Loki
type: loki
uid: loki
access: proxy
url: http://loki:3100
editable: false
- name: Alertmanager
type: alertmanager
uid: alertmanager
access: proxy
url: http://alertmanager:9093
editable: false
jsonData:
implementation: prometheus
The uid fields are important. Dashboards reference datasources by UID, not by name. If you leave them out, Grafana auto-generates random UIDs and the pre-built dashboard panels won't find their datasources. Worse: queries meant for Loki get routed to Prometheus (the default), and Prometheus chokes on LogQL syntax with a confusing "invalid character" error.
This auto-provisions all three datasources on Grafana's first boot. No clicking through the UI to add them manually.
Dashboard Provisioning
Create grafana/provisioning/dashboards/dashboards.yml:
apiVersion: 1
providers:
- name: "Homelab"
orgId: 1
folder: ""
type: file
disableDeletion: false
updateIntervalSeconds: 30
options:
path: /var/lib/grafana/dashboards
foldersFromFilesStructure: false
Any .json file you drop in grafana/dashboards/ gets auto-imported. The 30-second update interval means changes appear quickly during development.
Start Phase 1
docker compose up -d prometheus node-exporter grafana
Start only these three services. The compose file references config files for Loki, Alertmanager, and Promtail, so you need placeholder files for them to exist (even empty ones work -- Docker just needs the path to be valid for the bind mount).
Create the placeholders if you haven't written the full configs yet:
touch loki/loki.yml promtail/promtail.yml alertmanager/alertmanager.yml
Verify Phase 1
Check Prometheus targets:
curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool | grep -E '"health"|"job"'
You should see:
"job": "prometheus" ... "health": "up"
"job": "node-exporter" ... "health": "up"
"job": "grafana" ... "health": "up"
The alertmanager, cadvisor, loki, and proxmox jobs will show as "down." Expected.
Open Grafana at http://your-server-ip:3000 (or whatever port you set). Log in with the credentials from your .env. The three datasources should already be listed under Connections > Data sources. No panels yet -- dashboards come in Phase 7.
Phase 2: Container Monitoring -- cAdvisor
cAdvisor exposes per-container CPU, memory, network, and disk I/O metrics to Prometheus. It runs in privileged mode because it needs access to cgroup data and the Docker socket.
Add the cAdvisor service to your compose file (or just docker compose up -d cadvisor if you already have the full compose):
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.51.0
container_name: monitoring-cadvisor
ports:
- "${CADVISOR_PORT:-8080}:8080"
command:
- "--housekeeping_interval=30s"
- "--disable_metrics=advtcp,cpu_topology,cpuset,hugetlb,memory_numa,process,referenced_memory,resctrl,tcp,udp"
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker:/var/lib/docker:ro
- /dev/disk:/dev/disk:ro
devices:
- /dev/kmsg
privileged: true
networks:
- monitoring
restart: unless-stopped
deploy:
resources:
limits:
memory: 128M
The --disable_metrics flag cuts out collectors you don't need in a homelab. This reduces cAdvisor's memory usage from about 128 MB to 80 MB and cuts metric cardinality by roughly 40%.
docker compose up -d cadvisor
Verify the target:
curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool | grep -A2 '"cadvisor"'
Should show "health": "up".
Docker 28+ Users: Read This
If you're running Docker 28 or newer (check with docker --version), cAdvisor v0.51.0 has a known incompatibility with the containerd snapshotter storage driver. You will see errors like:
failed to identify the read-write layer ID for container "abc123"
cAdvisor still starts, still serves metrics, and the Prometheus target shows UP. But per-container labels (name, image) won't be attached to the metrics, which means the Docker Containers dashboard won't show individual container names. The cgroup-level CPU and memory data still flows.
This is an upstream cAdvisor bug, not a config issue. I tried v0.52.1 too -- same problem. If you're on Docker 24-27, everything works perfectly. On Docker 28+, you get host-level metrics from cAdvisor but lose per-container naming. Node Exporter and Prometheus still give you full host visibility regardless.
(I spent a while trying workarounds -- --docker_only=true, mounting the containerd socket, creating dummy mount-id files. None of them fixed it. Sometimes you just have to document the limitation and move on.)
Phase 3: Log Aggregation -- Loki and Promtail
Prometheus handles metrics. Loki handles logs. The pipeline is: Promtail collects logs from Docker containers and system journals, ships them to Loki, and Grafana queries Loki to display them.
Loki Configuration
Create loki/loki.yml:
auth_enabled: false
server:
http_listen_port: 3100
grpc_listen_port: 9096
log_level: warn
common:
path_prefix: /loki
storage:
filesystem:
chunks_directory: /loki/chunks
rules_directory: /loki/rules
replication_factor: 1
ring:
kvstore:
store: inmemory
schema_config:
configs:
- from: "2024-01-01"
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: index_
period: 24h
limits_config:
retention_period: 168h
max_query_length: 721h
max_query_series: 500
ingestion_rate_mb: 10
ingestion_burst_size_mb: 20
compactor:
working_directory: /loki/compactor
compaction_interval: 10m
retention_enabled: true
retention_delete_delay: 2h
delete_request_store: filesystem
Two things to note. The ingestion_rate_mb: 10 and ingestion_burst_size_mb: 20 are higher than Loki's defaults (4 and 6). I bumped these because on first boot, Promtail reads ALL existing Docker container logs at once. With the default limits, you get a flood of 429 rate-limit errors for the first minute. 10 MB/s handles the initial burst without issues.
The retention_period: 168h keeps logs for 7 days. For a homelab, that is usually plenty. Bump it to 336h (14 days) or 720h (30 days) if you have the disk space.
Promtail Configuration
Create promtail/promtail.yml:
server:
http_listen_port: 9080
grpc_listen_port: 0
log_level: warn
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
# --- Docker container logs ---
- job_name: docker
docker_sd_configs:
- host: unix:///var/run/docker.sock
refresh_interval: 5s
relabel_configs:
- replacement: "docker"
target_label: "job"
- source_labels: ["__meta_docker_container_name"]
regex: "/(.+)"
target_label: "container"
- source_labels: ["__meta_docker_container_label_com_docker_compose_service"]
target_label: "service"
- source_labels: ["__meta_docker_container_label_com_docker_compose_project"]
target_label: "project"
# --- System logs (syslog-based distros) ---
- job_name: system
static_configs:
- targets: ["localhost"]
labels:
job: syslog
host: monitoring-host
__path__: /var/log/syslog
- targets: ["localhost"]
labels:
job: authlog
host: monitoring-host
__path__: /var/log/auth.log
# --- journald logs (Ubuntu 24.04+, Fedora, Arch) ---
- job_name: journal
journal:
max_age: 12h
labels:
job: journal
host: monitoring-host
relabel_configs:
- source_labels: ["__journal__systemd_unit"]
target_label: "unit"
- source_labels: ["__journal__hostname"]
target_label: "hostname"
- source_labels: ["__journal_priority_keyword"]
target_label: "severity"
One gotcha: docker_sd_configs does not automatically set a job label from the job_name field. That first relabel rule (replacement: "docker", target_label: "job") is required. Without it, Grafana dashboard queries that filter on {job="docker"} match nothing, and you get cryptic errors instead of logs.
The Promtail config supports both syslog-based distros (Debian, Ubuntu before 24.04) and journald-based distros (Ubuntu 24.04+, Fedora, Arch). Missing log files are silently skipped. On Ubuntu 24.04, /var/log/syslog does not exist -- Promtail just ignores that entry and reads from journald instead.
The max_age: 12h on the journal scrape prevents Promtail from reading weeks of journal history on first boot. You only want recent entries.
The Promtail service in the compose file needs these volume mounts to access both Docker logs and system journals:
promtail:
image: grafana/promtail:3.4.2
container_name: monitoring-promtail
command: -config.file=/etc/promtail/promtail.yml
volumes:
- ./promtail/promtail.yml:/etc/promtail/promtail.yml:ro
- /var/log:/var/log:ro
- /var/lib/docker/containers:/var/lib/docker/containers:ro
- /var/run/docker.sock:/var/run/docker.sock:ro
- /run/log/journal:/run/log/journal:ro
- /etc/machine-id:/etc/machine-id:ro
- promtail-positions:/tmp
networks:
- monitoring
depends_on:
loki:
condition: service_healthy
restart: unless-stopped
deploy:
resources:
limits:
memory: 128M
Start Loki and Promtail
docker compose up -d loki
Wait 30 seconds. Loki takes a moment to warm up its ingester -- you will see a log line about "waiting for 15s after being ready." Be patient.
Then start Promtail:
docker compose up -d promtail
Verify Logs Are Flowing
Check Loki labels:
curl -s http://localhost:3100/loki/api/v1/labels | python3 -m json.tool
You should see labels like container, service, project (from Docker), unit, hostname (from journald), and job.
What to expect on first boot: The first 30-60 seconds will produce some errors in Promtail's logs -- 429 (rate limit) and 400 (timestamp too old) responses from Loki. This is normal. Promtail is trying to send all existing Docker container logs at once. It catches up quickly and the errors stop. Don't restart anything.
Phase 4: Alerting -- Alertmanager and 23 Rules
This phase adds Alertmanager for notification routing and three files of Prometheus alert rules.
Alertmanager Configuration
Create alertmanager/alertmanager.yml:
global:
resolve_timeout: 5m
route:
group_by: ["alertname", "severity"]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: "default"
routes:
- match:
severity: critical
receiver: "critical"
repeat_interval: 1h
- match:
severity: warning
receiver: "default"
repeat_interval: 4h
receivers:
# Notifications are NOT configured by default.
# Alerts will fire and appear in the Alerts Dashboard, but won't reach
# Discord/Slack/email until you set up a receiver.
# See "Adding Notification Channels" below.
- name: "default"
# webhook_configs:
# - url: "http://localhost:9095"
# send_resolved: true
- name: "critical"
# webhook_configs:
# - url: "http://localhost:9095"
# send_resolved: true
inhibit_rules:
- source_match:
severity: "critical"
target_match:
severity: "warning"
equal: ["alertname", "instance"]
The routing logic: critical alerts repeat every hour, warnings repeat every 4 hours. The inhibit rule is important -- if a CriticalCpuUsage alert fires, it suppresses the HighCpuUsage warning for the same instance. Otherwise you get two notifications for the same problem.
To actually receive notifications, you need to point the receiver URLs at a real webhook. See the "Adding Notification Channels" section at the end for Discord, Slack, and email setup.
Alert Rules -- Host Monitoring
Create prometheus/alerts/host-alerts.yml. Here are the key rules:
groups:
- name: host-alerts
rules:
- alert: HighCpuUsage
expr: >
100 - (avg by(instance)
(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is above 80% (current: {{ $value | printf \"%.1f\" }}%)"
- alert: CriticalCpuUsage
expr: >
100 - (avg by(instance)
(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 95
for: 5m
labels:
severity: critical
annotations:
summary: "Critical CPU usage on {{ $labels.instance }}"
- alert: DiskWillFillIn24Hours
expr: >
predict_linear(
node_filesystem_avail_bytes{mountpoint="/",fstype!="tmpfs"}[6h],
24 * 3600
) < 0
for: 30m
labels:
severity: warning
annotations:
summary: "Disk predicted to fill within 24 hours"
description: "Based on the last 6 hours of growth, the root filesystem will be full in less than 24 hours."
- alert: NodeDown
expr: up{job=~"node-exporter|remote-nodes"} == 0
for: 2m
labels:
severity: critical
- alert: HighSwapUsage
expr: >
(1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes))
* 100 > 50
for: 10m
labels:
severity: warning
- alert: ClockSkew
expr: abs(node_timex_offset_seconds) > 0.05
for: 5m
labels:
severity: warning
The full file has 12 rules covering CPU (80% and 95%), memory (80% and 95%), disk (80%, 90%, and predictive fill), node availability, swap, system load, network traffic, and clock drift.
The DiskWillFillIn24Hours rule is the most interesting. Instead of alerting when disk hits 85% (which might be fine on a 4 TB drive), it uses predict_linear over the last 6 hours of growth to estimate when the filesystem will actually hit zero. Much more useful in practice.
All rules use for: durations of at least 2 minutes. Without them, a 30-second CPU spike during a backup would fire an alert. Five minutes is the sweet spot for homelab use -- long enough to filter transient spikes, short enough to catch real problems.
Alert Rules -- Container Monitoring
Create prometheus/alerts/container-alerts.yml with 6 rules:
groups:
- name: container-alerts
rules:
- alert: ContainerDown
expr: >
absent(container_last_seen{name=~".+"})
or (time() - container_last_seen{name=~".+"}) > 300
for: 2m
labels:
severity: warning
- alert: ContainerOomKill
expr: increase(container_oom_events_total{name=~".+"}[5m]) > 0
for: 0m
labels:
severity: critical
- alert: ContainerRestarting
expr: increase(container_start_time_seconds{name=~".+"}[15m]) > 2
for: 0m
labels:
severity: warning
ContainerOomKill fires immediately (no for: delay) because an OOM kill is always worth knowing about. ContainerRestarting catches crash loops -- more than 2 restarts in 15 minutes means something is broken.
Alert Rules -- Proxmox
Create prometheus/alerts/proxmox-alerts.yml with 5 rules. If you are not running Proxmox, skip this file entirely. Prometheus silently ignores rule files for targets that don't exist.
groups:
- name: proxmox-alerts
rules:
- alert: ProxmoxGuestStopped
expr: pve_up{id=~"qemu/.*|lxc/.*"} * on(id) group_left(name, node) pve_guest_info == 0
for: 5m
labels:
severity: warning
annotations:
summary: "Proxmox guest {{ $labels.name }} is not running"
- alert: ProxmoxNodeDown
expr: pve_up{id=~"node/.*"} == 0
for: 2m
labels:
severity: critical
- alert: ProxmoxStorageHigh
expr: (pve_disk_usage_bytes{id=~"storage/.*"} / pve_disk_size_bytes{id=~"storage/.*"}) * 100 > 85
for: 10m
labels:
severity: warning
I put the Proxmox rules in their own file so you can delete it if you don't use Proxmox. One file removal, no commenting out individual rules.
Start Alertmanager and Reload Prometheus
docker compose up -d alertmanager
docker compose restart prometheus
Prometheus needs a restart to load the new rule files. Verify they loaded:
curl -s http://localhost:9090/api/v1/rules | python3 -m json.tool | grep -c '"name"'
Should return 3 (three rule groups: host-alerts, container-alerts, proxmox-alerts).
Check for any rules currently firing:
curl -s http://localhost:9090/api/v1/alerts | python3 -m json.tool | grep '"alertname"'
Don't be surprised if HighSwapUsage is already pending. On a box with limited RAM running 9+ containers, 50% swap usage is not unusual.
Test the full alert pipeline by sending a manual alert to Alertmanager:
curl -X POST http://localhost:9093/api/v2/alerts \
-H 'Content-Type: application/json' \
-d '[{"labels":{"alertname":"TestAlert","severity":"warning"},"annotations":{"summary":"Test alert from tutorial"}}]'
Check it appeared: curl -s http://localhost:9093/api/v2/alerts | python3 -m json.tool. If you have notification channels configured, you should get a notification within 30 seconds.
Phase 5: Proxmox Monitoring
Skip this phase if you don't run Proxmox. Nothing else depends on it.
The PVE Exporter queries the Proxmox API and exposes metrics about nodes, VMs, LXC containers, and storage. Prometheus scrapes the exporter, not Proxmox directly. I wrote a deeper guide on the Proxmox monitoring setup covering the dashboard design, all 5 alert rules, and multi-cluster support.
Create the Monitoring User on Proxmox
On your Proxmox host (SSH in or use the web shell). The full Proxmox API token guide covers troubleshooting and the web UI alternative if you prefer clicking over typing:
# Create a dedicated monitoring user
pveum user add monitoring@pve --comment "Monitoring read-only"
# Create a role with audit-only permissions
pveum role add monitoring -privs "VM.Audit,Datastore.Audit,Sys.Audit,SDN.Audit"
# Assign the role at the root level (covers all nodes, VMs, storage)
pveum aclmod / -user monitoring@pve -role monitoring
# Create an API token (save the output -- you only see the secret once)
pveum user token add monitoring@pve monitoring --privsep 0
The last command outputs something like:
full-tokenid: monitoring@pve!monitoring
value: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
Copy that value -- it goes in your .env.
Configure the Exporter
In your .env:
PVE_HOST=192.168.1.100 # Your Proxmox IP
PVE_USER=monitoring@pve
PVE_TOKEN_NAME=monitoring
PVE_TOKEN_VALUE=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
PVE_VERIFY_SSL=false # Self-signed certs are fine
Handle the Prometheus Config Quirk
Prometheus config files do not support ${VAR} substitution. The Docker Compose ${PVE_HOST} syntax works in docker-compose.yml but not in prometheus.yml. The Proxmox scrape job needs the actual IP in the target parameter.
The product kit includes a setup.sh script that reads your .env and replaces the PVE_TARGET placeholder in prometheus.yml:
#!/bin/bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
source "$SCRIPT_DIR/.env"
PROM_CONFIG="$SCRIPT_DIR/prometheus/prometheus.yml"
if [ -n "${PVE_HOST:-}" ] && [ "$PVE_HOST" != "192.168.1.100" ]; then
sed -i "s/PVE_TARGET/${PVE_HOST}/g" "$PROM_CONFIG"
echo "Prometheus: Proxmox target set to ${PVE_HOST}"
fi
Or just manually edit prometheus.yml and replace PVE_TARGET with your Proxmox IP. Either way works.
The Proxmox scrape job in prometheus.yml:
- job_name: "proxmox"
metrics_path: /pve
params:
module: [default]
cluster: ["1"]
node: ["1"]
target: ["PVE_TARGET"] # Replace with your Proxmox IP
static_configs:
- targets: ["pve-exporter:9221"]
Start the PVE Exporter
./setup.sh # or manually edit prometheus.yml
docker compose up -d pve-exporter
docker compose restart prometheus
Verify:
curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool | grep -A3 '"proxmox"'
If the target shows "up," your API token works and metrics are flowing. If you see 401 Unauthorized in the PVE Exporter logs, double-check the token value and make sure --privsep 0 was used when creating it.
For multi-node Proxmox clusters, the single API token works across all nodes -- the exporter discovers them automatically through the cluster API.
Phase 6: Uptime Kuma
Uptime Kuma is a standalone uptime monitoring tool with its own web UI. I included it in the stack because it fills a gap: Prometheus monitors what is running, Uptime Kuma monitors what is reachable. HTTP endpoints, TCP ports, DNS records, ping checks.
uptime-kuma:
image: louislam/uptime-kuma:1
container_name: monitoring-uptime-kuma
ports:
- "${UPTIME_KUMA_PORT:-3001}:3001"
volumes:
- uptime-kuma-data:/app/data
networks:
- monitoring
restart: unless-stopped
deploy:
resources:
limits:
memory: 256M
docker compose up -d uptime-kuma
Open http://your-server-ip:3001. Uptime Kuma's first-boot screen asks you to create an admin account. After that, add monitors through its UI -- click "Add New Monitor" and enter a URL, port, or IP.
No Prometheus integration needed. Uptime Kuma has its own notification system (Discord, Slack, Telegram, email, and about 90 others built in). It is genuinely the easiest part of this entire stack.
All 9 services running. Under two minutes for this phase.
Phase 7: The 7 Dashboards
With data flowing into Prometheus and Loki, the dashboards come alive. The product kit includes 7 pre-built Grafana dashboards as JSON files in grafana/dashboards/. They auto-provision on boot.
If you are building manually, you can import community dashboards from grafana.com/grafana/dashboards or build your own. Here is what each dashboard in the kit covers:
1. System Overview (12 panels)
The home dashboard. CPU gauge with green/yellow/red thresholds (0/70/90%), memory gauge, disk usage, uptime stat. Below that: CPU usage by mode (stacked area chart showing user, system, iowait, steal), system load averages (1m/5m/15m) with a CPU core count reference line, memory breakdown (used/buffers/cached/available), swap usage, disk space per mountpoint, disk I/O throughput, network traffic (filtered to exclude lo, veth*, br-*, docker*), and network errors.
2. Docker Containers (11 panels)
Container count, total CPU %, total memory usage as stat panels. Per-container CPU stacked by name, CPU throttling events, a memory table with name/usage/limit/percentage (color-coded), per-container network I/O, block I/O read/write.
3. Proxmox Cluster (10 panels)
Node status table with online/offline color mapping, guest count, cluster-wide CPU and memory stats, per-node resource graphs, per-guest CPU and memory (filtered to VMs and LXC containers), storage usage bar gauge with color thresholds, and a storage details table.
4. Disk Health (9 panels)
Disk usage bar gauge excluding tmpfs/overlay/snap mounts, root free space, disk space over time, IOPS, throughput, IO utilization percentage, read/write latency, and the most useful panel: disk fill prediction using predict_linear with three projection lines (24 hours, 48 hours, 7 days) rendered with dashed/dotted styles.
5. Network (10 panels)
Total bandwidth in/out, active TCP connection count, per-interface bandwidth, TCP connection states (ESTABLISHED, TIME_WAIT, etc.), socket state breakdown, network errors and drops per interface, ICMP messages.
6. Alerts Dashboard (9 panels)
Firing alert count, pending alert count (both with or vector(0) fallback so they show 0 instead of "No data"), alert history over time, firing alerts table with severity color-coding, alert groups from Alertmanager, active and expired silences, and a configured rules table.
7. Logs Explorer (7 panels, Loki datasource)
Log volume by container (stacked bar), error count using regex matching (error|fatal|panic), Docker container log viewer, log levels pie chart, error rate by container, journal log viewer, journal errors panel.
Total across all dashboards: 68 visualization panels.
All dashboards reference provisioned datasource UIDs (prometheus, loki, alertmanager) that match the datasource provisioning config. This ensures Loki queries go to Loki and Prometheus queries go to Prometheus — Grafana's default datasource fallback only resolves to one type, so mixed-datasource stacks need explicit UIDs.
Phase 8: Verify Everything
At this point you should have all 9 containers running:
docker compose ps
Expected output:
NAME STATUS PORTS
monitoring-alertmanager Up 0.0.0.0:9093->9093/tcp
monitoring-cadvisor Up 0.0.0.0:8080->8080/tcp
monitoring-grafana Up 0.0.0.0:3000->3000/tcp
monitoring-loki Up 0.0.0.0:3100->3100/tcp
monitoring-node-exporter Up 0.0.0.0:9100->9100/tcp
monitoring-prometheus Up 0.0.0.0:9090->9090/tcp
monitoring-promtail Up
monitoring-pve-exporter Up 0.0.0.0:9221->9221/tcp
monitoring-uptime-kuma Up 0.0.0.0:3001->3001/tcp
Run through these checks:
Prometheus targets: curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool | grep '"health"' -- expect 6-7 "up" targets (all except proxmox if you haven't configured credentials).
Alert rules loaded: curl -s http://localhost:9090/api/v1/rules | python3 -m json.tool | grep -c '"alertname"' -- should return 23.
Loki receiving logs: curl -s http://localhost:3100/loki/api/v1/labels -- should list labels like container, service, job.
Grafana dashboards: Open Grafana, click the dashboards icon. All 7 should be listed. The System Overview should show live CPU, memory, and disk data.
Resource Usage
With all 9 services running on my Ubuntu 24.04 box:
| Service | RAM Usage |
|---|---|
| Grafana | ~120 MB |
| Prometheus | ~250 MB |
| Loki | ~90 MB |
| Promtail | ~45 MB |
| Node Exporter | ~15 MB |
| cAdvisor | ~80 MB |
| Alertmanager | ~20 MB |
| PVE Exporter | ~30 MB |
| Uptime Kuma | ~150 MB |
| Total | ~800 MB |
Disk usage: Prometheus writes about 200-500 MB per day depending on metric cardinality (how many containers you run, how many scrape targets). With 15-day retention and 10 GB max, it self-manages. Loki uses 1-3 GB for 7 days of logs.
Monthly cost: $0. Everything is self-hosted, no API calls, no subscriptions.
Adding Notification Channels
The Alertmanager config ships with placeholder webhook URLs. Here is how to wire up real notifications.
Discord
- In your Discord server: Server Settings > Integrations > Webhooks > New Webhook
- Copy the webhook URL
- You need a Discord webhook adapter (Alertmanager doesn't speak Discord natively). The simplest option is alertmanager-discord as a small container, or use a service like Discord Webhook Proxy.
Alternatively, swap the webhook receiver for a generic webhook that posts to Discord's API directly. In alertmanager.yml:
receivers:
- name: "default"
webhook_configs:
- url: "https://discord.com/api/webhooks/YOUR_ID/YOUR_TOKEN"
send_resolved: true
Slack
Slack has native support in Alertmanager:
receivers:
- name: "default"
slack_configs:
- api_url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
channel: "#monitoring"
title: '{{ .CommonAnnotations.summary }}'
text: '{{ .CommonAnnotations.description }}'
send_resolved: true
global:
smtp_smarthost: "smtp.gmail.com:587"
smtp_from: "[email protected]"
smtp_auth_username: "[email protected]"
smtp_auth_password: "your-app-password"
receivers:
- name: "default"
email_configs:
- to: "[email protected]"
send_resolved: true
For Gmail, you need an App Password (not your regular password). Go to Google Account > Security > 2-Step Verification > App passwords.
After changing alertmanager.yml, restart Alertmanager:
docker compose restart alertmanager
What to Customize
Once the stack is running, these are the things most worth changing:
Add remote hosts. Install Node Exporter on other machines (the product kit includes a deploy-node-exporter.sh script that installs it as a systemd service on any Linux box). Then add them to prometheus.yml:
- job_name: "remote-nodes"
static_configs:
- targets: ["192.168.1.50:9100"]
labels:
host: "nas"
- targets: ["192.168.1.51:9100"]
labels:
host: "plex-server"
Restart Prometheus after adding targets.
Change retention. Edit .env: PROMETHEUS_RETENTION=30d for a month of metrics, LOKI_RETENTION=336h for 14 days of logs. Longer retention = more disk usage.
Tune alert thresholds. The 80%/95% CPU thresholds and 50% swap threshold work for general-purpose servers. If you run Plex or compile code, you might want to raise the CPU thresholds. Edit the expressions in prometheus/alerts/host-alerts.yml.
Add custom alert rules. Drop a new .yml file in prometheus/alerts/. Prometheus picks it up on restart. The format is the same as the existing rule files. A common addition:
- alert: SSLCertExpiringSoon
expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 14
for: 1h
labels:
severity: warning
annotations:
summary: "SSL certificate expiring within 14 days"
(This requires Blackbox Exporter, which is not in the base stack but straightforward to add.)
Enable anonymous Grafana access for wall-mounted dashboards. In .env: GRAFANA_ANONYMOUS_ENABLED=true. Anonymous users get read-only Viewer access.
Skip the Setup -- Get the Pre-Built Kit
This tutorial covers the core deployment. The Homelab Monitoring Stack kit includes everything pre-configured and tested:
- 9 Docker Compose services with pinned versions, health checks, and resource limits
- 7 Grafana dashboards (68 panels total) auto-provisioned on first boot
- 23 alert rules across 3 rule files, tuned for homelab workloads
- 4 documentation files: README, Troubleshooting, Customization, and Proxmox Setup
- 2 scripts: remote Node Exporter installer and backup utility
- Complete .env.example with every configurable value documented
Three commands to a fully monitored homelab: copy .env, run setup, docker compose up -d.
Free -- skip the debugging, get straight to monitoring.
Get the Homelab Monitoring Stack
Built and documented by Dyllan at nxsi.io. Every config, command, and error in this tutorial comes from the real deployment I ran while building the product kit. The resource numbers, the cAdvisor Docker 28 bug, the Loki rate-limiting on first boot -- all real.