nXsi
HomeProductsBlogGuidesMCP ServersServicesAbout
HomeProductsBlogGuidesMCP ServersServicesAbout
nXsi

Practical guides, automation tools, and self-hosted products for developers and homelabbers.

Content

  • Blog
  • Products
  • Guides
  • MCP Servers

Resources

  • About
  • Services
  • Support
  • Privacy Policy

Newsletter

Weekly AI architecture insights. No spam.

© 2026 nXsi Intelligence. All rights reserved.
  1. Home
  2. Blog
  3. Deploy a Complete Homelab Monitoring Sta…
TutorialintermediateFebruary 21, 2026·21 min read·24 min read hands-on

DeployaCompleteHomelabMonitoringStackwithDockerCompose:Grafana,Prometheus,Loki,and23AlertRules

Step-by-step tutorial for deploying a 9-service monitoring stack on any Linux server. Prometheus for metrics, Loki for logs, Grafana for dashboards, Alertmanager for notifications, plus Proxmox and Uptime Kuma. One docker compose up and you have 7 dashboards and 23 pre-configured alert rules.

homelabmonitoringgrafanaprometheuslokidocker-composetutorial
Series: Homelab Monitoring StackPart 2 of 4
← Previous

I Built a 9-Service Homelab Monitoring Stack and Shipped It as a Product — Here's the Full Build Log

Next →

Monitoring Proxmox with Grafana and Prometheus: A Practical Setup

Share
XLinkedIn
Table of Contents

Deploy a Complete Homelab Monitoring Stack with Docker Compose: Grafana, Prometheus, Loki, and 23 Alert Rules

Already downloaded the Homelab Monitoring Stack kit? All the files below are pre-configured in your download. This tutorial walks through building from scratch so you understand what each piece does and can customize it. Follow along to learn the stack, or use it as a reference when you need to modify something.

Most homelab monitoring setups start with Grafana and Prometheus, then slowly bolt on pieces over the next six months. Loki for logs. Alertmanager for notifications. cAdvisor for container metrics. Each one needs its own config, its own data source, its own debugging session.

I built this stack as a single Docker Compose deployment. Nine services, seven dashboards, twenty-three alert rules, and the entire thing comes up with one command. The build log covers the full story of how I designed it and the 8 errors I hit along the way. This tutorial walks through deploying it phase by phase so you understand what each piece does and can troubleshoot it yourself.

The full architecture:

                    +-----------+
                    |  Grafana  |  (Visualization, 7 dashboards)
                    +-----+-----+
                          |
          +---------------+---------------+
          |               |               |
    +-----+-----+   +----+----+   +------+------+
    | Prometheus |   |  Loki   |   | Alertmanager|
    +-----+-----+   +----+----+   +------+------+
          |               |               |
    +-----+-----+   +----+----+     Notifications
    |  Scrapers |   | Promtail|     (Discord/Slack/Email)
    +-----+-----+   +---------+
          |
  +-------+-------+----------+
  |               |          |
Node Exporter  cAdvisor  PVE Exporter
(host metrics) (Docker)  (Proxmox VMs)

  +---------------+
  |  Uptime Kuma  |  (Standalone, HTTP/TCP/DNS checks)
  +---------------+

Prerequisites

Hardware:

  • 2 GB RAM minimum (the full stack uses 800 MB to 1.2 GB)
  • 10 GB free disk space (Prometheus defaults to 10 GB retention, Loki adds 1-3 GB)
  • Any x86_64 Linux server -- bare metal, VM, or LXC container

Software:

  • Docker 24+ with Compose v2. Check with docker compose version. If you see docker-compose (with the hyphen), you have v1 which will not work with this compose file.
  • Git (optional, for cloning the config repo)

Tested on: Ubuntu 22.04, Ubuntu 24.04, Debian 12. Should work on Fedora, Arch, and any distro with Docker and systemd. The Promtail config handles both syslog and journald, so it adapts to whatever your distro uses.

If you are running Docker 28 or newer, read the cAdvisor note in Phase 2. There is a known upstream issue with the containerd snapshotter that affects container-level metrics.


Phase 1: Core Stack -- Prometheus, Grafana, Node Exporter

Create a project directory and set up the config files. I will show you every file you need to create, in order.

Directory Structure

mkdir -p homelab-monitoring/{prometheus/alerts,loki,promtail,alertmanager,grafana/provisioning/datasources,grafana/provisioning/dashboards,grafana/dashboards}
cd homelab-monitoring

The Docker Compose File

This is the backbone. Nine services, all with health checks, restart policies, and memory limits:

services:
  grafana:
    image: grafana/grafana:11.5.2
    container_name: monitoring-grafana
    ports:
      - "${GRAFANA_PORT:-3000}:3000"
    environment:
      - GF_SECURITY_ADMIN_USER=${GRAFANA_ADMIN_USER:-admin}
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD:-admin}
      - GF_AUTH_ANONYMOUS_ENABLED=${GRAFANA_ANONYMOUS_ENABLED:-false}
      - GF_AUTH_ANONYMOUS_ORG_ROLE=Viewer
      - GF_DASHBOARDS_DEFAULT_HOME_DASHBOARD_PATH=/var/lib/grafana/dashboards/system-overview.json
    volumes:
      - grafana-data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
      - ./grafana/dashboards:/var/lib/grafana/dashboards:ro
    networks:
      - monitoring
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "wget", "--spider", "-q", "http://localhost:3000/api/health"]
      interval: 30s
      timeout: 5s
      retries: 3
    deploy:
      resources:
        limits:
          memory: 256M

  prometheus:
    image: prom/prometheus:v3.2.1
    container_name: monitoring-prometheus
    ports:
      - "${PROMETHEUS_PORT:-9090}:9090"
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.path=/prometheus"
      - "--storage.tsdb.retention.time=${PROMETHEUS_RETENTION:-15d}"
      - "--storage.tsdb.retention.size=${PROMETHEUS_RETENTION_SIZE:-10GB}"
      - "--web.enable-lifecycle"
      - "--web.enable-admin-api"
    volumes:
      - prometheus-data:/prometheus
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./prometheus/alerts:/etc/prometheus/alerts:ro
    networks:
      - monitoring
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "wget", "--spider", "-q", "http://localhost:9090/-/healthy"]
      interval: 30s
      timeout: 5s
      retries: 3
    deploy:
      resources:
        limits:
          memory: 512M

  node-exporter:
    image: prom/node-exporter:v1.9.0
    container_name: monitoring-node-exporter
    ports:
      - "${NODE_EXPORTER_PORT:-9100}:9100"
    command:
      - "--path.procfs=/host/proc"
      - "--path.sysfs=/host/sys"
      - "--path.rootfs=/host"
      - "--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/host:ro,rslave
    pid: host
    networks:
      - monitoring
    restart: unless-stopped
    deploy:
      resources:
        limits:
          memory: 64M

I am showing only three services here. The full compose file has all nine, but you don't need to deploy them all at once. The remaining services (Loki, Promtail, cAdvisor, Alertmanager, PVE Exporter, Uptime Kuma) follow the same pattern and get added in later phases.

Every service has:

  • Pinned image versions. No :latest tags. When you deploy this in six months, you get the same tested versions.
  • Health checks. Compose can gate dependent services on health status. Promtail waits for Loki to be healthy before starting.
  • Memory limits. The whole stack fits in about 1.2 GB. Without limits, Prometheus alone can eat 2 GB on a busy host.
  • Named volumes. Persistent data survives container recreation.

I wrote about 9 Docker Compose patterns from this stack in a separate article — covers env defaults, health checks, memory limits, restart policies, and the other small decisions that make compose files portable.

A volumes and networks block at the bottom ties it all together:

volumes:
  grafana-data:
  prometheus-data:
  loki-data:
  alertmanager-data:
  uptime-kuma-data:
  promtail-positions:

networks:
  monitoring:
    name: monitoring
    driver: bridge

Environment Configuration

Create .env from the example:

cp .env.example .env
nano .env

The .env.example has every configurable value with comments:

# --- Grafana ---
GRAFANA_PORT=3000
GRAFANA_ADMIN_USER=admin
GRAFANA_ADMIN_PASSWORD=CHANGE_ME_grafana  # Change this
GRAFANA_ANONYMOUS_ENABLED=false

# --- Prometheus ---
PROMETHEUS_PORT=9090
PROMETHEUS_RETENTION=15d
PROMETHEUS_RETENTION_SIZE=10GB

# --- Loki ---
LOKI_PORT=3100
LOKI_RETENTION=168h    # 7 days

At minimum, change GRAFANA_ADMIN_PASSWORD. Everything else has sane defaults. If port 3000 is taken on your host (common if you run other services), change GRAFANA_PORT to something else. I used 3030 on my box because another service already had 3000.

Prometheus Configuration

Create prometheus/prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  scrape_timeout: 10s

rule_files:
  - /etc/prometheus/alerts/*.yml

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  - job_name: "node-exporter"
    static_configs:
      - targets: ["node-exporter:9100"]
        labels:
          host: "monitoring-host"

  - job_name: "cadvisor"
    static_configs:
      - targets: ["cadvisor:8080"]

  - job_name: "alertmanager"
    static_configs:
      - targets: ["alertmanager:9093"]

  - job_name: "loki"
    static_configs:
      - targets: ["loki:3100"]

  - job_name: "grafana"
    static_configs:
      - targets: ["grafana:3000"]

All targets use Docker service names (node-exporter:9100, not localhost:9100). Docker's internal DNS resolves these automatically. This is portable -- the config works on any machine without IP changes.

Grafana Datasource Provisioning

Create grafana/provisioning/datasources/datasources.yml:

apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    uid: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false
    jsonData:
      timeInterval: "15s"
      httpMethod: POST

  - name: Loki
    type: loki
    uid: loki
    access: proxy
    url: http://loki:3100
    editable: false

  - name: Alertmanager
    type: alertmanager
    uid: alertmanager
    access: proxy
    url: http://alertmanager:9093
    editable: false
    jsonData:
      implementation: prometheus

The uid fields are important. Dashboards reference datasources by UID, not by name. If you leave them out, Grafana auto-generates random UIDs and the pre-built dashboard panels won't find their datasources. Worse: queries meant for Loki get routed to Prometheus (the default), and Prometheus chokes on LogQL syntax with a confusing "invalid character" error.

This auto-provisions all three datasources on Grafana's first boot. No clicking through the UI to add them manually.

Dashboard Provisioning

Create grafana/provisioning/dashboards/dashboards.yml:

apiVersion: 1

providers:
  - name: "Homelab"
    orgId: 1
    folder: ""
    type: file
    disableDeletion: false
    updateIntervalSeconds: 30
    options:
      path: /var/lib/grafana/dashboards
      foldersFromFilesStructure: false

Any .json file you drop in grafana/dashboards/ gets auto-imported. The 30-second update interval means changes appear quickly during development.

Start Phase 1

docker compose up -d prometheus node-exporter grafana

Start only these three services. The compose file references config files for Loki, Alertmanager, and Promtail, so you need placeholder files for them to exist (even empty ones work -- Docker just needs the path to be valid for the bind mount).

Create the placeholders if you haven't written the full configs yet:

touch loki/loki.yml promtail/promtail.yml alertmanager/alertmanager.yml

Verify Phase 1

Check Prometheus targets:

curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool | grep -E '"health"|"job"'

You should see:

"job": "prometheus" ... "health": "up"
"job": "node-exporter" ... "health": "up"
"job": "grafana" ... "health": "up"

The alertmanager, cadvisor, loki, and proxmox jobs will show as "down." Expected.

Open Grafana at http://your-server-ip:3000 (or whatever port you set). Log in with the credentials from your .env. The three datasources should already be listed under Connections > Data sources. No panels yet -- dashboards come in Phase 7.


Phase 2: Container Monitoring -- cAdvisor

cAdvisor exposes per-container CPU, memory, network, and disk I/O metrics to Prometheus. It runs in privileged mode because it needs access to cgroup data and the Docker socket.

Add the cAdvisor service to your compose file (or just docker compose up -d cadvisor if you already have the full compose):

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.51.0
    container_name: monitoring-cadvisor
    ports:
      - "${CADVISOR_PORT:-8080}:8080"
    command:
      - "--housekeeping_interval=30s"
      - "--disable_metrics=advtcp,cpu_topology,cpuset,hugetlb,memory_numa,process,referenced_memory,resctrl,tcp,udp"
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker:/var/lib/docker:ro
      - /dev/disk:/dev/disk:ro
    devices:
      - /dev/kmsg
    privileged: true
    networks:
      - monitoring
    restart: unless-stopped
    deploy:
      resources:
        limits:
          memory: 128M

The --disable_metrics flag cuts out collectors you don't need in a homelab. This reduces cAdvisor's memory usage from about 128 MB to 80 MB and cuts metric cardinality by roughly 40%.

docker compose up -d cadvisor

Verify the target:

curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool | grep -A2 '"cadvisor"'

Should show "health": "up".

Docker 28+ Users: Read This

If you're running Docker 28 or newer (check with docker --version), cAdvisor v0.51.0 has a known incompatibility with the containerd snapshotter storage driver. You will see errors like:

failed to identify the read-write layer ID for container "abc123"

cAdvisor still starts, still serves metrics, and the Prometheus target shows UP. But per-container labels (name, image) won't be attached to the metrics, which means the Docker Containers dashboard won't show individual container names. The cgroup-level CPU and memory data still flows.

This is an upstream cAdvisor bug, not a config issue. I tried v0.52.1 too -- same problem. If you're on Docker 24-27, everything works perfectly. On Docker 28+, you get host-level metrics from cAdvisor but lose per-container naming. Node Exporter and Prometheus still give you full host visibility regardless.

(I spent a while trying workarounds -- --docker_only=true, mounting the containerd socket, creating dummy mount-id files. None of them fixed it. Sometimes you just have to document the limitation and move on.)


Phase 3: Log Aggregation -- Loki and Promtail

Prometheus handles metrics. Loki handles logs. The pipeline is: Promtail collects logs from Docker containers and system journals, ships them to Loki, and Grafana queries Loki to display them.

Loki Configuration

Create loki/loki.yml:

auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096
  log_level: warn

common:
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: "2024-01-01"
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

limits_config:
  retention_period: 168h
  max_query_length: 721h
  max_query_series: 500
  ingestion_rate_mb: 10
  ingestion_burst_size_mb: 20

compactor:
  working_directory: /loki/compactor
  compaction_interval: 10m
  retention_enabled: true
  retention_delete_delay: 2h
  delete_request_store: filesystem

Two things to note. The ingestion_rate_mb: 10 and ingestion_burst_size_mb: 20 are higher than Loki's defaults (4 and 6). I bumped these because on first boot, Promtail reads ALL existing Docker container logs at once. With the default limits, you get a flood of 429 rate-limit errors for the first minute. 10 MB/s handles the initial burst without issues.

The retention_period: 168h keeps logs for 7 days. For a homelab, that is usually plenty. Bump it to 336h (14 days) or 720h (30 days) if you have the disk space.

Promtail Configuration

Create promtail/promtail.yml:

server:
  http_listen_port: 9080
  grpc_listen_port: 0
  log_level: warn

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  # --- Docker container logs ---
  - job_name: docker
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
        refresh_interval: 5s
    relabel_configs:
      - replacement: "docker"
        target_label: "job"
      - source_labels: ["__meta_docker_container_name"]
        regex: "/(.+)"
        target_label: "container"
      - source_labels: ["__meta_docker_container_label_com_docker_compose_service"]
        target_label: "service"
      - source_labels: ["__meta_docker_container_label_com_docker_compose_project"]
        target_label: "project"

  # --- System logs (syslog-based distros) ---
  - job_name: system
    static_configs:
      - targets: ["localhost"]
        labels:
          job: syslog
          host: monitoring-host
          __path__: /var/log/syslog
      - targets: ["localhost"]
        labels:
          job: authlog
          host: monitoring-host
          __path__: /var/log/auth.log

  # --- journald logs (Ubuntu 24.04+, Fedora, Arch) ---
  - job_name: journal
    journal:
      max_age: 12h
      labels:
        job: journal
        host: monitoring-host
    relabel_configs:
      - source_labels: ["__journal__systemd_unit"]
        target_label: "unit"
      - source_labels: ["__journal__hostname"]
        target_label: "hostname"
      - source_labels: ["__journal_priority_keyword"]
        target_label: "severity"

One gotcha: docker_sd_configs does not automatically set a job label from the job_name field. That first relabel rule (replacement: "docker", target_label: "job") is required. Without it, Grafana dashboard queries that filter on {job="docker"} match nothing, and you get cryptic errors instead of logs.

The Promtail config supports both syslog-based distros (Debian, Ubuntu before 24.04) and journald-based distros (Ubuntu 24.04+, Fedora, Arch). Missing log files are silently skipped. On Ubuntu 24.04, /var/log/syslog does not exist -- Promtail just ignores that entry and reads from journald instead.

The max_age: 12h on the journal scrape prevents Promtail from reading weeks of journal history on first boot. You only want recent entries.

The Promtail service in the compose file needs these volume mounts to access both Docker logs and system journals:

  promtail:
    image: grafana/promtail:3.4.2
    container_name: monitoring-promtail
    command: -config.file=/etc/promtail/promtail.yml
    volumes:
      - ./promtail/promtail.yml:/etc/promtail/promtail.yml:ro
      - /var/log:/var/log:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - /run/log/journal:/run/log/journal:ro
      - /etc/machine-id:/etc/machine-id:ro
      - promtail-positions:/tmp
    networks:
      - monitoring
    depends_on:
      loki:
        condition: service_healthy
    restart: unless-stopped
    deploy:
      resources:
        limits:
          memory: 128M

Start Loki and Promtail

docker compose up -d loki

Wait 30 seconds. Loki takes a moment to warm up its ingester -- you will see a log line about "waiting for 15s after being ready." Be patient.

Then start Promtail:

docker compose up -d promtail

Verify Logs Are Flowing

Check Loki labels:

curl -s http://localhost:3100/loki/api/v1/labels | python3 -m json.tool

You should see labels like container, service, project (from Docker), unit, hostname (from journald), and job.

What to expect on first boot: The first 30-60 seconds will produce some errors in Promtail's logs -- 429 (rate limit) and 400 (timestamp too old) responses from Loki. This is normal. Promtail is trying to send all existing Docker container logs at once. It catches up quickly and the errors stop. Don't restart anything.


Phase 4: Alerting -- Alertmanager and 23 Rules

This phase adds Alertmanager for notification routing and three files of Prometheus alert rules.

Alertmanager Configuration

Create alertmanager/alertmanager.yml:

global:
  resolve_timeout: 5m

route:
  group_by: ["alertname", "severity"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: "default"
  routes:
    - match:
        severity: critical
      receiver: "critical"
      repeat_interval: 1h
    - match:
        severity: warning
      receiver: "default"
      repeat_interval: 4h

receivers:
  # Notifications are NOT configured by default.
  # Alerts will fire and appear in the Alerts Dashboard, but won't reach
  # Discord/Slack/email until you set up a receiver.
  # See "Adding Notification Channels" below.
  - name: "default"
    # webhook_configs:
    #   - url: "http://localhost:9095"
    #     send_resolved: true

  - name: "critical"
    # webhook_configs:
    #   - url: "http://localhost:9095"
    #     send_resolved: true

inhibit_rules:
  - source_match:
      severity: "critical"
    target_match:
      severity: "warning"
    equal: ["alertname", "instance"]

The routing logic: critical alerts repeat every hour, warnings repeat every 4 hours. The inhibit rule is important -- if a CriticalCpuUsage alert fires, it suppresses the HighCpuUsage warning for the same instance. Otherwise you get two notifications for the same problem.

To actually receive notifications, you need to point the receiver URLs at a real webhook. See the "Adding Notification Channels" section at the end for Discord, Slack, and email setup.

Alert Rules -- Host Monitoring

Create prometheus/alerts/host-alerts.yml. Here are the key rules:

groups:
  - name: host-alerts
    rules:
      - alert: HighCpuUsage
        expr: >
          100 - (avg by(instance)
          (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is above 80% (current: {{ $value | printf \"%.1f\" }}%)"

      - alert: CriticalCpuUsage
        expr: >
          100 - (avg by(instance)
          (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 95
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Critical CPU usage on {{ $labels.instance }}"

      - alert: DiskWillFillIn24Hours
        expr: >
          predict_linear(
            node_filesystem_avail_bytes{mountpoint="/",fstype!="tmpfs"}[6h],
            24 * 3600
          ) < 0
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Disk predicted to fill within 24 hours"
          description: "Based on the last 6 hours of growth, the root filesystem will be full in less than 24 hours."

      - alert: NodeDown
        expr: up{job=~"node-exporter|remote-nodes"} == 0
        for: 2m
        labels:
          severity: critical

      - alert: HighSwapUsage
        expr: >
          (1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes))
          * 100 > 50
        for: 10m
        labels:
          severity: warning

      - alert: ClockSkew
        expr: abs(node_timex_offset_seconds) > 0.05
        for: 5m
        labels:
          severity: warning

The full file has 12 rules covering CPU (80% and 95%), memory (80% and 95%), disk (80%, 90%, and predictive fill), node availability, swap, system load, network traffic, and clock drift.

The DiskWillFillIn24Hours rule is the most interesting. Instead of alerting when disk hits 85% (which might be fine on a 4 TB drive), it uses predict_linear over the last 6 hours of growth to estimate when the filesystem will actually hit zero. Much more useful in practice.

All rules use for: durations of at least 2 minutes. Without them, a 30-second CPU spike during a backup would fire an alert. Five minutes is the sweet spot for homelab use -- long enough to filter transient spikes, short enough to catch real problems.

Alert Rules -- Container Monitoring

Create prometheus/alerts/container-alerts.yml with 6 rules:

groups:
  - name: container-alerts
    rules:
      - alert: ContainerDown
        expr: >
          absent(container_last_seen{name=~".+"})
          or (time() - container_last_seen{name=~".+"}) > 300
        for: 2m
        labels:
          severity: warning

      - alert: ContainerOomKill
        expr: increase(container_oom_events_total{name=~".+"}[5m]) > 0
        for: 0m
        labels:
          severity: critical

      - alert: ContainerRestarting
        expr: increase(container_start_time_seconds{name=~".+"}[15m]) > 2
        for: 0m
        labels:
          severity: warning

ContainerOomKill fires immediately (no for: delay) because an OOM kill is always worth knowing about. ContainerRestarting catches crash loops -- more than 2 restarts in 15 minutes means something is broken.

Alert Rules -- Proxmox

Create prometheus/alerts/proxmox-alerts.yml with 5 rules. If you are not running Proxmox, skip this file entirely. Prometheus silently ignores rule files for targets that don't exist.

groups:
  - name: proxmox-alerts
    rules:
      - alert: ProxmoxGuestStopped
        expr: pve_up{id=~"qemu/.*|lxc/.*"} * on(id) group_left(name, node) pve_guest_info == 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Proxmox guest {{ $labels.name }} is not running"

      - alert: ProxmoxNodeDown
        expr: pve_up{id=~"node/.*"} == 0
        for: 2m
        labels:
          severity: critical

      - alert: ProxmoxStorageHigh
        expr: (pve_disk_usage_bytes{id=~"storage/.*"} / pve_disk_size_bytes{id=~"storage/.*"}) * 100 > 85
        for: 10m
        labels:
          severity: warning

I put the Proxmox rules in their own file so you can delete it if you don't use Proxmox. One file removal, no commenting out individual rules.

Start Alertmanager and Reload Prometheus

docker compose up -d alertmanager
docker compose restart prometheus

Prometheus needs a restart to load the new rule files. Verify they loaded:

curl -s http://localhost:9090/api/v1/rules | python3 -m json.tool | grep -c '"name"'

Should return 3 (three rule groups: host-alerts, container-alerts, proxmox-alerts).

Check for any rules currently firing:

curl -s http://localhost:9090/api/v1/alerts | python3 -m json.tool | grep '"alertname"'

Don't be surprised if HighSwapUsage is already pending. On a box with limited RAM running 9+ containers, 50% swap usage is not unusual.

Test the full alert pipeline by sending a manual alert to Alertmanager:

curl -X POST http://localhost:9093/api/v2/alerts \
  -H 'Content-Type: application/json' \
  -d '[{"labels":{"alertname":"TestAlert","severity":"warning"},"annotations":{"summary":"Test alert from tutorial"}}]'

Check it appeared: curl -s http://localhost:9093/api/v2/alerts | python3 -m json.tool. If you have notification channels configured, you should get a notification within 30 seconds.


Phase 5: Proxmox Monitoring

Skip this phase if you don't run Proxmox. Nothing else depends on it.

The PVE Exporter queries the Proxmox API and exposes metrics about nodes, VMs, LXC containers, and storage. Prometheus scrapes the exporter, not Proxmox directly. I wrote a deeper guide on the Proxmox monitoring setup covering the dashboard design, all 5 alert rules, and multi-cluster support.

Create the Monitoring User on Proxmox

On your Proxmox host (SSH in or use the web shell). The full Proxmox API token guide covers troubleshooting and the web UI alternative if you prefer clicking over typing:

# Create a dedicated monitoring user
pveum user add monitoring@pve --comment "Monitoring read-only"

# Create a role with audit-only permissions
pveum role add monitoring -privs "VM.Audit,Datastore.Audit,Sys.Audit,SDN.Audit"

# Assign the role at the root level (covers all nodes, VMs, storage)
pveum aclmod / -user monitoring@pve -role monitoring

# Create an API token (save the output -- you only see the secret once)
pveum user token add monitoring@pve monitoring --privsep 0

The last command outputs something like:

full-tokenid: monitoring@pve!monitoring
value: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx

Copy that value -- it goes in your .env.

Configure the Exporter

In your .env:

PVE_HOST=192.168.1.100        # Your Proxmox IP
PVE_USER=monitoring@pve
PVE_TOKEN_NAME=monitoring
PVE_TOKEN_VALUE=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
PVE_VERIFY_SSL=false           # Self-signed certs are fine

Handle the Prometheus Config Quirk

Prometheus config files do not support ${VAR} substitution. The Docker Compose ${PVE_HOST} syntax works in docker-compose.yml but not in prometheus.yml. The Proxmox scrape job needs the actual IP in the target parameter.

The product kit includes a setup.sh script that reads your .env and replaces the PVE_TARGET placeholder in prometheus.yml:

#!/bin/bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
source "$SCRIPT_DIR/.env"

PROM_CONFIG="$SCRIPT_DIR/prometheus/prometheus.yml"
if [ -n "${PVE_HOST:-}" ] && [ "$PVE_HOST" != "192.168.1.100" ]; then
    sed -i "s/PVE_TARGET/${PVE_HOST}/g" "$PROM_CONFIG"
    echo "Prometheus: Proxmox target set to ${PVE_HOST}"
fi

Or just manually edit prometheus.yml and replace PVE_TARGET with your Proxmox IP. Either way works.

The Proxmox scrape job in prometheus.yml:

  - job_name: "proxmox"
    metrics_path: /pve
    params:
      module: [default]
      cluster: ["1"]
      node: ["1"]
      target: ["PVE_TARGET"]   # Replace with your Proxmox IP
    static_configs:
      - targets: ["pve-exporter:9221"]

Start the PVE Exporter

./setup.sh   # or manually edit prometheus.yml
docker compose up -d pve-exporter
docker compose restart prometheus

Verify:

curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool | grep -A3 '"proxmox"'

If the target shows "up," your API token works and metrics are flowing. If you see 401 Unauthorized in the PVE Exporter logs, double-check the token value and make sure --privsep 0 was used when creating it.

For multi-node Proxmox clusters, the single API token works across all nodes -- the exporter discovers them automatically through the cluster API.


Phase 6: Uptime Kuma

Uptime Kuma is a standalone uptime monitoring tool with its own web UI. I included it in the stack because it fills a gap: Prometheus monitors what is running, Uptime Kuma monitors what is reachable. HTTP endpoints, TCP ports, DNS records, ping checks.

  uptime-kuma:
    image: louislam/uptime-kuma:1
    container_name: monitoring-uptime-kuma
    ports:
      - "${UPTIME_KUMA_PORT:-3001}:3001"
    volumes:
      - uptime-kuma-data:/app/data
    networks:
      - monitoring
    restart: unless-stopped
    deploy:
      resources:
        limits:
          memory: 256M
docker compose up -d uptime-kuma

Open http://your-server-ip:3001. Uptime Kuma's first-boot screen asks you to create an admin account. After that, add monitors through its UI -- click "Add New Monitor" and enter a URL, port, or IP.

No Prometheus integration needed. Uptime Kuma has its own notification system (Discord, Slack, Telegram, email, and about 90 others built in). It is genuinely the easiest part of this entire stack.

All 9 services running. Under two minutes for this phase.


Phase 7: The 7 Dashboards

With data flowing into Prometheus and Loki, the dashboards come alive. The product kit includes 7 pre-built Grafana dashboards as JSON files in grafana/dashboards/. They auto-provision on boot.

If you are building manually, you can import community dashboards from grafana.com/grafana/dashboards or build your own. Here is what each dashboard in the kit covers:

1. System Overview (12 panels)

The home dashboard. CPU gauge with green/yellow/red thresholds (0/70/90%), memory gauge, disk usage, uptime stat. Below that: CPU usage by mode (stacked area chart showing user, system, iowait, steal), system load averages (1m/5m/15m) with a CPU core count reference line, memory breakdown (used/buffers/cached/available), swap usage, disk space per mountpoint, disk I/O throughput, network traffic (filtered to exclude lo, veth*, br-*, docker*), and network errors.

2. Docker Containers (11 panels)

Container count, total CPU %, total memory usage as stat panels. Per-container CPU stacked by name, CPU throttling events, a memory table with name/usage/limit/percentage (color-coded), per-container network I/O, block I/O read/write.

3. Proxmox Cluster (10 panels)

Node status table with online/offline color mapping, guest count, cluster-wide CPU and memory stats, per-node resource graphs, per-guest CPU and memory (filtered to VMs and LXC containers), storage usage bar gauge with color thresholds, and a storage details table.

4. Disk Health (9 panels)

Disk usage bar gauge excluding tmpfs/overlay/snap mounts, root free space, disk space over time, IOPS, throughput, IO utilization percentage, read/write latency, and the most useful panel: disk fill prediction using predict_linear with three projection lines (24 hours, 48 hours, 7 days) rendered with dashed/dotted styles.

5. Network (10 panels)

Total bandwidth in/out, active TCP connection count, per-interface bandwidth, TCP connection states (ESTABLISHED, TIME_WAIT, etc.), socket state breakdown, network errors and drops per interface, ICMP messages.

6. Alerts Dashboard (9 panels)

Firing alert count, pending alert count (both with or vector(0) fallback so they show 0 instead of "No data"), alert history over time, firing alerts table with severity color-coding, alert groups from Alertmanager, active and expired silences, and a configured rules table.

7. Logs Explorer (7 panels, Loki datasource)

Log volume by container (stacked bar), error count using regex matching (error|fatal|panic), Docker container log viewer, log levels pie chart, error rate by container, journal log viewer, journal errors panel.

Total across all dashboards: 68 visualization panels.

All dashboards reference provisioned datasource UIDs (prometheus, loki, alertmanager) that match the datasource provisioning config. This ensures Loki queries go to Loki and Prometheus queries go to Prometheus — Grafana's default datasource fallback only resolves to one type, so mixed-datasource stacks need explicit UIDs.


Phase 8: Verify Everything

At this point you should have all 9 containers running:

docker compose ps

Expected output:

NAME                      STATUS    PORTS
monitoring-alertmanager   Up        0.0.0.0:9093->9093/tcp
monitoring-cadvisor       Up        0.0.0.0:8080->8080/tcp
monitoring-grafana        Up        0.0.0.0:3000->3000/tcp
monitoring-loki           Up        0.0.0.0:3100->3100/tcp
monitoring-node-exporter  Up        0.0.0.0:9100->9100/tcp
monitoring-prometheus     Up        0.0.0.0:9090->9090/tcp
monitoring-promtail       Up
monitoring-pve-exporter   Up        0.0.0.0:9221->9221/tcp
monitoring-uptime-kuma    Up        0.0.0.0:3001->3001/tcp

Run through these checks:

Prometheus targets: curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool | grep '"health"' -- expect 6-7 "up" targets (all except proxmox if you haven't configured credentials).

Alert rules loaded: curl -s http://localhost:9090/api/v1/rules | python3 -m json.tool | grep -c '"alertname"' -- should return 23.

Loki receiving logs: curl -s http://localhost:3100/loki/api/v1/labels -- should list labels like container, service, job.

Grafana dashboards: Open Grafana, click the dashboards icon. All 7 should be listed. The System Overview should show live CPU, memory, and disk data.

Resource Usage

With all 9 services running on my Ubuntu 24.04 box:

ServiceRAM Usage
Grafana~120 MB
Prometheus~250 MB
Loki~90 MB
Promtail~45 MB
Node Exporter~15 MB
cAdvisor~80 MB
Alertmanager~20 MB
PVE Exporter~30 MB
Uptime Kuma~150 MB
Total~800 MB

Disk usage: Prometheus writes about 200-500 MB per day depending on metric cardinality (how many containers you run, how many scrape targets). With 15-day retention and 10 GB max, it self-manages. Loki uses 1-3 GB for 7 days of logs.

Monthly cost: $0. Everything is self-hosted, no API calls, no subscriptions.


Adding Notification Channels

The Alertmanager config ships with placeholder webhook URLs. Here is how to wire up real notifications.

Discord

  1. In your Discord server: Server Settings > Integrations > Webhooks > New Webhook
  2. Copy the webhook URL
  3. You need a Discord webhook adapter (Alertmanager doesn't speak Discord natively). The simplest option is alertmanager-discord as a small container, or use a service like Discord Webhook Proxy.

Alternatively, swap the webhook receiver for a generic webhook that posts to Discord's API directly. In alertmanager.yml:

receivers:
  - name: "default"
    webhook_configs:
      - url: "https://discord.com/api/webhooks/YOUR_ID/YOUR_TOKEN"
        send_resolved: true

Slack

Slack has native support in Alertmanager:

receivers:
  - name: "default"
    slack_configs:
      - api_url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
        channel: "#monitoring"
        title: '{{ .CommonAnnotations.summary }}'
        text: '{{ .CommonAnnotations.description }}'
        send_resolved: true

Email

global:
  smtp_smarthost: "smtp.gmail.com:587"
  smtp_from: "[email protected]"
  smtp_auth_username: "[email protected]"
  smtp_auth_password: "your-app-password"

receivers:
  - name: "default"
    email_configs:
      - to: "[email protected]"
        send_resolved: true

For Gmail, you need an App Password (not your regular password). Go to Google Account > Security > 2-Step Verification > App passwords.

After changing alertmanager.yml, restart Alertmanager:

docker compose restart alertmanager

What to Customize

Once the stack is running, these are the things most worth changing:

Add remote hosts. Install Node Exporter on other machines (the product kit includes a deploy-node-exporter.sh script that installs it as a systemd service on any Linux box). Then add them to prometheus.yml:

  - job_name: "remote-nodes"
    static_configs:
      - targets: ["192.168.1.50:9100"]
        labels:
          host: "nas"
      - targets: ["192.168.1.51:9100"]
        labels:
          host: "plex-server"

Restart Prometheus after adding targets.

Change retention. Edit .env: PROMETHEUS_RETENTION=30d for a month of metrics, LOKI_RETENTION=336h for 14 days of logs. Longer retention = more disk usage.

Tune alert thresholds. The 80%/95% CPU thresholds and 50% swap threshold work for general-purpose servers. If you run Plex or compile code, you might want to raise the CPU thresholds. Edit the expressions in prometheus/alerts/host-alerts.yml.

Add custom alert rules. Drop a new .yml file in prometheus/alerts/. Prometheus picks it up on restart. The format is the same as the existing rule files. A common addition:

- alert: SSLCertExpiringSoon
  expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 14
  for: 1h
  labels:
    severity: warning
  annotations:
    summary: "SSL certificate expiring within 14 days"

(This requires Blackbox Exporter, which is not in the base stack but straightforward to add.)

Enable anonymous Grafana access for wall-mounted dashboards. In .env: GRAFANA_ANONYMOUS_ENABLED=true. Anonymous users get read-only Viewer access.


Skip the Setup -- Get the Pre-Built Kit

This tutorial covers the core deployment. The Homelab Monitoring Stack kit includes everything pre-configured and tested:

  • 9 Docker Compose services with pinned versions, health checks, and resource limits
  • 7 Grafana dashboards (68 panels total) auto-provisioned on first boot
  • 23 alert rules across 3 rule files, tuned for homelab workloads
  • 4 documentation files: README, Troubleshooting, Customization, and Proxmox Setup
  • 2 scripts: remote Node Exporter installer and backup utility
  • Complete .env.example with every configurable value documented

Three commands to a fully monitored homelab: copy .env, run setup, docker compose up -d.

Free -- skip the debugging, get straight to monitoring.

Get the Homelab Monitoring Stack


Built and documented by Dyllan at nxsi.io. Every config, command, and error in this tutorial comes from the real deployment I ran while building the product kit. The resource numbers, the cAdvisor Docker 28 bug, the Loki rate-limiting on first boot -- all real.

On this page

Series: Homelab Monitoring StackPart 2 of 4
← Previous

I Built a 9-Service Homelab Monitoring Stack and Shipped It as a Product — Here's the Full Build Log

Next →

Monitoring Proxmox with Grafana and Prometheus: A Practical Setup

Get weekly AI architecture insights

Patterns, lessons, and tools from building a production multi-agent system. Delivered weekly.

Related Product

Homelab Monitoring Stack — Complete Docker Compose + Grafana Dashboards

A 9-service Docker Compose monitoring stack with 7 pre-built Grafana dashboards (68 panels), 23 Prometheus alert rules, Loki log aggregation, and full documentation. Copy .env, run docker compose up, done.

Get Free Download

Read Next

Build Log17 min

I Built a 9-Service Homelab Monitoring Stack and Shipped It as a Product — Here's the Full Build Log

A chronological build log of creating a complete homelab monitoring stack with Grafana, Prometheus, Loki, cAdvisor, Alertmanager, and Uptime Kuma — 9 Docker services, 7 dashboards, 68 panels, 23 alert rules. Every decision and error documented.

Guide10 min

Monitoring Proxmox with Grafana and Prometheus: A Practical Setup

How to monitor Proxmox VMs, LXC containers, node health, and storage from Grafana using the PVE Exporter and Prometheus, with pre-built dashboards and alert rules.

Guide7 min

9 Docker Compose Patterns I Use in Every Homelab Stack

Practical patterns from a production 9-service monitoring stack — environment defaults, health checks, memory limits, and the small details that prevent 3 AM debugging sessions.