Observability10 min

Prometheus

Metrics collection and alerting. Pairs with Grafana for pipeline monitoring.

Prometheus pulls (scrapes) metrics from your services on a schedule, stores them as time series data, and evaluates alert rules against them. It doesn't push. Your services expose a /metrics endpoint, Prometheus hits it every N seconds, done.

Quick Setup (Docker Compose)

prometheus:
  image: prom/prometheus:v3.2.1
  container_name: prometheus
  ports:
    - "9090:9090"
  command:
    - "--config.file=/etc/prometheus/prometheus.yml"
    - "--storage.tsdb.path=/prometheus"
    - "--storage.tsdb.retention.time=15d"
  volumes:
    - prometheus-data:/prometheus
    - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
    - ./prometheus/alerts:/etc/prometheus/alerts:ro
  restart: unless-stopped

volumes:
  prometheus-data:

Create the config file:

# prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]
mkdir -p prometheus/alerts
docker compose up -d prometheus

Open http://your-server:9090 to verify. Type up in the expression box and hit Execute -- you should see up{job="prometheus"} 1.

Adding Scrape Targets

Every service you want to monitor needs a scrape job. Common ones:

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  - job_name: "node-exporter"
    static_configs:
      - targets: ["node-exporter:9100"]

  - job_name: "cadvisor"
    static_configs:
      - targets: ["cadvisor:8080"]

Node Exporter gives you host-level metrics (CPU, memory, disk, network). cAdvisor gives you per-container metrics. Between the two, you can monitor basically everything on a Docker host.

After editing prometheus.yml, reload the config:

docker compose restart prometheus

Or send a SIGHUP if you don't want downtime:

docker exec prometheus kill -HUP 1

Alert Rules

Put alert rule files in prometheus/alerts/ and reference them in the config:

# Add to prometheus.yml
rule_files:
  - "/etc/prometheus/alerts/*.yml"

Example alert rule file:

# prometheus/alerts/host-alerts.yml
groups:
  - name: host
    rules:
      - alert: HighCpuUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "CPU usage above 85% for 5 minutes"

      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Root filesystem has less than 15% free space"

Alerts fire in Prometheus but notifications go through Alertmanager -- a separate service that handles deduplication, grouping, silencing, and routing to Discord/Slack/email. That's covered in the Homelab Monitoring Stack Tutorial.

Configuration Notes

  • Retention: Default is 15 days. For homelab use, that's plenty. Bump it with --storage.tsdb.retention.time=30d if you want a month. Each day of retention costs roughly 1-2 MB per scrape target.
  • Scrape interval: 15s is the standard default. Going below 10s generates a lot of data without much benefit for infrastructure monitoring. Going above 60s and you'll miss short spikes.
  • Relabeling: Prometheus has a powerful relabeling system for manipulating labels before storage. You probably don't need it until you do, and then you really need it. The relabel_config docs are worth bookmarking.
  • Federation: If you run Prometheus on multiple machines, one instance can scrape another's /federate endpoint. Useful for multi-host setups without a central time series database.

Troubleshooting

Target shows "DOWN" in Status > Targets -- The scrape target isn't reachable. Check that the container is running, the port is correct, and both containers share a Docker network. curl http://target:port/metrics from inside the Prometheus container to debug.

"out of order sample" errors -- Two Prometheus instances are scraping the same target, or the system clock jumped. Don't run duplicate scrapers.

Storage growing faster than expected -- High-cardinality labels (like unique request IDs or user IDs in metric labels) create massive time series counts. Use prometheus_tsdb_head_series to check your active series count. Anything above 100k on a homelab setup is suspicious.