Observability8 min

GrafanaBackupDashboard

Connect the Homelab Backup Stack's Prometheus metrics to Grafana for visual backup health, restore verification status, and storage trends.

Connect the Homelab Backup Stack's Prometheus metrics to Grafana for visual backup health -- scores, ages, sizes, verification status -- alongside your existing infrastructure monitoring.

This guide assumes you're running both the backup stack and a Grafana/Prometheus setup (like the Homelab Monitoring Stack). If you're running a different Prometheus/Grafana installation, adapt the Docker Compose paths accordingly.

How the Data Flows

backup-metrics.sh (cron, every 5 min)
    ↓ writes
/var/lib/node_exporter/backup.prom (textfile on host)
    ↓ mounted into
Node Exporter (textfile collector)
    ↓ scraped by
Prometheus (every 15s)
    ↓ queried by
Grafana (dashboard panels)

The textfile collector is Node Exporter's mechanism for ingesting custom metrics. You write a .prom file in Prometheus exposition format, Node Exporter serves it on its /metrics endpoint, and Prometheus scrapes it like any other metric.

Step 1: Create the Textfile Directory

On your host machine:

sudo mkdir -p /var/lib/node_exporter
sudo chown $(whoami):$(whoami) /var/lib/node_exporter

Step 2: Enable the Textfile Collector in Node Exporter

Edit your monitoring stack's docker-compose.yml and update the node-exporter service:

node-exporter:
  image: prom/node-exporter:v1.9.0
  container_name: monitoring-node-exporter
  command:
    - "--path.procfs=/host/proc"
    - "--path.sysfs=/host/sys"
    - "--path.rootfs=/host"
    - "--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)"
    - "--collector.textfile.directory=/textfile"    # ADD THIS LINE
  volumes:
    - /proc:/host/proc:ro
    - /sys:/host/sys:ro
    - /:/host:ro,rslave
    - /var/lib/node_exporter:/textfile:ro           # ADD THIS LINE

Restart Node Exporter:

docker compose up -d node-exporter

Step 3: Set Up the Cron Export

Add a cron job that runs backup-metrics.sh every 5 minutes and writes the output to the textfile directory:

crontab -e

Add this line (adjust the path to your backup stack installation):

*/5 * * * * /opt/homelab-backup-stack/scripts/backup-metrics.sh > /var/lib/node_exporter/backup.prom 2>/dev/null

Run it once manually to verify:

/opt/homelab-backup-stack/scripts/backup-metrics.sh > /var/lib/node_exporter/backup.prom
cat /var/lib/node_exporter/backup.prom

You should see metrics like:

backup_repo_size_bytes 501010509
backup_repo_snapshots 8
backup_last_success_timestamp{service="nxsi-postgres",profile="database"} 1771463558
backup_health_score{service="nxsi-postgres",profile="database"} 100
backup_verify_last_result{service="nxsi-postgres",profile="database"} 1

Step 4: Verify Prometheus Is Scraping

Wait a minute for Prometheus to scrape, then check:

curl -s http://localhost:9090/api/v1/query?query=backup_health_score | python3 -m json.tool

If you see results with your service names, the pipeline is working.

You can also check the Prometheus UI at http://your-server:9090 -- type backup_ in the expression box and you should see autocomplete suggestions for all backup metrics.

Step 5: Create the Grafana Dashboard

Open Grafana (default: http://your-server:3000) and create a new dashboard. Add panels for each metric group below.

Panel 1: Backup Health Score (Gauge)

Shows each service's 0-100 health score as a colored gauge.

Query:

backup_health_score

Panel type: Gauge

Settings:

  • Min: 0, Max: 100
  • Thresholds: 0 = red, 50 = yellow, 80 = green
  • Legend: {{service}}
  • Title: "Backup Health Score"

Panel 2: Time Since Last Backup (Stat)

Shows how long ago each service was backed up. Stale backups stand out immediately.

Query:

backup_age_seconds

Panel type: Stat

Settings:

  • Unit: seconds (s) -- Grafana auto-formats to "2h 15m" etc.
  • Thresholds: 0 = green, 86400 = yellow (>24h), 172800 = red (>48h)
  • Legend: {{service}}
  • Title: "Time Since Last Backup"

Panel 3: Verification Status (Stat)

Shows PASS/FAIL/NEVER for each service's last restore verification.

Query:

backup_verify_last_result

Panel type: Stat

Settings:

  • Value mappings: 1 = "PASS" (green), 0 = "FAIL" (red), -1 = "NEVER" (yellow)
  • Legend: {{service}}
  • Title: "Verification Status"

Panel 4: Backup Size per Service (Bar Chart)

Shows how much data each service's latest snapshot contains.

Query:

backup_last_size_bytes

Panel type: Bar chart

Settings:

  • Unit: bytes (decbytes) -- Grafana auto-formats to MB/GB
  • Legend: {{service}}
  • Title: "Latest Backup Size"

Panel 5: Snapshot Count (Stat)

Total snapshots per service. Useful for spotting retention issues.

Query:

backup_snapshot_count

Panel type: Stat

Settings:

  • Legend: {{service}}
  • Title: "Snapshot Count"

Panel 6: Repository Total Size (Stat)

Single value showing total repository size across all snapshots after dedup.

Query:

backup_repo_size_bytes

Panel type: Stat

Settings:

  • Unit: bytes (decbytes)
  • Title: "Repository Size (after dedup)"

Panel 7: Backup Age Trend (Time Series)

Track backup freshness over time. Useful for catching cron failures -- the sawtooth pattern (age resets to 0 at each backup, then climbs) should be regular.

Query:

backup_age_seconds

Panel type: Time series

Settings:

  • Unit: seconds
  • Legend: {{service}}
  • Title: "Backup Age Over Time"

Panel 8: Repository Growth (Time Series)

Track total repository size over time. Sudden jumps indicate new services or dedup failures. Steady growth is normal.

Query:

backup_repo_size_bytes

Panel type: Time series

Settings:

  • Unit: bytes (decbytes)
  • Title: "Repository Size Trend"

Step 6: Set Up Alerts (Optional)

If you want Grafana or Prometheus to alert on backup health, add an alert rule.

Prometheus Alert Rule

Add to your monitoring stack's prometheus/alerts/backup-alerts.yml:

groups:
  - name: backup-alerts
    rules:
      - alert: BackupHealthLow
        expr: backup_health_score < 50
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Backup health low for {{ $labels.service }}"
          description: "{{ $labels.service }} backup health score is {{ $value }}/100"

      - alert: BackupStale
        expr: backup_age_seconds > 172800
        for: 30m
        labels:
          severity: critical
        annotations:
          summary: "Backup stale for {{ $labels.service }}"
          description: "{{ $labels.service }} hasn't been backed up in over 48 hours"

      - alert: BackupVerifyFailed
        expr: backup_verify_last_result == 0
        for: 1h
        labels:
          severity: critical
        annotations:
          summary: "Backup verification failed for {{ $labels.service }}"
          description: "{{ $labels.service }} backup failed restore verification"

Add the file to your Prometheus config's rule_files list and restart Prometheus:

docker compose restart prometheus

These alerts route through Alertmanager, so they'll hit your Discord/Slack/email -- the same notification channels you already configured for infrastructure alerts.

Dashboard Layout

A suggested layout for a single Grafana row:

┌─────────────────┬───────────────────┬────────────────────┐
│  Health Score   │  Time Since Last  │  Verify Status     │
│  (Gauge)        │  Backup (Stat)    │  (Stat)            │
├─────────────────┴───────────────────┴────────────────────┤
│  Latest Backup Size (Bar Chart)     │  Snapshots / Repo  │
│                                     │  Size (Stats)      │
├─────────────────────────────────────┴────────────────────┤
│  Backup Age Trend (Time Series)                          │
├──────────────────────────────────────────────────────────┤
│  Repository Growth (Time Series)                         │
└──────────────────────────────────────────────────────────┘

Eight panels, one row. Place this below your existing System Overview or Docker Containers dashboard rows, or create a dedicated "Backups" dashboard.

Available Metrics Reference

All metrics exported by backup-metrics.sh:

MetricTypeLabelsDescription
backup_last_success_timestampgaugeservice, profileUnix timestamp of last successful backup
backup_last_size_bytesgaugeservice, profileSize of last backup snapshot in bytes
backup_snapshot_countgaugeservice, profileTotal snapshots for this service
backup_health_scoregaugeservice, profileHealth score 0-100
backup_verify_last_resultgaugeservice, profileLast verify result: 1=pass, 0=fail, -1=never
backup_age_secondsgaugeservice, profileSeconds since last backup (-1 if never)
backup_repo_size_bytesgaugeTotal repository size in bytes
backup_repo_snapshotsgaugeTotal snapshot count in repository

This guide connects the Homelab Backup Stack with the Homelab Monitoring Stack. Both available at nxsi.io.