Monitoring Proxmox with Grafana and Prometheus: A Practical Setup
Proxmox ships with built-in monitoring. RRD graphs right in the web UI. CPU, memory, disk, network -- it's all there.
And for a single node with a handful of VMs, it's fine. You glance at it when something feels slow, maybe notice a memory spike, move on. But the moment you want to answer questions like "which VM has been eating the most CPU over the last two weeks" or "is my ZFS pool fill rate going to be a problem next month," those built-in graphs fall apart. No historical query language. No cross-node comparison. No alerting that reaches your phone at 2 AM when a production VM dies.
External monitoring fixes all of this. Prometheus gives you a real query language. Grafana gives you dashboards that don't require logging into the Proxmox UI. And alert rules mean you find out about problems before your users do. This guide focuses specifically on the Proxmox integration — the full stack tutorial covers the complete 9-service deployment including log aggregation with Loki and container monitoring with cAdvisor.
How the Pieces Fit Together
The data flow is straightforward: Proxmox exposes an API, an exporter translates it into Prometheus metrics, and Grafana visualizes the result.
Proxmox API (:8006) → PVE Exporter (:9221) → Prometheus (:9090) → Grafana (:3000)
The PVE Exporter is the critical piece. It's a small Python service that authenticates against the Proxmox API, pulls cluster/node/guest/storage data, and reformats it as Prometheus metrics on a /pve endpoint. Prometheus scrapes that endpoint on a schedule (every 15 seconds by default), stores the time series, and Grafana queries Prometheus to render dashboards.
One detail that saves headaches: the PVE Exporter queries the Proxmox API, not individual node agents. A single exporter instance can pull data for an entire cluster through any one node. You don't need an exporter per node.
The Docker service config for the exporter is minimal:
pve-exporter:
image: prompve/prometheus-pve-exporter:3.8.1
container_name: monitoring-pve-exporter
ports:
- "9221:9221"
environment:
- PVE_USER=${PVE_USER:-monitoring@pve}
- PVE_TOKEN_NAME=${PVE_TOKEN_NAME:-monitoring}
- PVE_TOKEN_VALUE=${PVE_TOKEN_VALUE}
- PVE_VERIFY_SSL=${PVE_VERIFY_SSL:-false}
restart: unless-stopped
deploy:
resources:
limits:
memory: 64M
64MB memory limit. The exporter is lightweight -- it makes API calls and reformats the response. No local storage, no background processing.
Setting Up the Proxmox API Token
This is where most people hit their first authentication error. The exporter needs an API token, and Proxmox's permission model trips people up if you're used to simpler systems. I wrote a standalone step-by-step guide with both CLI and web UI instructions if you want the full walkthrough with troubleshooting.
Four commands, run on the Proxmox host:
# 1. Create a dedicated monitoring user in the local (pve) realm
pveum user add monitoring@pve -comment "Monitoring read-only user"
# 2. Create a role with only audit (read-only) privileges
pveum role add monitoring -privs "VM.Audit,Datastore.Audit,Sys.Audit,SDN.Audit"
# 3. Assign the role at the root path (covers everything)
pveum aclmod / -user monitoring@pve -role monitoring
# 4. Create an API token with privilege separation DISABLED
pveum user token add monitoring@pve monitoring --privsep 0
The last command outputs the token secret. Copy it. You won't see it again.
The privilege separation gotcha -- this is the one that will cost you 30 minutes of debugging if you miss it. By default, Proxmox creates API tokens with privilege separation enabled. That means the token gets its own empty permission set, independent of the user's permissions. Your monitoring user has VM.Audit, but the token has nothing. Result: 401 Unauthorized, and the error message won't tell you why.
The fix is --privsep 0 on the CLI or unchecking "Privilege Separation" in the web UI. With privsep off, the token inherits the user's permissions directly.
If you prefer the web UI: Datacenter > Permissions > API Tokens > Add. Set User to monitoring@pve, Token ID to monitoring, uncheck Privilege Separation.
Once you have the token, drop these values into your .env:
PVE_HOST=10.10.10.101
PVE_USER=monitoring@pve
PVE_TOKEN_NAME=monitoring
PVE_TOKEN_VALUE=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
PVE_VERIFY_SSL=false
Prometheus Scrape Configuration
Prometheus doesn't scrape the PVE Exporter the way it scrapes most targets. Instead of hitting a /metrics endpoint, it requests /pve with query parameters that tell the exporter which Proxmox host to query. This multi-target pattern is common in the exporter ecosystem (Blackbox Exporter and SNMP Exporter work the same way) but it confuses people who expect a simple static_configs block.
The scrape config:
- job_name: "proxmox"
metrics_path: /pve
params:
module: [default]
cluster: ["1"]
node: ["1"]
target: ["10.10.10.101"]
static_configs:
- targets: ["pve-exporter:9221"]
What's happening here: Prometheus connects to pve-exporter:9221, requests the path /pve?module=default&cluster=1&node=1&target=10.10.10.101. The exporter receives this, calls the Proxmox API at 10.10.10.101:8006, and returns the results as Prometheus metrics. The static_configs target is the exporter address, not the Proxmox host. The params.target is the Proxmox host.
The Environment Variable Problem
I hit something annoying during my build. Prometheus config files do not support Docker Compose-style ${VAR} substitution. I initially put ${PVE_HOST} in the target parameter and Prometheus URL-encoded the literal string to $%7bpve_host. No error, no warning. Just a silently broken scrape target pointing at a nonsense hostname.
The standard community fix: a setup script that reads .env and does sed replacement before you start the stack. My setup.sh replaces the placeholder PVE_TARGET in prometheus.yml with the actual IP from your .env file:
sed -i "s/PVE_TARGET/${PVE_HOST}/g" "$PROM_CONFIG"
Not elegant. Effective.
(I looked at using envsubst as a Docker entrypoint wrapper and building a custom Prometheus image, but both felt like overkill for one substitution. The sed approach works and doesn't change the image.)
The Proxmox Cluster Dashboard
The Grafana dashboard I built has 10 panels organized across 4 rows. Each row maps to a layer of the Proxmox hierarchy.
Row 1 -- Cluster Status. Four panels for the high-level view:
- Node Status -- a table showing each Proxmox node with an Online/Offline indicator. Uses
pve_node_infowith value mappings: 1 = green "Online", 0 = red "Offline". Filterable columns. - Total Guests -- stat panel,
count(pve_guest_info). The number across all nodes. - Total CPU Usage -- stat panel with sparkline,
avg(pve_cpu_usage_ratio{id=~"node/.*"}) * 100. Green/yellow/red thresholds at 0/60/85%. - Total Memory Usage -- stat panel with sparkline.
sum(usage) / sum(size)across all nodes, filtered by{id=~"node/.*"}. Thresholds at 0/70/90%.
This first row is what you glance at. Everything green, move on. Something red, scroll down.
Row 2 -- Nodes. Two time series panels showing per-node CPU and memory usage over time. These filter on {id=~"node/.*"} to match physical Proxmox hosts. A group_left(name) join with pve_node_info pulls the friendly hostname into the legend. Legend tables with mean and max calcs at the bottom. This is where you spot one node running hotter than the others and decide to migrate a VM.
Row 3 -- VMs & Containers. Two time series panels for guest-level CPU and memory. Filtered with {id=~"qemu/.*|lxc/.*"} to show only VMs and LXC containers, not nodes or storage. Each panel joins with pve_guest_info so the legend shows VM names instead of numeric IDs. The memory panel uses raw bytes (not percentages) because LXC containers don't always report a memory limit, which makes percentage calculation unreliable.
Row 4 -- Storage. A horizontal bar gauge showing percentage used per storage pool, color-coded with green/yellow/red thresholds at 0/70/85%. Uses pve_disk_usage_bytes{id=~"storage/.*"} and pve_disk_size_bytes{id=~"storage/.*"} with a pve_storage_info join for friendly pool names. Next to it, a details table with storage name, used bytes, and total bytes. The table sorts by "Used" descending so your fullest pools are always at the top.
All panels reference the provisioned prometheus datasource UID. This explicit reference ensures Proxmox queries go to Prometheus and not to whichever datasource Grafana happens to use as the default.
Alert Rules
Five Proxmox-specific alert rules. Each is designed to catch a different failure mode, and they're in a separate proxmox-alerts.yml file so people who don't run Proxmox can delete it without touching their other rules.
ProxmoxGuestStopped -- fires when any VM or LXC has been down for 5 minutes. The pve_up metric is 0 when a guest is stopped, and the group_left join pulls the friendly VM name and node from pve_guest_info. The 5-minute for: duration avoids false alarms during planned reboots or migrations.
- alert: ProxmoxGuestStopped
expr: pve_up{id=~"qemu/.*|lxc/.*"} * on(id) group_left(name, node) pve_guest_info == 0
for: 5m
labels:
severity: warning
annotations:
summary: "Proxmox guest {{ $labels.name }} is not running"
description: >
VM/LXC {{ $labels.name }} ({{ $labels.id }}) on node
{{ $labels.node }} has been stopped for more than 5 minutes.
ProxmoxNodeDown -- critical severity, 2-minute threshold. If a node drops offline, you want to know immediately. Two minutes gives enough buffer for brief network blips without masking real outages.
PveExporterDown -- warning severity, 5 minutes. This catches the case where the exporter itself is unhealthy -- network issues to the Proxmox API, bad credentials, or the exporter container crashing. Without this rule, you'd have a blind spot: your Proxmox monitoring would silently stop working and no alert would fire because there are no metrics to evaluate.
ProxmoxStorageHigh (85%) and ProxmoxStorageCritical (95%) -- two-tier storage alerts. The 85% warning gives you time to clean up or expand. The 95% critical means you're about to run out. The critical rule uses a 5-minute for: duration (vs 10 minutes for the warning) because at 95%, every minute counts.
The expression for both uses pve_disk_usage_bytes{id=~"storage/.*"} and pve_disk_size_bytes{id=~"storage/.*"}. The {id=~"storage/.*"} filter is important -- without it, the query would also match VM disk metrics. PVE Exporter uses pve_disk_* for both storage pools and guest disks, distinguished only by the id label prefix.
Multi-Node and Multi-Cluster Monitoring
A common question: do I need one PVE Exporter per Proxmox node?
No. The exporter queries the Proxmox API, which returns cluster-wide data through any single node. Point the exporter at one node and you get metrics for all nodes, all VMs, all storage pools in that cluster. The cluster: ["1"] and node: ["1"] parameters in the Prometheus config enable cluster-level and node-level metric collection respectively.
For a single Proxmox cluster, the setup is what I showed above. One exporter, one scrape job. Done.
For multiple independent Proxmox installations (separate clusters that aren't joined), you add additional scrape jobs in prometheus.yml with different target parameters:
- job_name: "proxmox-site2"
metrics_path: /pve
params:
module: [default]
cluster: ["1"]
node: ["1"]
target: ["192.168.2.100"]
static_configs:
- targets: ["pve-exporter:9221"]
Same exporter instance, different target. The exporter acts as a proxy -- Prometheus tells it which Proxmox host to query on each scrape. You can monitor 5 different Proxmox clusters through one exporter container. Each site needs its own API token configured on that Proxmox installation, but the PVE Exporter uses the token from its environment variables. If your sites need different credentials, you'd run one exporter per credential set.
For Grafana, the dashboard's PromQL queries use label selectors like {{ $labels.node }} in legends, so nodes from different clusters automatically appear as separate series. No dashboard changes needed.
Gotchas and Debugging Checklist
A collection of things that will save you time.
Self-signed SSL certificates. Every default Proxmox install uses a self-signed cert. If PVE_VERIFY_SSL is true (or unset), the exporter will reject the connection with an SSL error. Set it to false in your .env. If you've set up Let's Encrypt or a custom CA on Proxmox, then you can flip it to true.
The 401 debugging checklist. When you get 401 Unauthorized from the PVE Exporter logs, work through this in order:
- Does the user exist?
pveum user list | grep monitoring - Does the token exist? Check Datacenter > Permissions > API Tokens in the web UI
- Is
PVE_TOKEN_VALUEthe secret (UUID-format string), not the token ID? People mix these up constantly - Is privilege separation disabled on the token? Re-create with
--privsep 0if unsure - Does the user have permissions at
/? Runpveum acl listand verify
Token auth vs password auth. The PVE Exporter supports both. Token auth (what I've shown) is the right choice. Password auth requires storing the Proxmox user's actual password in plaintext, the token can be revoked independently, and tokens don't expire unless you set an expiry. There's no reason to use password auth for monitoring.
Metrics exist but VMs are missing. The monitoring role needs VM.Audit. If you created the role without it, or if the ACL assignment is on a specific path instead of /, some guests won't appear. The fix is to re-run pveum aclmod / -user monitoring@pve -role monitoring to ensure root-level access.
Port 8006 not reachable. The PVE Exporter connects to the Proxmox API on port 8006 over HTTPS. If the monitoring host and Proxmox host are on different subnets or VLANs, check firewall rules. A quick test: curl -k https://YOUR_PVE_HOST:8006/api2/json/version should return a JSON response with the Proxmox version.
Summary
The full Proxmox monitoring pipeline, pattern by pattern:
- PVE Exporter translates the Proxmox API into Prometheus metrics. One instance covers an entire cluster.
- API token with privsep disabled gives the exporter read-only access. Four audit privileges, nothing writable.
- Prometheus scrapes
/pvewith a target parameter, not/metrics. The exporter is a proxy, not a direct metrics source. setup.shhandles the Prometheus config templating because Prometheus doesn't support environment variables in its YAML config.- 10-panel dashboard covers cluster status, per-node resources, per-guest resources, and storage usage. Auto-provisions on first boot.
- 5 alert rules catch stopped guests, downed nodes, exporter failures, and storage thresholds at two tiers.
- Multi-cluster works through a single exporter by adding scrape jobs with different target parameters.
If you want the dashboard JSON, the alert rules, the PVE Exporter config, and the full Prometheus scrape configuration pre-wired and ready to deploy, the Homelab Monitoring Stack kit (free download) includes all of it alongside 6 other dashboards, 23 total alert rules, and a 9-service Docker Compose setup. Copy .env.example, fill in your values, run docker compose up -d.