Most Docker Compose files I see in homelab repos are fragile. They work on the author's machine, break on yours, and give you zero information about why. The patterns below come from a 9-service monitoring stack I built and tested on a box with 8 GB of RAM. Every snippet is pulled from that real compose file — nothing hypothetical.
Environment Variables With Defaults
The single most impactful pattern for making a compose file portable: ${VAR:-default} syntax. Users who care about customization edit .env. Everyone else gets working defaults without touching YAML.
ports:
- "${GRAFANA_PORT:-3000}:3000"
environment:
- GF_SECURITY_ADMIN_USER=${GRAFANA_ADMIN_USER:-admin}
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD:-admin}
- GF_AUTH_ANONYMOUS_ENABLED=${GRAFANA_ANONYMOUS_ENABLED:-false}
command:
- "--storage.tsdb.retention.time=${PROMETHEUS_RETENTION:-15d}"
- "--storage.tsdb.retention.size=${PROMETHEUS_RETENTION_SIZE:-10GB}"
The monitoring stack has 9 services with roughly 20 configurable values. Not one requires editing docker-compose.yml. Copy .env.example to .env, change what matters, run docker compose up -d. That .env.example becomes the documentation — inline comments explain every variable:
# How long to keep metrics (default: 15 days)
PROMETHEUS_RETENTION=15d
# Max disk space for metrics (default: 10GB)
PROMETHEUS_RETENTION_SIZE=10GB
Ship an .env.example, never a populated .env. Users who get a pre-filled .env will inevitably run with your test credentials.
Health Checks on Every Service
Without health checks, Docker has no idea whether your container is actually functioning or just technically alive with a running PID. The difference matters when other services depend on it.
Grafana's health check:
healthcheck:
test: ["CMD", "wget", "--spider", "-q", "http://localhost:3000/api/health"]
interval: 30s
timeout: 5s
retries: 3
Prometheus uses its dedicated health endpoint:
healthcheck:
test: ["CMD", "wget", "--spider", "-q", "http://localhost:9090/-/healthy"]
interval: 30s
timeout: 5s
retries: 3
Uptime Kuma is the awkward one — no wget or curl in the image, just Node.js:
healthcheck:
test: ["CMD-SHELL", "node -e \"const http = require('http'); http.get('http://localhost:3001/api/status-page/heartbeat', (r) => { process.exit(r.statusCode === 200 ? 0 : 1) }).on('error', () => process.exit(1))\""]
interval: 30s
timeout: 5s
retries: 3
Ugly. But it works, and that's the point — you use whatever HTTP client the image gives you. Alpine-based images have wget. Debian-based often have curl. Node images have node. Check what's available before reaching for apt-get install in a health check command (I've seen people do this; don't).
The interval: 30s with retries: 3 means Docker marks a container unhealthy after 90 seconds of consecutive failures. Tight enough to catch real problems, loose enough to survive a slow Loki ingester warmup.
Memory Limits
If you're running on a 4-8 GB homelab box, one misbehaving container can OOM-kill everything. Memory limits turn a full-system crash into one container restarting.
deploy:
resources:
limits:
memory: 256M
The actual limits from the monitoring stack:
| Service | Memory Limit | Why |
|---|---|---|
| Prometheus | 512M | Largest consumer — TSDB, WAL, query execution |
| Grafana | 256M | Dashboard rendering, plugin loading |
| Loki | 256M | Log ingestion and TSDB |
| Uptime Kuma | 256M | Node.js runtime + SQLite |
| cAdvisor | 128M | Kernel metrics collection |
| Promtail | 128M | Log tailing and shipping |
| Node Exporter | 64M | Minimal — reads /proc and /sys |
| Alertmanager | 64M | Lightweight alert routing |
| PVE Exporter | 64M | Python + API polling |
Total budget: ~1.4 GB for 9 services. That leaves breathing room on an 8 GB machine running other things.
I arrived at these numbers by running the stack under load and watching actual consumption. Prometheus peaked around 380 MB with 15 days of retention and 7 scrape targets. The 512M limit gives it headroom without letting a runaway query eat all your RAM.
(An aside: I initially set Promtail to 64M and it kept getting killed during first boot when it tried to ingest all existing Docker logs at once. 128M fixed it. Your move, first-boot burst ingestion.)
Named Volumes vs. Bind Mounts
Two different volume patterns serve two different purposes. The compose file uses both, deliberately.
Named volumes for persistent data — things the service writes and manages:
volumes:
grafana-data:/var/lib/grafana
prometheus-data:/prometheus
loki-data:/loki
alertmanager-data:/alertmanager
uptime-kuma-data:/app/data
promtail-positions:/tmp
Bind mounts for config files you author and the service reads:
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./prometheus/alerts:/etc/prometheus/alerts:ro
- ./grafana/provisioning:/etc/grafana/provisioning:ro
- ./grafana/dashboards:/var/lib/grafana/dashboards:ro
Named volumes survive docker compose down and get managed by Docker's storage driver. Bind mounts point at files you version control. Mixing these up — putting Prometheus data in a bind mount, or configs in a named volume — creates pain that shows up weeks later when you try to back up or redeploy.
Read-Only Config Mounts
Every bind-mounted config file has :ro appended:
volumes:
- ./loki/loki.yml:/etc/loki/loki.yml:ro
- ./promtail/promtail.yml:/etc/promtail/promtail.yml:ro
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
Two reasons. First, defense in depth — a compromised container can't modify its own config to do something you didn't intend. Second, it makes intent explicit. When I see :ro I know that file flows one direction: host into container, never back.
Node Exporter takes this further with read-only mounts of the host filesystem:
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/host:ro,rslave
That :ro,rslave on the root mount matters — rslave propagates mount events from the host into the container (so new mounts appear), but :ro prevents the container from writing anything back. Node Exporter needs to see your filesystems to report disk usage. It does not need write access.
Dedicated Networks
networks:
monitoring:
name: monitoring
driver: bridge
Every service joins this network:
networks:
- monitoring
The name: monitoring is deliberate. Without it, Docker prefixes the project directory name — you'd get homelab-monitoring_monitoring or whatever your folder is called. Explicit names make docker network inspect monitoring predictable.
Inside this network, services resolve each other by name. Grafana's datasource config references http://prometheus:9090 and http://loki:3100. No IPs. If Prometheus restarts and gets a different internal IP, everything still works.
A common mistake: using the default bridge network for everything. When your monitoring stack, your media stack, and your home automation stack all share a network, every container can talk to every other container. Dedicated networks provide isolation. Your Prometheus doesn't need to reach your Plex server.
Restart Policies
Every service uses the same policy:
restart: unless-stopped
Not always. The difference: unless-stopped respects manual docker stop commands. If you stop Grafana to debug something, it stays stopped. always would restart it immediately, fighting you.
Not on-failure either. on-failure only restarts on non-zero exit codes. If a service gets OOM-killed, the exit code is 137 (SIGKILL) — on-failure handles that. But some services exit cleanly with code 0 during transient issues and need to come back. unless-stopped covers both cases.
One line. Boring. Prevents you from waking up to a dead monitoring stack because Loki crashed at 2 AM and nobody restarted it.
Container Naming
container_name: monitoring-grafana
container_name: monitoring-prometheus
container_name: monitoring-loki
container_name: monitoring-promtail
container_name: monitoring-node-exporter
container_name: monitoring-cadvisor
container_name: monitoring-alertmanager
container_name: monitoring-pve-exporter
container_name: monitoring-uptime-kuma
The monitoring- prefix does two things. docker ps becomes scannable — you can instantly tell which stack a container belongs to. And it prevents name collisions. If you have a grafana container in your monitoring stack and another in a dev stack, Docker will refuse to start the second one. Prefixed names eliminate the ambiguity.
docker ps --filter name=monitoring- gives you just this stack. Fast.
Dependency Ordering With Health Conditions
Promtail ships logs to Loki. If Promtail starts before Loki is ready, log batches fail and get retried — noisy, wasteful, and alarming if you're watching the logs.
promtail:
depends_on:
loki:
condition: service_healthy
This is the only dependency declaration in the stack because it's the only one that genuinely matters. Prometheus scrapes targets on an interval — if a target isn't up yet, the scrape fails silently and succeeds on the next cycle. Grafana queries Prometheus on demand — if Prometheus isn't ready, the dashboard shows "No data" and refreshes automatically. These services tolerate startup ordering gracefully.
Promtail is different. It discovers existing Docker log files on boot and immediately starts shipping thousands of lines. Sending that burst to a Loki instance that's still initializing its TSDB produces a wall of 503 errors. The condition: service_healthy gates Promtail startup until Loki's /ready endpoint returns 200.
Don't add depends_on between services that handle missing peers gracefully on their own. You'll just slow down your startup for no reason.
All Nine Patterns
${VAR:-default}everywhere — zero YAML editing for users- Health checks on every service — using whatever HTTP client the image provides
- Memory limits — total stack budget under 1.5 GB for 9 services
- Named volumes for data, bind mounts for config — clean separation of concerns
:roon all config mounts — config flows host to container, never back- Dedicated named network — service DNS, stack isolation, predictable naming
unless-stoppedrestart policy — respects manual stops, covers all failure modes- Stack-prefixed container names —
monitoring-*for instantdocker psfiltering depends_onwithcondition: service_healthy— only where startup order actually matters
None of these are exotic. Most are one or two lines. I have been guilty of skipping half of them in personal projects and regretting it six months later when I try to redeploy on a different box. The difference between a compose file that works on your machine and one that works on anyone's machine is usually just these small, boring decisions compounding.
The Homelab Monitoring Stack kit uses every pattern above across a production-tested 9-service Docker Compose file — Grafana, Prometheus, Loki, Alertmanager, Node Exporter, cAdvisor, Promtail, PVE Exporter, and Uptime Kuma. Pre-built dashboards, 23 alert rules, and documentation included. Free download, yours to modify.