I spent a weekend building a server health dashboard for my homelab. Nine servers, SSH access to each, Discord alerts when something goes wrong. The n8n workflow I ended up with is not complicated — but it required solving five distinct problems that I haven't seen written up anywhere. These are the patterns that made it work.
Each one is self-contained. Take what applies.
Pattern 1: One SSH Connection Per Server, Not One Per Metric
The obvious approach when you're wiring up your first monitoring workflow: one SSH node for CPU, another for memory, another for disk. Easy to read. Easy to debug. Also wrong.
Every SSH node opens a new TCP connection, negotiates keys, and authenticates. On a fast LAN with passwordless key auth that's maybe 80-150ms per connection. Over a WAN link or to a Pi on a slow network, you're looking at 300-800ms each. Multiply that by five metrics across nine servers and your "lightweight monitoring" workflow is now a multi-minute blocking chain.
The fix: compound commands. One SSH node. One connection. All metrics.
echo "HOSTNAME=$(hostname)"
echo "UPTIME=$(uptime)"
echo "CPU_LOAD=$(cat /proc/loadavg)"
echo "CORES=$(nproc)"
echo "MEMORY=$(free -m | grep Mem)"
echo "SWAP=$(free -m | grep Swap)"
echo "TEMP=$(cat /sys/class/thermal/thermal_zone0/temp 2>/dev/null || echo N/A)"
echo "ZOMBIES=$(ps aux | awk '{print $8}' | grep -c Z 2>/dev/null || echo 0)"
echo "FAILED_SERVICES=$(systemctl --failed --no-legend 2>/dev/null | wc -l)"
echo "CONNECTIONS=$(ss -tun state established 2>/dev/null | wc -l)"
# Real CPU utilization via /proc/stat delta sampling (1s sleep)
CPU1=$(cat /proc/stat | head -1); sleep 1; CPU2=$(cat /proc/stat | head -1)
echo "CPU_STAT1=$CPU1"
echo "CPU_STAT2=$CPU2"
# All mounted filesystems (filtered)
echo "---FS---"
df -P | grep -vE 'tmpfs|udev|Filesystem'
echo "---NET---"
cat /proc/net/dev | grep -v 'Inter\|face'
echo "---TOP5CPU---"
ps aux --sort=-%cpu | head -6
The big change from a basic version: real CPU utilization via /proc/stat delta sampling. Take a snapshot, sleep 1 second, take another, and calculate the actual percentage of CPU time spent doing work. This is fundamentally different from load average — load average tells you how many processes are waiting, CPU% tells you how busy the cores actually are. A server can show 0.5 load average while running at 95% CPU on a single-core task. The delta sampling catches that.
The ---FS---, ---NET---, and ---TOP5CPU--- markers are section delimiters. The downstream Code node splits on these markers to parse each data type separately. df -P with the tmpfs|udev filter gives you every real filesystem without the noise. /proc/net/dev gives you RX/TX bytes and error counts per interface.
Each KEY=VALUE line in the first section is parsed the same way as before. The 2>/dev/null || echo N/A pattern on temperature handles machines without thermal sensors. The 2>/dev/null || echo 0 on zombies and failed services handles systems where those checks might not apply.
Parsing this in a downstream Code node is straightforward:
const output = $input.first().json.stdout;
const lines = output.split('\n').filter(l => l.includes('='));
const metrics = {};
for (const line of lines) {
const [key, ...rest] = line.split('=');
metrics[key.trim()] = rest.join('=').trim();
}
return [{ json: metrics }];
The rest.join('=') handles values that contain = signs, which uptime output sometimes does. Learned that the hard way on the first test run.
Docker requires a separate SSH node because you need the Docker daemon and the output format is different:
docker ps -a --format '{{.Names}}|{{.Status}}|{{.Image}}'
docker stats --no-stream --format '{{.Names}}|{{.CPUPerc}}|{{.MemPerc}}'
docker inspect --format '{{.Name}}|{{.RestartCount}}|{{.State.Health.Status}}' $(docker ps -q) 2>/dev/null
Pipe-delimited output instead of KEY=VALUE — easier to split into structured arrays. The $(docker ps -q) in the inspect command only queries running containers, so the command doesn't fail when no containers are running.
A common trap: n8n's expression validator flags {{.Names}} in SSH node parameters as invalid template syntax. The red underline is wrong. The command runs correctly at execution time — n8n validates expressions but doesn't execute them in test mode, so the Go template syntax passes through to the shell unmodified. Ignore the validator warning on this one.
Two SSH nodes per server instead of nine. The workflow runs in under two seconds per server versus what would have been twelve-plus seconds with individual metric nodes.
Pattern 2: Deduction-Based Health Scoring
Additive scoring — start at 0, add points for good states — sounds clean until you try to write the weights. What's a healthy CPU worth? How do you add "container is up" across fifteen containers of different importance? You end up in a philosophy discussion instead of shipping.
Deduction scoring cuts through it. Start at 100. Subtract for problems. The score represents "how much is wrong" rather than "how much is right," which maps directly to how operators actually think about servers.
The deduction table I settled on after the first week of real data — and then expanded significantly once I started collecting more metrics:
| Condition | Deduction |
|---|---|
| Disk usage >90% | -30 |
| Disk usage >80% | -15 |
| Memory usage >95% | -25 |
| Memory usage >85% | -10 |
| CPU utilization >80% | -20 |
| CPU utilization >50% | -10 |
| CPU load >1.5x core count | -10 |
| CPU load >1x core count | -5 |
| Inode usage >90% | -20 |
| Inode usage >80% | -10 |
| Swap usage >80% | -10 |
| Swap usage >50% | -5 |
| Container restart count >5 | -15 |
| Container restart count >0 | -5 |
| Each container in stopped/failed state | -10 |
| Zombie processes >0 | -5 |
| Failed systemd services >0 | -10 |
| Docker reclaimable >50% | -5 |
The weighting encodes operational judgment. Disk at 90% is catastrophic — you lose writes, logs stop rotating, databases corrupt. Real CPU utilization (measured via /proc/stat delta, not load average) at 80% means your cores are genuinely saturated. That's why disk deductions are still the highest, but CPU% now carries real weight too. Running out of inodes is rarer but equally catastrophic, and most monitoring setups ignore it entirely.
The newer entries catch problems I kept missing with the original table. Swap usage creeping up means memory pressure is real but hasn't hit the OOM killer yet — early warning. Zombie processes and failed systemd services are the kind of thing you never check manually until something breaks downstream. Docker reclaimable space above 50% means you're hoarding dead images and build cache — not urgent, but worth a docker system prune before it contributes to a disk alert.
Container restarts over 5 is the threshold where "it's flapping" becomes "something is genuinely wrong." Below that it could be a deploy, a dependency race, a one-time OOM — but even a single restart now gets a small deduction so it shows up in the score.
The score maps to a Discord embed color in the notification node:
function getScoreColor(score) {
if (score >= 80) return 3066993; // Green
if (score >= 50) return 16776960; // Yellow
return 15158332; // Red
}
These are Discord's integer color values. The embed uses the color to make the severity immediately readable without opening the message.
One thing I adjusted after seeing real data: I originally had separate thresholds for memory at 70%, 80%, and 90%. Three thresholds created too much noise. Most servers run hot on memory by design — Linux caches aggressively, JVM heaps get pre-allocated, databases take what you give them. I collapsed it to two thresholds and the alert-to-actual-problem ratio improved significantly.
Pattern 3: Two Triggers, One Workflow
My monitoring has two different cadences. Daily at 7 AM: full digest with AI analysis, trend comparisons against the previous seven days, container health summary. Every 5 minutes, around the clock: quick threshold check, alert to Discord only if something is critical.
The obvious structure is two separate workflows. I didn't do that, and I'm glad.
n8n supports multiple trigger nodes in a single workflow. Each trigger starts a completely independent execution path. They share nothing at runtime — not state, not variables, not Config nodes. But they share the same workflow definition, which means one place to update node configurations, one credential to manage, one workflow to version.
The structure looks like this: two Schedule Trigger nodes at the top of the canvas, each fanning out into its own branch. The daily path runs through thirteen nodes — SSH collection, parsing, AI analysis, history lookup, Discord format, send. The 5-minute path runs through seven — SSH collection, threshold check, conditional Discord send.
The honest trade-off: you can't share a Config node across triggers. Config nodes are read at the start of execution, and each trigger starts its own execution from scratch. So the daily path has a Config node with all the threshold values and server list. The 5-minute path hardcodes its threshold values directly in the Code node, with inline comments explaining what each number means.
// Threshold values — kept here because Config node can't be shared across trigger paths
const DISK_CRITICAL = 90; // percent — beyond this we lose write capability
const MEM_CRITICAL = 95; // percent — above this the OOM killer wakes up
const LOAD_MULTIPLIER = 1.5; // load average / core count — 1.5x sustained is real saturation
const RESTART_THRESHOLD = 5; // restarts — below this could be normal deploys
This is a small violation of DRY that I've made peace with. The alternative — two separate workflows with duplicated SSH collection logic — is worse.
Pattern 4: Chain, Not Agent
When I added AI analysis to the daily digest, my first instinct was to use an AI Agent node. That instinct was wrong, and it would have added latency and cost for no benefit.
AI Agents in n8n are for tasks that need tools, multi-step reasoning, or context accumulation across turns. Analyzing a structured JSON object and producing a formatted health report is none of those things. It's a single shot: one structured input, one formatted output.
The Basic LLM Chain node (chainLlm) is the right tool. No tool loop overhead. No token budget consumed by tool descriptions. No risk of the agent deciding to take unexpected actions. One prompt in, one response out.
The system message is where the actual work happens. The AI is told to act as a senior DevOps engineer and return structured JSON — not free-form text:
You are a senior DevOps engineer performing a daily health review.
You receive JSON containing current server metrics and 7-day history.
Respond with ONLY valid JSON in this exact structure:
{
"status": "healthy|warning|critical",
"headline": "One sentence overall status",
"findings": [
{
"severity": "HIGH|MED|LOW",
"title": "Short finding title",
"detail": "What's happening and why it matters",
"command": "exact CLI command to investigate or fix"
}
],
"trend_summary": "2-3 sentences on 7-day trends, or 'Insufficient history' if < 3 days",
"top_recommendation": "Single most important action to take"
}
Rules:
- Every finding MUST include a specific CLI command. No generic advice.
- Order findings by severity (HIGH first).
- Maximum 6 findings. If more exist, prioritize by impact.
- No markdown. No preamble. No explanation outside the JSON.
The structured JSON output is a deliberate upgrade from the free-form text in earlier versions. Every finding comes with a severity tag and a CLI command you can copy-paste. The downstream Code node parses the JSON directly — no regex splitting on section headers, no hoping the LLM kept the format consistent. JSON.parse() either works or it doesn't.
The "senior DevOps engineer" persona matters more than it sounds. Without it, the LLM gives you cautious, hedging advice. With it, you get direct recommendations: "run docker system prune -af to reclaim 12GB" instead of "you may want to consider cleaning up unused Docker resources."
Agents add real value for tasks that genuinely need reasoning loops. For batch analysis of structured data, they're overhead.
Pattern 5: Graceful First-Run Handling
Small thing. Matters more than it sounds.
On first import of the workflow, the Google Sheets node that fetches historical health data has nothing to read. The sheet is empty. The default behavior is to fail, which means anyone who downloads this workflow sees an error before they've collected a single data point.
The fix is two node settings:
- On Error: Continue (regular output)
- Always Output Data: true
With these set, an empty or missing sheet passes through as an empty array instead of throwing an error. The downstream Code node checks:
const history = $('Get History').all();
const hasHistory = history.length > 0 && history[0].json.date !== undefined;
// Adjust the prompt based on data availability
const trendInstruction = hasHistory
? `Historical data from the past ${history.length} days is provided in the historical_data field.`
: `No historical data is available yet. Analyze current state only. Skip the TRENDS section.`;
This is the difference between a workflow that works on import and one that requires a setup ritual. The first-time user doesn't have to know to seed the sheet with placeholder data or read a setup section explaining why it fails.
(I will admit I did not think of this on the first build. A friend imported it, pinged me that it was broken, and I felt appropriately embarrassed.)
The same pattern applies to any node that reads external state that might not exist yet: Airtable, Notion, a webhook that hasn't fired, a database table that's empty on first run. Set continueRegularOutput on error, output empty data, check for it downstream.
All Five Patterns
-
Compound SSH commands — one connection per server, labeled
KEY=VALUEoutput plus section-delimited blocks (---FS---,---NET---,---TOP5CPU---), parsed in a single Code node. Collects real CPU% via/proc/statdelta sampling, all mounted filesystems, network I/O, zombie processes, failed services, and more. Docker gets its own second SSH call with pipe-delimited output plusdocker system dffor ecosystem health. -
Deduction-based health scoring — start at 100, subtract by severity across 18 conditions. Disk failures still weighted highest because running out of disk is catastrophic. CPU% (real utilization, not load average) now carries proper weight. Swap, zombies, failed services, and Docker reclaimable space catch the long-tail problems. Map score to Discord embed color for at-a-glance severity.
-
Dual-trigger architecture — two Schedule Triggers in one workflow, separate execution paths. Daily path does full AI analysis with 13 nodes; 5-minute path does lightweight threshold checks with 7 nodes. Config nodes can't be shared, so the fast path hardcodes its thresholds inline with comments.
-
LLM Chain for structured analysis — Basic LLM Chain, not an AI Agent, for single-shot structured analysis. System message defines a senior DevOps engineer persona and requires structured JSON output:
{status, headline, findings[{severity, title, detail, command}], trend_summary, top_recommendation}. Every finding includes a CLI fix command. No free-form text to parse. -
Graceful first-run handling —
continueRegularOutput+alwaysOutputData: trueon any node reading external state that might not exist yet. Downstream Code node detects empty input and adjusts behavior accordingly.
The complete workflow JSON — with all five patterns wired together, the Discord formatting, the Google Sheets history integration, and the AI analysis chain — is available in the nXsi homelab health dashboard product kit. It includes a setup guide covering the SSH key configuration, the Sheets schema, and the Discord webhook setup.