Monitoring Your Homelab: Prometheus + Grafana Stack Complete Guide 2026
Monitoring Your Homelab: Prometheus + Grafana Stack Complete Guide 2026
You are running a homelab. Maybe it is three mini PCs, a NAS, and a Raspberry Pi. Maybe it is a Proxmox cluster with a dozen VMs and fifty Docker containers. Either way, you have no idea how much RAM is actually being used, whether that disk is filling up, or why Jellyfin stutters every Tuesday evening. You find out about problems when your partner tells you the media server is down. Again.
Monitoring fixes this. Not enterprise-grade, six-figure-contract monitoring. Homelab monitoring: a stack that runs on a single machine, scrapes metrics from everything in your network, visualizes them in beautiful dashboards, and sends you an alert before things break instead of after.
The standard stack in 2026 is the same one that has dominated for years, because it works: Prometheus for metric collection and storage, Grafana for visualization and dashboards, Node Exporter for hardware and OS metrics, and Alertmanager for routing alerts to your phone, email, or Discord server.
This guide deploys the entire stack with Docker Compose, configures it to monitor your Docker hosts, VMs, network equipment, and containers, builds useful dashboards, and sets up alerts that actually tell you something actionable.
Table of Contents
- TL;DR
- Architecture Overview
- Step 1: Docker Compose for the Full Stack
- Step 2: Prometheus Configuration
- Step 3: Alertmanager Configuration
- Step 4: Alert Rules
- Step 5: Deploy and Verify
- Step 6: Grafana Setup and Data Sources
- Building Dashboards
- Monitoring Docker Containers
- Monitoring Proxmox and Virtual Machines
- Monitoring Network Equipment
- Storage and Retention
- Securing the Stack
- Advanced: Loki for Log Aggregation
- Troubleshooting Common Issues
- FAQ
- Conclusion
TL;DR
- Prometheus scrapes metrics from targets (servers, containers, network devices) at regular intervals and stores them as time-series data.
- Grafana connects to Prometheus and visualizes metrics in customizable dashboards.
- Node Exporter runs on each machine and exposes hardware/OS metrics (CPU, RAM, disk, network) for Prometheus to scrape.
- Alertmanager receives alerts from Prometheus and routes them to email, Slack, Discord, PagerDuty, or any webhook.
- cAdvisor exposes per-container resource metrics (CPU, memory, network I/O) from Docker.
- The entire stack runs in Docker Compose and uses about 500 MB-1 GB of RAM depending on the number of targets.
- Deploy the stack on one node. Run Node Exporter on every node you want to monitor.
Architecture Overview
+-------------------+ +-------------------+ +-------------------+
| Server Node 1 | | Server Node 2 | | Server Node 3 |
| | | | | |
| [Node Exporter] | | [Node Exporter] | | [Node Exporter] |
| [cAdvisor] | | [cAdvisor] | | [cAdvisor] |
| port 9100, 8080 | | port 9100, 8080 | | port 9100, 8080 |
+---------+---------+ +---------+---------+ +---------+---------+
| | |
+------------+------------+------------+------------+
|
+--------v---------+
| Monitor Node |
| |
| [Prometheus] | <--- scrapes all exporters
| [Grafana] | <--- visualizes Prometheus data
| [Alertmanager] | <--- routes alerts
| [Node Exporter] |
| [cAdvisor] |
+------------------+
Prometheus operates on a pull model: it reaches out to targets and scrapes metrics at defined intervals (default: 15 seconds). This is different from push-based systems like InfluxDB/Telegraf where agents push data to a central server. The pull model means you configure targets in Prometheus, and Prometheus handles the rest.
Step 1: Docker Compose for the Full Stack
Create a directory for your monitoring stack:
mkdir -p /opt/monitoring/{prometheus,grafana,alertmanager}
Here is the complete Docker Compose file:
# /opt/monitoring/docker-compose.yml
services:
# -------------------------------------------------------------------
# Prometheus: metric collection and storage
# -------------------------------------------------------------------
prometheus:
image: prom/prometheus:latest
container_name: prometheus
restart: unless-stopped
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--storage.tsdb.retention.time=90d"
- "--storage.tsdb.retention.size=10GB"
- "--web.enable-lifecycle"
- "--web.enable-admin-api"
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- ./prometheus/alert_rules.yml:/etc/prometheus/alert_rules.yml:ro
- prometheus_data:/prometheus
ports:
- "9090:9090"
networks:
- monitoring
extra_hosts:
- "host.docker.internal:host-gateway"
# -------------------------------------------------------------------
# Grafana: visualization and dashboards
# -------------------------------------------------------------------
grafana:
image: grafana/grafana:latest
container_name: grafana
restart: unless-stopped
environment:
GF_SECURITY_ADMIN_USER: admin
GF_SECURITY_ADMIN_PASSWORD: "CHANGE_ME_STRONG_PASSWORD"
GF_USERS_ALLOW_SIGN_UP: "false"
GF_SERVER_ROOT_URL: "https://grafana.yourdomain.com"
GF_INSTALL_PLUGINS: "grafana-clock-panel,grafana-piechart-panel"
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning:ro
ports:
- "3000:3000"
networks:
- monitoring
depends_on:
- prometheus
# -------------------------------------------------------------------
# Alertmanager: alert routing and notification
# -------------------------------------------------------------------
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
restart: unless-stopped
command:
- "--config.file=/etc/alertmanager/alertmanager.yml"
- "--storage.path=/alertmanager"
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
- alertmanager_data:/alertmanager
ports:
- "9093:9093"
networks:
- monitoring
# -------------------------------------------------------------------
# Node Exporter: hardware and OS metrics
# -------------------------------------------------------------------
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
restart: unless-stopped
command:
- "--path.procfs=/host/proc"
- "--path.rootfs=/rootfs"
- "--path.sysfs=/host/sys"
- "--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
ports:
- "9100:9100"
networks:
- monitoring
pid: host
# -------------------------------------------------------------------
# cAdvisor: container metrics
# -------------------------------------------------------------------
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
container_name: cadvisor
restart: unless-stopped
privileged: true
devices:
- /dev/kmsg:/dev/kmsg
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker:/var/lib/docker:ro
- /cgroup:/cgroup:ro
ports:
- "8080:8080"
networks:
- monitoring
networks:
monitoring:
driver: bridge
volumes:
prometheus_data:
grafana_data:
alertmanager_data:
Node Exporter on Additional Hosts
On every other machine you want to monitor, deploy Node Exporter and cAdvisor (if running Docker). You can use a minimal Compose file:
# /opt/node-exporter/docker-compose.yml (on each additional host)
services:
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
restart: unless-stopped
command:
- "--path.procfs=/host/proc"
- "--path.rootfs=/rootfs"
- "--path.sysfs=/host/sys"
- "--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
ports:
- "9100:9100"
pid: host
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
container_name: cadvisor
restart: unless-stopped
privileged: true
devices:
- /dev/kmsg:/dev/kmsg
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker:/var/lib/docker:ro
- /cgroup:/cgroup:ro
ports:
- "8080:8080"
Deploy on each host:
docker compose up -d
Step 2: Prometheus Configuration
The Prometheus configuration file defines what to scrape, how often, and where to find alert rules.
# /opt/monitoring/prometheus/prometheus.yml
global:
scrape_interval: 15s # How often to scrape targets
evaluation_interval: 15s # How often to evaluate alert rules
scrape_timeout: 10s # Timeout for each scrape request
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# Alert rule files
rule_files:
- "alert_rules.yml"
# Scrape configurations
scrape_configs:
# -------------------------------------------------------------------
# Prometheus self-monitoring
# -------------------------------------------------------------------
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
labels:
instance: "monitor-node"
# -------------------------------------------------------------------
# Node Exporter: system metrics from all hosts
# -------------------------------------------------------------------
- job_name: "node-exporter"
static_configs:
- targets: ["node-exporter:9100"]
labels:
instance: "monitor-node"
location: "rack-1"
- targets: ["192.168.1.11:9100"]
labels:
instance: "server-2"
location: "rack-1"
- targets: ["192.168.1.12:9100"]
labels:
instance: "server-3"
location: "rack-1"
- targets: ["192.168.1.20:9100"]
labels:
instance: "nas"
location: "rack-1"
# -------------------------------------------------------------------
# cAdvisor: container metrics from all Docker hosts
# -------------------------------------------------------------------
- job_name: "cadvisor"
static_configs:
- targets: ["cadvisor:8080"]
labels:
instance: "monitor-node"
- targets: ["192.168.1.11:8080"]
labels:
instance: "server-2"
- targets: ["192.168.1.12:8080"]
labels:
instance: "server-3"
# -------------------------------------------------------------------
# Docker daemon metrics (if enabled)
# -------------------------------------------------------------------
- job_name: "docker"
static_configs:
- targets: ["host.docker.internal:9323"]
labels:
instance: "monitor-node"
# -------------------------------------------------------------------
# Traefik metrics (if Traefik is your reverse proxy)
# -------------------------------------------------------------------
- job_name: "traefik"
static_configs:
- targets: ["traefik:8082"]
labels:
instance: "traefik"
# -------------------------------------------------------------------
# Application-specific exporters
# -------------------------------------------------------------------
# Uncomment and configure as needed:
# - job_name: "postgres"
# static_configs:
# - targets: ["postgres-exporter:9187"]
# labels:
# instance: "main-db"
# - job_name: "redis"
# static_configs:
# - targets: ["redis-exporter:9121"]
# labels:
# instance: "main-redis"
# - job_name: "nginx"
# static_configs:
# - targets: ["nginx-exporter:9113"]
# labels:
# instance: "nginx"
Understanding Scrape Configuration
Each job_name groups related targets. The static_configs section lists the IP:port combinations where Prometheus can scrape metrics. Labels like instance and location are attached to every metric from that target, making it easy to filter and group data in Grafana.
For the docker job, you need to enable Docker daemon metrics. Add this to /etc/docker/daemon.json on each Docker host:
{
"metrics-addr": "0.0.0.0:9323",
"experimental": true
}
Then restart Docker:
sudo systemctl restart docker
For Traefik metrics, add these command flags to your Traefik container:
command:
- "--metrics.prometheus=true"
- "--metrics.prometheus.entryPoint=metrics"
- "--entrypoints.metrics.address=:8082"
Step 3: Alertmanager Configuration
Alertmanager receives alerts from Prometheus and routes them to notification channels. Here is a configuration that sends alerts to email and Discord:
# /opt/monitoring/alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
smtp_from: "alerts@yourdomain.com"
smtp_smarthost: "smtp.gmail.com:587"
smtp_auth_username: "your_email@gmail.com"
smtp_auth_password: "your_app_password"
smtp_require_tls: true
# Notification templates
templates:
- "/etc/alertmanager/templates/*.tmpl"
# Routing tree
route:
group_by: ["alertname", "instance"]
group_wait: 30s # Wait before sending first notification
group_interval: 5m # Wait before sending updates for same group
repeat_interval: 4h # Re-send if alert is still firing
receiver: "email" # Default receiver
routes:
# Critical alerts go to all channels immediately
- match:
severity: critical
receiver: "all-channels"
group_wait: 10s
repeat_interval: 1h
# Warning alerts go to Discord only
- match:
severity: warning
receiver: "discord"
repeat_interval: 12h
# Notification receivers
receivers:
- name: "email"
email_configs:
- to: "you@yourdomain.com"
send_resolved: true
headers:
subject: "[Homelab Alert] {{ .GroupLabels.alertname }}"
- name: "discord"
webhook_configs:
- url: "https://discord.com/api/webhooks/YOUR_WEBHOOK_ID/YOUR_WEBHOOK_TOKEN"
send_resolved: true
- name: "all-channels"
email_configs:
- to: "you@yourdomain.com"
send_resolved: true
webhook_configs:
- url: "https://discord.com/api/webhooks/YOUR_WEBHOOK_ID/YOUR_WEBHOOK_TOKEN"
send_resolved: true
# Inhibition rules (suppress less severe alerts when critical ones fire)
inhibit_rules:
- source_match:
severity: critical
target_match:
severity: warning
equal: ["alertname", "instance"]
Setting Up Discord Webhooks
Discord is a popular alert destination for homelab users because it is free, has mobile notifications, and supports rich message formatting.
- Open your Discord server settings.
- Go to Integrations > Webhooks.
- Click “New Webhook” and copy the webhook URL.
- Paste the URL into the Alertmanager configuration.
Setting Up Slack (Alternative)
If you prefer Slack:
receivers:
- name: "slack"
slack_configs:
- api_url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
channel: "#homelab-alerts"
send_resolved: true
title: '{{ .GroupLabels.alertname }}'
text: >-
{{ range .Alerts }}
*Alert:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Severity:* {{ .Labels.severity }}
*Instance:* {{ .Labels.instance }}
{{ end }}
Setting Up Ntfy (Self-Hosted Alternative)
Ntfy is a self-hosted push notification service that pairs well with a monitoring stack:
receivers:
- name: "ntfy"
webhook_configs:
- url: "https://ntfy.yourdomain.com/homelab-alerts"
send_resolved: true
http_config:
basic_auth:
username: "alertmanager"
password: "your_ntfy_password"
Step 4: Alert Rules
Alert rules define the conditions under which Prometheus fires alerts. Here is a comprehensive set of rules for homelab monitoring:
# /opt/monitoring/prometheus/alert_rules.yml
groups:
# -------------------------------------------------------------------
# Host alerts: CPU, memory, disk, network
# -------------------------------------------------------------------
- name: host_alerts
rules:
- alert: HighCpuUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is above 85% for more than 10 minutes. Current value: {{ $value | printf \"%.1f\" }}%"
- alert: CriticalCpuUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 95
for: 5m
labels:
severity: critical
annotations:
summary: "Critical CPU usage on {{ $labels.instance }}"
description: "CPU usage is above 95% for more than 5 minutes. Current value: {{ $value | printf \"%.1f\" }}%"
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
for: 10m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is above 85% for more than 10 minutes. Current value: {{ $value | printf \"%.1f\" }}%"
- alert: CriticalMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 95
for: 5m
labels:
severity: critical
annotations:
summary: "Critical memory usage on {{ $labels.instance }}"
description: "Memory usage is above 95% for more than 5 minutes. Current value: {{ $value | printf \"%.1f\" }}%"
- alert: DiskSpaceLow
expr: (1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100 > 80
for: 15m
labels:
severity: warning
annotations:
summary: "Disk space low on {{ $labels.instance }}"
description: "Root partition is {{ $value | printf \"%.1f\" }}% full on {{ $labels.instance }}"
- alert: DiskSpaceCritical
expr: (1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100 > 90
for: 5m
labels:
severity: critical
annotations:
summary: "Disk space critical on {{ $labels.instance }}"
description: "Root partition is {{ $value | printf \"%.1f\" }}% full on {{ $labels.instance }}. Immediate action required."
- alert: DiskWillFillIn24Hours
expr: predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 24 * 3600) < 0
for: 1h
labels:
severity: critical
annotations:
summary: "Disk will fill within 24 hours on {{ $labels.instance }}"
description: "Based on current growth rate, the root partition on {{ $labels.instance }} will be full within 24 hours."
- alert: HighNetworkTraffic
expr: rate(node_network_receive_bytes_total{device!="lo"}[5m]) > 100000000
for: 15m
labels:
severity: warning
annotations:
summary: "High network traffic on {{ $labels.instance }}"
description: "Network interface {{ $labels.device }} is receiving more than 100 MB/s for 15+ minutes."
- alert: SystemdServiceFailed
expr: node_systemd_unit_state{state="failed"} == 1
for: 5m
labels:
severity: warning
annotations:
summary: "Systemd service failed on {{ $labels.instance }}"
description: "Service {{ $labels.name }} has been in failed state for more than 5 minutes."
- alert: HostRebooted
expr: (node_time_seconds - node_boot_time_seconds) < 600
for: 0m
labels:
severity: info
annotations:
summary: "Host {{ $labels.instance }} was recently rebooted"
description: "{{ $labels.instance }} booted less than 10 minutes ago."
# -------------------------------------------------------------------
# Container alerts
# -------------------------------------------------------------------
- name: container_alerts
rules:
- alert: ContainerHighCpu
expr: rate(container_cpu_usage_seconds_total{name!=""}[5m]) * 100 > 80
for: 10m
labels:
severity: warning
annotations:
summary: "Container {{ $labels.name }} high CPU"
description: "Container {{ $labels.name }} on {{ $labels.instance }} is using more than 80% CPU for 10+ minutes."
- alert: ContainerHighMemory
expr: (container_memory_usage_bytes{name!=""} / container_spec_memory_limit_bytes{name!=""}) * 100 > 85
for: 10m
labels:
severity: warning
annotations:
summary: "Container {{ $labels.name }} high memory"
description: "Container {{ $labels.name }} is using {{ $value | printf \"%.1f\" }}% of its memory limit."
- alert: ContainerRestarting
expr: increase(container_last_seen{name!=""}[10m]) == 0 and increase(container_start_time_seconds{name!=""}[10m]) > 0
for: 0m
labels:
severity: warning
annotations:
summary: "Container {{ $labels.name }} is restart-looping"
description: "Container {{ $labels.name }} on {{ $labels.instance }} appears to be restarting repeatedly."
# -------------------------------------------------------------------
# Prometheus self-monitoring alerts
# -------------------------------------------------------------------
- name: prometheus_alerts
rules:
- alert: PrometheusTargetDown
expr: up == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Prometheus target down: {{ $labels.job }}/{{ $labels.instance }}"
description: "Prometheus cannot reach {{ $labels.instance }} (job: {{ $labels.job }}) for more than 5 minutes."
- alert: PrometheusHighMemory
expr: process_resident_memory_bytes{job="prometheus"} > 2147483648
for: 15m
labels:
severity: warning
annotations:
summary: "Prometheus using more than 2 GB RAM"
description: "Prometheus memory usage: {{ $value | humanize }}. Consider reducing retention or number of targets."
- alert: PrometheusTsdbHighCardinality
expr: prometheus_tsdb_head_series > 500000
for: 1h
labels:
severity: warning
annotations:
summary: "Prometheus has high time series count"
description: "Prometheus is tracking {{ $value }} time series. High cardinality impacts performance and storage."
How Alert Rules Work
Each rule has four parts:
- expr: A PromQL expression that evaluates to true when the condition is met.
- for: How long the condition must be true before the alert fires. This prevents alerts on brief spikes.
- labels: Metadata attached to the alert, used by Alertmanager for routing.
- annotations: Human-readable descriptions used in notifications.
The predict_linear function in the DiskWillFillIn24Hours rule is one of the most useful features in Prometheus. Instead of alerting when disk is 90% full (which might be fine for months), it looks at the growth trend over the last 6 hours and alerts if the disk will fill within 24 hours. This catches runaway log files and data ingestion issues before they become outages.
Step 5: Deploy and Verify
Set up Grafana provisioning for auto-configured data sources:
mkdir -p /opt/monitoring/grafana/provisioning/{datasources,dashboards}
Create the Prometheus data source provisioning file:
# /opt/monitoring/grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: false
Now deploy everything:
cd /opt/monitoring
docker compose up -d
# Wait 30 seconds for everything to start, then check
docker compose ps
# Verify Prometheus is scraping targets
curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool | head -50
# Verify Alertmanager is running
curl -s http://localhost:9093/api/v2/status | python3 -m json.tool
# Check Prometheus logs for any config errors
docker compose logs prometheus | tail -20
Access the web interfaces:
- Prometheus:
http://your-server-ip:9090 - Grafana:
http://your-server-ip:3000(default login: admin / your configured password) - Alertmanager:
http://your-server-ip:9093
In Prometheus, navigate to Status > Targets. You should see all your configured targets with their current state (UP or DOWN). If any target shows DOWN, verify that the exporter is running on that host and that the port is accessible from the monitoring node.
Step 6: Grafana Setup and Data Sources
If you used the provisioning file above, Prometheus is already configured as a data source. If not, add it manually:
- Log in to Grafana.
- Navigate to Connections > Data Sources > Add data source.
- Select Prometheus.
- Set the URL to
http://prometheus:9090. - Click “Save & Test”.
Importing Community Dashboards
One of Grafana’s best features is its library of community dashboards. Instead of building everything from scratch, import pre-built dashboards by their ID:
- In Grafana, click Dashboards > New > Import.
- Enter the dashboard ID and click “Load”.
- Select your Prometheus data source and click “Import”.
Essential dashboard IDs:
| Dashboard | ID | Description |
|---|---|---|
| Node Exporter Full | 1860 | Comprehensive host metrics with CPU, memory, disk, network graphs |
| Docker Container Monitoring | 893 | Per-container CPU, memory, and network metrics from cAdvisor |
| Prometheus Stats | 2 | Prometheus self-monitoring (scrape duration, target count, TSDB stats) |
| Alertmanager Overview | 9578 | Alert counts, notification rates, silences |
| Traefik Dashboard | 17346 | Request rates, response codes, latency (if using Traefik) |
The Node Exporter Full dashboard (ID 1860) is the single most useful dashboard for homelab monitoring. It gives you everything you need to understand the health of each machine at a glance.
Building Dashboards
Community dashboards cover most needs, but building custom dashboards teaches you PromQL and lets you create views tailored to your setup.
PromQL Basics
PromQL is Prometheus’s query language. Here are the queries you will use most often:
CPU usage percentage:
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
This calculates the percentage of time the CPU was NOT idle over the last 5 minutes.
Memory usage percentage:
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
Disk usage percentage:
(1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100
Network traffic (bytes per second):
rate(node_network_receive_bytes_total{device="eth0"}[5m])
rate(node_network_transmit_bytes_total{device="eth0"}[5m])
Container CPU usage:
rate(container_cpu_usage_seconds_total{name!=""}[5m]) * 100
Container memory usage:
container_memory_usage_bytes{name!=""}
Building a Homelab Overview Dashboard
Here is how to build a single-pane-of-glass dashboard for your homelab:
-
Create a new dashboard in Grafana.
-
Add a Stat panel for total nodes:
- Query:
count(up{job="node-exporter"} == 1) - Title: “Nodes Online”
- Threshold: green at 3 (or however many nodes you have), red below that.
- Query:
-
Add a Stat panel for total containers:
- Query:
count(container_last_seen{name!=""}) - Title: “Running Containers”
- Query:
-
Add a Table panel for per-node status:
- Query 1:
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)(CPU %) - Query 2:
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100(Memory %) - Query 3:
(1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100(Disk %) - Use Transform > Merge to combine queries into a single table.
- Add value thresholds: green < 70, yellow < 85, red >= 85.
- Query 1:
-
Add Time Series panels for trends:
- CPU over time per node
- Memory over time per node
- Network traffic per node
- Disk I/O per node
-
Add a panel for container resource usage:
- Top 10 containers by CPU:
topk(10, rate(container_cpu_usage_seconds_total{name!=""}[5m]) * 100) - Top 10 containers by memory:
topk(10, container_memory_usage_bytes{name!=""})
- Top 10 containers by CPU:
-
Save the dashboard and set it as your home dashboard in Grafana settings.
Dashboard Variables
Make your dashboards interactive with template variables:
- Go to Dashboard Settings > Variables.
- Add a variable named
instancewith query:label_values(node_uname_info, instance). - Use
$instancein your panel queries to filter by the selected node.
This lets you switch between nodes in a dropdown instead of creating separate dashboards for each host.
Monitoring Docker Containers
cAdvisor provides detailed per-container metrics, but the Docker daemon itself also exposes useful metrics.
Docker Daemon Metrics
Enable metrics in /etc/docker/daemon.json:
{
"metrics-addr": "0.0.0.0:9323",
"experimental": true
}
This exposes metrics about the Docker engine itself: number of running containers, image pull times, build durations, and more.
Useful Container Queries
Containers sorted by memory usage:
sort_desc(container_memory_usage_bytes{name!=""})
Container restart count (last 24 hours):
increase(container_start_time_seconds{name!=""}[24h])
Container network I/O:
rate(container_network_receive_bytes_total{name!=""}[5m])
rate(container_network_transmit_bytes_total{name!=""}[5m])
Container disk I/O:
rate(container_fs_reads_bytes_total{name!=""}[5m])
rate(container_fs_writes_bytes_total{name!=""}[5m])
Monitoring Proxmox and Virtual Machines
If you run Proxmox VE, there are two approaches to monitoring.
Option 1: Node Exporter Inside Each VM
Install Node Exporter in each VM just like a physical host. This gives you the same granularity as bare-metal monitoring. The VM does not know it is virtual, and Node Exporter reports hardware metrics from the VM’s perspective.
Option 2: Proxmox VE Exporter
The Proxmox VE Exporter scrapes the Proxmox API and exposes metrics about all VMs and containers managed by Proxmox, without needing anything installed inside the guests.
# Add to docker-compose.yml
pve-exporter:
image: prompve/prometheus-pve-exporter:latest
container_name: pve-exporter
restart: unless-stopped
environment:
PVE_USER: "prometheus@pve"
PVE_PASSWORD: "your_pve_password"
PVE_VERIFY_SSL: "false"
ports:
- "9221:9221"
networks:
- monitoring
Create a monitoring user in Proxmox:
# On the Proxmox host
pveum user add prometheus@pve --password your_pve_password
pveum aclmod / -user prometheus@pve -role PVEAuditor
Add the scrape config to Prometheus:
- job_name: "proxmox"
static_configs:
- targets: ["pve-exporter:9221"]
params:
target: ["192.168.1.100"] # Your Proxmox host IP
metrics_path: /pve
This gives you VM CPU, memory, disk, and network metrics without installing anything in the VMs, plus Proxmox cluster health metrics.
Recommended Approach
Use both. The Proxmox exporter gives you the host-level view (how much of the physical resources each VM consumes), while Node Exporter inside VMs gives you the guest-level view (how the VM perceives its own resources). Both perspectives are useful.
Monitoring Network Equipment
SNMP Exporter for Switches and Routers
Most managed network switches and routers expose metrics via SNMP. The Prometheus SNMP Exporter translates SNMP data into Prometheus metrics.
# Add to docker-compose.yml
snmp-exporter:
image: prom/snmp-exporter:latest
container_name: snmp-exporter
restart: unless-stopped
volumes:
- ./snmp/snmp.yml:/etc/snmp_exporter/snmp.yml:ro
ports:
- "9116:9116"
networks:
- monitoring
Add the scrape config for each SNMP device:
- job_name: "snmp"
static_configs:
- targets:
- "192.168.1.1" # Router
- "192.168.1.2" # Managed switch
metrics_path: /snmp
params:
auth: ["public_v2"]
module: ["if_mib"]
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: snmp-exporter:9116
This gives you interface traffic, error counts, and port status for managed switches and routers.
Blackbox Exporter for Uptime Monitoring
The Blackbox Exporter probes endpoints via HTTP, TCP, ICMP, or DNS and reports whether they are up and how fast they respond.
# Add to docker-compose.yml
blackbox-exporter:
image: prom/blackbox-exporter:latest
container_name: blackbox-exporter
restart: unless-stopped
volumes:
- ./blackbox/blackbox.yml:/etc/blackbox_exporter/config.yml:ro
ports:
- "9115:9115"
networks:
- monitoring
# /opt/monitoring/blackbox/blackbox.yml
modules:
http_2xx:
prober: http
timeout: 10s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
valid_status_codes: [200]
follow_redirects: true
preferred_ip_protocol: ip4
icmp:
prober: icmp
timeout: 5s
tcp_connect:
prober: tcp
timeout: 5s
# Add to prometheus.yml scrape_configs
- job_name: "blackbox-http"
metrics_path: /probe
params:
module: ["http_2xx"]
static_configs:
- targets:
- "https://grafana.yourdomain.com"
- "https://nextcloud.yourdomain.com"
- "https://jellyfin.yourdomain.com"
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
- job_name: "blackbox-ping"
metrics_path: /probe
params:
module: ["icmp"]
static_configs:
- targets:
- "192.168.1.1" # Router
- "192.168.1.2" # Switch
- "1.1.1.1" # Internet connectivity check
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
The Blackbox Exporter is invaluable for catching issues that internal metrics miss: DNS resolution failures, SSL certificate expiry, and network connectivity problems between your homelab and the internet.
Storage and Retention
Prometheus stores time-series data on local disk. The amount of storage depends on:
- Number of time series: each unique combination of metric name and labels is one series.
- Scrape interval: more frequent scrapes = more data points.
- Retention period: how long to keep data.
Estimating Storage Requirements
A rough formula:
Storage per day = (number of series) * (scrapes per day) * (2 bytes per sample)
For a typical homelab with 3 nodes, 20 containers, and a 15-second scrape interval:
- ~5,000 time series (Node Exporter produces ~500 series per node, cAdvisor ~100 per container)
- 5,760 scrapes per day (86,400 seconds / 15)
- ~5,000 * 5,760 * 2 bytes = ~55 MB per day uncompressed
- Prometheus compresses data roughly 10x, so ~5-6 MB per day
With 90-day retention, that is about 500 MB. Even a large homelab with 10 nodes and 100 containers will use under 5 GB for 90 days of retention. Storage is not a concern for homelab-scale monitoring.
Configuring Retention
Set both time and size limits in the Prometheus command flags:
command:
- "--storage.tsdb.retention.time=90d"
- "--storage.tsdb.retention.size=10GB"
Prometheus will keep data until either limit is reached, whichever comes first.
Long-Term Storage
If you want to keep metrics longer than 90 days (for capacity planning or historical analysis), consider:
- Thanos: adds long-term storage to Prometheus using object storage (S3, MinIO). Complex but powerful.
- VictoriaMetrics: a Prometheus-compatible TSDB that uses less memory and disk. Can act as a drop-in remote storage backend.
For most homelabs, 90 days of local retention is more than enough.
Securing the Stack
By default, Prometheus, Grafana, and Alertmanager are accessible without authentication (except Grafana, which has its own login). In a homelab on a private network, this is often acceptable. If you expose these services externally, secure them.
Put Everything Behind Authelia
If you followed our Authelia guide, add the Authelia middleware to each service:
# Grafana
labels:
- "traefik.http.routers.grafana.middlewares=authelia@docker"
# Prometheus
labels:
- "traefik.http.routers.prometheus.middlewares=authelia@docker"
# Alertmanager
labels:
- "traefik.http.routers.alertmanager.middlewares=authelia@docker"
Restrict Network Access
If not using a reverse proxy, bind services to localhost and use SSH tunnels:
ports:
- "127.0.0.1:9090:9090" # Prometheus
- "127.0.0.1:3000:3000" # Grafana
- "127.0.0.1:9093:9093" # Alertmanager
Access them via SSH tunnel:
ssh -L 3000:localhost:3000 -L 9090:localhost:9090 user@your-server
Firewall Rules
Allow metric scraping ports (9100, 8080) only from the monitoring node:
# On each monitored host, allow only the monitoring node
sudo ufw allow from 192.168.1.10 to any port 9100 proto tcp
sudo ufw allow from 192.168.1.10 to any port 8080 proto tcp
Advanced: Loki for Log Aggregation
Metrics tell you what is happening. Logs tell you why. Grafana Loki is a log aggregation system designed to pair with the Prometheus + Grafana stack.
# Add to docker-compose.yml
loki:
image: grafana/loki:latest
container_name: loki
restart: unless-stopped
command: -config.file=/etc/loki/loki-config.yml
volumes:
- ./loki/loki-config.yml:/etc/loki/loki-config.yml:ro
- loki_data:/loki
ports:
- "3100:3100"
networks:
- monitoring
promtail:
image: grafana/promtail:latest
container_name: promtail
restart: unless-stopped
command: -config.file=/etc/promtail/promtail-config.yml
volumes:
- ./promtail/promtail-config.yml:/etc/promtail/promtail-config.yml:ro
- /var/log:/var/log:ro
- /var/lib/docker/containers:/var/lib/docker/containers:ro
networks:
- monitoring
depends_on:
- loki
volumes:
loki_data:
# /opt/monitoring/loki/loki-config.yml
auth_enabled: false
server:
http_listen_port: 3100
common:
path_prefix: /loki
storage:
filesystem:
chunks_directory: /loki/chunks
rules_directory: /loki/rules
replication_factor: 1
ring:
kvstore:
store: inmemory
schema_config:
configs:
- from: 2024-01-01
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: index_
period: 24h
limits_config:
retention_period: 30d
# /opt/monitoring/promtail/promtail-config.yml
server:
http_listen_port: 9080
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
# System logs
- job_name: syslog
static_configs:
- targets: ["localhost"]
labels:
job: syslog
__path__: /var/log/syslog
# Docker container logs
- job_name: docker
static_configs:
- targets: ["localhost"]
labels:
job: docker
__path__: /var/lib/docker/containers/*/*-json.log
pipeline_stages:
- docker: {}
- json:
expressions:
stream: stream
attrs: attrs
tag: attrs.tag
- labels:
stream:
tag:
Add Loki as a data source in Grafana (Connections > Data Sources > Add > Loki, URL: http://loki:3100). Then you can query logs alongside metrics in the same dashboard, correlating spikes in CPU or errors with the log entries that explain them.
Troubleshooting Common Issues
Prometheus Target Shows “DOWN”
Check connectivity:
curl http://target-ip:9100/metrics
If this fails, the exporter is not running or a firewall is blocking the port.
Check from the Prometheus container:
docker exec -it prometheus wget -qO- http://target-ip:9100/metrics | head -5
If this fails but the curl from the host works, it is a Docker networking issue. Ensure the monitoring network can reach the target.
Grafana Dashboard Shows “No Data”
- Verify the data source is configured and test passes (Connections > Data Sources > Prometheus > Save & Test).
- Check the time range. Grafana defaults to “Last 6 hours” — if you just deployed, there may not be enough data yet. Set it to “Last 15 minutes”.
- Run the query directly in Prometheus at
http://localhost:9090/graphto verify data exists.
Alerts Not Firing
- Check that alert rules are loaded: Prometheus > Status > Rules.
- Verify the alert expression in the Prometheus graph UI. It should return results.
- Check Alertmanager is reachable: Prometheus > Status > Runtime & Build Information > Alertmanagers.
- Check Alertmanager logs:
docker compose logs alertmanager.
High Memory Usage by Prometheus
Prometheus memory usage scales with the number of time series. If it is using too much RAM:
- Check cardinality:
prometheus_tsdb_head_seriesmetric. - Reduce scrape targets or increase scrape intervals for less important targets.
- Use
metric_relabel_configsto drop high-cardinality metrics you do not need.
# Example: drop container metrics for short-lived containers
metric_relabel_configs:
- source_labels: [__name__]
regex: "container_(network|blkio).*"
action: drop
cAdvisor High CPU Usage
cAdvisor can use significant CPU on hosts with many containers. Reduce its housekeeping interval:
cadvisor:
command:
- "--housekeeping_interval=30s"
- "--docker_only=true"
- "--disable_metrics=percpu,sched,tcp,udp,disk,diskIO,hugetlb,referenced_memory"
FAQ
How much RAM does the full stack use?
On a typical homelab setup: Prometheus ~200-500 MB, Grafana ~100-200 MB, Alertmanager ~30 MB, Node Exporter ~15 MB, cAdvisor ~50-100 MB. Total: about 400 MB to 850 MB depending on the number of targets and dashboards. Adding Loki adds another 200-500 MB.
Can I monitor Windows machines?
Yes. Use windows_exporter (formerly WMI Exporter) on Windows hosts. It exposes CPU, memory, disk, and network metrics in Prometheus format on port 9182. Add it to your Prometheus scrape config like any other target.
Should I use InfluxDB + Telegraf instead?
The InfluxDB + Telegraf + Grafana (TIG) stack is a valid alternative. InfluxDB uses a push model (Telegraf pushes data), while Prometheus uses pull (Prometheus scrapes targets). Prometheus has a larger ecosystem of exporters and is the CNCF standard. InfluxDB has better support for high-cardinality data and custom application metrics. For homelab monitoring of infrastructure, Prometheus is the more popular and better-supported choice.
How do I monitor a remote site (like a VPS)?
You have three options:
- VPN: Connect the VPS to your homelab network via WireGuard or Tailscale. Prometheus scrapes through the VPN tunnel.
- Federation: Run a small Prometheus instance on the VPS that scrapes local targets. Your main Prometheus federates (pulls) aggregated metrics from the remote instance.
- Push gateway: Exporters push metrics to a Prometheus Push Gateway on the VPS, which your main Prometheus scrapes.
Option 1 (VPN) is simplest for a single VPS.
Can Grafana send alerts directly, without Alertmanager?
Yes. Grafana has its own built-in alerting system that can evaluate queries and send notifications. However, for Prometheus-based setups, using Alertmanager is recommended because it supports grouping, deduplication, silencing, and inhibition. Grafana alerting is better suited for alerts based on non-Prometheus data sources.
How do I back up Grafana dashboards?
Grafana stores dashboards in its database (/var/lib/grafana/grafana.db). Back up this file, or better, use Grafana’s built-in JSON export:
# Export all dashboards using the API
curl -s "http://admin:password@localhost:3000/api/search" | \
python3 -c "import sys,json; [print(d['uid']) for d in json.load(sys.stdin)]" | \
while read uid; do
curl -s "http://admin:password@localhost:3000/api/dashboards/uid/$uid" > "dashboard-$uid.json"
done
For a GitOps approach, store dashboards as JSON in a Git repository and use Grafana’s provisioning to load them automatically.
Do I need all four components (Prometheus, Grafana, Node Exporter, Alertmanager)?
At minimum, you need Prometheus and Node Exporter to collect and store metrics. Grafana adds visualization, and Alertmanager adds notifications. You can start with just Prometheus + Node Exporter + Grafana and add Alertmanager later when you know what conditions you want to be alerted about.
Conclusion
A monitoring stack transforms your homelab from a collection of machines you hope are working into a system you know is working. The combination of Prometheus, Grafana, Node Exporter, and Alertmanager has become the standard not because it is the only option, but because it is well-documented, well-integrated, and genuinely useful at every scale from three Raspberry Pis to a thousand-node data center.
Start simple. Deploy the Docker Compose stack from this guide, import the Node Exporter Full dashboard (ID 1860), and look at your homelab’s metrics for the first time. You will immediately notice patterns you never knew existed: the nightly backup that spikes CPU at 3 AM, the container that slowly leaks memory over a week, the disk that is 85% full because Docker images are piling up.
Then add alerts for the things that actually matter: disk filling up, nodes going offline, containers crash-looping. Do not alert on everything. Alert on conditions that require human action. If an alert fires and you look at it and shrug, it should not be an alert.
Over time, expand the stack. Add the Blackbox Exporter to monitor external endpoints. Add the SNMP Exporter if you have managed network gear. Add Loki for log aggregation. Build custom dashboards for the services you care about most.
The monitoring stack itself is low-maintenance once deployed. Prometheus is rock-solid. Grafana upgrades cleanly. Node Exporter rarely needs attention. The ongoing work is tuning alert thresholds and building dashboards that surface the information you actually need. That work pays for itself the first time you catch a problem before it becomes an outage.