Monitoring Your Homelab: Prometheus + Grafana Stack Complete Guide 2026

prometheus grafanahomelab monitoringself-hosted monitoringdocker monitoring

Monitoring Your Homelab: Prometheus + Grafana Stack Complete Guide 2026

You are running a homelab. Maybe it is three mini PCs, a NAS, and a Raspberry Pi. Maybe it is a Proxmox cluster with a dozen VMs and fifty Docker containers. Either way, you have no idea how much RAM is actually being used, whether that disk is filling up, or why Jellyfin stutters every Tuesday evening. You find out about problems when your partner tells you the media server is down. Again.

Monitoring fixes this. Not enterprise-grade, six-figure-contract monitoring. Homelab monitoring: a stack that runs on a single machine, scrapes metrics from everything in your network, visualizes them in beautiful dashboards, and sends you an alert before things break instead of after.

The standard stack in 2026 is the same one that has dominated for years, because it works: Prometheus for metric collection and storage, Grafana for visualization and dashboards, Node Exporter for hardware and OS metrics, and Alertmanager for routing alerts to your phone, email, or Discord server.

This guide deploys the entire stack with Docker Compose, configures it to monitor your Docker hosts, VMs, network equipment, and containers, builds useful dashboards, and sets up alerts that actually tell you something actionable.

Table of Contents

TL;DR

  • Prometheus scrapes metrics from targets (servers, containers, network devices) at regular intervals and stores them as time-series data.
  • Grafana connects to Prometheus and visualizes metrics in customizable dashboards.
  • Node Exporter runs on each machine and exposes hardware/OS metrics (CPU, RAM, disk, network) for Prometheus to scrape.
  • Alertmanager receives alerts from Prometheus and routes them to email, Slack, Discord, PagerDuty, or any webhook.
  • cAdvisor exposes per-container resource metrics (CPU, memory, network I/O) from Docker.
  • The entire stack runs in Docker Compose and uses about 500 MB-1 GB of RAM depending on the number of targets.
  • Deploy the stack on one node. Run Node Exporter on every node you want to monitor.

Architecture Overview

+-------------------+     +-------------------+     +-------------------+
|   Server Node 1   |     |   Server Node 2   |     |   Server Node 3   |
|                   |     |                   |     |                   |
|  [Node Exporter]  |     |  [Node Exporter]  |     |  [Node Exporter]  |
|  [cAdvisor]       |     |  [cAdvisor]       |     |  [cAdvisor]       |
|  port 9100, 8080  |     |  port 9100, 8080  |     |  port 9100, 8080  |
+---------+---------+     +---------+---------+     +---------+---------+
          |                         |                         |
          +------------+------------+------------+------------+
                       |
              +--------v---------+
              |   Monitor Node   |
              |                  |
              |  [Prometheus]    | <--- scrapes all exporters
              |  [Grafana]       | <--- visualizes Prometheus data
              |  [Alertmanager]  | <--- routes alerts
              |  [Node Exporter] |
              |  [cAdvisor]      |
              +------------------+

Prometheus operates on a pull model: it reaches out to targets and scrapes metrics at defined intervals (default: 15 seconds). This is different from push-based systems like InfluxDB/Telegraf where agents push data to a central server. The pull model means you configure targets in Prometheus, and Prometheus handles the rest.

Step 1: Docker Compose for the Full Stack

Create a directory for your monitoring stack:

mkdir -p /opt/monitoring/{prometheus,grafana,alertmanager}

Here is the complete Docker Compose file:

# /opt/monitoring/docker-compose.yml
services:
  # -------------------------------------------------------------------
  # Prometheus: metric collection and storage
  # -------------------------------------------------------------------
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    restart: unless-stopped
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.path=/prometheus"
      - "--storage.tsdb.retention.time=90d"
      - "--storage.tsdb.retention.size=10GB"
      - "--web.enable-lifecycle"
      - "--web.enable-admin-api"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./prometheus/alert_rules.yml:/etc/prometheus/alert_rules.yml:ro
      - prometheus_data:/prometheus
    ports:
      - "9090:9090"
    networks:
      - monitoring
    extra_hosts:
      - "host.docker.internal:host-gateway"

  # -------------------------------------------------------------------
  # Grafana: visualization and dashboards
  # -------------------------------------------------------------------
  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    restart: unless-stopped
    environment:
      GF_SECURITY_ADMIN_USER: admin
      GF_SECURITY_ADMIN_PASSWORD: "CHANGE_ME_STRONG_PASSWORD"
      GF_USERS_ALLOW_SIGN_UP: "false"
      GF_SERVER_ROOT_URL: "https://grafana.yourdomain.com"
      GF_INSTALL_PLUGINS: "grafana-clock-panel,grafana-piechart-panel"
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
    ports:
      - "3000:3000"
    networks:
      - monitoring
    depends_on:
      - prometheus

  # -------------------------------------------------------------------
  # Alertmanager: alert routing and notification
  # -------------------------------------------------------------------
  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    restart: unless-stopped
    command:
      - "--config.file=/etc/alertmanager/alertmanager.yml"
      - "--storage.path=/alertmanager"
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
      - alertmanager_data:/alertmanager
    ports:
      - "9093:9093"
    networks:
      - monitoring

  # -------------------------------------------------------------------
  # Node Exporter: hardware and OS metrics
  # -------------------------------------------------------------------
  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    restart: unless-stopped
    command:
      - "--path.procfs=/host/proc"
      - "--path.rootfs=/rootfs"
      - "--path.sysfs=/host/sys"
      - "--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    ports:
      - "9100:9100"
    networks:
      - monitoring
    pid: host

  # -------------------------------------------------------------------
  # cAdvisor: container metrics
  # -------------------------------------------------------------------
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:latest
    container_name: cadvisor
    restart: unless-stopped
    privileged: true
    devices:
      - /dev/kmsg:/dev/kmsg
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker:/var/lib/docker:ro
      - /cgroup:/cgroup:ro
    ports:
      - "8080:8080"
    networks:
      - monitoring

networks:
  monitoring:
    driver: bridge

volumes:
  prometheus_data:
  grafana_data:
  alertmanager_data:

Node Exporter on Additional Hosts

On every other machine you want to monitor, deploy Node Exporter and cAdvisor (if running Docker). You can use a minimal Compose file:

# /opt/node-exporter/docker-compose.yml (on each additional host)
services:
  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    restart: unless-stopped
    command:
      - "--path.procfs=/host/proc"
      - "--path.rootfs=/rootfs"
      - "--path.sysfs=/host/sys"
      - "--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    ports:
      - "9100:9100"
    pid: host

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:latest
    container_name: cadvisor
    restart: unless-stopped
    privileged: true
    devices:
      - /dev/kmsg:/dev/kmsg
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker:/var/lib/docker:ro
      - /cgroup:/cgroup:ro
    ports:
      - "8080:8080"

Deploy on each host:

docker compose up -d

Step 2: Prometheus Configuration

The Prometheus configuration file defines what to scrape, how often, and where to find alert rules.

# /opt/monitoring/prometheus/prometheus.yml
global:
  scrape_interval: 15s       # How often to scrape targets
  evaluation_interval: 15s   # How often to evaluate alert rules
  scrape_timeout: 10s        # Timeout for each scrape request

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

# Alert rule files
rule_files:
  - "alert_rules.yml"

# Scrape configurations
scrape_configs:
  # -------------------------------------------------------------------
  # Prometheus self-monitoring
  # -------------------------------------------------------------------
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]
        labels:
          instance: "monitor-node"

  # -------------------------------------------------------------------
  # Node Exporter: system metrics from all hosts
  # -------------------------------------------------------------------
  - job_name: "node-exporter"
    static_configs:
      - targets: ["node-exporter:9100"]
        labels:
          instance: "monitor-node"
          location: "rack-1"

      - targets: ["192.168.1.11:9100"]
        labels:
          instance: "server-2"
          location: "rack-1"

      - targets: ["192.168.1.12:9100"]
        labels:
          instance: "server-3"
          location: "rack-1"

      - targets: ["192.168.1.20:9100"]
        labels:
          instance: "nas"
          location: "rack-1"

  # -------------------------------------------------------------------
  # cAdvisor: container metrics from all Docker hosts
  # -------------------------------------------------------------------
  - job_name: "cadvisor"
    static_configs:
      - targets: ["cadvisor:8080"]
        labels:
          instance: "monitor-node"

      - targets: ["192.168.1.11:8080"]
        labels:
          instance: "server-2"

      - targets: ["192.168.1.12:8080"]
        labels:
          instance: "server-3"

  # -------------------------------------------------------------------
  # Docker daemon metrics (if enabled)
  # -------------------------------------------------------------------
  - job_name: "docker"
    static_configs:
      - targets: ["host.docker.internal:9323"]
        labels:
          instance: "monitor-node"

  # -------------------------------------------------------------------
  # Traefik metrics (if Traefik is your reverse proxy)
  # -------------------------------------------------------------------
  - job_name: "traefik"
    static_configs:
      - targets: ["traefik:8082"]
        labels:
          instance: "traefik"

  # -------------------------------------------------------------------
  # Application-specific exporters
  # -------------------------------------------------------------------
  # Uncomment and configure as needed:

  # - job_name: "postgres"
  #   static_configs:
  #     - targets: ["postgres-exporter:9187"]
  #       labels:
  #         instance: "main-db"

  # - job_name: "redis"
  #   static_configs:
  #     - targets: ["redis-exporter:9121"]
  #       labels:
  #         instance: "main-redis"

  # - job_name: "nginx"
  #   static_configs:
  #     - targets: ["nginx-exporter:9113"]
  #       labels:
  #         instance: "nginx"

Understanding Scrape Configuration

Each job_name groups related targets. The static_configs section lists the IP:port combinations where Prometheus can scrape metrics. Labels like instance and location are attached to every metric from that target, making it easy to filter and group data in Grafana.

For the docker job, you need to enable Docker daemon metrics. Add this to /etc/docker/daemon.json on each Docker host:

{
  "metrics-addr": "0.0.0.0:9323",
  "experimental": true
}

Then restart Docker:

sudo systemctl restart docker

For Traefik metrics, add these command flags to your Traefik container:

command:
  - "--metrics.prometheus=true"
  - "--metrics.prometheus.entryPoint=metrics"
  - "--entrypoints.metrics.address=:8082"

Step 3: Alertmanager Configuration

Alertmanager receives alerts from Prometheus and routes them to notification channels. Here is a configuration that sends alerts to email and Discord:

# /opt/monitoring/alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m
  smtp_from: "alerts@yourdomain.com"
  smtp_smarthost: "smtp.gmail.com:587"
  smtp_auth_username: "your_email@gmail.com"
  smtp_auth_password: "your_app_password"
  smtp_require_tls: true

# Notification templates
templates:
  - "/etc/alertmanager/templates/*.tmpl"

# Routing tree
route:
  group_by: ["alertname", "instance"]
  group_wait: 30s         # Wait before sending first notification
  group_interval: 5m      # Wait before sending updates for same group
  repeat_interval: 4h     # Re-send if alert is still firing
  receiver: "email"       # Default receiver

  routes:
    # Critical alerts go to all channels immediately
    - match:
        severity: critical
      receiver: "all-channels"
      group_wait: 10s
      repeat_interval: 1h

    # Warning alerts go to Discord only
    - match:
        severity: warning
      receiver: "discord"
      repeat_interval: 12h

# Notification receivers
receivers:
  - name: "email"
    email_configs:
      - to: "you@yourdomain.com"
        send_resolved: true
        headers:
          subject: "[Homelab Alert] {{ .GroupLabels.alertname }}"

  - name: "discord"
    webhook_configs:
      - url: "https://discord.com/api/webhooks/YOUR_WEBHOOK_ID/YOUR_WEBHOOK_TOKEN"
        send_resolved: true

  - name: "all-channels"
    email_configs:
      - to: "you@yourdomain.com"
        send_resolved: true
    webhook_configs:
      - url: "https://discord.com/api/webhooks/YOUR_WEBHOOK_ID/YOUR_WEBHOOK_TOKEN"
        send_resolved: true

# Inhibition rules (suppress less severe alerts when critical ones fire)
inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: ["alertname", "instance"]

Setting Up Discord Webhooks

Discord is a popular alert destination for homelab users because it is free, has mobile notifications, and supports rich message formatting.

  1. Open your Discord server settings.
  2. Go to Integrations > Webhooks.
  3. Click “New Webhook” and copy the webhook URL.
  4. Paste the URL into the Alertmanager configuration.

Setting Up Slack (Alternative)

If you prefer Slack:

receivers:
  - name: "slack"
    slack_configs:
      - api_url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
        channel: "#homelab-alerts"
        send_resolved: true
        title: '{{ .GroupLabels.alertname }}'
        text: >-
          {{ range .Alerts }}
          *Alert:* {{ .Annotations.summary }}
          *Description:* {{ .Annotations.description }}
          *Severity:* {{ .Labels.severity }}
          *Instance:* {{ .Labels.instance }}
          {{ end }}

Setting Up Ntfy (Self-Hosted Alternative)

Ntfy is a self-hosted push notification service that pairs well with a monitoring stack:

receivers:
  - name: "ntfy"
    webhook_configs:
      - url: "https://ntfy.yourdomain.com/homelab-alerts"
        send_resolved: true
        http_config:
          basic_auth:
            username: "alertmanager"
            password: "your_ntfy_password"

Step 4: Alert Rules

Alert rules define the conditions under which Prometheus fires alerts. Here is a comprehensive set of rules for homelab monitoring:

# /opt/monitoring/prometheus/alert_rules.yml
groups:
  # -------------------------------------------------------------------
  # Host alerts: CPU, memory, disk, network
  # -------------------------------------------------------------------
  - name: host_alerts
    rules:
      - alert: HighCpuUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is above 85% for more than 10 minutes. Current value: {{ $value | printf \"%.1f\" }}%"

      - alert: CriticalCpuUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 95
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Critical CPU usage on {{ $labels.instance }}"
          description: "CPU usage is above 95% for more than 5 minutes. Current value: {{ $value | printf \"%.1f\" }}%"

      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is above 85% for more than 10 minutes. Current value: {{ $value | printf \"%.1f\" }}%"

      - alert: CriticalMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 95
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Critical memory usage on {{ $labels.instance }}"
          description: "Memory usage is above 95% for more than 5 minutes. Current value: {{ $value | printf \"%.1f\" }}%"

      - alert: DiskSpaceLow
        expr: (1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100 > 80
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Disk space low on {{ $labels.instance }}"
          description: "Root partition is {{ $value | printf \"%.1f\" }}% full on {{ $labels.instance }}"

      - alert: DiskSpaceCritical
        expr: (1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100 > 90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Disk space critical on {{ $labels.instance }}"
          description: "Root partition is {{ $value | printf \"%.1f\" }}% full on {{ $labels.instance }}. Immediate action required."

      - alert: DiskWillFillIn24Hours
        expr: predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 24 * 3600) < 0
        for: 1h
        labels:
          severity: critical
        annotations:
          summary: "Disk will fill within 24 hours on {{ $labels.instance }}"
          description: "Based on current growth rate, the root partition on {{ $labels.instance }} will be full within 24 hours."

      - alert: HighNetworkTraffic
        expr: rate(node_network_receive_bytes_total{device!="lo"}[5m]) > 100000000
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "High network traffic on {{ $labels.instance }}"
          description: "Network interface {{ $labels.device }} is receiving more than 100 MB/s for 15+ minutes."

      - alert: SystemdServiceFailed
        expr: node_systemd_unit_state{state="failed"} == 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Systemd service failed on {{ $labels.instance }}"
          description: "Service {{ $labels.name }} has been in failed state for more than 5 minutes."

      - alert: HostRebooted
        expr: (node_time_seconds - node_boot_time_seconds) < 600
        for: 0m
        labels:
          severity: info
        annotations:
          summary: "Host {{ $labels.instance }} was recently rebooted"
          description: "{{ $labels.instance }} booted less than 10 minutes ago."

  # -------------------------------------------------------------------
  # Container alerts
  # -------------------------------------------------------------------
  - name: container_alerts
    rules:
      - alert: ContainerHighCpu
        expr: rate(container_cpu_usage_seconds_total{name!=""}[5m]) * 100 > 80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Container {{ $labels.name }} high CPU"
          description: "Container {{ $labels.name }} on {{ $labels.instance }} is using more than 80% CPU for 10+ minutes."

      - alert: ContainerHighMemory
        expr: (container_memory_usage_bytes{name!=""} / container_spec_memory_limit_bytes{name!=""}) * 100 > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Container {{ $labels.name }} high memory"
          description: "Container {{ $labels.name }} is using {{ $value | printf \"%.1f\" }}% of its memory limit."

      - alert: ContainerRestarting
        expr: increase(container_last_seen{name!=""}[10m]) == 0 and increase(container_start_time_seconds{name!=""}[10m]) > 0
        for: 0m
        labels:
          severity: warning
        annotations:
          summary: "Container {{ $labels.name }} is restart-looping"
          description: "Container {{ $labels.name }} on {{ $labels.instance }} appears to be restarting repeatedly."

  # -------------------------------------------------------------------
  # Prometheus self-monitoring alerts
  # -------------------------------------------------------------------
  - name: prometheus_alerts
    rules:
      - alert: PrometheusTargetDown
        expr: up == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Prometheus target down: {{ $labels.job }}/{{ $labels.instance }}"
          description: "Prometheus cannot reach {{ $labels.instance }} (job: {{ $labels.job }}) for more than 5 minutes."

      - alert: PrometheusHighMemory
        expr: process_resident_memory_bytes{job="prometheus"} > 2147483648
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Prometheus using more than 2 GB RAM"
          description: "Prometheus memory usage: {{ $value | humanize }}. Consider reducing retention or number of targets."

      - alert: PrometheusTsdbHighCardinality
        expr: prometheus_tsdb_head_series > 500000
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "Prometheus has high time series count"
          description: "Prometheus is tracking {{ $value }} time series. High cardinality impacts performance and storage."

How Alert Rules Work

Each rule has four parts:

  • expr: A PromQL expression that evaluates to true when the condition is met.
  • for: How long the condition must be true before the alert fires. This prevents alerts on brief spikes.
  • labels: Metadata attached to the alert, used by Alertmanager for routing.
  • annotations: Human-readable descriptions used in notifications.

The predict_linear function in the DiskWillFillIn24Hours rule is one of the most useful features in Prometheus. Instead of alerting when disk is 90% full (which might be fine for months), it looks at the growth trend over the last 6 hours and alerts if the disk will fill within 24 hours. This catches runaway log files and data ingestion issues before they become outages.

Step 5: Deploy and Verify

Set up Grafana provisioning for auto-configured data sources:

mkdir -p /opt/monitoring/grafana/provisioning/{datasources,dashboards}

Create the Prometheus data source provisioning file:

# /opt/monitoring/grafana/provisioning/datasources/prometheus.yml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false

Now deploy everything:

cd /opt/monitoring
docker compose up -d

# Wait 30 seconds for everything to start, then check
docker compose ps

# Verify Prometheus is scraping targets
curl -s http://localhost:9090/api/v1/targets | python3 -m json.tool | head -50

# Verify Alertmanager is running
curl -s http://localhost:9093/api/v2/status | python3 -m json.tool

# Check Prometheus logs for any config errors
docker compose logs prometheus | tail -20

Access the web interfaces:

  • Prometheus: http://your-server-ip:9090
  • Grafana: http://your-server-ip:3000 (default login: admin / your configured password)
  • Alertmanager: http://your-server-ip:9093

In Prometheus, navigate to Status > Targets. You should see all your configured targets with their current state (UP or DOWN). If any target shows DOWN, verify that the exporter is running on that host and that the port is accessible from the monitoring node.

Step 6: Grafana Setup and Data Sources

If you used the provisioning file above, Prometheus is already configured as a data source. If not, add it manually:

  1. Log in to Grafana.
  2. Navigate to Connections > Data Sources > Add data source.
  3. Select Prometheus.
  4. Set the URL to http://prometheus:9090.
  5. Click “Save & Test”.

Importing Community Dashboards

One of Grafana’s best features is its library of community dashboards. Instead of building everything from scratch, import pre-built dashboards by their ID:

  1. In Grafana, click Dashboards > New > Import.
  2. Enter the dashboard ID and click “Load”.
  3. Select your Prometheus data source and click “Import”.

Essential dashboard IDs:

DashboardIDDescription
Node Exporter Full1860Comprehensive host metrics with CPU, memory, disk, network graphs
Docker Container Monitoring893Per-container CPU, memory, and network metrics from cAdvisor
Prometheus Stats2Prometheus self-monitoring (scrape duration, target count, TSDB stats)
Alertmanager Overview9578Alert counts, notification rates, silences
Traefik Dashboard17346Request rates, response codes, latency (if using Traefik)

The Node Exporter Full dashboard (ID 1860) is the single most useful dashboard for homelab monitoring. It gives you everything you need to understand the health of each machine at a glance.

Building Dashboards

Community dashboards cover most needs, but building custom dashboards teaches you PromQL and lets you create views tailored to your setup.

PromQL Basics

PromQL is Prometheus’s query language. Here are the queries you will use most often:

CPU usage percentage:

100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

This calculates the percentage of time the CPU was NOT idle over the last 5 minutes.

Memory usage percentage:

(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

Disk usage percentage:

(1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100

Network traffic (bytes per second):

rate(node_network_receive_bytes_total{device="eth0"}[5m])
rate(node_network_transmit_bytes_total{device="eth0"}[5m])

Container CPU usage:

rate(container_cpu_usage_seconds_total{name!=""}[5m]) * 100

Container memory usage:

container_memory_usage_bytes{name!=""}

Building a Homelab Overview Dashboard

Here is how to build a single-pane-of-glass dashboard for your homelab:

  1. Create a new dashboard in Grafana.

  2. Add a Stat panel for total nodes:

    • Query: count(up{job="node-exporter"} == 1)
    • Title: “Nodes Online”
    • Threshold: green at 3 (or however many nodes you have), red below that.
  3. Add a Stat panel for total containers:

    • Query: count(container_last_seen{name!=""})
    • Title: “Running Containers”
  4. Add a Table panel for per-node status:

    • Query 1: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) (CPU %)
    • Query 2: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 (Memory %)
    • Query 3: (1 - (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"})) * 100 (Disk %)
    • Use Transform > Merge to combine queries into a single table.
    • Add value thresholds: green < 70, yellow < 85, red >= 85.
  5. Add Time Series panels for trends:

    • CPU over time per node
    • Memory over time per node
    • Network traffic per node
    • Disk I/O per node
  6. Add a panel for container resource usage:

    • Top 10 containers by CPU: topk(10, rate(container_cpu_usage_seconds_total{name!=""}[5m]) * 100)
    • Top 10 containers by memory: topk(10, container_memory_usage_bytes{name!=""})
  7. Save the dashboard and set it as your home dashboard in Grafana settings.

Dashboard Variables

Make your dashboards interactive with template variables:

  1. Go to Dashboard Settings > Variables.
  2. Add a variable named instance with query: label_values(node_uname_info, instance).
  3. Use $instance in your panel queries to filter by the selected node.

This lets you switch between nodes in a dropdown instead of creating separate dashboards for each host.

Monitoring Docker Containers

cAdvisor provides detailed per-container metrics, but the Docker daemon itself also exposes useful metrics.

Docker Daemon Metrics

Enable metrics in /etc/docker/daemon.json:

{
  "metrics-addr": "0.0.0.0:9323",
  "experimental": true
}

This exposes metrics about the Docker engine itself: number of running containers, image pull times, build durations, and more.

Useful Container Queries

Containers sorted by memory usage:

sort_desc(container_memory_usage_bytes{name!=""})

Container restart count (last 24 hours):

increase(container_start_time_seconds{name!=""}[24h])

Container network I/O:

rate(container_network_receive_bytes_total{name!=""}[5m])
rate(container_network_transmit_bytes_total{name!=""}[5m])

Container disk I/O:

rate(container_fs_reads_bytes_total{name!=""}[5m])
rate(container_fs_writes_bytes_total{name!=""}[5m])

Monitoring Proxmox and Virtual Machines

If you run Proxmox VE, there are two approaches to monitoring.

Option 1: Node Exporter Inside Each VM

Install Node Exporter in each VM just like a physical host. This gives you the same granularity as bare-metal monitoring. The VM does not know it is virtual, and Node Exporter reports hardware metrics from the VM’s perspective.

Option 2: Proxmox VE Exporter

The Proxmox VE Exporter scrapes the Proxmox API and exposes metrics about all VMs and containers managed by Proxmox, without needing anything installed inside the guests.

# Add to docker-compose.yml
  pve-exporter:
    image: prompve/prometheus-pve-exporter:latest
    container_name: pve-exporter
    restart: unless-stopped
    environment:
      PVE_USER: "prometheus@pve"
      PVE_PASSWORD: "your_pve_password"
      PVE_VERIFY_SSL: "false"
    ports:
      - "9221:9221"
    networks:
      - monitoring

Create a monitoring user in Proxmox:

# On the Proxmox host
pveum user add prometheus@pve --password your_pve_password
pveum aclmod / -user prometheus@pve -role PVEAuditor

Add the scrape config to Prometheus:

  - job_name: "proxmox"
    static_configs:
      - targets: ["pve-exporter:9221"]
    params:
      target: ["192.168.1.100"]  # Your Proxmox host IP
    metrics_path: /pve

This gives you VM CPU, memory, disk, and network metrics without installing anything in the VMs, plus Proxmox cluster health metrics.

Use both. The Proxmox exporter gives you the host-level view (how much of the physical resources each VM consumes), while Node Exporter inside VMs gives you the guest-level view (how the VM perceives its own resources). Both perspectives are useful.

Monitoring Network Equipment

SNMP Exporter for Switches and Routers

Most managed network switches and routers expose metrics via SNMP. The Prometheus SNMP Exporter translates SNMP data into Prometheus metrics.

# Add to docker-compose.yml
  snmp-exporter:
    image: prom/snmp-exporter:latest
    container_name: snmp-exporter
    restart: unless-stopped
    volumes:
      - ./snmp/snmp.yml:/etc/snmp_exporter/snmp.yml:ro
    ports:
      - "9116:9116"
    networks:
      - monitoring

Add the scrape config for each SNMP device:

  - job_name: "snmp"
    static_configs:
      - targets:
          - "192.168.1.1"    # Router
          - "192.168.1.2"    # Managed switch
    metrics_path: /snmp
    params:
      auth: ["public_v2"]
      module: ["if_mib"]
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: snmp-exporter:9116

This gives you interface traffic, error counts, and port status for managed switches and routers.

Blackbox Exporter for Uptime Monitoring

The Blackbox Exporter probes endpoints via HTTP, TCP, ICMP, or DNS and reports whether they are up and how fast they respond.

# Add to docker-compose.yml
  blackbox-exporter:
    image: prom/blackbox-exporter:latest
    container_name: blackbox-exporter
    restart: unless-stopped
    volumes:
      - ./blackbox/blackbox.yml:/etc/blackbox_exporter/config.yml:ro
    ports:
      - "9115:9115"
    networks:
      - monitoring
# /opt/monitoring/blackbox/blackbox.yml
modules:
  http_2xx:
    prober: http
    timeout: 10s
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
      valid_status_codes: [200]
      follow_redirects: true
      preferred_ip_protocol: ip4

  icmp:
    prober: icmp
    timeout: 5s

  tcp_connect:
    prober: tcp
    timeout: 5s
# Add to prometheus.yml scrape_configs
  - job_name: "blackbox-http"
    metrics_path: /probe
    params:
      module: ["http_2xx"]
    static_configs:
      - targets:
          - "https://grafana.yourdomain.com"
          - "https://nextcloud.yourdomain.com"
          - "https://jellyfin.yourdomain.com"
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

  - job_name: "blackbox-ping"
    metrics_path: /probe
    params:
      module: ["icmp"]
    static_configs:
      - targets:
          - "192.168.1.1"    # Router
          - "192.168.1.2"    # Switch
          - "1.1.1.1"        # Internet connectivity check
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

The Blackbox Exporter is invaluable for catching issues that internal metrics miss: DNS resolution failures, SSL certificate expiry, and network connectivity problems between your homelab and the internet.

Storage and Retention

Prometheus stores time-series data on local disk. The amount of storage depends on:

  • Number of time series: each unique combination of metric name and labels is one series.
  • Scrape interval: more frequent scrapes = more data points.
  • Retention period: how long to keep data.

Estimating Storage Requirements

A rough formula:

Storage per day = (number of series) * (scrapes per day) * (2 bytes per sample)

For a typical homelab with 3 nodes, 20 containers, and a 15-second scrape interval:

  • ~5,000 time series (Node Exporter produces ~500 series per node, cAdvisor ~100 per container)
  • 5,760 scrapes per day (86,400 seconds / 15)
  • ~5,000 * 5,760 * 2 bytes = ~55 MB per day uncompressed
  • Prometheus compresses data roughly 10x, so ~5-6 MB per day

With 90-day retention, that is about 500 MB. Even a large homelab with 10 nodes and 100 containers will use under 5 GB for 90 days of retention. Storage is not a concern for homelab-scale monitoring.

Configuring Retention

Set both time and size limits in the Prometheus command flags:

command:
  - "--storage.tsdb.retention.time=90d"
  - "--storage.tsdb.retention.size=10GB"

Prometheus will keep data until either limit is reached, whichever comes first.

Long-Term Storage

If you want to keep metrics longer than 90 days (for capacity planning or historical analysis), consider:

  • Thanos: adds long-term storage to Prometheus using object storage (S3, MinIO). Complex but powerful.
  • VictoriaMetrics: a Prometheus-compatible TSDB that uses less memory and disk. Can act as a drop-in remote storage backend.

For most homelabs, 90 days of local retention is more than enough.

Securing the Stack

By default, Prometheus, Grafana, and Alertmanager are accessible without authentication (except Grafana, which has its own login). In a homelab on a private network, this is often acceptable. If you expose these services externally, secure them.

Put Everything Behind Authelia

If you followed our Authelia guide, add the Authelia middleware to each service:

# Grafana
labels:
  - "traefik.http.routers.grafana.middlewares=authelia@docker"

# Prometheus
labels:
  - "traefik.http.routers.prometheus.middlewares=authelia@docker"

# Alertmanager
labels:
  - "traefik.http.routers.alertmanager.middlewares=authelia@docker"

Restrict Network Access

If not using a reverse proxy, bind services to localhost and use SSH tunnels:

ports:
  - "127.0.0.1:9090:9090"  # Prometheus
  - "127.0.0.1:3000:3000"  # Grafana
  - "127.0.0.1:9093:9093"  # Alertmanager

Access them via SSH tunnel:

ssh -L 3000:localhost:3000 -L 9090:localhost:9090 user@your-server

Firewall Rules

Allow metric scraping ports (9100, 8080) only from the monitoring node:

# On each monitored host, allow only the monitoring node
sudo ufw allow from 192.168.1.10 to any port 9100 proto tcp
sudo ufw allow from 192.168.1.10 to any port 8080 proto tcp

Advanced: Loki for Log Aggregation

Metrics tell you what is happening. Logs tell you why. Grafana Loki is a log aggregation system designed to pair with the Prometheus + Grafana stack.

# Add to docker-compose.yml
  loki:
    image: grafana/loki:latest
    container_name: loki
    restart: unless-stopped
    command: -config.file=/etc/loki/loki-config.yml
    volumes:
      - ./loki/loki-config.yml:/etc/loki/loki-config.yml:ro
      - loki_data:/loki
    ports:
      - "3100:3100"
    networks:
      - monitoring

  promtail:
    image: grafana/promtail:latest
    container_name: promtail
    restart: unless-stopped
    command: -config.file=/etc/promtail/promtail-config.yml
    volumes:
      - ./promtail/promtail-config.yml:/etc/promtail/promtail-config.yml:ro
      - /var/log:/var/log:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
    networks:
      - monitoring
    depends_on:
      - loki

volumes:
  loki_data:
# /opt/monitoring/loki/loki-config.yml
auth_enabled: false

server:
  http_listen_port: 3100

common:
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2024-01-01
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

limits_config:
  retention_period: 30d
# /opt/monitoring/promtail/promtail-config.yml
server:
  http_listen_port: 9080

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  # System logs
  - job_name: syslog
    static_configs:
      - targets: ["localhost"]
        labels:
          job: syslog
          __path__: /var/log/syslog

  # Docker container logs
  - job_name: docker
    static_configs:
      - targets: ["localhost"]
        labels:
          job: docker
          __path__: /var/lib/docker/containers/*/*-json.log
    pipeline_stages:
      - docker: {}
      - json:
          expressions:
            stream: stream
            attrs: attrs
            tag: attrs.tag
      - labels:
          stream:
          tag:

Add Loki as a data source in Grafana (Connections > Data Sources > Add > Loki, URL: http://loki:3100). Then you can query logs alongside metrics in the same dashboard, correlating spikes in CPU or errors with the log entries that explain them.

Troubleshooting Common Issues

Prometheus Target Shows “DOWN”

Check connectivity:

curl http://target-ip:9100/metrics

If this fails, the exporter is not running or a firewall is blocking the port.

Check from the Prometheus container:

docker exec -it prometheus wget -qO- http://target-ip:9100/metrics | head -5

If this fails but the curl from the host works, it is a Docker networking issue. Ensure the monitoring network can reach the target.

Grafana Dashboard Shows “No Data”

  1. Verify the data source is configured and test passes (Connections > Data Sources > Prometheus > Save & Test).
  2. Check the time range. Grafana defaults to “Last 6 hours” — if you just deployed, there may not be enough data yet. Set it to “Last 15 minutes”.
  3. Run the query directly in Prometheus at http://localhost:9090/graph to verify data exists.

Alerts Not Firing

  1. Check that alert rules are loaded: Prometheus > Status > Rules.
  2. Verify the alert expression in the Prometheus graph UI. It should return results.
  3. Check Alertmanager is reachable: Prometheus > Status > Runtime & Build Information > Alertmanagers.
  4. Check Alertmanager logs: docker compose logs alertmanager.

High Memory Usage by Prometheus

Prometheus memory usage scales with the number of time series. If it is using too much RAM:

  1. Check cardinality: prometheus_tsdb_head_series metric.
  2. Reduce scrape targets or increase scrape intervals for less important targets.
  3. Use metric_relabel_configs to drop high-cardinality metrics you do not need.
# Example: drop container metrics for short-lived containers
metric_relabel_configs:
  - source_labels: [__name__]
    regex: "container_(network|blkio).*"
    action: drop

cAdvisor High CPU Usage

cAdvisor can use significant CPU on hosts with many containers. Reduce its housekeeping interval:

cadvisor:
  command:
    - "--housekeeping_interval=30s"
    - "--docker_only=true"
    - "--disable_metrics=percpu,sched,tcp,udp,disk,diskIO,hugetlb,referenced_memory"

FAQ

How much RAM does the full stack use?

On a typical homelab setup: Prometheus ~200-500 MB, Grafana ~100-200 MB, Alertmanager ~30 MB, Node Exporter ~15 MB, cAdvisor ~50-100 MB. Total: about 400 MB to 850 MB depending on the number of targets and dashboards. Adding Loki adds another 200-500 MB.

Can I monitor Windows machines?

Yes. Use windows_exporter (formerly WMI Exporter) on Windows hosts. It exposes CPU, memory, disk, and network metrics in Prometheus format on port 9182. Add it to your Prometheus scrape config like any other target.

Should I use InfluxDB + Telegraf instead?

The InfluxDB + Telegraf + Grafana (TIG) stack is a valid alternative. InfluxDB uses a push model (Telegraf pushes data), while Prometheus uses pull (Prometheus scrapes targets). Prometheus has a larger ecosystem of exporters and is the CNCF standard. InfluxDB has better support for high-cardinality data and custom application metrics. For homelab monitoring of infrastructure, Prometheus is the more popular and better-supported choice.

How do I monitor a remote site (like a VPS)?

You have three options:

  1. VPN: Connect the VPS to your homelab network via WireGuard or Tailscale. Prometheus scrapes through the VPN tunnel.
  2. Federation: Run a small Prometheus instance on the VPS that scrapes local targets. Your main Prometheus federates (pulls) aggregated metrics from the remote instance.
  3. Push gateway: Exporters push metrics to a Prometheus Push Gateway on the VPS, which your main Prometheus scrapes.

Option 1 (VPN) is simplest for a single VPS.

Can Grafana send alerts directly, without Alertmanager?

Yes. Grafana has its own built-in alerting system that can evaluate queries and send notifications. However, for Prometheus-based setups, using Alertmanager is recommended because it supports grouping, deduplication, silencing, and inhibition. Grafana alerting is better suited for alerts based on non-Prometheus data sources.

How do I back up Grafana dashboards?

Grafana stores dashboards in its database (/var/lib/grafana/grafana.db). Back up this file, or better, use Grafana’s built-in JSON export:

# Export all dashboards using the API
curl -s "http://admin:password@localhost:3000/api/search" | \
  python3 -c "import sys,json; [print(d['uid']) for d in json.load(sys.stdin)]" | \
  while read uid; do
    curl -s "http://admin:password@localhost:3000/api/dashboards/uid/$uid" > "dashboard-$uid.json"
  done

For a GitOps approach, store dashboards as JSON in a Git repository and use Grafana’s provisioning to load them automatically.

Do I need all four components (Prometheus, Grafana, Node Exporter, Alertmanager)?

At minimum, you need Prometheus and Node Exporter to collect and store metrics. Grafana adds visualization, and Alertmanager adds notifications. You can start with just Prometheus + Node Exporter + Grafana and add Alertmanager later when you know what conditions you want to be alerted about.

Conclusion

A monitoring stack transforms your homelab from a collection of machines you hope are working into a system you know is working. The combination of Prometheus, Grafana, Node Exporter, and Alertmanager has become the standard not because it is the only option, but because it is well-documented, well-integrated, and genuinely useful at every scale from three Raspberry Pis to a thousand-node data center.

Start simple. Deploy the Docker Compose stack from this guide, import the Node Exporter Full dashboard (ID 1860), and look at your homelab’s metrics for the first time. You will immediately notice patterns you never knew existed: the nightly backup that spikes CPU at 3 AM, the container that slowly leaks memory over a week, the disk that is 85% full because Docker images are piling up.

Then add alerts for the things that actually matter: disk filling up, nodes going offline, containers crash-looping. Do not alert on everything. Alert on conditions that require human action. If an alert fires and you look at it and shrug, it should not be an alert.

Over time, expand the stack. Add the Blackbox Exporter to monitor external endpoints. Add the SNMP Exporter if you have managed network gear. Add Loki for log aggregation. Build custom dashboards for the services you care about most.

The monitoring stack itself is low-maintenance once deployed. Prometheus is rock-solid. Grafana upgrades cleanly. Node Exporter rarely needs attention. The ongoing work is tuning alert thresholds and building dashboards that surface the information you actually need. That work pays for itself the first time you catch a problem before it becomes an outage.