HomeLab Monitor

v… connecting…

One small container for your home lab — GPU & local-AI, Docker containers, systemd services and host health, all on one page. Everything is discovered automatically; nothing is hardcoded.

Host: loading…

Fleet

—

Burn Rate

›hub

0.000

Today

0.00

7 days

0.00

30 days

0.00

The Lab's Engine

›Showing

GPU

—

CPU

—

RAM

—

Compute history

›hub

Event & Insight Feed

What's Costing You

›hub

Containers›hub

0/ 0—

—

Services›hub

0running

Diagnostics

AI workload · MCP · setup & requirements

AI agent (MCP)

A read-only MCP server is built in, so Claude — or any MCP client — can connect and explore this homelab with full dashboard parity, right down to what each model and container costs to run.

Connect with the Claude CLI (then ask it about your fleet):

claude mcp add --transport http homelab …

MCP docs →

🩺 Setup & requirements

What the monitor can see on this hub, and how to enable anything that's missing. The dashboard keeps running even when optional pieces aren't mounted — no GPU required.

Requirement	Status	Detail

🎮 GPU right now

VRAM allocation

📋 Services on the GPU (selected range)

Service	Peak	Avg	% time

📊 VRAM by service over time

Stacked = VRAM each service holds. Dashed = capacity. Red bands = VRAM pressure; ▼ markers = out-of-memory events.

⚡ GPU utilization, power & temperature

🗓️ Busy hours — when your lab burns power

Each cell is the average total draw (GPU + CPU + DRAM) for that local day-of-week and hour, over the chosen window. Colour scales by cost-rate when a tariff is set, otherwise by power. Sparse cells are dimmed — hover for the exact figures.

🧾 Breakdown by process / container / model

Name	Kind	Avg W	Energy	Cost

Click a row to drill into its power & cost over time. GPU split by VRAM share; CPU split by CPU-time share of the measured package power.

🧪 Runs — pushed from your notebooks & MLflow, priced with real GPU energy

Run	Source	Status	Started	Duration	Metrics	Energy	Cost

🔬 Auto-detected on the hub — training processes running now (no push needed)

📈 GPU activity sessions

Started	Duration	Peak util	Peak VRAM	Avg power	Energy	Cost

Contiguous GPU-busy periods reconstructed from power & utilisation history — each is a training run, batch job or inference burst. Energy is integrated from GPU power; cost uses your kWh price.

📊 Results

Click a column to sort, a row to see its context sweep below, or tick rows to overlay them in Compare. Re-run stores a fresh row so you can track changes over time.

Push notifications when something noteworthy happens. Configured here — no env vars, no config files. Any channel (Discord, ntfy, or Telegram) can be used; all are optional. Alerts are edge-triggered (one ping per state change) so they don't spam.

Alerts enabled — turn the notifier on once a channel below is filled in.

Discord webhook URL

ntfy.sh topic

Topic is required; server defaults to https://ntfy.sh. Self-hosted ntfy works too.

Telegram bot token

Telegram chat ID

Required with a bot token. Message your bot once, then read chat.id from getUpdates.

Email (SMTP)

SMTP host

Port

TLS

From address

To address

SMTP user (optional)

SMTP password (optional)

Slack webhook URL

Generic webhook URL

POSTs JSON with {level, title, detail, host} — works with Teams, Gotify, n8n…

Minimum severity

Disk alert threshold (%)

Triggers: container unhealthy/exited non-zero/dead, systemd unit failed, GPU VRAM pressure (free < PRESSURE_FREE_MB), GPU OOM events, and disks crossing the threshold above.

📋 Routing rules

Per-service rules control which channels receive which alerts. If no rule matches, the default (all configured channels) applies. Wildcards (*, ?) work in the pattern column.

Kind	Pattern	Channel	Min level	On

📰 Daily brief

A scheduled “is the lab OK?” digest — the calm counterpart to alerts. Sent once a day to one channel: email gets the full HTML report, chat channels get a compact summary with the things that need attention.

Daily brief enabled — send once a day at the time below.

Send at (local time)

Send to channel

Theme

Only channels you've configured above appear here — configure one and it shows up automatically.

Settings here save automatically when you leave a field. Preview and Send test use your saved settings.

Public status page

A shareable, read-only status page for the checks you mark public no auth, no host internals leaked.

Public status page enabled available at /public without needing a restart.

Push run/session data from Jupyter, Colab or Kaggle (or mirror MLflow) — each run comes back priced with the real GPU energy it used. Writes need the API key; the dashboard reads are open on your LAN.

🔑 API keys

Create one key per machine/notebook so you can track and revoke each independently. Keys are stored hashed — the secret is shown once, at creation.

New key name

Expires in

Name	Key	Created	Expires	Last used	Runs

Download client (homelab_run.py)

Quickstart in a notebook:

import homelab_run as homelab
homelab.configure(url="http://THIS-HOST:9800", key="hlm_…")
with homelab.run("my-finetune", params={"lr":2e-4}) as r:
    for step, loss in enumerate(train()):
        r.log_metric("loss", loss, step=step)
        r.log_metric("tokens_per_sec", batch_tokens / step_seconds, step=step)

Colab/Kaggle run in the cloud — point url at a Tailscale/ngrok address that reaches this hub.

🟢 MLflow (pull)

Mirror runs from an MLflow tracking server — they appear under Runs with GPU cost attached. Synced every ~5 min.

MLflow tracking URI

MLflow bearer token (optional)

🌐 Network

Self-hosted & open source • single container • no Prometheus/Grafana required • history downsampled on read. GPU services attributed via /proc/<pid>/cgroup + the Docker API.
Ideas welcome — see CONTRIBUTING.md · issues & PRs always open.

Fleet

Burn Rate

The Lab's Engine

Compute history

Event & Insight Feed

What's Costing You

🧠 AI workload

AI agent (MCP)

🩺 Setup & requirements

🎮 GPU right now

📋 Services on the GPU (selected range)

🎮 Per-GPU

📊 VRAM by service over time

⚡ GPU utilization, power & temperature

💰 Power & cost

🗓️ Busy hours — when your lab burns power

🧾 Breakdown by process / container / model

Drilldown

📦 Installed models

🧪 Runs — pushed from your notebooks & MLflow, priced with real GPU energy all statusrunning finishedfailedkilled

Run

⚡ GPU power & utilisation during this run

🔬 Auto-detected on the hub — training processes running now (no push needed)

🧰 Notebooks & tools

📈 GPU activity sessions

📊 Benchmark Lab

📊 Benchmark Lab — measure what actually fits & how fast, then keep the result

📊 Results

📈 Context sweep

🆚 Compare

📊 No benchmarks yet

📦 Docker containers

🧩 systemd services

📋 Routing rules

📰 Daily brief

Public status page

💾 Backup & Restore

🌐 Hosts

🖥️ System

🧠 Top processes

💰 Power & cost

🧠 Memory map — containers & services

🖥️ CPU, RAM & load

💾 Disk I/O Throughput

📶 Throughput

📊 Top talkers — containers

🌐 Network

🛡️ Security

🛰️ Uptime checks

🧪 Runs — pushed from your notebooks & MLflow, priced with real GPU energy