CGR Labs Day 3: Observability | Computer Generated Reality

I’m building a system to experiment with local LLM models and host my apps. I’m using a dedicated Hetzner server with a Kubernetes cluster on it. I’ve always wanted to run Kubernetes on bare metal and escape the managed K8s golden handcuffs. We’re also running AI inference on CPU (slow as it is), so squeezing efficiency matters.

Yesterday I got the production-ai service running with llama.cpp and Qwen3-30B at ~12 tokens/second. Today’s goal: establish comprehensive observability so I can see exactly how well the models are utilizing the resources and if my performance tuning will actually do anything.

What we’re building

┌─────────────────────────────────────────────────────────────────┐
│                      K3s CLUSTER                                │
│                                                                 │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐           │
│  │   Web Apps   │  │  AI Models   │  │   Database   │           │
│  │              │  │  (llama.cpp) │  │  (Postgres)  │           │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘           │
│         │                 │                  │                  │
│         │ /metrics        │ /metrics         │ /metrics         │
│         │                 │                  │                  │
│         └─────────────────┼──────────────────┘                  │
│                           │                                     │
│  ┌────────────────────────▼────────────────────────────┐        │
│  │              PROMETHEUS                             │        │
│  │         (Metrics Collection & Storage)              │        │
│  │                                                     │        │
│  │  • Scrapes /metrics endpoints every 15-30s          │        │
│  │  • Time-series database                             │        │
│  │  • 15-day retention                                 │        │
│  └────────────────────┬────────────────────────────────┘        │
│                       │                                         │
│                       │ PromQL Queries                          │
│                       │                                         │
│  ┌────────────────────▼────────────────────────────┐            │
│  │              GRAFANA                            │            │
│  │         (Visualization & Dashboards)            │            │
│  │                                                 │            │
│  │  Cluster Overview                               │            │
│  │  Hardware Metrics                               │            │
│  │  AI Performance                                 │            │
│  │  Application Health                             │            │
│  └─────────────────────────────────────────────────┘            │
│                                                                 │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │         METRIC EXPORTERS                                 │   │
│  ├──────────────────────────────────────────────────────────┤   │
│  │  • node-exporter: Hardware metrics (CPU, RAM, disk)      │   │
│  │  • kube-state-metrics: Kubernetes object state           │   │
│  │  • Application metrics: Custom app instrumentation       │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Observability Flow

Application/Service
       ↓
   /metrics endpoint (Prometheus format)
       ↓
   Prometheus (scrapes & stores)
       ↓
   Grafana (queries & visualizes)
       ↓
   Dashboard (actionable insights)

Key Metrics Tracked

Infrastructure

CPU usage per core
Memory utilization
Disk I/O and space
Network throughput

Kubernetes

Pod health and restarts
Resource requests vs limits
Deployment status

AI Models

Tokens per second
Request latency (p50, p95, p99)
Queue depth
Inference errors

Applications

Request rates
Response times
Error rates
Database connections

Technology Stack

Prometheus: Metrics collection and time-series storage
Grafana: Visualization and alerting
node-exporter: Hardware-level metrics
kube-state-metrics: Kubernetes cluster state
llama.cpp: Built-in Prometheus metrics for LLM inference

llama.cpp monitoring

llama.cpp already exposes a metrics endpoint, so all I had to do was enable it in my deployment.

llama.cpp metrics endpoint enabled in the deployment (Prometheus format).

Set up monitoring

I’ll start by setting up node-exporter and kube-state-metrics. node-exporter runs on every node and exposes detailed hardware-level metrics—CPU usage per core, memory statistics, disk I/O, network throughput, and even temperature sensors. kube-state-metrics provides insight into the cluster’s logical state by exposing metrics about Kubernetes objects themselves: pod phases, deployment replica counts, resource requests and limits, and restart counts.

Monitoring services running — Core monitoring components deployed: node-exporter, kube-state-metrics, and supporting services.

After that I set up the prometheus persistent volume and prometheus itself configured to scrape metrics from all our exporters.

Prometheus configuration manifest defining scrape targets and retention.

It was at this point that I realized I forgot to actually create the /mnt/monitoring folders and give the right permissions on it, so prometheus was crashlooping.

Prometheus error CrashLoopBackOff — Prometheus CrashLoopBackOff due to missing /mnt/monitoring directory & permissions.

Finally everything is running and our scrapers are up—except for ai-experimental, which hasn’t been created yet. Next up: Grafana.

Prometheus targets up — Prometheus targets healthy after storage & permission fix (ai-experimental pending).

Setting up Grafana

After I added the persistent volume (create folders, set up permissions) and got the service running—after dealing with a bunch of Let’s Encrypt + Cloudflare + Vodafone quirks—I still wasn’t seeing Grafana in the browser. The pod was healthy, the certificate had been issued, so I just needed to wait for Vodafone DNS propagation.

Grafana pod running — Grafana running with Ingress + TLS; waiting on ISP DNS cache refresh.

I have to remember to set up DNS records before I start and also to disable proxied traffic to services until I get the https certificate issued.

While I wait for that, I’ll set up some dashboards.

Grafana dashboards

First, let’s see what metrics we’re getting from llama.cpp.

llama.cpp raw metrics — Exploring llama.cpp exporter metrics (tokens/sec, request latency, queue depth).

Ok, based on llama.cpp metrics and node-exporter, we basically want to track combined metrics to see LLM inference performance and cluster utilization.

cluster-performance dashboard

Focus: overall node + namespace health & pressure signals.

Node CPU, load (normalized per core), memory %, network RX/TX, disk I/O (ops & throughput)
Namespace CPU & memory, pod throttling ratio, pod phase distribution, pod counts
PSI (Pressure Stall Information) panels for CPU / memory / IO

llama.cpp dashboard

Focus: model serving performance & utilization.

Token & prompt throughput
Active requests & queue depth
KV cache utilization / fragmentation signals

llm-tuning dashboard

Focus: correlating inference throughput to resource usage.

Correlates llama.cpp tokens/sec with node & pod CPU usage
Prompt vs generation time split; active vs deferred requests
KV cache utilization patterns & decode slot occupancy
Memory & CPU throttling for llama pods; efficiency (tokens per CPU second)
Prompt/gen time ratio trend

I also added some Prometheus recording rules since some of the panels got a bit too complex.

After creating the dashboards JSON files, I set them up as ConfigMaps (questionable—could switch to sidecar or S3/gitrepo provisioning later):

kubectl create configmap grafana-dashboards \
  --from-file=k8s-configs/monitoring/dashboards/ \
  -n monitoring \
  --dry-run=client -o yaml > k8s-configs/monitoring/grafana-dashboards.yaml

Also set up the dashboard provider and added the volumes to the Grafana deployment.

Meanwhile DNS is also fixed so we can actually see our dashboards.

Cluster performance dashboard: CPU, memory, network, and PSI signals.

llama.cpp dashboard — llama.cpp inference dashboard: tokens/sec, latency, queue depth, KV cache.

That’s it for today. With the monitoring stack in place, next steps:

Performance tuning (numactl, CPU pinning, memory bandwidth awareness).
Experiment with more CPU-friendly model variants.
Fix some of the panels that don’t yet have data because I’m not exporting the metrics or the calculations are invalid.

I’ll cover tuning, scheduling, and model selection in the next post.