CGR Labs Day 3: Observability

Ciprian Ciprian · · 6 min read

I’m building a system to experiment with local LLM models and host my apps. I’m using a dedicated Hetzner server with a Kubernetes cluster on it. I’ve always wanted to run Kubernetes on bare metal and escape the managed K8s golden handcuffs. We’re also running AI inference on CPU (slow as it is), so squeezing efficiency matters.

Yesterday I got the production-ai service running with llama.cpp and Qwen3-30B at ~12 tokens/second. Today’s goal: establish comprehensive observability so I can see exactly how well the models are utilizing the resources and if my performance tuning will actually do anything.

What we’re building

┌─────────────────────────────────────────────────────────────────┐
│                      K3s CLUSTER                                │
│                                                                 │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐           │
│  │   Web Apps   │  │  AI Models   │  │   Database   │           │
│  │              │  │  (llama.cpp) │  │  (Postgres)  │           │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘           │
│         │                 │                  │                  │
│         │ /metrics        │ /metrics         │ /metrics         │
│         │                 │                  │                  │
│         └─────────────────┼──────────────────┘                  │
│                           │                                     │
│  ┌────────────────────────▼────────────────────────────┐        │
│  │              PROMETHEUS                             │        │
│  │         (Metrics Collection & Storage)              │        │
│  │                                                     │        │
│  │  • Scrapes /metrics endpoints every 15-30s          │        │
│  │  • Time-series database                             │        │
│  │  • 15-day retention                                 │        │
│  └────────────────────┬────────────────────────────────┘        │
│                       │                                         │
│                       │ PromQL Queries                          │
│                       │                                         │
│  ┌────────────────────▼────────────────────────────┐            │
│  │              GRAFANA                            │            │
│  │         (Visualization & Dashboards)            │            │
│  │                                                 │            │
│  │  Cluster Overview                               │            │
│  │  Hardware Metrics                               │            │
│  │  AI Performance                                 │            │
│  │  Application Health                             │            │
│  └─────────────────────────────────────────────────┘            │
│                                                                 │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │         METRIC EXPORTERS                                 │   │
│  ├──────────────────────────────────────────────────────────┤   │
│  │  • node-exporter: Hardware metrics (CPU, RAM, disk)      │   │
│  │  • kube-state-metrics: Kubernetes object state           │   │
│  │  • Application metrics: Custom app instrumentation       │   │
│  └──────────────────────────────────────────────────────────┘   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Observability Flow

Application/Service

   /metrics endpoint (Prometheus format)

   Prometheus (scrapes & stores)

   Grafana (queries & visualizes)

   Dashboard (actionable insights)

Key Metrics Tracked

Infrastructure

  • CPU usage per core
  • Memory utilization
  • Disk I/O and space
  • Network throughput

Kubernetes

  • Pod health and restarts
  • Resource requests vs limits
  • Deployment status

AI Models

  • Tokens per second
  • Request latency (p50, p95, p99)
  • Queue depth
  • Inference errors

Applications

  • Request rates
  • Response times
  • Error rates
  • Database connections

Technology Stack

  • Prometheus: Metrics collection and time-series storage
  • Grafana: Visualization and alerting
  • node-exporter: Hardware-level metrics
  • kube-state-metrics: Kubernetes cluster state
  • llama.cpp: Built-in Prometheus metrics for LLM inference

llama.cpp monitoring

llama.cpp already exposes a metrics endpoint, so all I had to do was enable it in my deployment.

llama.cpp metrics endpoint enabled
llama.cpp metrics endpoint enabled in the deployment (Prometheus format).

Set up monitoring

I’ll start by setting up node-exporter and kube-state-metrics. node-exporter runs on every node and exposes detailed hardware-level metrics—CPU usage per core, memory statistics, disk I/O, network throughput, and even temperature sensors. kube-state-metrics provides insight into the cluster’s logical state by exposing metrics about Kubernetes objects themselves: pod phases, deployment replica counts, resource requests and limits, and restart counts.

Monitoring services running
Core monitoring components deployed: node-exporter, kube-state-metrics, and supporting services.

After that I set up the prometheus persistent volume and prometheus itself configured to scrape metrics from all our exporters.

Prometheus configuration
Prometheus configuration manifest defining scrape targets and retention.

It was at this point that I realized I forgot to actually create the /mnt/monitoring folders and give the right permissions on it, so prometheus was crashlooping.

Prometheus error CrashLoopBackOff
Prometheus CrashLoopBackOff due to missing /mnt/monitoring directory & permissions.

Finally everything is running and our scrapers are up—except for ai-experimental, which hasn’t been created yet. Next up: Grafana.

Prometheus targets up
Prometheus targets healthy after storage & permission fix (ai-experimental pending).

Setting up Grafana

After I added the persistent volume (create folders, set up permissions) and got the service running—after dealing with a bunch of Let’s Encrypt + Cloudflare + Vodafone quirks—I still wasn’t seeing Grafana in the browser. The pod was healthy, the certificate had been issued, so I just needed to wait for Vodafone DNS propagation.

Grafana pod running
Grafana running with Ingress + TLS; waiting on ISP DNS cache refresh.

I have to remember to set up DNS records before I start and also to disable proxied traffic to services until I get the https certificate issued.

While I wait for that, I’ll set up some dashboards.

Grafana dashboards

First, let’s see what metrics we’re getting from llama.cpp.

llama.cpp raw metrics
Exploring llama.cpp exporter metrics (tokens/sec, request latency, queue depth).

Ok, based on llama.cpp metrics and node-exporter, we basically want to track combined metrics to see LLM inference performance and cluster utilization.

cluster-performance dashboard

Focus: overall node + namespace health & pressure signals.

  • Node CPU, load (normalized per core), memory %, network RX/TX, disk I/O (ops & throughput)
  • Namespace CPU & memory, pod throttling ratio, pod phase distribution, pod counts
  • PSI (Pressure Stall Information) panels for CPU / memory / IO

llama.cpp dashboard

Focus: model serving performance & utilization.

  • Token & prompt throughput
  • Active requests & queue depth
  • KV cache utilization / fragmentation signals

llm-tuning dashboard

Focus: correlating inference throughput to resource usage.

  • Correlates llama.cpp tokens/sec with node & pod CPU usage
  • Prompt vs generation time split; active vs deferred requests
  • KV cache utilization patterns & decode slot occupancy
  • Memory & CPU throttling for llama pods; efficiency (tokens per CPU second)
  • Prompt/gen time ratio trend

I also added some Prometheus recording rules since some of the panels got a bit too complex.

After creating the dashboards JSON files, I set them up as ConfigMaps (questionable—could switch to sidecar or S3/gitrepo provisioning later):

kubectl create configmap grafana-dashboards \
  --from-file=k8s-configs/monitoring/dashboards/ \
  -n monitoring \
  --dry-run=client -o yaml > k8s-configs/monitoring/grafana-dashboards.yaml

Also set up the dashboard provider and added the volumes to the Grafana deployment.

Meanwhile DNS is also fixed so we can actually see our dashboards.

Cluster performance dashboard
Cluster performance dashboard: CPU, memory, network, and PSI signals.
llama.cpp dashboard
llama.cpp inference dashboard: tokens/sec, latency, queue depth, KV cache.

That’s it for today. With the monitoring stack in place, next steps:

  1. Performance tuning (numactl, CPU pinning, memory bandwidth awareness).
  2. Experiment with more CPU-friendly model variants.
  3. Fix some of the panels that don’t yet have data because I’m not exporting the metrics or the calculations are invalid.

I’ll cover tuning, scheduling, and model selection in the next post.