Hetzner, Let's Encrypt, k3s - How I spent 2 hours debugging

There are maybe 2 things you can learn from this post:

The dangers of vibe coding and DevOps in a fairly complex environment. I’d say I know what I’m doing maybe about 60-70% of the time, but this was starting to get frustrating and I fully succumbed to the vibe coding copy output - paste command flow. Maybe it would have taken less time if I paused to think and understand what was wrong.
If you plan to build something like this on a Hetzner dedicated server, the below could come in very handy.

Anyway, here’s a fully AI generated article that summarizes everything that happened very well.

AI part starts here

After setting up the basic infrastructure and waiting for DNS propagation, I thought I was done. The blog deployment was ready, cert-manager was installed, and I had a test nginx pod waiting to get its certificate. Should be straightforward, right?

It wasn’t.

The First Sign of Trouble

kubectl get certificate -n default
NAME             READY   SECRET           AGE
test-nginx-tls   False   test-nginx-tls   3m48s

False. Always false. And when I checked the cert-manager logs:

E1005 18:14:06.912240 cert-manager/clusterissuers: failed to register an ACME account
err="Get \"https://acme-v02.api.letsencrypt.org/directory\": dial tcp 172.65.32.248:443: connect: connection refused"

Connection refused. But when I tested from the host:

curl https://acme-v02.api.letsencrypt.org/directory

It worked perfectly. So what was going on?

Problem 1: The search . Domain

I started digging into /etc/resolv.conf:

nameserver 127.0.0.53
options edns0 trust-ad
search .

That search . line looked suspicious. Turns out, Hetzner’s DHCP configuration adds this, and it completely breaks DNS resolution inside Kubernetes pods. The problem is that when CoreDNS tries to resolve external domains, it forwards to /etc/resolv.conf, which points to 127.0.0.53 (systemd-resolved on the host). But pods can’t reach the host’s systemd-resolved, so every DNS query fails.

I found this issue documented here: https://github.com/cert-manager/cert-manager/issues/3394

The fix required multiple steps:

Step 1: Fix the host’s resolv.conf

sudo nano /etc/netplan/01-netcfg.yaml

Added:

dhcp4-overrides:
  use-domains: false

This tells netplan to ignore the search domain from DHCP. Then:

sudo netplan apply

But the search . line kept coming back. systemd-resolved was stubborn. So I took the nuclear option:

sudo rm /etc/resolv.conf
sudo bash -c 'cat > /etc/resolv.conf << EOF
nameserver 8.8.8.8
nameserver 8.8.4.4
EOF'
sudo chattr +i /etc/resolv.conf

Making it immutable with chattr +i prevents anything from changing it.

Step 2: Fix CoreDNS

The pods were still using CoreDNS, which was forwarding to the broken /etc/resolv.conf. I had to tell CoreDNS to use public DNS directly:

kubectl edit configmap coredns -n kube-system

Changed:

forward . /etc/resolv.conf

To:

forward . 8.8.8.8 8.8.4.4

Then restarted CoreDNS:

kubectl rollout restart deployment coredns -n kube-system

Now pods could resolve external domains. Progress!

Problem 2: Outbound HTTPS Was Blocked

But cert-manager still couldn’t reach Let’s Encrypt. DNS worked (I tested with nslookup), but HTTPS connections timed out.

I tested from a pod:

kubectl run test --image=busybox --rm -it -- telnet 8.8.8.8 443

Timeout. Port 443 outbound was completely blocked.

The culprit? My iptables rules. Earlier, I had added rules to forward ports 80 and 443 to K3s’ NodePorts:

sudo iptables -t nat -A PREROUTING -p tcp --dport 443 -j REDIRECT --to-port 31994

This rule was too broad. It redirected ALL port 443 traffic, including outbound connections from pods trying to reach Let’s Encrypt. Every HTTPS request was being hijacked and redirected to Traefik.

The fix:

# Delete the bad rule
sudo iptables -t nat -D PREROUTING -p tcp --dport 443 -j REDIRECT --to-port 31994

# Add a more specific rule
sudo iptables -t nat -I PREROUTING -i enp195s0 -p tcp --dport 443 -j REDIRECT --to-port 31994

The -i enp195s0 flag makes the rule only apply to traffic coming FROM the external network interface, not traffic originating from pods.

Now pods could make outbound HTTPS connections.

Problem 3: Nothing Listening on Port 80

cert-manager could finally reach Let’s Encrypt. The ClusterIssuer showed READY: True. I deleted the old certificate and triggered a fresh attempt.

The challenge was created. The solver pod started. And then:

Warning  Failed  Accepting challenge authorization failed:
acme: authorization error for computergeneratedreality.com:
400 urn:ietf:params:acme:error:connection:
Fetching http://computergeneratedreality.com/.well-known/acme-challenge/...:
Connection refused

Connection refused on port 80. But I could curl my domain from the server and get a response. What was happening?

I tested from my local machine:

curl -v http://ip-address

Connection refused.

From the server:

curl -v http://localhost:80

404 from nginx. It worked locally but not externally.

The issue: I could connect from the server because traffic from localhost bypassed the iptables rules. But external connections (like Let’s Encrypt) were hitting port 80 on my server and getting connection refused because nothing was actually LISTENING on port 80.

K3s uses ServiceLB (svclb pods) to handle this - they’re supposed to forward traffic from 80→30179 and 443→31994. But for some reason, it wasn’t working. The svclb pod was running, but not binding to the ports.

I checked:

sudo netstat -tlnp | grep ':80 '

Nothing. No process was listening on port 80.

The Nuclear Option That Actually Worked

I edited the Traefik HelmChart to use hostPort directly:

kubectl edit helmchart traefik -n kube-system

Added to the valuesContent section:

ports:
  web:
    hostPort: 80
  websecure:
    hostPort: 443

This makes Traefik bind directly to ports 80 and 443 on the host, bypassing all the ServiceLB complexity. K3s automatically redeployed Traefik.

I tested again from my local machine:

curl -v http://ip-address

Success! 404 from Traefik, which meant it was reachable.

Finally: Success

I deleted all the failed certificates and triggered a fresh attempt:

kubectl delete certificate blog-tls -n blog
kubectl delete certificaterequest -n blog --all
kubectl delete order -n blog --all
kubectl delete challenge -n blog --all

Created a clean certificate:

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: blog-tls
  namespace: blog
spec:
  secretName: blog-tls
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  dnsNames:
  - computergeneratedreality.com

Applied it:

kubectl apply -f ~/cgr-infrastructure/k8s-configs/blog/blog-certificate.yaml

Watched the magic happen:

kubectl get certificate,challenge -n blog -w

The solver pod appeared. The challenge went to “Processing”. And then:

NAME       READY   SECRET     AGE
blog-tls   True    blog-tls   45s

True. Finally true.

What I Learned

Hetzner’s DHCP adds a search . line that breaks Kubernetes DNS. Disable it in netplan with dhcp4-overrides: use-domains: false, if that doesn’t work - make your resolv.conf immutable.

Configure CoreDNS to use public DNS directly instead of forwarding to the host’s resolv.conf. This avoids all the systemd-resolved complexity. - Should have worked, it didn’t, still had to change resolv.conf manually.

iptables rules need interface specifications. Use -i enp195s0 to make rules only apply to external traffic, not internal pod traffic.

K3s’ ServiceLB doesn’t always work. When it fails, hostPort is a valid solution for single-node clusters. Just bind Traefik directly to ports 80 and 443.

Browser DNS caching is real. After everything was working, my browser still couldn’t resolve the domain because it had cached the NXDOMAIN response. Incognito mode or clearing the browser’s DNS cache (chrome://net-internals/#dns) didn’t fix it either, but it works on my phone and that’s good enough for now.

The blog is now live with automatic SSL certificates. Was it worth the debugging marathon? We’ll find out in 90 days when the first renewal happens.

Next up: AI inference pods and getting llama.cpp running in the cluster.