Hetzner, Let's Encrypt, k3s - How I spent 2 hours debugging
Ciprian · · 7 min read There are maybe 2 things you can learn from this post:
-
The dangers of vibe coding and DevOps in a fairly complex environment. I’d say I know what I’m doing maybe about 60-70% of the time, but this was starting to get frustrating and I fully succumbed to the vibe coding copy output - paste command flow. Maybe it would have taken less time if I paused to think and understand what was wrong.
-
If you plan to build something like this on a Hetzner dedicated server, the below could come in very handy.
Anyway, here’s a fully AI generated article that summarizes everything that happened very well.
AI part starts here
After setting up the basic infrastructure and waiting for DNS propagation, I thought I was done. The blog deployment was ready, cert-manager was installed, and I had a test nginx pod waiting to get its certificate. Should be straightforward, right?
It wasn’t.
The First Sign of Trouble
kubectl get certificate -n default
NAME READY SECRET AGE
test-nginx-tls False test-nginx-tls 3m48s
False. Always false. And when I checked the cert-manager logs:
E1005 18:14:06.912240 cert-manager/clusterissuers: failed to register an ACME account
err="Get \"https://acme-v02.api.letsencrypt.org/directory\": dial tcp 172.65.32.248:443: connect: connection refused"
Connection refused. But when I tested from the host:
curl https://acme-v02.api.letsencrypt.org/directory
It worked perfectly. So what was going on?
Problem 1: The search . Domain
I started digging into /etc/resolv.conf:
nameserver 127.0.0.53
options edns0 trust-ad
search .
That search . line looked suspicious. Turns out, Hetzner’s DHCP configuration adds this, and it completely breaks DNS resolution inside Kubernetes pods. The problem is that when CoreDNS tries to resolve external domains, it forwards to /etc/resolv.conf, which points to 127.0.0.53 (systemd-resolved on the host). But pods can’t reach the host’s systemd-resolved, so every DNS query fails.
I found this issue documented here: https://github.com/cert-manager/cert-manager/issues/3394
The fix required multiple steps:
Step 1: Fix the host’s resolv.conf
sudo nano /etc/netplan/01-netcfg.yaml
Added:
dhcp4-overrides:
use-domains: false
This tells netplan to ignore the search domain from DHCP. Then:
sudo netplan apply
But the search . line kept coming back. systemd-resolved was stubborn. So I took the nuclear option:
sudo rm /etc/resolv.conf
sudo bash -c 'cat > /etc/resolv.conf << EOF
nameserver 8.8.8.8
nameserver 8.8.4.4
EOF'
sudo chattr +i /etc/resolv.conf
Making it immutable with chattr +i prevents anything from changing it.
Step 2: Fix CoreDNS
The pods were still using CoreDNS, which was forwarding to the broken /etc/resolv.conf. I had to tell CoreDNS to use public DNS directly:
kubectl edit configmap coredns -n kube-system
Changed:
forward . /etc/resolv.conf
To:
forward . 8.8.8.8 8.8.4.4
Then restarted CoreDNS:
kubectl rollout restart deployment coredns -n kube-system
Now pods could resolve external domains. Progress!
Problem 2: Outbound HTTPS Was Blocked
But cert-manager still couldn’t reach Let’s Encrypt. DNS worked (I tested with nslookup), but HTTPS connections timed out.
I tested from a pod:
kubectl run test --image=busybox --rm -it -- telnet 8.8.8.8 443
Timeout. Port 443 outbound was completely blocked.
The culprit? My iptables rules. Earlier, I had added rules to forward ports 80 and 443 to K3s’ NodePorts:
sudo iptables -t nat -A PREROUTING -p tcp --dport 443 -j REDIRECT --to-port 31994
This rule was too broad. It redirected ALL port 443 traffic, including outbound connections from pods trying to reach Let’s Encrypt. Every HTTPS request was being hijacked and redirected to Traefik.
The fix:
# Delete the bad rule
sudo iptables -t nat -D PREROUTING -p tcp --dport 443 -j REDIRECT --to-port 31994
# Add a more specific rule
sudo iptables -t nat -I PREROUTING -i enp195s0 -p tcp --dport 443 -j REDIRECT --to-port 31994
The -i enp195s0 flag makes the rule only apply to traffic coming FROM the external network interface, not traffic originating from pods.
Now pods could make outbound HTTPS connections.
Problem 3: Nothing Listening on Port 80
cert-manager could finally reach Let’s Encrypt. The ClusterIssuer showed READY: True. I deleted the old certificate and triggered a fresh attempt.
The challenge was created. The solver pod started. And then:
Warning Failed Accepting challenge authorization failed:
acme: authorization error for computergeneratedreality.com:
400 urn:ietf:params:acme:error:connection:
Fetching http://computergeneratedreality.com/.well-known/acme-challenge/...:
Connection refused
Connection refused on port 80. But I could curl my domain from the server and get a response. What was happening?
I tested from my local machine:
curl -v http://ip-address
Connection refused.
From the server:
curl -v http://localhost:80
404 from nginx. It worked locally but not externally.
The issue: I could connect from the server because traffic from localhost bypassed the iptables rules. But external connections (like Let’s Encrypt) were hitting port 80 on my server and getting connection refused because nothing was actually LISTENING on port 80.
K3s uses ServiceLB (svclb pods) to handle this - they’re supposed to forward traffic from 80→30179 and 443→31994. But for some reason, it wasn’t working. The svclb pod was running, but not binding to the ports.
I checked:
sudo netstat -tlnp | grep ':80 '
Nothing. No process was listening on port 80.
The Nuclear Option That Actually Worked
I edited the Traefik HelmChart to use hostPort directly:
kubectl edit helmchart traefik -n kube-system
Added to the valuesContent section:
ports:
web:
hostPort: 80
websecure:
hostPort: 443
This makes Traefik bind directly to ports 80 and 443 on the host, bypassing all the ServiceLB complexity. K3s automatically redeployed Traefik.
I tested again from my local machine:
curl -v http://ip-address
Success! 404 from Traefik, which meant it was reachable.
Finally: Success
I deleted all the failed certificates and triggered a fresh attempt:
kubectl delete certificate blog-tls -n blog
kubectl delete certificaterequest -n blog --all
kubectl delete order -n blog --all
kubectl delete challenge -n blog --all
Created a clean certificate:
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: blog-tls
namespace: blog
spec:
secretName: blog-tls
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer
dnsNames:
- computergeneratedreality.com
Applied it:
kubectl apply -f ~/cgr-infrastructure/k8s-configs/blog/blog-certificate.yaml
Watched the magic happen:
kubectl get certificate,challenge -n blog -w
The solver pod appeared. The challenge went to “Processing”. And then:
NAME READY SECRET AGE
blog-tls True blog-tls 45s
True. Finally true.
What I Learned
Hetzner’s DHCP adds a search . line that breaks Kubernetes DNS. Disable it in netplan with dhcp4-overrides: use-domains: false, if that doesn’t work - make your resolv.conf immutable.
Configure CoreDNS to use public DNS directly instead of forwarding to the host’s resolv.conf. This avoids all the systemd-resolved complexity. - Should have worked, it didn’t, still had to change resolv.conf manually.
iptables rules need interface specifications. Use -i enp195s0 to make rules only apply to external traffic, not internal pod traffic.
K3s’ ServiceLB doesn’t always work. When it fails, hostPort is a valid solution for single-node clusters. Just bind Traefik directly to ports 80 and 443.
Browser DNS caching is real. After everything was working, my browser still couldn’t resolve the domain because it had cached the NXDOMAIN response. Incognito mode or clearing the browser’s DNS cache (chrome://net-internals/#dns) didn’t fix it either, but it works on my phone and that’s good enough for now.
The blog is now live with automatic SSL certificates. Was it worth the debugging marathon? We’ll find out in 90 days when the first renewal happens.
Next up: AI inference pods and getting llama.cpp running in the cluster.