I’ve been sort-of following the series that Lee Carpenter is doing over at carpie.net, but for a while I was hung up on getting cert-manager to work. The specific failure mode I had was this:
My external IP address (the IP assigned to my router by the cable company) for some reason isn’t routed correctly from inside my home network. The IP responds to pings, and DNS resolves it, but any SSH, HTTP or HTTPS traffic (and presumably any other TCP connections) all hang indefinitely. This appears to be a router issue, since my router, a TP-Link Archer 20-based model, doesn’t use an alternate port for its web admin UI. The router presents the UI on port 443 with a self-signed certificate, and redirects port 80 traffic to 443. I suspect that the web server embedded in the router’s firmware is catching my web connections (the ones that originate inside the network) and doesn’t know what to do with them, so they just hang.
External connections are properly routed, as I’ve got port-forwarding configured to send the traffic to the kube cluster.
Here’s why this is a problem: cert-manager has a “sanity check” it runs before issuing a certificate request; if you are using the http01 verification strategy, cert-manager tries to reach the verification challenge response URL before it sends any cert requests to letsencrypt. This makes sense, since there’s no reason to send a request if letsencrypt can’t find the verification challenge response.
Except, in this case, the response actually is correctly configured, and if you hit that URL from outside of my home network, you would see it. The sanity check, however, is running inside, and thus it was failing, thus no certificate for me!.
The solution to this was simple: I run pi-hole on my home network, as both a DHCP server and a DNS server. So all I had to do was “spoof” my external DNS name on the internal network, so that it resolved to the internal address of the kube cluster, rather than the external address of the router.
At least, it sounds simple. In reality, it proved to be difficult, mainly because I made a decision when I started building my cluster to use Ubuntu server (which is a full 64 bit OS) rather than Raspian (which runs userspace in 32-bit, even on Raspberry Pi 4). And I’m running Ubuntu 19.10, which means that (by default) I’m using systemd-resolved to handle DNS resolution.
I’ve long ago gotten over my distaste for systemd, but man, systemd-resolved is pure evil. If you think you understand how Linux DNS resolution works, be prepared to feel dumb. I won’t go into all the reasons why I think what they’ve done with resolution in systemd is evil, but I will say this: no matter what I did, the cert-manager pods seemed to not use my internal DNS server, until I fully disabled (and
apt purged) systemd-resolved, and did a whole bunch of other stuff to get resolv.conf back to what anyone who’s used Unix for 30 years would expect.
I actually walked away from this for a while, because it was so frustrating. And in the course of trying to figure out what was wrong, I rebuilt the kube cluster without traefik, and installed metallb and nginx using Grégoire Jeanmart’s helpful articles as a guide. Let me be clear: traefik was NOT the problem, and not even related. My issue was with DNS. But at this point, I’ve got the cluster working with cert-manager, so I think I’m just going to leave it the way it is for now.