This is going to be kind of a rambling post. Sorry in advance.
I’ve done some reworking of both the hardware and software. These are presented in no particular order.
My original stack had a 1tb external HD wired up via a SATA2-to-USB3 connector which has proved problematic; I was connecting it to the master node, but I had persistent low voltage warnings. I’ve got some plans to run the HD off a powered USB hub, but with COVID-19 messing with everyone’s work and shipment schedule, it might be a month before the parts required arrive. So in the meantime, I’ve replaced the drive with a USB flash drive. This is only a 16gb drive, but it’s enough to let me tinker with NFS and persistent volumes.
With that settled, I’ve done some ansible work to get an NFS server configured on the master node, and have NFS clients running on each of the worker nodes. This lets me have a common pool for persistent volume claims to work off. I haven’t actually started using PVCs yet, so no idea how well this will work, but it’s a start.
I’ve added a Rock Pi 4 from Radxa to the stack. Eventually, once the power issues are resolved, my plan is to convert this to a dedicated NAS (perhaps using Open Media Vault) and take the NFS server burden off the master node. The Rock Pi might be a challenge, as support for it seems spotty and it seems to run off images specifically created for it; we will see how this experiment works out. If all else fails, I’ve managed to pick up an Atomic Pi that might serve nicely.
I’ve replaced the BrambleCase with a Cloudlet Cluster case, also from C4Labs. I like the BrambleCase a lot, but the Cloutlet case offers easier access to the boards installed in it, which works better at this phase of the project. I’d still recommend, guardedly, the BrambleCase; it’s a fine piece of engineering, albeit a bit tough to assemble (especially with my near-60 eyes and fingers.) I’ve kept the Bramble for some other RPi projects I’m planning.
After struggling with internal name resolution issues, I’ve made two sweeping changes. First, I’ve added a dedicated RPi 4 running pihole. My main reason for doing this is because I’ve got a DNS spoofing requirement, which I’ll cover below. The second change was to systematically disable systemd-resolved on all the Ubuntu 19 systems I’m running (which includes Kepler, my day-to-day Linux desktop, which is built off an old Mac Mini, and which probably deserves a whole series of posts itself.) I have had nothing but grief and misfortune with systemd-resolved, and it’s bad enough that I’ve decided to disable it anywhere I can. There are a lot of critiques and defenses of systemd and related project out on the web; I won’t go into the controversies, because I’m not firmly in either the pro- or anti- camps as far as systemd goes, but systemd-resolved violates the principle of least surprise, and the way it works both obscures DNS resolution and intentionally breaks how classic resolv.conf/glibc resolver works. Systemd-resolved expects a world where there is one contiguous DNS namespace, and all DNS servers agree on all hosts. That doesn’t work for internal networks, which is basically every corporate network and a lot of home networks as well.
I’ve been trying to follow the series that Lee Carpenter has been doing on his RPi/k3s cluster, but I am hung up on getting cert-manager to work. I’ll update more on those issues in another post once I land on a solution, but the gist of things is: the external interface of my router is not reachable from my internal net. As a result of this, cert-manager fails its self-check, because the self-check tries to make sure that the ACME challenge url is reachable (from its container) before it actually forwards the request to letsencrypt. This doesn’t work with “regular” DNS for me, because the internet DNS resolves project.kube.thejimnicholson.com to my external IP, and the container (inside my network) can’t reach that IP. To try and solve this, I use dnsmasq internally (via pihole.) So far, this hasn’t helped, which I’ve tracked down to one of two things: either coreDNS is configured wrong in my cluster, or the cert-manager containers are hard-wired to use some external DNS rather than refer back to the node’s DNS configuration for resolving names. I’ll have more to say about this once I’ve solved it.