Author Archives: jim

Using letsencrypt with Zentyal

In this article, I will describe the steps I took to use acme.sh to obtain and manage TLS certificates for a Zentyal server running in my homelab.

Background

On my home network, I run a single virtualized instance of Zentyal Developers Edition to act as an internal DNS server. Let me make a few things clear about this server up front:

  • I am in no sense fully utilizing Zentyal. Zentyal can provide a multitude of services to a LAN, including acting as a Domain Controller, DHCP server, Mail server (with optional filtering and a Web Mail interface), NTP service, RADIUS, VPN services and more. I’m using it for DNS, which arguably isn’t even scratching the surface of Zentyal’s capabilities.
  • There is a dirth of Open Source DNS server offerings available for small networks that would fit my ideal requirements. Those would include:
    • Providing an API service (hopefully compatible with the API for something like AWS Route 53) that would support automation.
    • Providing a web user interface for maintaining DNS.
    • Including the ability to manage zones across providers. For example, I want to manage and serve “home.thejimnicholson.com” within my home network, but “thejimnicholson.com” would reside in an external DNS server or provider.

One of the things I’ve been trying to do within my home networking projects is to use TLS certificates from an actual recognized “official” issuer. This is possible because I own the domain I’m using for internal names, and free TLS providers like LetsEncrypt and ZeroSSL support the use of DNS TXT records for issuer challenges. What this means is that rather than authenticating a certificate request by issuing an HTTP request to a webserver on the target domain, the issuer

  • generates a text secret that it returns to the requester,
  • sleeps for a bit to allow the requester to store the secret as a TXT record in DNS and then
  • issues a DNS lookup for a specific TXT record and insures that the value returned matches.

ACME is a protocol for requesting, obtaining and renewing TLS certificates. The protocol is used by many certificate authorities, services and has been implemented by many certificate management system vendors within their products. There are a number of client tools that can be used to interact with a service that supports the ACME protocol; the most well-known of these is certbot from the Electronic Frontier Foundation. While certbot is a fine tool for the job, it has a potential drawback in that it is written in Python, and thus requires a python interpreter to work. An alternative client, acme.sh, implemented entirely as a bash shell script, offers similar capabilities with a simpler deployment process.

Zentyal’s own documentation for using LetsEncrypt TLS for its web administration interface is (sadly, like most of their documentation) terse and a bit incomplete. I was able to find a snipet for using acme.sh with GoDaddy on a Zentyal server, and adapt it to use AWS Route 53, my provider.

Installing acme.sh

Log into the zentyal server using ssh. Use an account with administrator privileges (ie, one that can run sudo.)

You can either run the quick install as outlined at get.acme.sh, or clone the project source and run the installer. It shouldn’t make a difference which you use; if you choose to clone the project, you should make sure that git and socat are installed on the server

apt install -y git socat
cd /tmp
git clone https://github.com/acmesh-official/acme.sh.git
cd acme.sh
./acme.sh --install --accountemail <YOUR EMAIL>

After the installation, you will need to restart your root shell session to pick up changes that the install process makes to the root .bashrc. The simplest way to do this is to exit the shell and restart it with sudo again.

If everything went well, you will now have a ~/.acme.sh containing the installed script, and a shell alias acme.sh that executes the script.

AWS Credentials

There are various tutorials for setting up an IAM user with access keys and policy rights to interact with Route 53. I followed this one. The bottom line is that before you can get certificates, you need to set these environment variables:

export AWS_ACCESS_KEY_ID="<your access key ID>"
export AWS_SECRET_ACCESS_KEY="<your secret access key>"
export HOST_FQDN="<your fully qualified domain name>"
export ACCOUNT_EMAIL="<your email address>"

Choosing a CA

acme.sh is configured to use ZeroSSL by default. If you want to use that issuer, you need to read this page, which covers the initial steps you need to register your ZeroSSL credentials for the script.

I decided to use LetsEncrypt, and to make it the default issuer. To do this, run the command

acme.sh --set-default-ca --server letsencrypt

Obtaining a certificate

Assuming you set the environment variables above, you can obtain a certificate by running this command

acme.sh --issue -d ${HOST_FQDN} --dns dns_aws --ocsp-must-staple \
   --keylength 4096  --force

The script will process for a while, and the results should be a brand new certificate.

Installing the certificate in Zentyal

This is the tricky part. acme.sh has automation that will keep your certificates up to date, but you need to use the script to do the installation. Fortunately, there’s a way to do this:

acme.sh --install-cert -d ${HOST_FQDN}  \
  --reloadcmd "cat /root/.acme.sh/${HOST_FQDN}/${HOST_FQDN}.cer /root/.acme.sh/${HOST_FQDN}/${HOST_FQDN}.key > /var/lib/zentyal/conf/ssl/ssl.pem && systemctl restart zentyal.webadmin-nginx.service"

What all this is doing is creating a .pem file out of the certificate and key created when the certificate was issued, putting it into place where Zentyal will use it, and then restarting the web admin interface to pick up the new cert.

If everything goes well, you should now have an “offical” certificate for your server.

Project Status update

This raspberry Pi cluster with 3 control plane nodes and 7 workers has been running for 111 days now. Here are some observations:

  • /var/log just grows and grows. My master nodes especially seem to be writing to /k3s-service.log a lot, and there’s no log-rollover happening. I can sh into the nodes and remove the files occasionally, while I’m looking for a more automated solution. Ideally, /var/log would be a tempfs mount.
  • system-update seems to be working (all my nodes currently show version v1.20.11+k3s1). However, occasionally a node will remain cordoned, and the result will affect pods that use Longhorn storage. When this happens, the “apply” job will get a “Job was active longer than the expected deadline” event posted. Typically, I can resolve these by removing the cordon on the node.

In my next post, I’ll describe how I created the k3os images for the cluster, how I brought the cluster up, and how things have been installed on it.

The state of the cluster – 2021 – part 1 (hardware)

It’s been a while since I’ve posted, and there have been a lot of changes in the lab. The cluster has grown, and in the process it has been reshaped several times. I’ll try to cover all of what’s happened in this post. This project has always been about learning, and I’ve learned a lot.

Pi-Kube consists of a cluster of 10 Raspberry Pi 4b single-board computers, housed in a Pico 10H cluster case. I went with the Picocluster case solution out of frustration with finding a good power supply solution for multiple Raspberry Pi boards. Without doing a full review, I’ll just say that the Pico 10H has been mostly satisfactory; I’ve not had any power issues with a full cluster of 4b boards, the unit is extremely quiet and provides more than adequate cooling for the cluster.

The only technical downside I’ve encountered are with the included 8-port switches (the 10H includes 2 of these,) which seem to be dropping enough packets to make PXE booting the cluster problematic; I would consistently get 2-3 boards to fail to boot every time I cycled power. I ultimately switched to a 16-port TP link switch to resolve this, and then later I abandoned network booting entirely when I switched to k3os.

The non-technical downside of the Pico 10H is the price, which puts it out of range for most single-board computer projects. The best I can say is that you’re paying for the engineering, and the engineering is mostly solid.

The other components of the cluster, aside from the aforementioned switch, are a NAS consisting of an Odroid XU-4 board in a Cloudshell 2 case, with 2 4TB drives installed. This has been a reliable little storage appliance for me, but unfortunately it’s made up of largely discontinued or out-of-stock components. Based on the experience, I’d probably recommend Hardkernel components for a NAS project. Having said that, I’d suggest you look at a dedicated NAS from Synology as well; I use a Diskstation DS220J unit for other projects in my homelab, and I’m very happy with it.

As for disk storage, I’ve used both Seagate Ironwolf NAS-class and Western Digital Red drives for the past 2 years without any incidents.

In the next post, I’ll discuss the software I’m currently using to run the cluster, and talk about how I got there.

Using podman to run a postgresql server

I’m moving over to podman for containers where possible, because I like where the project is going. I have no specific objections to docker, and in fact use it for several of my own projects, but podman feels more kubernetes-ish to me somehow.

When I set up k3s, except for trivial instances, I generally use postgresql as the backend storage. Again, this is a personal preference, as I have a lot of experience with postgresql, and it’s the database I’m most comfortable using. But installing and maintaining postgresql can be a chore, and the project create a really great container that makes it easy to run. So I’ve decided to document how I use podman to create a postgres instance in this post.

#!/usr/bin/env bash

CONTAINER_NAME=k3spg
VOLUME_NAME=k3spg-data
DB_NAME=k3s
DB_USER=k3s
DB_PORT=5432

PASSWORD=$(date +%s | sha256sum | base64 | head -c 32 ; echo)
echo "${PASSWORD}" > ${CONTAINER_NAME}.pg-pw.txt

podman volume create ${VOLUME_NAME}
podman create 
        -v ${VOLUME_NAME}:/var/lib/postgresql/data 
        -e POSTGRES_PASSWORD=${PASSWORD} 
        -e POSTGRES_DB=${DB_NAME} 
        -e POSTGRES_USER=${DB_USER} 
        -p 5432:${DB_PORT} 
        --restart on-failure 
        --name ${CONTAINER_NAME} 
        -d 
        postgres:12.4

podman generate systemd 
        --new 
        --files 
        --name 
        ${CONTAINER_NAME}

echo "Your db user password is ${PASSWORD}"

What happened?


I’ve been taking a break. I’ve found that with the restrictions on activities due to (or rather, rationalized by) COVID19, I’m having work-life balance issues all over the place.

I”m still working on the cluster, and I have some posts in the pipeline, but it’s going to be a while before I’m able to put the kind of time in required to maintain the pace I’ve had since March. That is, if I ever get back to there, which is questionable because it’s manifestly unhealthy.

I hope everyone reading this is safe and healthy. These are trying times. Take breaks if you need them.

Progress on pvc storage


I’ve been quiet, but also busy. After building and configuring my Cloudshell 2, I played around with several potential uses for it. Meanwhile, I ran into several issues with local-path-provider in k3s. Some of these were due to my own sloppiness; I managed to create two postgresql pods that played in a lot of the same spaces, and all manner of mischief ensued.

Tonight, I got nfs-client-provisioner working. This involved setting up an nfs server on the Odroid xu4 that powers the Cloudshell2, which I’ll describe in another post.

The nfs provisioner for kubernetes requires that the nfs client be installed on all the nodes in the cluster. This turned out to be pretty easy. Ansible to the rescue:

ansible -m package -a 'name=nfs-common' master:workers

The next step was to set up the provisioner’s configuration. Most of what I did here was based on the NFS section of Isaac Johnson’s post at FreshBrewed.Science I grabbed a copy of the sample config first:

wget https://raw.githubusercontent.com/jromers/k8s-ol-howto/master/nfs-client/values-nfs-client.yaml

These values need to be changed to match your NFS server configuration.

replicaCount: 1

nfs:
  server: 10.0.96.30
  path: /storage1/pvc
  mountOptions:

storageClass:
  archiveOnDelete: false

Once that was done, it was time to deploy the client. That can be accomplished via helm:

helm install nfs -f values-nfs-client.yaml 
    stable/nfs-client-provisioner 
    --set image.repository=quay.io/external_storage/nfs-client-provisioner-arm

Wait a few minutes for the deploy to finish, and there we have it:

$ kubectl get storageclasses
NAME                   PROVISIONER                                RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
local-path (default)   rancher.io/local-path                      Delete          WaitForFirstConsumer   false                  23d
nfs-client             cluster.local/nfs-nfs-client-provisioner   Delete          Immediate              true                   57m

Odroid XU-4 file server (CloudShell 2)

I’ve finished my NAS build. It’s built around an ODroid XU4 that I got from ebay. I replaced the fan/heatsink combination with a taller heatsink from AmeriDrod, which brought CPU temp down by 8°C. I picked up a pair of Seagate Ironwolf 1tb drives to use with it. All of this is housed in HardKernel’s CloudShell 2 for XU4 case. The CloudShell 2 can support RAID 0, RAID 1, spanned, or seperate volumes; I went with RAID 1 out of paranoia. It’s running an ODroid-provided Ubuntu Bionic (18.04) minimal system with Samba and Gluster installed.

I didn’t much care for the display software provided by HardKernel, and there was some discussion about how those scripts are CPU-intensive. I tried a few things from scratch, but in the end I settled on Dave Burkhardt’s nmon wrapper scripts for the CloudShell 2 display, and I’m happy with the results.

For now, I’m using this mainly as a share ddrive for the family

The current state of pi-kube


A closer look at the cluster

Hardware and Operating System

Currently, the cluster consists of 5 Raspberry Pi 4B systems with 4GB memory. Each of these has a 16gb micro SD card and a 16gb USB flash drive. I use the SD cards for boot only, but the root filesystem is on the flash drives. I’m not overclocking anything in the cluster yet, mostly because I don’t trust the power supply arrangement enough.

The systems are running Ubuntu 20.04 minimal, the stock distribution code from Canonical.

A 6th Raspberry Pi 4B acts as a file server. It’s equiped with a 256gb USB flash drive. The file server runs Ubuntu 19.04 (it’s due for a rebuild soon.) I use gluster to mount the flash drive on all of the cluster servers. Currently, I’m not using replication or sharding with gluster, although at some point I intend to pursue that. The file system is mounted as /data/shared on each member system in the cluster.

Kubernetes configuration

The system runs k3s from Rancher. One RPi 4B servers as the master (and is also a node); there are 4 RPi 4B nodes besides this.

The stock k3s configuration is deployed, using sqlite for storage and Traefik to manage ingresses. Cert-manager is used to manage letsencrypt requests. External (internet) DNS for ingress is provided by AWS Route 53, although this is not currently automated. My home network using [pihole] for DNS and DHCP; static DHCP leases are used for all hardware nodes. pihole also provides internal DNS spoofing of the external domain names, since the external IP address of my router does not work on the internal network.

The entire system is deployed via a set of ansible playbooks. You can see those playbooks here in github.

Reboot


Last night, I tore down the entire stack and rebuilt it. This was to accommodate two things:

  • The release of Ubuntu 20.04, which required an upgrade, and
  • A hardware change involving using USB flash drives for the root file system on all nodes, rather than just a micro SD card.

As a result of this, I’ve come up with some changes to node provisioning – the steps required to go from bare hardware to an operating node. I’m planning a step-by-step guide for building out the cluster based on this. Stay tuned.

Getting Cert-manager to work


I’ve been sort-of following the series that Lee Carpenter is doing over at carpie.net, but for a while I was hung up on getting cert-manager to work. The specific failure mode I had was this:

My external IP address (the IP assigned to my router by the cable company) for some reason isn’t routed correctly from inside my home network. The IP responds to pings, and DNS resolves it, but any SSH, HTTP or HTTPS traffic (and presumably any other TCP connections) all hang indefinitely. This appears to be a router issue, since my router, a TP-Link Archer 20-based model, doesn’t use an alternate port for its web admin UI. The router presents the UI on port 443 with a self-signed certificate, and redirects port 80 traffic to 443. I suspect that the web server embedded in the router’s firmware is catching my web connections (the ones that originate inside the network) and doesn’t know what to do with them, so they just hang.

External connections are properly routed, as I’ve got port-forwarding configured to send the traffic to the kube cluster.

Here’s why this is a problem: cert-manager has a “sanity check” it runs before issuing a certificate request; if you are using the http01 verification strategy, cert-manager tries to reach the verification challenge response URL before it sends any cert requests to letsencrypt. This makes sense, since there’s no reason to send a request if letsencrypt can’t find the verification challenge response.

Except, in this case, the response actually is correctly configured, and if you hit that URL from outside of my home network, you would see it. The sanity check, however, is running inside, and thus it was failing, thus no certificate for me!.

The solution to this was simple: I run pi-hole on my home network, as both a DHCP server and a DNS server. So all I had to do was “spoof” my external DNS name on the internal network, so that it resolved to the internal address of the kube cluster, rather than the external address of the router.

At least, it sounds simple. In reality, it proved to be difficult, mainly because I made a decision when I started building my cluster to use Ubuntu server (which is a full 64 bit OS) rather than Raspian (which runs userspace in 32-bit, even on Raspberry Pi 4). And I’m running Ubuntu 19.10, which means that (by default) I’m using systemd-resolved to handle DNS resolution.

I’ve long ago gotten over my distaste for systemd, but man, systemd-resolved is pure evil. If you think you understand how Linux DNS resolution works, be prepared to feel dumb. I won’t go into all the reasons why I think what they’ve done with resolution in systemd is evil, but I will say this: no matter what I did, the cert-manager pods seemed to not use my internal DNS server, until I fully disabled (and apt purged) systemd-resolved, and did a whole bunch of other stuff to get resolv.conf back to what anyone who’s used Unix for 30 years would expect.

I actually walked away from this for a while, because it was so frustrating. And in the course of trying to figure out what was wrong, I rebuilt the kube cluster without traefik, and installed metallb and nginx using Grégoire Jeanmart’s helpful articles as a guide. Let me be clear: traefik was NOT the problem, and not even related. My issue was with DNS. But at this point, I’ve got the cluster working with cert-manager, so I think I’m just going to leave it the way it is for now.