Resurrection and Confusion
January 14, 2025

Things just mostly started working again?
As I wrote about yesterday, the plan was to get home and start troubleshooting the issue with the one NUC. Absolutely horrible drive home in terms of weather, though certainly better than ice and cold. I was expecting to encounter some kind of issue that would be new to me and therefore would need an evening of research. At a bare minimum I was expecting some form of operating system corruption perhaps, hopefully not hardware failures beyond say storage and memory. Once I got home it was a case of unplugging things so that I could plug them into the NUC, a further reminder of needing to invest in a KVM.
The moment arrived and the machine just..came back to life? I saw Ubuntu do its thing and all the services start to load on the monitor and then it was just sat there at the login prompt. There wasn’t any relief being felt, if anything there was just pure annoyance. If there’s an issue I’d rather experience it again so I can identify a root cause and fix it once and for all hopefully. Not just have things magically back working again!.
Nothing was perfect, my Tailscale containers were failing with some kind of panic and not starting up. I am eagerly awaiting the ability to run the Tailscale pods in a high availability mode via their Operator as all the Pods for Tailscale scheduled to the one machine. Not ideal! I elected to just drain the Node and once those Pods rescheduled, they were back to life. I’m not quite sure if everything is fully solved with this machine and I’ll probably need to figure out some way of doing a file system integrity check. I was able to look back in the data too once my machine was back on and I was able to confirm it was not a temperature issue like before. This also adds to the confusion of what originally went wrong. Looking at log files did not really yield much, besides a realisation I need to filter out the container log files from the host level monitoring as it just fills up with permission deny errors.
In terms of what I am calling “Voyager Cluster V2”, the Raspberry Pi I needed for a five node etcd cluster arrived today and I have a short depth rack shelf that the SSDs will sit on. So, I need to reorganise things in the rack in preparation for that. I checked UniFi too and they do not support round robin DNS records. So my plan is to reintroduce Traefik into my setup and have it act as the L4 load balancer for the kube API server. The SSDs are due to arrive next Monday and honestly if the delivery date firms up and it’s Monday, I’m strongly inclined to just take the day off work and spend a full day trying to get this working once and for all so I can hopefully, at least move on for a little while! Things will always break and that’s okay, but I’d prefer if these breakages would happen a bit more spread out than one after the other!
Thank you!
You could of consumed content on any website, but you went ahead and consumed my content, so I'm very grateful! If you liked this, then you might like this other piece of content I worked on.
The previous post documenting these issuesPhotographer
I've no real claim to fame when it comes to good photos, so it's why the header photo for this post was shot by Jakub Żerdzicki . You can find some more photos from them on Unsplash. Unsplash is a great place to source photos for your website, presentation and more! But it wouldn't be anything without the photographers who put in the work.
Find Them On UnsplashSupport what I do
I write for the love and passion I have for technology. Just reading and sharing my articles is more than enough. But if you want to offer more direct support, then you can support the running costs of my website by donating via Stripe. Only do so if you feel I have truly delivered value, but as I said, your readership is more than enough already. Thank you :)
Support My Work