Is this the end?

February 11, 2025

Is this the end?

Probably not, but I’m just letting the dust settle

While I took a break from my current endeavours for a while, I jumped back in quite quickly. I’ve since taken a more deliberate break. This weekend gone, besides some routine patching I did literally nothing self hosting related and it was honestly glorious. But it’s time to go over what steps I took after the realisation of having to lose an entire Kubernetes cluster, yet again, and why I feel / hope that the end is approaching in this story arc.

Doing the wipe of the RKE2 cluster while a bit mentally painful, is at least straight forward. Depending on how you install RKE2 it will ship with scripts included to do the uninstall and cleanup for you. If you don’t have the scripts on the machine, they’re quite easy to find online. I went ahead and got all the machines back to a clean slate and elected to change tactics a bit and use K3s instead of RKE2.

RKE2 actually uses parts of K3s under the hood, but I figured since K3s is positioned directly for local / edge use of Kubernetes, it would be a better fit. The install process was quite straightforward. I did experience some confusion when trying to add a second machine, in that I grabbed the wrong password value? Seemingly there’s the node token and then a separate token for a server. That might have been a hangover from how you join RKE2 machines in my mind, so once I rectified that it was straight forward. I rolled through the Raspberry Pi’s, since I wanted the control plane to be stable before revisiting the NUCs. I had a little to do list of everything I thought of, inside a doc. It certainly was a help to try and keep on the straight and narrow and ensure I configure some things out the gate properly.

Once I had my Pi’s all configured, I worked on the NUCs. As is the trend, these things did “just work” initially. The primary item was the fact that the CIDR issues I saw previously, were no more and services were indeed working. I started to restore things as best as I could, but I still stayed away from anything that needed to maintain state or data. I have plans to get Velero working on this cluster instead of relying on Longhorn as I at least know that Velero works for me and I have experience with it’s restore process as well.

Again, by mid day things were looking good and a natural stopping point was approaching. As I shifted towards other hobby’s, I did notice eventually that one of the NUCs was encountering issues once more. My secret power this time however, was that past me had left the HDMI cable plugged into the machine. So I changed monitor inputs to see was there any logs on the screen and low and behold, I discovered a paper trail!

What seems to have been happening, is that the OS encounters some kind of bad blocks on the system drive. This leads to a panic nearly and the filesystem gets mounted in read-only mode. This is why I was seeing the machine be completely offline, while also not being that hot after I adjusted fan speeds in the BIOS. I saw the issue occur one or two more times, so I removed the node from the cluster. Which I can do easily now with my more reliable cluster. I’ve tried to run some file system integrity checks but it seems this process has gotten harder to do based on what I’ve read online. It’s important to note too that I’ve seen the machine go offline still even with no workloads. I still don’t truly know if I can rule out temperatures.

But for now I’ve a plan of continuing to try force a file system integrity check, followed by installing a brand new drive in the system and redoing the operating system and then see what happens. There is always a chance I could have gotten a bad drive, or the operating system got corrupted after too many power cuts. I’ve also discovered since that I need a new UPS battery, yet more expense!

Hopefully I can sum up the effort to perform these changes, cause there’s certainly a fear of what else will be around the corner. But I do think now more than ever, that this could be the end of this particular journey and I can focus back on building services on top of my infrastructure as versus debugging it.

Will just have to see what I report back with next time!

Thank you!

You could of consumed content on any website, but you went ahead and consumed my content, so I'm very grateful! If you liked this, then you might like this other piece of content I worked on.

The previous post in this mini series

Photographer

I've no real claim to fame when it comes to good photos, so it's why the header photo for this post was shot by Frédérick Tubiermont . You can find some more photos from them on Unsplash. Unsplash is a great place to source photos for your website, presentation and more! But it wouldn't be anything without the photographers who put in the work.

Find Them On Unsplash

Support what I do

I write for the love and passion I have for technology. Just reading and sharing my articles is more than enough. But if you want to offer more direct support, then you can support the running costs of my website by donating via Stripe. Only do so if you feel I have truly delivered value, but as I said, your readership is more than enough already. Thank you :)

Support My Work

GitHub Profile

Visit My GitHub

LinkedIn

Connect With Me

Support my content

Support What I Do!

My CV / Resume

Download Here

Email

contact at evanday dot dev

Client Agreement

Read Here