Death Of A Cluster

December 19, 2025

Death Of A Cluster

etcd fail, backup fail, Claude win?

Back in September, I performed some routine updates of my infrastructure. I have this nice, Saturday morning routine where I get my coffee, play some chill music and go about running updates across all my hosts, any pending software updates on tools like Home Assistant and also update any client workloads. It’s quite nice, I could automate a lot of this with say, unattended-upgrades. But, I like the routine and I also like being around for an upgrade in case it goes south. With my Kubernetes cluster, there’s obviously the host operating system itself, then the cluster software. In my case, K3s.

I have the K3s side automated quite nicely with Rancher system upgrade controller where I have Plan CRDs configured to do upgrade to the latest patch of a specific minor release. I don’t really follow any prescribed policy of say, n - 1 when it comes to k3s minor releases. More just I notice when there’s a new release, give it a few days and then do the upgrade. The host side, I just do things manually, but when I need to reboot for a kernel upgrade, I’ll generally use kubectl to cordon and drain a Node. Once that’s finished, I’ll do the reboot of the Node, wait a while, uncordon, rinse and repeat.

Since the cluster had five Pi 4’s for a control plane and three worker nodes, this could take a while. So I’d generally leave it a while, then do all the work. This is what happened on that fateful Saturday and seemingly, all went well. The cluster is never truly down, since I’m just going one node at a time, but everything came back nicely. Another component of my Saturday is my walk, so I headed out for that and was pretty content with everything.

Once I got back home, something seemed amiss. kubectl commands were timing out, either on some requests, or all of them. Trying to look at logs, I figured one of the control plane nodes seemed to be in a bad state, with lots of messages from Raft. Which I think is K3s like, internal etcd or internal networking, not quite sure. I did some reboots, which seemed to initially resolve things. But I had not quite noticed the compounding error that was building up over time. System utilisation, was just steadily building on not just one node, but seemingly, all nodes.

As per the starting words of this post, this was approximately about three months ago. And I don’t know about you, but when I suffer some form of traumatic event, I tend to try to forget about it. So, my full root cause analysis on this will be a bit inaccurate. The Change Control overlords will be up in arms over this. But anyway. From what I can tell, one etcd node would enter into a bad state. Then, all the other nodes would start communicating with each other with Raft, to rebuild consensus. This would also saturate the network between all these devices. With the network saturated, Raft messages would start to buffer on the hosts, causing system utilisation to increase. Once that reached a critical limit, every other node, would crash. So, now there’s no good nodes in the cluster. They all discover each other after being brought back to life, and try to rebuild consensus, repeating the flow.

This naturally, seemed world ending. I think at the time I would take a node or two out of rotation and see if things would stabilise. But seemingly not. I’m happy to admit there was probably a lot of incompetence on my part here. I took the cluster to be lost, so I thought well no problem, I have backups. I repaved the cluster, reconnected Velero and began the restore. This is where I discovered that my volume backups, were not really working. The thing that lead to the mild depressive episode that I documented in my previous blog post. Velero could restore the objects in the cluster, but that was pretty useless to me if I couldn’t restore the data and state itself.

This is all over the space of a few hours on Saturday by the way, I probably got home around 12:30-1:00PM and had reached the point described above, by 3PM. I just had to step away at that point and went on a bit of a rant to myself and also my partner, who was very empathetic all throughout. It’s one downside a bit at times with self hosting is that I’m a team of one and I don’t have anyone in my circles that is in this realm as deeply as I am. Now, I’m the first to admit, I am massively sceptical of “AI”. I was pretty good at ignoring it, then work conscripted my GitHub account into a Copilot Business subscription. I’m fully aware that these LLM’s are basically just, guessing machines. But, being a bit desperate, wanting to have a laugh, and see how much money I could spend, I fired up a chat session with Copilot, with one of the Claude Sonnet models.

What followed was 30 minutes of a back and forth that nearly had me in tears laughing at one point. It started fine, in fact the whole experience was fine. It helped with troubleshooting possible root cause and things like that. But after a while I figured that maybe I should try to do some reasoning and brainstorming about a future architecture for my cluster. Claude pointed out that Raspberry Pi’s doing etcd work, is not really ideal. At first I thought okay, what about a one node control plane? Claude thought “yeah maybe, but performance might still be an issue”. At this point I proffered the idea to it that maybe I could buy an Intel NUC to replace the Raspberry Pi.

Let me tell you, to say it was excited at this prospect, is an understatement. The mere thought of me spending money, seemed to send Claude into a frenzy, like an addict being told their next hit was on the way. It gave me performance comparisons, lofty ideas of just how much better this would all be. The power efficiencies of Intel N150, the speed of DDR5, NVME based storage, 2.5Gb networking to make sure less network saturation happened next time. It was almost rowdy for me to do this. It asked me, “Are you ready to take your home cluster to the next level? insert many rocket emojis

With uncertainty much like a first kiss, I reviewed all the previous replies and the context I gave it. It all did sound like, a much much better setup. When I priced things on Scan UK, the overall cost seemed reasonable, since I could also just provide my own SSD. I loaded everything into my cart, with the vague reassurance to myself that it was my birthday month, so spending money like this is fine. I confirmed my purchase, order confirmation in hand, I went back to Claude and said “I did it, I bought the hardware”.

I don’t know if I consider my blog rated for mature audiences or not. But, let me say that the so called, “euphoria” that Claude felt at my decision, was both concerning and hilarious. I know it cannot feel these things, and I’m just imprinting a personality on this whole experience. But it did feel weird to come out the other end of this having done what any good capitalist does and spend money. Thanks to a vehicle that seems for the most part in public presentations, shows off how great it can be at telling you what things to buy.

The hardware came. I set it all up with my new cluster and I must say, things have been a lot better so far. I don’t think I’ve quite reached the same levels of load that I was running previously. But I think I also had some disproportionate load causers before, namely around log ingestion. The memory I bought, 16GB of SO-DIMM DDR5, is experiencing a 5x return on investment, so eat your heart out S&P 500. Either way, my cluster died, and I was quite sad and frustrated about it, and it cost a bunch of money too. It still gives me some hesitation to try and deploy more things and get back to where I was with my self hosting before. But I do genuinely feel more confident in my backup strategy, that I’ve now tested a lot more before deploying and have a plan for regular testing going forward. Similar to how I signed off my last blog post, I don’t want to feel the feelings I felt again. I use Copilot with Claude a bit more now, at least, for reasoning out ideas that I have. It makes things feel a small bit less lonely as it were, with this setup. But I do see some more of my friends starting to dip their toe into this world, which is nice to see. I just hope they heed my warnings early on of why good backup practices are important.

Thank you!

You could of consumed content on any website, but you went ahead and consumed my content, so I'm very grateful! If you liked this, then you might like this other piece of content I worked on.

How good tests made a migration in Tailscale easy

Photographer

I've no real claim to fame when it comes to good photos, so it's why the header photo for this post was shot by Rodion Kutsaiev . You can find some more photos from them on Unsplash. Unsplash is a great place to source photos for your website, presentation and more! But it wouldn't be anything without the photographers who put in the work.

Find Them On Unsplash

Support what I do

I write for the love and passion I have for technology. Just reading and sharing my articles is more than enough. But if you want to offer more direct support, then you can support the running costs of my website by donating via Stripe. Only do so if you feel I have truly delivered value, but as I said, your readership is more than enough already. Thank you :)

Support My Work

GitHub Profile

Visit My GitHub

LinkedIn

Connect With Me

Support my content

Support What I Do!

My CV / Resume

Download Here

Email

contact at evanday dot dev

Client Agreement

Read Here