Going local with Kubernetes
November 8, 2024
Strap in because we going for it
Earlier in the summer I decided I needed to repave my local infrastructure to move away from Docker Compose and towards Kubernetes, as part of revitalising my self hosting hobby. I worked on creating a remote Kubernetes cluster, initially trying to use my local compute for Nodes on this cluster, which failed. I then elected to simply pay for a managed Kubernetes cluster for a few months to at least experiment with how I would operate such a cluster.
Now came the moment of jumping into the deep end. Besides some experimenting on Raspberry Pi’s, I never really ran Kubernetes at home. For all my career experience with it, it was something I never elected to do outside of cloud. Never a conscious decision I made, just didn’t cross my mind. But I felt that for my “pseudo production” environment, I was getting sick of a bunch of Docker containers and also wanted to maintain my knowledge of Kubernetes as my career path has subtly transitioned me away from it being a day to day interaction. The day arrived to try and migrate my remote Kubernetes cluster, into a local Kubernetes cluster and straightforward would not be how one would describe the task.
Hardware
For the metal powering it all, I’ve continued down the path of using Intel NUCs for the low power consumption and diminutive form factor. Being able to fit three of them in a 1U rack mount is also very handy. I’m using the NUC11TNKv5 which has the Intel Core Intel Core i5-1145G7, four cores and eight threads. The main selling feature for me is that they have an onboard 2.5Gb interface, since I had ideas around having persistent volumes available from my NAS. There is also thunderbolt ports so in theory I could get 10Gb adapters. But that topic will be visited when I’ve a 10Gb backbone in place for my networking.
Memory wise for them all I’ve started with 16GB, with an upgrade path up to 64GB if I feel it is required. Storage I ended up upgrading to 1TB SSDs for having local fast storage available. One thing that I didn’t think of was how I would need to also deploy a volume manager for my cluster. I was experimenting with Longhorn, and it was almost surprising how by default a volume gets replicated three times. Makes perfect sense obviously, but it’s interesting to finally get an insight into how something like storage, usually obfuscated by cloud magic, works behind the scenes. The SSDs have no RAID whatsoever, not that it’s a backup. I think the NUCs can take a second drive, but budget wise I don’t feel like investigating that level of redundancy. I plan to investigate the backup options available to me with Longhorn and I imagine that will give me some peace of mind.
Software
The underlying operating system of choice is Ubuntu 24.04 LTS. There’s no real thoughts here besides that I’ve been using Ubuntu throughout my career and even before it really started. So this is just out of habit and familiarity more than anything. I did take advantage of the free five machines one can get for Ubuntu Pro with a Ubuntu account. So my cluster has nice to haves such as Livepatch for the kernel among other things.
For the software connecting it all together, mentions of Longhorn above might allude to my choice for Kubernetes. I elected to use Rancher Kubernetes Engine 2 (RKE2). Looking at it online, it didn’t seem like K3s, also from Rancher, where features would be stripped out or optimised for size. The angle for RKE2 being a hardened distribution for government workloads also seemed like a good indicator for reliability, which came in useful later for me. Others in the self hosting world also reported some good success with RKE2 in their own endeavours so I said I would give it a go.
Storage as I mentioned above, I’m using Longhorn. While I don’t think there’s an angle of two Rancher products simply just gelling together and that’s why one should use both RKE2 and Longhorn together, it was very straightforward to deploy. That helps a lot on the self hosting front, you don’t want something you do at home to turn into something you have to configure to almost the same scale as your day job. It felt like Longhorn shipped out of the box with many sensible defaults too, the fact that volumes get replicated three times by default like I described above is quite nice. But I still need to dive deep into Longhorn to see what else I can do with it.
Networking it should come as no surprise that I am using the Tailscale operator for getting my cluster workloads connected externally. For the cluster itself I am just using the default network overlay that RKE2 ships with which I believe is Canal. As I talk about in the linked blog post, I do have concerns around how do I access my workloads in the event of Internet loss at home. Note that I don’t mean being able to access from outside of home, more so inside. I need to investigate options for having most likely a second Ingress controller that provisions Ingress objects with IP addresses routable within my own subnet. In an ideal world, I can just turn off the Internet and all my services are still accessible from inside my home. Far flung into the future, I should be able to solve for losing my Internet connection, but for now it’s a problem to solve for the backlog.
Configuring It All
I was hoping to be amazing and write configuration as code for everything and experience glory. Unfortunately I did not quite get to that level. Using Ansible throughout, I certainly have the operating system configured in terms of users and system dependencies. I managed to connect Bitwarden Secrets Manager to Ansible also which I use to pull an auth key for Tailscale onto the hosts automatically, with a systemD job on boot to connect the host to the tailnet. Unfortunately that does not quite work that well so you know, learning experience! I can simply just SSH to the machines locally and run the commands manually to get the hosts connected.
I also have my monitoring tooling installed as part of the Ansible playbook for the hosts. I did not write Ansible for installing RKE2 as I was targeting high availability mode and for a first time, there’s benefit in doing it manually. I simply followed the documentation for a HA setup of RKE2. I don’t quite have a separate IP address / load balancer for the API server, so right now if the first machine gets rebooted, the whole API server is unreachable! With a customer base of one though, who’s very informed as to how everything works, it’s tolerable!
The nice thing about all of this was that it did not take very long. By default the nodes in the cluster are schedulable, which might not be perfect. It could be better to have dedicated etcd nodes followed by worker nodes. But this isn’t a home data centre. I could investigate virtualising the NUCs and then relying on VMs to distribute things, but it seemed like overhead for now to bring that in too and I really want to try and use KubeVirt for my virtualisation needs. Either way, as they say proof is in the pudding, et voila!
(if you’re reading on mobile I’m sorry but I’m not paid to write CSS so I don’t know how :D :D :D)
No backup survives first contact with the enemy
If you have followed this journey from the beginning, firstly thank you :) But you would also know I had a Kubernetes cluster on Scaleway. I configured Velero to backup that cluster, with CSI based backups enabled. The idea was that if I was using CSI and also taking file system level backups, I could just take a backup of the Scaleway cluster and restore it into my local RKE2 cluster.
Least, it was an idea! Reading documentation, there was the following callout “When restoring CSI VolumeSnapshots across clusters, the name of the CSI driver in the destination cluster is the same as that on the source cluster to ensure cross cluster portability of CSI VolumeSnapshots”. Which in effect, snookered me since the Scaleway CSI driver would not be the same name as the Longhorn CSI driver. With knowledge that there is liers on the internet, I elected to try it anyway and it failed spectacularly. So much so that it knocked out Tailscale on the Scaleway cluster and left namespaces on the RKE2 cluster in a degraded, confused state of trying to schedule Pods but no Volumes were available and clean up operations were not working.
For Tailscale, the issue was duplicate machine names and something I didn’t account for being an issue. For the backup failure, I knew going in it might not work, but I didn’t expect it to fail so bad that I was considering just wiping the machines and rebuilding the cluster. Which was going to be annoying. However, a beacon of light emerged in the darkness. I discovered that out of the box, RKE2 performs etcd database snapshots. This was an amazing discovery as it meant I could get back to the stage I was at before the backup disaster.
Again the process was well documented. Effectively just take down all the nodes, delete the etcd database on all of them, spin up one machine with a reference to a snapshot, then spin up the other two nodes and the database rebuilds itself. It was so slick I was nearly stunned. I got my blank slate back. I ended up deleting the Scaleway cluster, while that was the intention I made no further effort to recover data. There was nothing critical on there that couldn’t just be recreated from scratch, so that’s the plan for the next few weeks to get back to where I was.
Conclusion
Have I learned a ton more about Kubernetes with this experience? Not quite thus far. My knowledge still feels the same, it would probably be different if I was looking to go some kubeadm route. But I’m extremely chuffed to finally have a local, pseudo-prod Kubernetes cluster for my own services. I feel my passion and interest in self hosting has been reignited, even with a few gut punches along the way. I think it’s more important than ever to have the metal that holds ones personal data as close to their hand as possible. And there’s so many cool applications out there that can enable automating so many things in ones life. I look forward to the day that I can create a smart living experience that is local, private and additive to my life and I hope to be able to document it all!
Thank you!
You could of consumed content on any website, but you went ahead and consumed my content, so I'm very grateful! If you liked this, then you might like this other piece of content I worked on.
When I gave up on Kubernetes years agoPhotographer
I've no real claim to fame when it comes to good photos, so it's why the header photo for this post was shot by Timelab Pro . You can find some more photos from them on Unsplash. Unsplash is a great place to source photos for your website, presentation and more! But it wouldn't be anything without the photographers who put in the work.
Find Them On UnsplashSupport what I do
I write for the love and passion I have for technology. Just reading and sharing my articles is more than enough. But if you want to offer more direct support, then you can support the running costs of my website by donating via Stripe. Only do so if you feel I have truly delivered value, but as I said, your readership is more than enough already. Thank you :)
Support My Work