Self Hosting Woes at Christmas

January 13, 2025

Day Two Firefighting, Almost Literally

Christmas came and went and I spent a good chunk of mine working on my self hosting hobby. Of course I took time to recharge and enjoy the holiday period. Honestly I think it was one of my most relaxing Christmas’ ever, certainly having the ability to drive off to wherever I wanted to probably helped, even though I didn’t seize that particular opportunity. I’m never one to just sit down and be idle, even with a TV on I’d find myself wanting to be doing something else. From my last blog post I had laid the foundations for my local Kubernetes cluster and I was eager to see to some building on top of said foundations.

Unfortunately I was soon to encounter some fairly dire issues that would throw a serious spanner in the works and has lead to as of this morning (13th January) a drawn up plan for a serious re-architecture and update of my local cluster. Let’s get into the timeline

Getting Monitoring Right

As part of my local cluster, I wanted to get monitoring of my infrastructure and services to a really good state while also giving myself the opportunity to learn. I believe I talked previously about my chosen stack, but as a brief reminder I went with Grafana’s LGTM stack. That acronym means:

Loki for logs
Grafana for dashboards and visualisation
Tempo for traces
Mimir for metrics
And while not mentioned in the acronym, Grafana Alloy is to be my choice of collector for metrics and logs

I had experience with Loki and Grafana independent of this effort. Indeed prior to my pivot towards local Kubernetes, I ran both services with Docker Compose. For metrics, instead I was using a combination of Telegraf and InfluxDB. They worked just fine but I remember when Influx announced a series of changes for InfluxDB as a product. Things just seemed to be getting complicated and I made a mental note back then to consider replacements. Yes that post is from September 2023, only getting around to it now :)

Using Mimir for metrics seemed like a logical step and an interesting one to learn. However, Mimir is certainly a good candidate to use as an example for over engineering a solution in the context of self hosting. Back in July when experimenting with it on Scaleway, I was already running into storage consumption issues. I even talked about how I upgraded my local SSDs to account for more storage usage. However I needed to increase my volume sizes at home from 50GiB to 128GiB for Mimir’s ingesters.

The TL;DR is that Mimir basically keeps a set amount of TSDB blocks locally on your local disk, and periodically writes those blocks to object storage. This is to keep things feeling snappy in terms of queries. Unfortunately what I encountered quickly after scaling the volumes was that due to the number of replicas I had in Longhorn, I was over allocating storage on my Nodes and therefore losing the ability to schedule new volumes, not great!

In the moment I elected to just simply wait and see. Mimir’s documentation was very clear on not adjusting settings for its Ingester components, namely that you might do more harm than good. My logic was that with the increased volume size I could at least see if Mimir settled into a reasonable storage usage pattern and then I could adjust. Unfortunately I was soon to encounter some environmental issues that forced my hand.

Its getting hot in here, so shut down all your Nodes

When I was working on all of this stuff with Mimir, I was downstairs in our living room, not using my room with my desktop and where my rack is. Once family had returned back to their homes after spending time with us, I went back upstairs to using my desktop. Naturally after a few days of consistent self hosting, I changed over to playing some games. As a result it meant my desktop was going full swing and exhausting hot air into my room.

Very quickly, I was beginning to notice my cluster was completely falling down. One node was becoming entirely unresponsive and the only way to get back online was to pull the power and turn it back on. Initial attempts at revival only succeeded for a few minutes. However I was quick to notice that the machine was running extremely hot, greater than 90 degrees. At first I just left the machine off over night to let it cool down, thinking it was just a one time thing. But regularly I was encountering this temperature issue. Initially I thought maybe the backups I introduced of my Longhorn volumes was pushing things over the edge so I turned those off for now. But the issue kept occurring. I was trying to think of what could be causing the most load on the machine and came to the conclusion that Mimir was probably the most likely contributor.

While Mimir was set to distribute its components, I can only assume that this one machine was getting additional load because it was also the machine I used for the address for the API server. I began investigating could I run a smaller version of Mimir. It has several architecture modes, monolithic and microservices. I was using the latter, as it was the only one supported by the Helm chart. There’s several complaints online about the lack of support for monolithic in Helm, while I hate Helm, I do tend to agree there should be the option for it. At least so people like me can just render out the yaml from Helm and call it a day. Loki supports simple scalable as its called, which is basically monolithic but scales horizontally. The Mimir Helm chart should definitely add support for this, their documentation describes monolithic as being able to do this, but without support in the Helm chart, it’s quite hard to deploy.

In the end I just gave up on Mimir. Really I was just using it for a remote_write endpoint for Prometheus and the hassle became too much. This was also the easiest way of solving my storage issues with Longhorn, though also reducing the number of replicas to two, which Longhorn recommends, helps a lot as well. Prometheus I gave a 64GiB volume and set a retention of about 80% of volume use. Which it’s nowhere near to using yet but I can rest easy knowing the tool itself will handle storage cleanup. As versus Mimir which seemed to just straight up have its Ingesters fall over when the disks got full. It was interesting to learn more about Mimir, but yeah too much for home hosting.

As for the temperatures? I was still seeing things be a bit toasty on that one machine. I had it longer than the other two, so dust is a real possibility in the machine. For now though the solution I decided to implement was quite simply, make the fans go faster all the time!

“Things finally stable? Guess I’ll die” - that same machine, probably

Last night I was relaxing after dinner and lining up something to watch on Netflix when I noticed my dashboard was not loading. I jumped over to Tailscale as I find that’s the quickest indicator for life of my services and I noticed virtually all of them were down, including my toasty little NUC. I was exasperated as I felt that the issue was solved, also annoyed as I planned to learn Alertmanager that day but elected instead to play games. So my logic was I could have been notified of temperatures going up.

So I go to pull the power cord and try a reboot. Only this time, nothing comes back up. I try a few more times and at least on the network side, it’s still completely dead. I elect to just watch Netflix, and revisit when the show is over in about 90 minutes. The time comes and…still dead. Considering this is different to other times it lead to some concern. I can hear the fan spool up on boot, but seemingly the machine does not get further into the boot process. I’ve yet to get a KVM for my home, so I need to get down into the weeds and plug some cables into it and see what I get on a screen. My hope is just something on the OS has died a death and I can just reimage and restore a snapshot of the etcd DB and be on my way. But I won’t know till later this evening, so you can expect a follow up post on how that goes.

In the meantime with the help of a good friend I have come up with a plan to really uplift the cluster. I think there are two key issues, even with this three node cluster

All Kubernetes components are running on all nodes, which is going to cause more load and generally just isn’t a good practice
Sending API requests to just one of the nodes when all of them can serve API requests, is less than ideal and basically just does not provide any redundancy when that one node dies or gets rebooted.

So the plan is to upgrade my Raspberry Pi 4’s to using SSDs and with the addition of a fifth Pi, have a five node etcd / api server group. I’m going to start with just etcd and maybe introduce the api server role later. But the hope is that having the Pi’s do etcd will reduce load on the NUCs which should keep temperatures a bit lower. Provided things are stable, I’ll try to have the Pi’s do the API server requests too and I’m going to look inside of UniFi to see can I create some kind of round robin DNS record that will help ensure the API server keeps functioning in the event of node death.

The journey continues and I have no regrets on starting it! Certainly low moments such as these can be demotivating but I find I can eventually bounce back as the interest in trying to solve the challenge develops! Hopefully I can share some good news soon on this blog in the coming days on the state of my little NUC and in the coming weeks when I get all the equipment I need to bring my Raspberry Pi’s into service with this cluster!

Thank you!

You could of consumed content on any website, but you went ahead and consumed my content, so I'm very grateful! If you liked this, then you might like this other piece of content I worked on.

That time when fans were going TOO fast

Photographer

I've no real claim to fame when it comes to good photos, so it's why the header photo for this post was shot by Jason Mavrommatis . You can find some more photos from them on Unsplash. Unsplash is a great place to source photos for your website, presentation and more! But it wouldn't be anything without the photographers who put in the work.

Find Them On Unsplash

Support what I do

I write for the love and passion I have for technology. Just reading and sharing my articles is more than enough. But if you want to offer more direct support, then you can support the running costs of my website by donating via Stripe. Only do so if you feel I have truly delivered value, but as I said, your readership is more than enough already. Thank you :)

Support My Work

Self Hosting Woes at Christmas

Getting Monitoring Right

Its getting hot in here, so shut down all your Nodes

“Things finally stable? Guess I’ll die” - that same machine, probably

Thank you!

Photographer

Support what I do

GitHub Profile

LinkedIn

Support my content

My CV / Resume

Email

Client Agreement