We've [1] been using Hetzner's dedicated servers to provide Kubernetes clusters to our clients for a few years now. The performance is certainly excellent, we typically see request times half. And because the hardware is cheaper we can provide dedicated DevOps engineering time to each client. There are some caveats though:
1) A staging cluster for testing updates is really a must. YOLO-ing prod updates on a Sunday is no one's idea of fun.
2) Application level replication is king, followed by block-level replication (we use OpenEBS/Mayastor). After going through all the Postgres operators we found StackGres to (currently) be the best.
3) The Ansible playbooks are your assets. Once you have them down and well-commented for a given service then re-deploying that service in other cases (or again in the future) becomes straightforward.
4) If you can I'd recommend a dedicated 10G network to connect your servers. 1G just isn't quite enough when it comes to the combined load of prod traffic, plus image pulls, plus inter-service traffic. This also gives a 10x latency improvement over AWS intra-az.
5) If you want network redundancy you can create a 1G vSwitch (VLAN) on the 1G ports for internal use. Give each server a loopback IP, then use BGP to distribute routes (bird).
6) MinIO clusters (via the operator) are not that tricky to operate as long as you follow the well trodden path. This provides you with local high-bandwidth, low-latency object storage.
7) The initial investment to do this does take time. I'd put it at 2-4 months of undistracted skilled engineering time.
8) You can still push ancillary/annoying tasks off onto cloud providers (personally I'm a fan of CloudFlare for HTTP load balancing).
Do you have to ask Hetzner nicely for this? They have a publicly documented 10G uplink option, but that is for external networking and IMHO heavily limited (20TB limit). For internal cluster IO 20TB could easily become a problem
> 5) If you want network redundancy you can create a 1G vSwitch (VLAN) on the 1G ports for internal use. Give each server a loopback IP, then use BGP to distribute routes (bird).
Are you willing to share example config for that part?
> The initial investment to do this does take time. I'd put it at 2-4 months of undistracted skilled engineering time.
Perhaps you could take a look at https://syself.com (Disclaimer: I'm an employee there). We built a platform that gives you production-ready clusters in a few minutes.
I have experience running Kubernetes clusters on Hetzner dedicated servers, as well as working with a range of fully or highly managed services like Aurora, S3, and ECS Fargate.
From my experience, the cloud bill on Hetzner can sometimes be as low as 20% of an equivalent AWS bill. However, this cost advantage comes with significant trade-offs.
On Kubernetes with Hetzner, we managed a Ceph cluster using NVMe storage, MariaDB operators, Cilium for networking, and ArgoCD for deploying Helm charts. We had to handle Kubernetes cluster updates ourselves, which included facing a complete cluster failure at one point. We also encountered various bugs in both Kubernetes and Ceph, many of which were documented in GitHub issues and Ceph trackers. The list of tasks to manage and monitor was endless. Depending on the number of workloads and the overall complexity of the environment, maintaining such a setup can quickly become a full-time job for a DevOps team.
In contrast, using AWS or other major cloud providers allows for a more hands-off setup. With managed services, maintenance often requires significantly less effort, reducing the operational burden on your team.
In essence, with AWS, your DevOps workload is reduced by a significant factor, while on Hetzner, your cloud bill is significantly lower.
Determining which option is more cost-effective requires a thorough TCO (Total Cost of Ownership) analysis. While Hetzner may seem cheaper upfront, the additional hours required for DevOps work can offset those savings.
This is definitely some ChatGPT output being posted here and your post history also has a lot of this "While X, Y also does Z. Y already overlaps with X" output.
I'd like to see your breakdowns as well, given that the cost difference between a 2 vCPU, 4GB configuration (as an example) and a similar configuration on AWS is priced much higher.
I've never operated a kubernetes cluster except for a toy dev cluster for reproducing support issues.
One day it broke because of something to do with certificates (not that it was easy to determine the underlying problem). There was plenty of information online about which incantations were necessary to get it working again, but instead I nuked it from orbit and rebuilt the cluster. From then on I did this every few weeks.
A real kubernetes operator would have tooling in place to automatically upgrade certs and who knows what else. I imagine a company would have to pay such an operator.
Ceph is a bastard to run. Its expensive, slow and just not really ready. Yes I know people use it, but compared to a fully grown up system (ie lustre[don't its raid 0 in prod] or GPFS [great but expensive]) its just a massive time sync.
You are much better off having a bunch of smaller file systems exported over NFS make sure that you have block level replication. Single address space filesystems are ok and convenient, but most of the time are not worth the cost of admin to get reliable at scale. like a DB shard your filesystems, especially as you can easily add mapping logic to kubernetes to make sure you get the right storage to the right image.
I mostly agree, but it surprises me that people don't often consider a solution right in the center, such as openshift. Have a much, much less burden for devops and have all the power and flexibility of running on bare metal. It's a great hybrid between a fully managed and expensive service versus a complete build your own. It's expensive enough. Todd, for startups it is not likely a good option, but if you have a cluster with at least 72 GB of RAM or 36 CPUs going (about 9 mid size nodes), you should definitely consider something like openshift.
> Determining which option is more cost-effective requires a thorough TCO (Total Cost of Ownership) analysis. While Hetzner may seem cheaper upfront, the additional hours required for DevOps work can offset those savings.
Sure, but the TLDR is going to be that if you employ n or more sysadmins, the cost savings will dominate. With 2 < n < 7. So for a given company size, Hetzner will start being cheaper at some point, and it will become more extreme the bigger you go.
Second if you have a "big" cost, whatever it is, bandwidth, disk space (essentially anything but compute), cost savings will dominate faster.
> Hetzner volumes are, in my experience, too slow for a production database. While you may in the past have had a good experience running customer-facing databases on AWS EBS, with Hetzner's volumes we were seeing >50ms of IOWAIT with very low IOPS.
There is a surprisingly easy way to address this issue: use (ridiculously cheap) Hetzner metal machines as nodes. The ones with nvme storage offer excellent performance for dbs and often have generous amounts of RAM. I'd go as far as to say you'd be better off to invest in two or more beefy bare metal machines for a master-replica(s) setup rather than run the db on k8s.
If you don't want to be bothered with the setup, you can use one of many modern packages such as Pigsty: https://pigsty.cc/ (not affiliated but a huge fan).
Been a happy Hetzner customer for over a decade, previously using their dedicated servers in their German DC's before migrating to their Cloud US VMs for better latency with the US. Slightly disappointed with their recent cuts of their generous 20TB free traffic down to 3TB (€1.19 per additional TB), but they still look to be a lot better value than all other US cloud providers we've evaluated.
Whilst I wouldn't run Kubernetes by choice, we've had success moving our custom SSH / Docker compose deployments over to use GitHub Actions with kamal-deploy.org, easy to setup and nice UX tools for monitoring remote deployed apps [1]
Seems to be a US thing, maybe their peering partners are forcing them to raise prices, the German DC still stells the 20TB bandwidth https://www.hetzner.com/cloud/, but US is an order of magnitude less for the same price :/
I used to do my own car maintenance, because I wanted to save money, and it was fun. It turned out it was more complex than I thought, things slowly fell apart or broke. I spent a good deal of time "re-fixing" things. Spent probably thousands on tools over the years, partly replacing the cheap stuff that broke or rusted quickly. My cars were often up on blocks. But I learned a lot of great lessons. The biggest one? Some things are not worth DIYing; pay a mechanic or lease your car, especially if you depend on it for your livelihood.
Even something as simple as an oil change, really isn't worth doing yourself. First you buy the tools (oil drip pan, filter wrench, funnel, creeper). Then you set aside the time to use them, find your dingy work clothes. You go to the store and buy new oil and a filter. You go home and change the oil. Then that day or another day you go to a store that will take your used oil. Versus 20 minutes at an auto mechanic, for about $15 more than the cost of the oil and filter.
Kubernetes is an entire car (and a complex one). It's really not worth doing the maintenance yourself, I promise you. Unless you're just doing it for fun.
I don’t know. My shop wanted 1800 to change my brakes. I bought the parts for 300 and got it done in a day (first time). Seems like a pretty good payback and good skill to have. My neighbour has a car lift which certainly helped.
Depends on one: how interested/motivated are you, now and down the line; and two: how likely is your dependency on a third party going to screw you over in the long run.
My opinion, from the viewpoint of a consultant often involved in Kubernetes, is to get initial help and a persistent help line, but get somebody internally interested enough to ride along and learn.
Consultants and experts in general can save you from a lot of bad up-front decisions and banging your head against the wall for months. It's not trivial to learn your way around technologies or ecosystems, including common dark corners and pitfalls, in a reasonable amount of time while also having to focus on your core business. Accept help but learn to fish and to make a fire.
In my experience noone bothers unless they are using GPUs or they are already at 100k/mo.
I do think 100k/mo is the tipping point actually, that is $1.2M/yr.
It costs around $400k/yr in engineering salaries to reasonably support a sophisticated bare metal deployment (though such people can generally do that AND provide a lot of value elsewhere in the business, so really it's actual cost is lower than this) and about $100k/yr in DC commitments, HW amortisation, and BW roughly. So you save around $700k a year which is great but the benefit becomes much greater when your equiv cloud spend is even bigger than that.
When I worked in web hosting (more than 10 years ago), we would constantly be blackholeing Hetzner IPs due to bad behavior. Same with every other budget/cheap vm provider. For us, it had nothing to do with geo databases, just behavior.
Yep I had the same problem years ago when I tried to use Mailgun's free tier. Not picking on them, I loved the features of their product but the free tier IPs had a horrble reputation and mail just would not get accepted especially by hotmail or yahoo.
Any free hosting service will be overwhelmed by spammers and fraudsters. Cheap services the same but less so, and the more expensive they are the less they will be used for scams and spams.
It's always evolving, but these days the most common platforms attacking sites that I host are the big cloud providers, especially Azure. But AWS, Google, Digital Ocean, Linode, Contabo, etc all host a lot of attacks trying to brute-force logins and search for common exploits.
depending on the prices, maybe a valid strategy would be to have servers at hetzner and then tunnel ingress/egress somewhere more prominent. Maybe adding the network traffic to the calculation still makes financial sense?
I work for a consultancy company that helps companies building and securing infrastructure. We have a lot of customers running Kubernetes at low-cost providers (like Hetzner), more local middle-tier and top-three (AWS, GCP, Azure). We also have some governmental, financial and medical companies that can not or will not run in public clouds, so they usually host on-prem.
If Hetzner has an issue or glitch once a month, the middle-tier providers have one every 2-3 months, and a place like AWS maybe every 5-6 months. However, prices also follow that observation, so you have to carefully consider on a case-by-case basis whether adding some extra machines and backup and failure scenarios is a better deal.
The major benefit by using basic hosting services is that their pricing is a lot more predictable; you pay for machines and scale as you go. Once you get hooked into all the extra services a provider like AWS provides, you might get some unexpectedly high bills and moving away might be a lot harder. For smaller companies, don't make short-sighted decisions that threaten your ability to survive long-term by choosing the easy solution or "free credits" scheme early on.
I haven't used it personally, but https://github.com/kube-hetzner/terraform-hcloud-kube-hetzne... looks amazing as a way to setup and manage kubernetes on Hetzner. At the moment I'm on Oracle free tier, but I keep thinking about switching to it to get off... Well Oracle.
I'm running two clusters on it, on for production and one for dev. Works pretty good. With a schedule to reboot machines every sunday for automatic security updates (SuSE Micro OS). Also expanded machines for increased workloads. You have to make sure to inspect every change terraform wants to do, but then you're pretty save.
The only downside is that every node needs a public IP, even though they are behind a firewall. But that is being worked on.
I've used this to set up a cluster to host a dogfooded journalling site.
In one evening I had a cluster working.
It works pretty well. I had one small problem when the auto-update wouldn't run on arm nodes which stopped the single node I had running at that point (with the control plane taint blocking the update pod running on them).
i recently read an article about running k8s on the oracle free tier and was looking to try it. i'm curious, are there any specific pain points that are making you think of switching?
> While DigitalOcean, like other providers, offers a free managed control plane, there is typically a 100% markup on the nodes that belong to these managed clusters.
I don't think this is true. With Digital Ocean, the worker nodes are the same cost as regular droplets, there's no additional costs involved. This makes Digital Ocean's offering very attractive - free control plane you don't have to worry about, free upgrades, and some extra integrations to things like the load balancer, storage, etc. I can't think of a reason to not go with that over self-managed.
I loved the article. Insightful, and packed with real world applications. What a gem.
I have a side-question pertaining to cost-cutting with Kubernetes. I've been musing over the idea of setting up Kubernetes clusters similar to these ones but mixing on-premises nodes with nodes from the cloud provider. The setup would be something like:
- vCPUs for bursty workloads,
- bare metal nodes for the performance-oriented workloads required as base-loads,
- on-premises nodes for spiky performance-oriented workloads, and dirt-cheap on-demand scaling.
What I believe will be the primary unknown is egress costs.
>All root servers have a dedicated 1 GBit uplink by default and with it unlimited traffic.
>Inclusive monthly traffic for servers with 10G uplink is 20TB. There is no bandwidth limitation. We will charge € 1/TB for overusage.
So it sounds like it depends. I have used them for (I'm guessing) 20 years and have never had a network problem with them or a surprise charge. Of course I mostly worked in the low double digit terabytes. But have had servers with them that handled millions of requests per day with zero problems.
We've toyed around with this idea for clients that do some data-heavy data-science work. Certainly I could see that running an on-premise Minio cluster could be very useful for providing fast access to data within the office.
Of course you could always move the data-science compute workloads to the cluster, but my gut says that bringing the data closer to the people that need it would be the ideal.
They run a Wireguard network between the nodes so you can have a mix of on-premise and cloud within one cluster. Works really well but unfortunately is a commercial product with a pricing model that is a little inflexible.
But at least it shows it's technically possible so maybe open source options exist.
The key take home point here is not how amazingly cheap Hetzner is, which it is. But how much of an extortion game Google, Amazon, MS, etc. are playing with their cloud services. These are trillion dollar companies because they are raking in cash with extreme margins.
Yes, there is some added value in the level of convenience provided. But maybe with a bit more competition, pricing could be more competitive. A lot more competitive.
> Hetzner volumes are, in my experience, too slow for a production database. While you may in the past have had a good experience running customer-facing databases on AWS EBS, with Hetzner's volumes we were seeing >50ms of IOWAIT with very low IOPS. See https://github.com/rook/rook/issues/14999 for benchmarks.
I set up rook ceph on a talos k8s cluster (with vm volumes) and experienced similar low performance; however, I always thought that was because of the 1Gi vSwitch (i.e. networking problem)?! The SSD volumes were quite fast.
SSD volumes are physically on the same node, and afaik not redundant. The cloud vms are ceph clusters behind the scenes, and writes need to commit for 3+ machines. It's both network latency and inherent process latency
Additionally, hetzner has an IOPS limit of 5000 and write limit of some amount that does not scale with the size of database.
50G has the same limits as 5TB.
For this reason, people are sometimes using different table spaces in postgres for example.
Ceph puts another burden on top of already-ceph-based cloud volumes, btw, so don't do that.
In my limited experience with rook-ceph it is strictly bare metal technology to deploy. On virtualization it will basically replicate your data to VM disks which usually are already replicated, so quite a bit of replication amplification will happen and tank your performance.
Be careful with Hetzner, they null routed my game server on launch day due to false positives from their abuse system, and then took 3 days for their support team to re-enable traffic.
By that point I had already moved to a different provider of course.
Digital Ocean did this to my previous company. They said we’d been the target of a DOS attack (no evidence we could see). They re-enabled the traffic, then did it again the next day, and then again. When we asked them to stop doing that they said we should use Cloudflare to prevent DOS attacks… all the box did was store backups that we transferred over SSH. Nothing that could go behind Cloudflare, no web server running, literally only one port open.
Reading comments from the past few days makes it seem like dealing with Hetzner is a pain (and as far as I can tell, they aren't really that cheaper than the competitors).
Data centers used 460 TWh, or about 2% of total worldwide electricity use, according to IEA in 2022.
In comparison, 30% of total energy (energy! Not electricity) goes to transport!
As another point of comparison, transport in Sweden in 2022 used 137 TWh [1]. So the same order of magnitude as total datacenter energy use.
And datacenters are powered by electricity which increases the chance that it comes from renewable energy. Conversely,
the chance that diesel comes from a renewable source is zero.
So can we please stop talking about data center energy use? It’s a narrative that the media is currently pushing but as so many things it makes no sense. It’s not the thing we should be focusing on if we want to decrease fossil fuel use.
Hi Bill, Wow! Thanks for the amazing write-up and for sharing it on your blog and here! I am so happy that we've helped you save so much money and that you're happy with our support team! It's a great way to start off the week! --Katie
My main complaint with OVH is that their checkout process is broken in various ways (missing translations so you get French bits, broken translations so placeholders like ACCEPT_BUTTON leak through, legally binding terms with typos and weird formatting because someone copied them from a PDF into a textarea, UIs from the 90s plastered in between modern ones, missing option to renew a domain for longer than a year, confusing automatic renewal setup, and so on). The control panel in general is quite confusing. They also don't allow hosting an email server (port 25 blocked), iirc the docs tell you to go away and use a competitor
I didn't have any of these web UI issues with Hetzner, but iirc OVH is cheaper for domain names, as well as having very reliable and fast DNS servers (measured various query types across some 6 months), and that's why I initially chose them — until my home ISP gave me a burned IP address and I needed an externally hosted server for originating email data (despite it coming from an old and trusted domain that permitlists the IP address) so now I'm with both OVH and Hetzner... Anyway, another thing I like in OVH is that you can edit the raw zone file data and that they support some of the more exotic record types. I don't know how Hetzner compares on domain hosting though
This is probably out of left field, but what is the benefit of having a naming scheme for nodes without any delimiters? Reading at a glance and not knowing the region name convention of a given provider (i.e. Hetzner), I'm at a loss to quickly decipher the "<region><zone><environment><role><number>" to "euc1pmgr1". I feel like I'm missing something because having delimiters would make all sorts of automated parsing much easier.
That's a really good article. Actually, recently we were migrating as well and we were using dedicated nodes in our setup.
In order to integrate a load-balancer provided by hetzner with our k8s on dedicated servers we had to implement a super thin operator that does it: https://github.com/Intreecom/robotlb
If anyone will be inspired by this article and would want to do the same, feel free to use this project.
I’m planning on doing something similar but want to use Talos with bare metal machines. I suspect to see similar price reductions from our current EKS bill.
It took minutes to setup a cluster and I love having a UI to see what is happening.
I wish there were more products like this as I suspect there will be a trend towards more self-managed Kubernetes clusters given how expensive the cloud is becoming.
I set up a Talos bare metal cluster about a year ago, and documented the whole process on my website. Feel free to reach out if you have any questions!
We're very happy to use Hetzner for our bare-metal staging environments to validate functionality, but I still feel reluctant to put our production there. Disks don't quite work as intended at all times and our vSwitch setup has gotten reset more than once.
All of this makes sense considering the extremely low price.
Very nicely written article. I’m also running a k8s cluster but on bare metal and qemu-kvms for the base load. Wonder why you would chose VMs instead of bare metal if you looking for cost optimisation (additional overhead maybe?), could you share more about this or did I miss it?
Thank you! The cloud servers are sufficiently cheap for us that we could afford the extra flexibility we get from them. Hetzner can move around VMs without us noticing but in contrast they are rebooting a number of metal machines for maintenance now and for the last little while, which would have been disruptive especially during the migration. I might have another look next year at metal but I’m happy with the cloud VMs currently.
I feel lots of the work described in the article can be automated by kops, probably in a much better way, especially when it comes to day 2 operations.
I wonder what is the motivation behind manually spinning up a cluster instead of going with more established tooling?
We at Syself.com also have great experiences with Kubernetes on Hetzner. We built a platform on top of Cluster API and brought a managed Kubernetes experience to Hetzner. Now we have self-healing, automated updates and 100% reproductibility, with full bare metal support.
> Hetzner volumes are, in my experience, too slow for a production database.
That's true, though. To solve that we developed a way to persist the local storage of bare metal servers across reprovisionings. This way it's both faster and cheaper. Now we are adding an automated database deployment layer on top of it.
Puppet's original design was that it was meant to be agent based on the things it was meant to configure. It was never very good at bringing up stuff before the agent could be connected.
The general flow was Imager->pre-configured puppet agent->connect to controller->apply changes to make it perform as x
originally it never really had the capacity to kick off the imaging/instantiation. THis meant that it scaled better (shared state is better handled than ansible)
However ansible shined because although it was a bastard to get running on more than a couple of hundred hosts in any speed, you could tell it to spin up 100x EC2(or equivalent) machines and then transform them into which every role that was needed. In puppet that was impossible to do in one go.
Funnily enough, we made the exact same transition from heroku to DigitalOceans managed Kubernetes service, and saved about 75%. Presumably this means that had we moved from heroku to hetzner, it would have been 93% savings!
The costs of cloud hosting are totally out of control, would love to see more efforts that lets developers move down the stack.
I’ve been humbly working on https://canine.sh which basically provides a Heroku like interface to any K8 cluster
When I first started hosting servers/services for customers I was using EC2 and Rackspace, then I discovered Linode and was happy it was so much cheaper with apparently no downside. After the first couple of interactions with support I started to relax. Then I discovered OVH, same story. I haven't needed the support yet though.
// Taking another slant at the discussion: Why kubernetes?
Thank you for sharing your experience.
I also have my 3 personal servers with Hetzner, plus a couple VM instances in Scaleways (French outfit).
Disclaimer: I’m a Googler, was SRE for ~10 years for GMail, identity, social, apps (gsuites nowadays) and more, managed hundreds of jobs in Borg, one of the 3 founders of the current dev+devops internal platform (and I focused on the releases,prod,capacity side of the platform), dabbled in K8s on my personal time. My opinions, not Google’s.
So, my question is: given the significant complexity that K8s brings (I don’t think anyone disputes this) why are people using it outside medium-large environments?
There are simpler and yet flexible & effective job schedulers that are way easier to manage. Nomad is an example.
Unless you have a LOT of machines to manage, with many jobs (I’d say +250) to manage, K8s complexity, brittleness and overhead are not justifiable, IMO.
The emergence of tools like Terraform and the many other management layers in top of K8s that try to make it easier but just introduce more complexity and their own abstractions are in itself a sign of that inherent complexity.
I would say that only a few companies in the world need that level of complexity. And then they will need it, for sure.
But, for most is like buying a Formula 1 to commute in a city.
One other aspect that I also noticed is that technical teams tend to carry on the mess they had in their previous “legacy” environment and just replicate in K8s, instead of trying to do an architectural design of the whole system needs. And K8s model enables that kind of mess: a “bucket of things”.
Those two things combined, mean that nowadays every company has soaring cloud costs, are running things they know nothing about but are afraid to touch in case of breaking something. And an outage is more career harming than a high bill that Finance will deal with it later, so why risk it, right?
A whole new IT area has been coined now to deal with this: FinOps :facepalm:
I too used to run a large clustered environment (VFX) and now work at a FAANG which has a "borg-like" scheduler.
K8s has a whole kit of parts which sound really grand when you are starting out on a new platform, but quickly become a pain when you actually start to implement it. I think thats the biggest problem, is by the time you've realised that actualy you don't need k8s, you've invested so much time into learning the sodding thing, its difficult to back out.
The other seductive thing is helm provides "AWS-like" features (ie fancy load balancing rules) that are hard to figure out unless you've dabbled with the underlying tech before (varnish/nginx/etc are daunting, so is storage and networking)
this tends to lead to utterly fucking stupid networking systems because unless you know better, that looks normal.
Every time I try to use Nomad, or any of the other "simpler" solutions, I hit a wall - there turns out to be a critical feature that is not available, and which if I want to retrofit into them, will be a hacky one-off that is badly integrated into API.
Additionally, I don't get US-style budgets or wages - this means that cloud prices which target such budgets are horrifyingly expensive to me, to the point that kubernetes pays itself off at the scale of single server
Yes, single server. The more I make it fit the proper kubernetes mold, the cheaper it gets, even. If I need to extend something, the CustomResourceDefinition system makes it easy to use a sensible common API.
Was there a cost to learning it? Yes, but honestly not so bad. And with things like k3s deploying small clusters on bare metal became trivial.
And I can easily wrap kubernetes API into something simpler for developers to use - create paved paths that reduce the amount of what they have to know, provide, and that will enforce certain deployment standards. At lowest cost I have encountered in my life, funnily enough.
> Unless you have a LOT of machines to manage, with many jobs (I’d say +250) to manage, K8s complexity, brittleness and overhead are not justifiable, IMO.
Because it looks amazing on my CV and in my promo pack.
Same reason they'll make 10 different microservices for a single product that isn't even 5K LoC. People chase trends because they don't know any better. K8s is a really big trend.
I didn’t touch on that in the article, but essentially it’s a one line change to add a worker node (or nodes) to the cluster, then it’s automatically enrolled.
We don’t have such bursty requirements fortunately so I have not needed to automate this.
adamcharnock|1 year ago
1) A staging cluster for testing updates is really a must. YOLO-ing prod updates on a Sunday is no one's idea of fun.
2) Application level replication is king, followed by block-level replication (we use OpenEBS/Mayastor). After going through all the Postgres operators we found StackGres to (currently) be the best.
3) The Ansible playbooks are your assets. Once you have them down and well-commented for a given service then re-deploying that service in other cases (or again in the future) becomes straightforward.
4) If you can I'd recommend a dedicated 10G network to connect your servers. 1G just isn't quite enough when it comes to the combined load of prod traffic, plus image pulls, plus inter-service traffic. This also gives a 10x latency improvement over AWS intra-az.
5) If you want network redundancy you can create a 1G vSwitch (VLAN) on the 1G ports for internal use. Give each server a loopback IP, then use BGP to distribute routes (bird).
6) MinIO clusters (via the operator) are not that tricky to operate as long as you follow the well trodden path. This provides you with local high-bandwidth, low-latency object storage.
7) The initial investment to do this does take time. I'd put it at 2-4 months of undistracted skilled engineering time.
8) You can still push ancillary/annoying tasks off onto cloud providers (personally I'm a fan of CloudFlare for HTTP load balancing).
[1]: https://lithus.eu
bigbones|1 year ago
Do you have to ask Hetzner nicely for this? They have a publicly documented 10G uplink option, but that is for external networking and IMHO heavily limited (20TB limit). For internal cluster IO 20TB could easily become a problem
bambambazooka|1 year ago
Are you willing to share example config for that part?
lucasrattz|1 year ago
Perhaps you could take a look at https://syself.com (Disclaimer: I'm an employee there). We built a platform that gives you production-ready clusters in a few minutes.
sureIy|1 year ago
How much is that worth to your company/customer vs a higher monthly bill for the next 5 years?
As a consultancy company, you want to sell that. As a customer, I don't see how that's worth it at all, unless I expect a 10k/month AWS bill.
xkcd comes to mind: https://xkcd.com/1319/
tutfbhuf|1 year ago
From my experience, the cloud bill on Hetzner can sometimes be as low as 20% of an equivalent AWS bill. However, this cost advantage comes with significant trade-offs.
On Kubernetes with Hetzner, we managed a Ceph cluster using NVMe storage, MariaDB operators, Cilium for networking, and ArgoCD for deploying Helm charts. We had to handle Kubernetes cluster updates ourselves, which included facing a complete cluster failure at one point. We also encountered various bugs in both Kubernetes and Ceph, many of which were documented in GitHub issues and Ceph trackers. The list of tasks to manage and monitor was endless. Depending on the number of workloads and the overall complexity of the environment, maintaining such a setup can quickly become a full-time job for a DevOps team.
In contrast, using AWS or other major cloud providers allows for a more hands-off setup. With managed services, maintenance often requires significantly less effort, reducing the operational burden on your team.
In essence, with AWS, your DevOps workload is reduced by a significant factor, while on Hetzner, your cloud bill is significantly lower.
Determining which option is more cost-effective requires a thorough TCO (Total Cost of Ownership) analysis. While Hetzner may seem cheaper upfront, the additional hours required for DevOps work can offset those savings.
supriyo-biswas|1 year ago
I'd like to see your breakdowns as well, given that the cost difference between a 2 vCPU, 4GB configuration (as an example) and a similar configuration on AWS is priced much higher.
There's also https://github.com/kube-hetzner/terraform-hcloud-kube-hetzne... to reduce the operational burden that you speak of.
MathMonkeyMan|1 year ago
One day it broke because of something to do with certificates (not that it was easy to determine the underlying problem). There was plenty of information online about which incantations were necessary to get it working again, but instead I nuked it from orbit and rebuilt the cluster. From then on I did this every few weeks.
A real kubernetes operator would have tooling in place to automatically upgrade certs and who knows what else. I imagine a company would have to pay such an operator.
KaiserPro|1 year ago
You are much better off having a bunch of smaller file systems exported over NFS make sure that you have block level replication. Single address space filesystems are ok and convenient, but most of the time are not worth the cost of admin to get reliable at scale. like a DB shard your filesystems, especially as you can easily add mapping logic to kubernetes to make sure you get the right storage to the right image.
freedomben|1 year ago
mountainriver|1 year ago
spwa4|1 year ago
Sure, but the TLDR is going to be that if you employ n or more sysadmins, the cost savings will dominate. With 2 < n < 7. So for a given company size, Hetzner will start being cheaper at some point, and it will become more extreme the bigger you go.
Second if you have a "big" cost, whatever it is, bandwidth, disk space (essentially anything but compute), cost savings will dominate faster.
kshri24|1 year ago
LordMignion|1 year ago
[deleted]
dvfjsdhgfv|1 year ago
There is a surprisingly easy way to address this issue: use (ridiculously cheap) Hetzner metal machines as nodes. The ones with nvme storage offer excellent performance for dbs and often have generous amounts of RAM. I'd go as far as to say you'd be better off to invest in two or more beefy bare metal machines for a master-replica(s) setup rather than run the db on k8s.
If you don't want to be bothered with the setup, you can use one of many modern packages such as Pigsty: https://pigsty.cc/ (not affiliated but a huge fan).
threeseed|1 year ago
There are just pinning the database pods to specific nodes and using a LocalPathProvisioner or distributed solutions like JuiceFS, OpenEBS etc.
BillFranklin|1 year ago
lucasrattz|1 year ago
This is the guide I wrote for our customers: https://syself.com/docs/hetzner/apalla/how-to-guides/storage...
gourneau|1 year ago
mythz|1 year ago
Whilst I wouldn't run Kubernetes by choice, we've had success moving our custom SSH / Docker compose deployments over to use GitHub Actions with kamal-deploy.org, easy to setup and nice UX tools for monitoring remote deployed apps [1]
[1] https://servicestack.net/posts/kamal-deployments
Voultapher|1 year ago
0xbadcafebee|1 year ago
Even something as simple as an oil change, really isn't worth doing yourself. First you buy the tools (oil drip pan, filter wrench, funnel, creeper). Then you set aside the time to use them, find your dingy work clothes. You go to the store and buy new oil and a filter. You go home and change the oil. Then that day or another day you go to a store that will take your used oil. Versus 20 minutes at an auto mechanic, for about $15 more than the cost of the oil and filter.
Kubernetes is an entire car (and a complex one). It's really not worth doing the maintenance yourself, I promise you. Unless you're just doing it for fun.
theappsecguy|1 year ago
p_l|1 year ago
A lot of it is finding balance between what to do yourself, what to outsource, and it's not as easy or clean as some people here like to claim.
wvh|1 year ago
My opinion, from the viewpoint of a consultant often involved in Kubernetes, is to get initial help and a persistent help line, but get somebody internally interested enough to ride along and learn.
Consultants and experts in general can save you from a lot of bad up-front decisions and banging your head against the wall for months. It's not trivial to learn your way around technologies or ecosystems, including common dark corners and pitfalls, in a reasonable amount of time while also having to focus on your core business. Accept help but learn to fish and to make a fire.
jonas21|1 year ago
How many nodes are there, how much traffic does it receive, what are the uptime and latency requirements?
And what's the absolute cost savings? Saving 75% of $100K/mo is very different from saving 75% of $100/mo.
jpgvm|1 year ago
I do think 100k/mo is the tipping point actually, that is $1.2M/yr.
It costs around $400k/yr in engineering salaries to reasonably support a sophisticated bare metal deployment (though such people can generally do that AND provide a lot of value elsewhere in the business, so really it's actual cost is lower than this) and about $100k/yr in DC commitments, HW amortisation, and BW roughly. So you save around $700k a year which is great but the benefit becomes much greater when your equiv cloud spend is even bigger than that.
slillibri|1 year ago
You get what you pay for, and all that.
SoftTalker|1 year ago
Any free hosting service will be overwhelmed by spammers and fraudsters. Cheap services the same but less so, and the more expensive they are the less they will be used for scams and spams.
haroldp|1 year ago
UltraSane|1 year ago
oblio|1 year ago
mzhaase|1 year ago
Keyframe|1 year ago
wvh|1 year ago
If Hetzner has an issue or glitch once a month, the middle-tier providers have one every 2-3 months, and a place like AWS maybe every 5-6 months. However, prices also follow that observation, so you have to carefully consider on a case-by-case basis whether adding some extra machines and backup and failure scenarios is a better deal.
The major benefit by using basic hosting services is that their pricing is a lot more predictable; you pay for machines and scale as you go. Once you get hooked into all the extra services a provider like AWS provides, you might get some unexpectedly high bills and moving away might be a lot harder. For smaller companies, don't make short-sighted decisions that threaten your ability to survive long-term by choosing the easy solution or "free credits" scheme early on.
There is no right answer here, just trade-offs.
Volundr|1 year ago
mkreis|1 year ago
not_elodin|1 year ago
In one evening I had a cluster working.
It works pretty well. I had one small problem when the auto-update wouldn't run on arm nodes which stopped the single node I had running at that point (with the control plane taint blocking the update pod running on them).
preisschild|1 year ago
https://github.com/syself/cluster-api-provider-hetzner
works rock solid
maestrae|1 year ago
no_carrier|1 year ago
I don't think this is true. With Digital Ocean, the worker nodes are the same cost as regular droplets, there's no additional costs involved. This makes Digital Ocean's offering very attractive - free control plane you don't have to worry about, free upgrades, and some extra integrations to things like the load balancer, storage, etc. I can't think of a reason to not go with that over self-managed.
czhu12|1 year ago
8GB RAM, shared cpu on hetzner is ~$10
Equivalent on digital ocean is $48
lucasrattz|1 year ago
If you want a managed experience on Hetzner, you could take a look at https://syself.com
Disclaimer: I'm an employee there
chipdart|1 year ago
I have a side-question pertaining to cost-cutting with Kubernetes. I've been musing over the idea of setting up Kubernetes clusters similar to these ones but mixing on-premises nodes with nodes from the cloud provider. The setup would be something like:
- vCPUs for bursty workloads,
- bare metal nodes for the performance-oriented workloads required as base-loads,
- on-premises nodes for spiky performance-oriented workloads, and dirt-cheap on-demand scaling.
What I believe will be the primary unknown is egress costs.
Has anyone ever toyed around with the idea?
mhuffman|1 year ago
>All root servers have a dedicated 1 GBit uplink by default and with it unlimited traffic.
>Inclusive monthly traffic for servers with 10G uplink is 20TB. There is no bandwidth limitation. We will charge € 1/TB for overusage.
So it sounds like it depends. I have used them for (I'm guessing) 20 years and have never had a network problem with them or a surprise charge. Of course I mostly worked in the low double digit terabytes. But have had servers with them that handled millions of requests per day with zero problems.
adamcharnock|1 year ago
Of course you could always move the data-science compute workloads to the cluster, but my gut says that bringing the data closer to the people that need it would be the ideal.
threeseed|1 year ago
Sidero Omni have done this: https://omni.siderolabs.com
They run a Wireguard network between the nodes so you can have a mix of on-premise and cloud within one cluster. Works really well but unfortunately is a commercial product with a pricing model that is a little inflexible.
But at least it shows it's technically possible so maybe open source options exist.
kgdkhxkzh|1 year ago
[deleted]
oblio|1 year ago
The comment was making fun of the wishful thinking and the realities of networking.
It was a funny comment :-(
jillesvangurp|1 year ago
Yes, there is some added value in the level of convenience provided. But maybe with a bit more competition, pricing could be more competitive. A lot more competitive.
surrTurr|1 year ago
I set up rook ceph on a talos k8s cluster (with vm volumes) and experienced similar low performance; however, I always thought that was because of the 1Gi vSwitch (i.e. networking problem)?! The SSD volumes were quite fast.
tehlike|1 year ago
Additionally, hetzner has an IOPS limit of 5000 and write limit of some amount that does not scale with the size of database.
50G has the same limits as 5TB.
For this reason, people are sometimes using different table spaces in postgres for example.
Ceph puts another burden on top of already-ceph-based cloud volumes, btw, so don't do that.
merpkz|1 year ago
hipadev23|1 year ago
By that point I had already moved to a different provider of course.
danpalmer|1 year ago
teitoklien|1 year ago
ronsor|1 year ago
esher|1 year ago
I believe that Hetzner data centers in Europe (Germany, Finland) are powered by green energy, but not the locations in US.
preisschild|1 year ago
https://app.electricitymaps.com/
huijzer|1 year ago
In comparison, 30% of total energy (energy! Not electricity) goes to transport!
As another point of comparison, transport in Sweden in 2022 used 137 TWh [1]. So the same order of magnitude as total datacenter energy use.
And datacenters are powered by electricity which increases the chance that it comes from renewable energy. Conversely, the chance that diesel comes from a renewable source is zero.
So can we please stop talking about data center energy use? It’s a narrative that the media is currently pushing but as so many things it makes no sense. It’s not the thing we should be focusing on if we want to decrease fossil fuel use.
[1]: https://www.energimyndigheten.se/en/energysystem/energy-cons...
postepowanieadm|1 year ago
Green lignite.
Hetzner_OL|1 year ago
ArtTimeInvestor|1 year ago
There ain't many large European cloud companies, and I would like to understand how they differentiate.
Ionos is another European one. Currently, it looks like their cloud business is stagnating, though.
Aachen|1 year ago
I didn't have any of these web UI issues with Hetzner, but iirc OVH is cheaper for domain names, as well as having very reliable and fast DNS servers (measured various query types across some 6 months), and that's why I initially chose them — until my home ISP gave me a burned IP address and I needed an externally hosted server for originating email data (despite it coming from an old and trusted domain that permitlists the IP address) so now I'm with both OVH and Hetzner... Anyway, another thing I like in OVH is that you can edit the raw zone file data and that they support some of the more exotic record types. I don't know how Hetzner compares on domain hosting though
thenaturalist|1 year ago
Bonkers first experience in the last two weeks.
Graphical "Data center designer", no ability to open multiple tabs, instead always rerouting to the main landing page.
Attached 3 IGWs to a box, all public IPs, GUI shows "no active firewall rules".
IGW 1: 100% packet loss over 1 minute.
IGW 2: 85% packet loss over 1 minute.
IGW3: 95% packet loss over 1 minute.
Turns out "no active Firewall rules" just wasn't the case and explicit whitelisting is absolutely required.
But wait, there's more!
Created a hosted PostgreSQL instance, assigned a private subnet for creation.
SSH into my server, ping the URL of the created Postgres instance: The DB's IP is outside the CIDR range of the assigned subnet and unreachable.
What?
Deleted the instance, created another one, exact same settings. Worked this time around.
Support quality also varies extremely.
Out of 3 encounters, I had a competent person once.
Other two straight out said they have no idea what's going on.
j16sdiz|1 year ago
This is a very low usage toy server, can't speak for performance/cost.
usrme|1 year ago
BillFranklin|1 year ago
Parsing works the same but is based on a simple regex rather than splitting on a hyphen.
euc=eu central; 1=zone/dc; p=production; wkr=worker; 1=node id
o11c|1 year ago
s3rius|1 year ago
In order to integrate a load-balancer provided by hetzner with our k8s on dedicated servers we had to implement a super thin operator that does it: https://github.com/Intreecom/robotlb
If anyone will be inspired by this article and would want to do the same, feel free to use this project.
aliasxneo|1 year ago
threeseed|1 year ago
It took minutes to setup a cluster and I love having a UI to see what is happening.
I wish there were more products like this as I suspect there will be a trend towards more self-managed Kubernetes clusters given how expensive the cloud is becoming.
MathiasPius|1 year ago
sureglymop|1 year ago
bittermandel|1 year ago
All of this makes sense considering the extremely low price.
Scotrix|1 year ago
BillFranklin|1 year ago
mnming|1 year ago
I wonder what is the motivation behind manually spinning up a cluster instead of going with more established tooling?
lucasrattz|1 year ago
> Hetzner volumes are, in my experience, too slow for a production database.
That's true, though. To solve that we developed a way to persist the local storage of bare metal servers across reprovisionings. This way it's both faster and cheaper. Now we are adding an automated database deployment layer on top of it.
MuffinFlavored|1 year ago
What do the fine people of HN think about the size/scope/amount of technology of this repo?
It is referenced in the article here: https://github.com/puppetlabs/puppetlabs-kubernetes/compare/...
KaiserPro|1 year ago
The general flow was Imager->pre-configured puppet agent->connect to controller->apply changes to make it perform as x
originally it never really had the capacity to kick off the imaging/instantiation. THis meant that it scaled better (shared state is better handled than ansible)
However ansible shined because although it was a bastard to get running on more than a couple of hundred hosts in any speed, you could tell it to spin up 100x EC2(or equivalent) machines and then transform them into which every role that was needed. In puppet that was impossible to do in one go.
I assume thats changed, but I don't miss puppet.
mkesper|1 year ago
czhu12|1 year ago
The costs of cloud hosting are totally out of control, would love to see more efforts that lets developers move down the stack.
I’ve been humbly working on https://canine.sh which basically provides a Heroku like interface to any K8 cluster
Neil44|1 year ago
acac10|1 year ago
Thank you for sharing your experience. I also have my 3 personal servers with Hetzner, plus a couple VM instances in Scaleways (French outfit).
Disclaimer: I’m a Googler, was SRE for ~10 years for GMail, identity, social, apps (gsuites nowadays) and more, managed hundreds of jobs in Borg, one of the 3 founders of the current dev+devops internal platform (and I focused on the releases,prod,capacity side of the platform), dabbled in K8s on my personal time. My opinions, not Google’s.
So, my question is: given the significant complexity that K8s brings (I don’t think anyone disputes this) why are people using it outside medium-large environments? There are simpler and yet flexible & effective job schedulers that are way easier to manage. Nomad is an example.
Unless you have a LOT of machines to manage, with many jobs (I’d say +250) to manage, K8s complexity, brittleness and overhead are not justifiable, IMO.
The emergence of tools like Terraform and the many other management layers in top of K8s that try to make it easier but just introduce more complexity and their own abstractions are in itself a sign of that inherent complexity.
I would say that only a few companies in the world need that level of complexity. And then they will need it, for sure. But, for most is like buying a Formula 1 to commute in a city.
One other aspect that I also noticed is that technical teams tend to carry on the mess they had in their previous “legacy” environment and just replicate in K8s, instead of trying to do an architectural design of the whole system needs. And K8s model enables that kind of mess: a “bucket of things”.
Those two things combined, mean that nowadays every company has soaring cloud costs, are running things they know nothing about but are afraid to touch in case of breaking something. And an outage is more career harming than a high bill that Finance will deal with it later, so why risk it, right? A whole new IT area has been coined now to deal with this: FinOps :facepalm:
I’m just puzzled by the whole situation, tbh.
KaiserPro|1 year ago
K8s has a whole kit of parts which sound really grand when you are starting out on a new platform, but quickly become a pain when you actually start to implement it. I think thats the biggest problem, is by the time you've realised that actualy you don't need k8s, you've invested so much time into learning the sodding thing, its difficult to back out.
The other seductive thing is helm provides "AWS-like" features (ie fancy load balancing rules) that are hard to figure out unless you've dabbled with the underlying tech before (varnish/nginx/etc are daunting, so is storage and networking)
this tends to lead to utterly fucking stupid networking systems because unless you know better, that looks normal.
p_l|1 year ago
Every time I try to use Nomad, or any of the other "simpler" solutions, I hit a wall - there turns out to be a critical feature that is not available, and which if I want to retrofit into them, will be a hacky one-off that is badly integrated into API.
Additionally, I don't get US-style budgets or wages - this means that cloud prices which target such budgets are horrifyingly expensive to me, to the point that kubernetes pays itself off at the scale of single server
Yes, single server. The more I make it fit the proper kubernetes mold, the cheaper it gets, even. If I need to extend something, the CustomResourceDefinition system makes it easy to use a sensible common API.
Was there a cost to learning it? Yes, but honestly not so bad. And with things like k3s deploying small clusters on bare metal became trivial.
And I can easily wrap kubernetes API into something simpler for developers to use - create paved paths that reduce the amount of what they have to know, provide, and that will enforce certain deployment standards. At lowest cost I have encountered in my life, funnily enough.
bigfatkitten|1 year ago
Because it looks amazing on my CV and in my promo pack.
0xbadcafebee|1 year ago
kakoni|1 year ago
lucasrattz|1 year ago
james_sulivan|1 year ago
devops000|1 year ago
cjr|1 year ago
BillFranklin|1 year ago
We don’t have such bursty requirements fortunately so I have not needed to automate this.
preisschild|1 year ago
awinter-py|1 year ago
aravindputrevu|1 year ago
End of they day, they are a business!
lucasrattz|1 year ago
segmondy|1 year ago
Iwan-Zotow|1 year ago
well, running on bare metal would be even better
lucasrattz|1 year ago
postepowanieadm|1 year ago
unknown|1 year ago
[deleted]
unknown|1 year ago
[deleted]