My home lab setup for highly-available Internet

[+] mansilladev|7 years ago|reply

I also have redundant WAN at my house, slightly less sophisticated. Comcast (primary) and U-Verse (backup) on separate modems (wired only, no WiFi). When an outage incident occurs, and it gets escalated, I received a page (iMessage from family member, "Dad, the WiFi is down!"). If I'm away from the NOC/DC, I call the DC remote hands support line (call onsite family member), and have them perform a hard cutover ("go to back of the device with the antenna thingies, disconnect the BLUE cable and plug in the YELLOW cable").

I do have a UPS on the modems and main access point.. but after reading this post, I may invest in diesel generator and a 5,000 gallon subterranean tank.

[+] chrissnell|7 years ago|reply

OP is using CenturyLink fiber. I don't know if things have improved in the two years since I moved from Tacoma, WA where I had it, but it was dreadfully unreliable back in 2016. The unreliability wasn't caused by the fiber drop itself but rather, by a super shitty oversubscription issue up in their Tukwilla/Seattle exchange.

Their IPv6 situation was even worse. They used 6rd and I swear, the translation box was probably a single router or Linux box with a 100 Mbit NIC in a rack somewhere. If you bothered enabling 6rd, every v6 site would be awfully slow. Even the browser projects to automate the selection of v6/v4 didn't help.

When I finally moved away and cancelled the service, I mailed my modem back as directed. A few months later, they sent my account to a collections agency over the cost of a modem, which their system claimed to have not received. I spent hours on endless phone calls but ended up just paying them the $250 or whatever to save my credit and stop the madness.

Seriously, they were the worst provider I ever had.

[+] mirimir|7 years ago|reply

> ... I may invest in diesel generator and a 5,000 gallon subterranean tank.

I highly recommend propane-fueled gensets. Fuel storage is much less hassle, and propane doesn't go bad. You won't get the runtime of a huge diesel tank. But there's often little point in that, because an extended power outage will also take down the telecom infrastructure. As I recall, a ~7kW genset at ~70% capacity went through ~40kg propane per day. That was running well pump, sump pump, refrigerator, CFLs, microwave, fans, and a ~3kW UPS for several computers.

Edit: Make sure to get a UPS that accepts genset power. That usually means full online aka double conversion. And everything must be grounded properly. Plus at least a manual transfer switch, to avoid injuring utility workers.

[+] segmondy|7 years ago|reply

I have one too, it's my cell phone with unlimited data. The hot spot can last for 4 hrs.

[+] nvr219|7 years ago|reply

I have redundant wireless at my house - fios for primary and t-mobile for backup. It's not a seamless handoff because I have to turn on the hotspot on my phone.

[+] minhazm423|7 years ago|reply

Hey how can i learn more about all this? id love to understand whates going on, on the github page, im following a bit but still want a better understanding the way you and the rest of the commenters have, what are some good starting points?

Any resources, books, links, youtubes especially, that you can point me to?

Also, in his set up, theres no router? He says hes using a VM? What does that mean?

[+] technofiend|7 years ago|reply

Well too late now but this is why you either hand it in to a human and get a receipt or you send it via certified or registered mail because that'll hold up in court.

It's unfortunate anyone has to go through so much trouble to prove to Century Link what their inventory system is probably telling them anyway but it's always best to protect yourself.

[+] Arbalest|7 years ago|reply

> 5,000 gallon subterranean tank. ...about 19,000 litres. That seems like tremendous overkill. Are you also planning on using that for home heating? If not, it seems like a very large maintenance burden for anything other than some kind of survival scenario.

[+] reacharavindh|7 years ago|reply

But, can you toss the connections from both providers into a switch and make both avaialable for use all the time? Like a Active-active setup??

Any reason not to do that?

[+] mlatu|7 years ago|reply

When would your onsite support assistant switch back to the blue cable?

Can't you keep both connected to a router and have a script do the switching instead?

Anyhow, still impressive.

[+] zeth___|7 years ago|reply

>I do have a UPS on the modems and main access point.. but after reading this post, I may invest in diesel generator and a 5,000 gallon subterranean tank.

I'm not sure if that's a joke or not, mainly because after reading it I'm thinking "It's a stupid idea, but ... no it's a really stupid idea, but ..."

[+] SpaethCo|7 years ago|reply

Whenever I see solutions like this I think back to an org I worked at where a high-visibility day-long database outage gained upper level management attention. The response, after the managers talked to our vendor (IBM), was to re-architect everything to use HACMP clusters for all of our production databases company-wide.

That was followed by a couple years of 100+ hour/year cumulative outages due to HACMP stability issues, and an environment that everyone was deathly afraid to touch.

The hardcore network engineer in me appreciates the detail in these kinds of solutions, but these days the practical side of me is satisfied with usability and maintainability of SPOF cable access with a manual failover to mobile hotspot on the rare occasions that drops offline.

[+] duxup|7 years ago|reply

Former network engineer here, can confirm. Time and again I've seen redundant systems create their own problems where without all that extra complexity things would have been fine.

Even ISPs and CDNs I worked with sometimes have surprisingly uncomplicated redundancy systems (sometimes just a handful of small routers they are very much ready to power down to cut over to backup paths or bring up new paths) and often they do not use the more complicated methods.

The catch with complicated redundancy is there is always a very close relationship or protocol or something between redundant components, bet it storage systems, network systems, anything. Inevitably a system goes down or loses its mind and takes it's redundant peers with it.... every new system you introduce is one more piece that could reach out and take everyone else with it. I saw it time and again, and again...

[+] stevbov|7 years ago|reply

Reminds me of what my brother in law says: I don't want to be stuck doing tech support for my family.

With my luck, it would catastrophically fail while out of town, leaving the wife and kids without internet.

My dad set up a lot of complicated stuff like this. As people are prone to do, eventually he died, and it just made it difficult to troubleshoot technical problems for mom. So now the equipment sits in some corner, unused, because we replaced it all with something your average AT&T technician could troubleshoot.

[+] larkeith|7 years ago|reply

There's a reason Arthur C Clarke's short story Superiority was once required reading at MIT [1].

[1] https://en.wikipedia.org/wiki/Superiority_(short_story)

[+] DoubleGlazing|7 years ago|reply

I used to work for a company whose setup was super simple.

ADSL Modem > Firewall > Router > Web/DB servers

It was basic, but it worked. Our web servers were mission critical, but as a B2B business they, and the ADSL connection, didn't sustain a heavy load. The only issues we had over several years were with the ADSL modem. Everything else just worked.

When we moved office we moved our servers to a co-hosting centre with an upgraded network setup with all sorts of backup and redundancy. Every week something went wrong. Sometimes simple is best.

[+] aidos|7 years ago|reply

I worked at a place that hosted the servers in-house. They even built a special little air-conditioned room and put a generator on the roof. I never knew all the details but there was dual everything, 2 lines coming in, stuff to switch between them, nothing could possibly go wrong... until the day it did. Turns out someone has plugged all the machines into a single extension cable, and the fuse popped.

[+] madmulita|7 years ago|reply

My anecdata: I used to admin a SWIFT cluster. It was built by the manuals on IBM hardware, that included HACMP with quorum determined by a shared disk.

Nobody understood exactly how the cluster worked to the point that a correction my boss made on the physical connections, made us loose a couple of million of dollars in transactions not processed.

The funny part is, when the cluster was working fine, a takeover took at least 20 minutes. During that time nothing was "available". The thing is, no matter what, SWIFT Alliance took that time to properly close and open the DB.

[+] fencepost|7 years ago|reply

I look at that and all I want to do is raise my eyebrow. That's like water cooling Celerons or heavy tweaking of Honda Civics - you're not doing all that for redundancy, you're doing that as a hobby and redundancy (or speed) are an excuse.

I've set up ISP redundancy on my home network before, I should probably test to verify that it still works after my update some months back. It's a truly high-tech solution: A Netgear WNDR3700v2 router (5x Gigabit, dual-band, circa 2011) running LEDE (previously OpenWRT).

It's not automatic, but I can set it to act as a wifi client, so if my regular Internet goes down I can simply connect into the router, connect to a phone hotspot, and continue providing internal network access. I don't recall if it's able to act as both a client and an AP on the same frequency at the same time, but since my wife's Kindle and Chumby are the only 2.4-only devices in the house I'm not really that concerned about it either.

And yes, the Chumby does still work though it's just a clock these days.

[+] scarejunba|7 years ago|reply

It's clearly intended to be for enjoyment and practice.

Like the guys who make videos of sharpening a grocery store knife to an atom width.

[+] megous|7 years ago|reply

That's also what I thought. All this for 45 minutes of internet when the power goes out, and twice a lifetime 1 day time saving when something crashes hard (like a hard drive) and you need to restore from the backup. It has to be for tinkering.

[+] Theodores|7 years ago|reply

I like your 'water cooling Celerons' analogy.

It is hard to beat a stock, as supplied by the telco, router with a generic Android phone for maximum uptime. If one connection is wired and the other is wifi then the computer handles broadband difficulties with no problems.

If you are actually serious about 'single point of failure' then you just need to live with someone that is likely to not pay the bills for electricity or broadband. Being insufficiently creditworthy to have better than a pay as you go burner phone helps too as every byte costs $$$. Living in an area where any nice toys will get stolen/destroyed also 'helps' as a refurbished laptop running linux is then only practical option. Congested wifi 'helps' too, a basic wifi booster with ethernet out becomes truly useful for 'blazing speeds', particularly if wanting your backup network to come from the local cafe or some neighbour with an easily Googlable password.

Having a local server for development and version control means that you are good to go when it comes to useful work even if there is no connectivity going.

For entertainment a regular FM radio works fine. Two refurbished laptops and a USB stick for bulk transfer of current project stuff makes it fully possible to pull an all-nighter even if there is no electricity due to bills-not-being paid reasons. A nice add is a Chromebook, those things designed for nine year olds with a battery that lasts 10 hours with no difficulty does the job with better wifi than any normal laptop, no fans and no thermal runaway.

Even better, the whole kit can be put in a modest backpack and a bit of couch-surfing later one can be back in business.

It is much more satisfying to do more with less, I would probably hate myself if I had a basement full of servers and only whiled away the hours on social media rather than do 'work'.

This budget ethos is anti-pattern but why should it be? The carbon footprint of operating on low-power refurbished hardware is penguin friendly and cheap. If your apps are supposed to be compatible with regular consumer PCs then it doesn't really help to have a beast of a machine with 4K screen, 32Gb or RAM and some quad Xeon. Maybe a linux toolchain with no virtualisation is better for making one's code performant on target devices. Obviously an SSD helps.

The kids and the grandparents can read books together if the devices are down. They can also listen to the FM radio. What's not to like?

Thank goodness I don't do company IT. Yes it would consist of two refurbished laptops hidden under the floorboards, servicing 50-100 office workers without any difficulty.

[+] subroutine|7 years ago|reply

Just googled "Chumby" and I'm very happy to report that a Chumby looks as cute as it sounds ( http://i.imgur.com/bKSgZPA.jpg )

[+] isostatic|7 years ago|reply

Good move on having not just two WANs, but two technologies. I've seen setups before where people have had two wans, from two different ISPs, but both cables ran down the same duct in the road. Single digger took them both out. It would be a pretty severe problem if fibre and wireless goes at the same time!

I assume you're not running a full BGP handoff to each ISP, so any existing sessions will die should your WAN die (as your lan get natted behind a different IP address). Presumably your nat state will move over in the case of router failure as it's a floating VM of some sort, so what's the failover time for each component? How does it compare to using say VRRP?

How are you detecting ISP failures -- are you pinging beyond the next hop, or are you assuming if you can ping/arp the upstream router, it's working? I've had failure scenarios with ISPs where the next hop works, but nothing past that.

What benefits are there of tcpproxy over something like nginx (for http/s) or dst-nat (for other connections)?

It looks like all your traffic defaults to WAN1, and only uses WAN2 in certain cases. Do you have the ability to send traffic for a given client to WAN2 by default?

What type of queuing are you using -- can 1 client hog all the bandwidth?

And finally, what keyboard layout is 6 above N?

[+] 8_hours_ago|7 years ago|reply

If you want to feel more inferior about your home lab, https://www.reddit.com/r/homelab is a good source of safe-for-work porn and information on over-engineered setups.

[+] jlgaddis|7 years ago|reply

~10 years ago, I had a completely full 42U cabinet in my house, along with another 8U or so of gear and several devices that aren't measured in RU's (access points, Cable and DSL modems, VoIP phones, etc.).

Most of the gear was used for lab scenarios and such for various (Cisco, Juniper, et al) networking certs and was (mostly, but not completely) isolated from my "real" network. IIRC, I had ~35 VLANs at one point.

My extremely over-engineered home lab certainly served its purpose but I think I spent as much time maintaining it as I did actually using it, although it really came in handy for building out PoCs for projects I was handling at $work (my test/lab network at $work wasn't nearly as well-equipped as my home lab was!).

For the last several years, though, I've managed to get by with a single subnet that is shared by everything -- a few laptops, a couple desktops, a server hosting the handful of obligatory VMs, and, of course, the various phones, tablets, and streaming devices that are ubiquitous in all of our homes nowadays.

Just within the last few weeks, however, I've acquired a new server (2 x 10-core Xeons, 256 GB RAM, 4 "Enterprise" SSDs and 12 "Enterprise" HDDs (600 GB 15k SAS)), dug a couple switches out of storage in the garage, replaced my Internet router with a small industrial box running OpenBSD, and started building out a few more subnets for proper separation of various devices (I've twice been offered a 42U cabinet recently but, thus far, managed to say no!). Like probably most HN'ers, I've got a few VPSes spread out here and there as well. Finally, I've got a decent (but was over-built) 2U box in a rack at $work ($work == ISP) that I am planning to use to tie all of this together (using Wireguard, of course).

Yes, I'm fully aware that I'm in the beginning stages of a relapse. After these upcoming changes, however, I don't intend to "grow" this lab much larger (although this kinda stuff does just creep up on you sometimes).

[+] JohnJamesRambo|7 years ago|reply

https://i.imgur.com/vxDfNBQ.jpg

yikes

[+] archi42|7 years ago|reply

I feel like this "article" should go there, not on HN. I mean, we all know what a server rack looks like?

[+] gerdesj|7 years ago|reply

Good stuff. However - only one Linux router (VM) which means that you can't upgrade it and reboot without loss of service. The way around that is two VMs and VRRP or similar and a lot of very complicated NAT and firewall rules.

Out of the box, pfSense can do multi WAN and CARP (similar to VRRP) clustering. At the office I have two older servers with lots of NICs and five WANs. Inbound redundancy is provided by dynamic DNS and SRV records etc. Note that to do CARP/VRRP, you do need at least a /29 IPv4 allocation. You need an address per box plus the virtual one that is actually used by services. PPPoA/E is harder to deal with than cable/leased line etc but it turns out that low cost Billion 8800NLR2 can do external IPv4 pass through as well as do the PPPoA/E. They will need an address as well from your range. You need something like them in this case because only one device can be the PPPoA/E dial up system at a time. Unless you have some very fancy secret sauce, your clustered routers' pppd or whatever are going to get confused as to who does what.

I notice you have a cloud key. Unifi on an Ubuntu VM is easy, and much easier to backup and snapshot before upgrades, so is safer. You can also front it with HA Proxy for simple URLs and perhaps Lets Encrypt. pfSense has a HA Proxy package with a GUI and I believe it is CARP friendly as well ...

[+] starefossen|7 years ago|reply

Only thing missing is a chaos monkey to randomly power down devices to make sure everything still stays available.

[+] jdboyd|7 years ago|reply

There is a child present.

[+] halbritt|7 years ago|reply

Nice setup, but we can all pretty much agree it's overkill for most. My ISP is fairly reliable and outside of infant death, most network elements have a pretty long MTBF.

I run a similar set of WiFi gear. I've a couple PoE powered Unifi UAP-AC-Pro spread around the house, all connected to an 8-port Unifi PoE GigE switch. Routing is done with an EdgeRouter lite, which as it turns out is capable of line rate GigE.

I have a low power industrial computer with 4 cores and 8GB memory that runs various services mostly via docker or vagrant. It consumes about 12w.

It's all powered by a 750VA APC SmartUPS. I get almost an hour of runtime on the internal batteries. I may add some external batteries at some point, but most power outages in my area don't last longer than 20-30 minutes.

[+] hackerpacker|7 years ago|reply

Everyone has different needs of course,

My home setup:

hardwired all the desktops and a few access points via cheap 1gbit hardware (literally found some at the thrift store/ebay), usually using tomato/shibby.

have a backup router.

battery backup on main routers/modem.

large external battery wire nutted to my desktop UPS.

NAS is an old laptop with battery intact, doubles as second display/machine.

use my phone via usb on my desktop if all else fails.

total cost, probably less than $100.

Oh, and I use a $5/month server for stuff that absolutely needs to be on full time. Otherwise the only external access is me occasionally remoting into my desktop and I am happy to stop and smell the flowers if that is interrupted briefly.

[+] chx|7 years ago|reply

I have an even simpler setup: if my cable connection dies, I simply tether my phone to replace it. There are no UPSes because both the laptop (TP25 w/ 24 + 72 Wh batteries) and the phone (it's a Moto Z Play with a battery mod) have large enough batteries to last much longer than a domestic blackout in downtown Vancouver.

My laptop is enough for me to stay productive (it's a ThinkPad 25! very productive). Everything that needs to be online is on a Hetzner server I rent for all sorts of purposes so the 51 EUR monthly bill kind of spreads out.

[+] Johnny555|7 years ago|reply

Fun solution, but seems like overkill for just about every home user.

I used to use a dual-WAN setup with cable modem + DSL backup. It worked well with automatic failover. I use a pfSense APU based router and, with no moving parts, it's been very reliable, nearly 4 years without any unscheduled downtime.

Then I moved and only had a single ISP to choose from, so my backup is to manually turn on a Wifi hotspot. I thought about using a cellular router with ethernet or a wifi connection to the hotspot for auto-failover, but it just wasn't worth the time and/or money to set it up -- if I'm home when the internet goes down, I can just switch to the hotspot, if I'm not home, then all I really lose is the ability to control the lights and thermostat remotely, not exactly a critical function.

[+] zf00002|7 years ago|reply

> seems like overkill

I think that's quite the understatement. The thing that really stands out to me is the claim that all of that is only drawing 220W at idle. I'm curious if he means truly idle, like literally just booted up and not doing anything at all, zero traffic, etc. Or if that's the draw with stuff actually being used. Because 220W just for your home network is hilarious. I mean I feel dumb often because my little pfsense box pulls about 15W.

[+] bluedino|7 years ago|reply

PfSense or something like an out of maintenance Fortigate is an easier solution

[+] kbenson|7 years ago|reply

This was as all fairly straightforward to implement a decade ago on cheap hardware and cheap switches running OpenBSD on pair of ALIXs and pair of semi-cheap net gear switches. Full firewall and VPN fail over using pfsync and sasync, IP failover with CARP.

You can do load balancing using PF as well, which is what we were mostly offering, cheap fault tolerant hosting for colocated customers.

[+] bradfitz|7 years ago|reply

Much of this exercise was me playing with Ceph, which is pretty impressive.

Having VMs float around with shared storage makes complexity elsewhere go away. i.e. I don't need to deal with CARP, VRRP, etc.

[+] peterwwillis|7 years ago|reply

Some alternatives:

* Cantenna/laser link to a house some blocks away to avoid local WAN link disruption

* For less performance-intense networks, remove the physical impediments: 2 routers, each with 1 APC, connected to 2 separate power circuits, connected to 2 WAN links, providing 2 radios each. No switch to go down or cables to trip over, redundancy of access point, redundancy of frequency/radio, redundancy of WAN link, redundancy of power. Hardware-wise this is pretty cheap and still highly available. If the routers are cheap, use a hardware watchdog.

[+] pedrocr|7 years ago|reply

I also thought having everything on UPS would allow me to keep an Internet connection during a power outage. Turns out that when the power goes out so does my ISP. Having a second ISP on LTE or Wifi like this setup may or may not be enough to fix that.

[+] comesee|7 years ago|reply

Looks like a pretty resilient setup... But can it handle an Ethernet pause frame broadcast flood https://github.com/nwholloway/mpcp

[+] daxorid|7 years ago|reply

Very cool configuration.

I attempted something similar to this in a 20U cabinet some time back. The biggest issue is the fan noise that 1U form factor servers and network gear produce, with their rather high RPMs. One can hear the noise across the other side of the house.

We've since switched to fanless network gear and ATX form factor servers with large diameter fans to keep the family happy. It definitely doesn't look as nice, though.

[+] isostatic|7 years ago|reply

You can get pretty much the same result from a couple of fanless routers (mikrotik, something running ddwrt, etc) -- resilient against hardware failure, power failure, and wan failure.

Not as cool though, and clearly not running any servers, but that's what things like AWS or Linode are for -- or for low power stuff, something like a fitlet [0]

[0] http://www.fit-pc.com/web/products/fitlet/

[+] kqr|7 years ago|reply

> ATX form factor servers with large diameter fans to keep the family happy.

It's insane how quiet you can go with this approach, while remaining air-cooled. I know when my home server is running backup scripts because the noise increases at least tenfold when the hard drives spin. Fortunately, I have coordinated that to be only once a day -- the rest of the time the drives are in standby.

[+] halbritt|7 years ago|reply

> I love Ceph so much...

Clearly hasn't been bitten by it, yet.

I mean... I love Ceph, too, but I don't ever want to run it again.

[+] galeforcewinds|7 years ago|reply

Have you also given yourself a mobile equivalent for those times when you are traveling, or when your primary environment is unsuitable and you must work at a place with public WiFi?

[+] tvanantwerp|7 years ago|reply

Neat! But to be honest, it's way more than I'd ever invest in a home setup. I manage an entire office of ~30 people with much less redundancy than this!

[+] llama052|7 years ago|reply

Couldn’t all this complexity be replaced with a ubiquiti edgerouter or a prosumer router that’ll balance the links for you?

This is more of a homelab tinkering setup to learn.

[+] w8rbt|7 years ago|reply

Awesome setup Brad. I wish I had a tenth of that speed. I have Verizon DSL (1.5 Mbit Down and 700 Kbit up). They advertise it as 3 down and 1.5 up, but I've never seen that. That's the best I can get in rural Virginia. I do use SQM on a Ubiquiti Edge Router X to fix buffer bloat, so latency is very good.

And thanks for all the Go code. It's awesome! I'm building 1.10.3 on an old Beagle Bone Black right now ;)

287 comments