(no title)
tstack | 3 months ago
Oh, oh, I get to talk about my favorite bug!
I was working on network-booting servers with iPXE and we got a bug saying that things were working fine until the cluster size went over 4/5 machines. In a larger cluster, machines would not come up from a reboot. I thought QA was just being silly, why would the size of the cluster matter? I took a closer look and, sure enough, was able to reproduce the bug. Basically, the machine would sit there stuck trying to download the boot image over TCP from the server.
After some investigation, it turned out to be related to the heartbeats sent between machines (they were ICMP pings). Since iPXE is a very nice and fancy bootloader, it will happily respond to ICMP pings. Note that, in order to do this, it would do an ARP to find address to send the response to. Unfortunately, the size of the ARP cache was pretty small since this was "embedded" software (take a guess how big the cache was...). Essentially, while iPXE was downloading the image, the address of the image server would get pushed out of the ARP cache by all these heartbeats. Thus, the download would suffer since it had to constantly pause to redo the ARP request. So, things would work with a smaller cluster size since the ARP cache was big enough to keep track of the download server and the peers in the cluster.
I think I "fixed" it by responding to the ICMP using the source MAC address (making sure it wasn't broadcast) rather than doing an ARP.
happyPersonR|3 months ago