top | item 13548795

Ask HN: We've all been there, what was your big stuff up?

76 points| shermanyo | 9 years ago | reply

I applaud the transparency of the GitLab team in their recent outage, but felt bad for the engineer who's typo was called out. Anyone who's done something similar will know the feeling immediately after realising your mistake...

To show that this sort of thing happens to the best of us, let's share some of our horror stories :)

A few months ago, I joined a new team and was still finding my way around the environments. I was tasked with performing manual deployments to a Dev, QA and Staging environment that weren't wired up to our automation system yet. We'd scheduled maintenance windows a week apart for the QA and Staging envs as we allow customers to test against these.

So the day of my QA deployment, I start by applying the database changes which all complete successfully. Next, I upload the new .ear files and deploy the new build of our web app. Again, all looks good, so I tell the QA team they can start testing.

Then the alerts started...

I deployed the app to the Staging env by mistake (and unexpectedly restarted the app server). I didn't realise the naming scheme of the hostnames indicated the environment in this case :/

Our UI broke immediately due to the schema changes, so my mistake was _very_ visible. I was lucky I could roll back the change easily, but I don't think I'll forget that day any time soon.

79 comments

order
[+] herghost|9 years ago|reply
Working on a software deployment across the whole company but without a reliable means of distributing software. Using a combination of AD login scripts where available, but mostly relying on the antivirus product which was installed to locally run scripts on each endpoint.

Cut to 1:30am after a full day of eking out 1 or 2 endpoints here or there, and I've figured out a new method to try. But first I need to test it and make sure it's not going to break anything else, so I create a separate asset group in the AV software and add only my machine to it. I add a simple "hello, world!" type script just to show that the script is executing and wait.

And wait.

No "hello, world!". It's 2am, I'm back in the office at 7am, my new insights will hold until tomorrow. I'm going to bed.

About 6:45 I'm in the queue at the shop to get coffee and bacon and my boss walks in for the same. We small talk and then he gets an incident call.

There's a virus affecting all of <locality's> machines. Uh-oh. He's getting ready to abandon his coffee and bacon aspirations (he's the Head of Security), when I ask what's actually happened.

"As everyone's logging in in <locality> this morning they're getting a command prompt pop up that just says 'hello, world!'"

Oh. Fuck.

I abandoned my coffee and bacon aspirations and assured him that this wasn't a virus, it was a misconfiguration that I'd made only hours before.

It was sorted within minutes and was broadly taken with good humour. But I was referred to as "World" for a while afterwards when people greeted me.

[+] passiveincomelg|9 years ago|reply
from the canonical list of Reasons Against Overtime

I surely contributed to that list, but was too tired to remember any details. :)

[+] oAlbe|9 years ago|reply
> I was referred to as "World" for a while afterwards when people greeted me.

That feels meaner than it sounds...

[+] petecooper|9 years ago|reply
I worked at Tesco, a UK grocery chain, in my teens. I was involved with stock control, among other things, and was partially responsible for populating shelves at a new store.

All products at Tesco have an 8-digit product number (SKU) in addition to the EAN/UPC. There's also a three digit case size number. Like this:

05123456-024

Each product has an estikmated weekly sale and a capacity (shelf plus warehouse) to aid efficient warehousing. Each product has a case size of less than 1,000. Well, all but one -- white sugar. That has a case size of 1,024. It's annotated on the shelf ticket as '024', dropping the leading '1'.

I didn't know about this until ~43 tonnes of sugar arrived on 6 trucks the following day. For a new store. In a small town.

It turns out that me misreading '024' as a case size and over-ordering sugar by a factor of 43x was enough to have the internal ordering software updated.

[+] jacobush|9 years ago|reply
See, that is what's wrong with you quiche eating europeans. Fixing the error... What a twist ending! (Twisted even!) A heartfelt "you're fired!" and a lawsuit, now that's the American way.
[+] masklinn|9 years ago|reply
I've got to say a bunch of trucks loaded with pallets of sugar arriving at a small-town Tesco is a hilarious image.

Were you at the new store when the trucks arrived or did you get told afterwards?

Were there negative consequences for you personally or was it considered a process/tooling error only?

[+] gumby|9 years ago|reply
> until ~43 tonnes of sugar arrived on 6 trucks the following day.

I read about this happening to a novice commodities trader in London: several barges full of coal supposedly showed up at their building at Canary Wharf.

(Oh drat, I looked it up and the only reference I could find was on DailyWTF so...maybe not so true: http://thedailywtf.com/articles/Special-Delivery )

[+] confluence|9 years ago|reply
This is brilliant. My day just got so much better.
[+] sokoloff|9 years ago|reply
We were moving buildings in 2006. The datacenter was not getting re-IP'd and did not have cross-connectivity, some infrastructure was moving ahead of the final move, including the backup targets.

So, I'd turned off the svn backups (dumps and post-commit incrementals) when the targets moved about a week before the final people move. We got into the new building and in the rush of getting everything setup, I'd forgotten to re-enable backups (had not made a checklist). Sure enough, svn server crashes, BDB corrupted, last backups about 8 days old.

Fortunately, we had nightly build snapshots, code on dev workstations, etc, so it was mostly a rock-fetch project to put things back together starting from a fresh repo. We had other automation that used the repo path and revision, so I created a "devtemp" repo and restored the backup re-imported all the code there and then laid on incrementals from nightly builds and dev workstations. In the process, I checked in the vast majority of our code as the author of "revision 6".

10 years later, I was still getting svn blame based questions "about this code you wrote (in -r 6)" "Man, that sokoloff dude wrote a whole lot of crappy code..."

Now that we've been mostly on git for 2 years and only have those repos for historical archeology, the questions are finally dying off.

[+] shermanyo|9 years ago|reply
> 10 years later, I was still getting svn blame based questions "about this code you wrote (in -r 6)"

That's fantastic haha

[+] phaemon|9 years ago|reply
A hastily and poorly written bash script. These days I start every bash script with what I now think of as the "Brexit Options":

  set -eu; set -o pipefail
The key missing one in this case was `-u`. That stops the script if you have an unset variable.

This script would do some stuff, and put a new website in place, and then remove the old one. So, my bash script had the line:

  rm -rf /var/www/$olddir
You can see it already. I ran it with $olddir unset. I think I had it in my head that the directory would simply not be found so that was fine. For those of you unfamiliar with bash, since olddir="", what actually ran was:

  rm -rf /var/www/
Gigabytes lost (back then, a GB was a lot!). We had backups but they took hours to restore. Horrible, horrible day.
[+] proaralyst|9 years ago|reply
For the record, this would also have saved you:

    rm -rf /var/www/${olddir:?}
Which causes a specific error if olddir is unset. A good thing to do regardless of if the Brexit options are set!
[+] tech2|9 years ago|reply
Likewise, but mine was when I was first learning Linux in the mid 90's

I'd written a script to clean out /tmp (since it wasn't a virtual fs back then) at boot. Problem is that it hadn't successfully changed to /tmp but was running instead in /etc

Goodbye /etc, it was nice knowing you... first I knew about it was when my box spectacularly failed to boot.

However, this was _the_ best learning experience of my life. No internet (since that was my only computer at the time) gradually rebuilding /etc by hand from a root prompt.

[+] nl|9 years ago|reply
I did this same thing, except mine was in effect:

  sudo rm -rf /home/username/something/$SOME_VAR/*
by some circumstances, SOME_VAR ended up being set to a space. Turns our that rm takes a list of directories to delete, so that deleted everything from the entire server.

Fortunately I had backups. But yeah.. don't do this.

[+] beaconstudios|9 years ago|reply
Thanks for providing the line you use to avoid these errors - I'll definitely be including this in future bash scripts!

It makes you think - if we had these threads more often, perhaps we'd all get to learn more about these little process changes that could avert a disaster.

[+] shermanyo|9 years ago|reply
oh wow, I've done this before too, multiple times. Deleting _all_ the backups instead of a specific one is a heart stopping moment...
[+] gargravarr|9 years ago|reply
On my internship while at uni back in 2010, I was tinkering with the company SVN server. It was the only machine running Linux in the whole company, and I'd only learned Linux the year before. If I recall correctly, I was trying to set up Trac. Back then, it wasn't in the repos so I was having to set it up from a tarball.

So, what do you know, I broke something in the source folder and the whole Trac install was unusable. I decided to nuke it and start again.

I'm sure you all know where this is going by now.

Back then, I had a habit of tryping ./* for anything in the current directory, rather than just * .

I forgot the .

Me being a total n00b and naive, I thought the permissions warnings I got were genuine (I didn't initially run the command as root) and that because I was chown'ing stuff to www-data... yep. sudo !!

And of course, even though --no-preserve-root was a thing even back then, that only works if the argument given to rm is '/'. Otherwise, bash resolves the wildcard and passes each one in.

It took about 5 seconds to kill my SSH session, just long enough for me to notice the missing . and go OHSHI-

Worse, the machine wasn't backed up. It had a reasonably concise wiki on the company in-house software. On the flipside, that meant the boss shared the blame with me because there was no backup. We were able to rescue the SVN repos, but the MySQL data tables were gone.

So I can totally relate to the poor Gitlab sysadmin who's probably suffering PTSD right now. For want of a single . I managed to trash a production machine too.

As one of my friends would later tell me, 'root is a state of mind'.

[+] shakna|9 years ago|reply
We were doing a cleanup of VMs.

The network had been rebuilt four times by three different people, and only half documented each time.

One time, each VM had been named after planetary bodies. Sol was the AD, Jupiter the print server, etc.

We found one called Mars. Completely undocumented. Doesn't exist so far as the docs knew. The previous admin didn't remember it.

I ran Wireshark, and got nothing.

So... I didn't just shut it down, but I deleted it.

Took 10 minutes for mass panic to hit the office.

Mars was the gateway for our publicly exposed servers. No website, no email, no VPN.

Our daily backup only copied data, not actual images.

So, just hoping, I threw a reverse pass through proxy up on the same IP, with routes for our servers.

Quiet returned, as I went about recovering the Mars image I had deleted.

Lesson learned: if you are working in unknown territory, let it break before deleting. Also, add VM images to the backup routine.

[+] shermanyo|9 years ago|reply
There should be a name specifically for this, like "VM Host Archaeology"
[+] danieltillett|9 years ago|reply
The name should have been a hint as well - I am not sure I would want to kill anything named after the god of war.
[+] dawnerd|9 years ago|reply
Also name things so they're a little more obvious.
[+] partisan|9 years ago|reply
A few weeks into a new job, I was tasked with fixing a bug in an ASP.NET application. I stepped through the application and was able to reproduce the issue, which unfortunately for me only happened when a payment was submitted. So I went through the code, once, twice, thrice, testing various scenarios to understand why the issue would happen. At some point, I started reading the code at a line level and realized that embedded right in the middle of that payment logic was a line of code that sent an email to the customer indicating that a payment had been made. Then I realized I had sent 10s of emails out during each debugging pass. I immediately ran over to my new boss and explained what happened. He smiled when I told him how many emails were sent and then asked me to get customer service in contact with the customer. He was obviously unhappy, but he said he was glad that I had realized it and that I had raised the issue to him immediately.

Lesson: Read the code before jumping headlong into a debugging session. If you make a mistake, inform someone immediately. They would probably rather hear it from you than discover it on their own. I stuck to this principal at that job and it served me really well.

[+] clooless|9 years ago|reply
Happened to me. Now, my Debug web.config file has a setting that directs all outgoing email to a local "SpecifiedPickupDirectory".
[+] chewyfruitloop|9 years ago|reply
not sure ... was it the time I deleted the table space containing the unbanked transactions for a local council which was about £1 million (I did a very hasty recovery) or when i accidentally deleted the table space of the last 3 years data for another council ... which took 3 weeks to recover ... or when I setup an ISDN modem to dial the wrong number every 30 seconds for 6 months costing £10k after discount (the bill snapped the table legs when it was dropped) .....
[+] shermanyo|9 years ago|reply
any of those will do nicely haha. thanks for sharing :)
[+] tangus|9 years ago|reply
A long time ago, working for a BBS, I wrote a nice interactive utility to review and change user configurations. I named it uc ("user configure"). After some testing, I installed it in /usr/local/bin.

Time to use it!

    # uc /bbs/users/*
Nothing happens. It needs some time to read all users' configuration files before displaying the user interface, but it's taking too long. What's happening? I decide to interrupt it. Shortly after, we find out all user accounts starting with A, B, and C are wiped out.

Apparently, unbeknown to me, somebody had previously written a utility to delete user accounts. It was named uc (user clear), and was installed in /usr/bin. Fortunately we had fairly recent backups.

That day I learned about hash -r.

[+] synicalx|9 years ago|reply
First week as a network engineer, still on probation. I was given the simple task of provisioning a new VLAN onto a few switches, one of which was a fairly large and important aggregation switch.

Everything was going ok, then I get to the big switch and move over to one of it's port channels and start adding the vlan there;

  switchport trunk allowed vlan 123
Shortly after my telnet session dropped. How weird, I thought. I tried re-connecting, no dice. Tried pinging it, nothing.

Then I hear a loud "What the fuck?" from the other side of room, and I look up to see about 30 bright red alerts on our board and a huge flood of red from our GLTail monitoring board showing a very large number of PPPoE session ending suddenly.

I'd missed the word "add" in my command and had wiped the other 200+ vlan's from that interface which had in turn killed the Internet, Phone, and IPTV of about 30,000 customers.

After restarting the switch to get it back to it's startup config, I returned to my desk to find the golden pineapple already displayed prominently on my desk. I also had to wear a cowboy hat for the next 10 changes I made in production.

I'd say lesson learned, but then my co-worker did the exact same thing 2 months later while drunk and on call so I don't think we really learned anything there.

[+] amingilani|9 years ago|reply
What's the "golden pineapple"?
[+] Intermernet|9 years ago|reply
Many years ago I was asked to image a new Samba server as the old one was throwing random errors due to age.

I waited until everyone else had left the building, grabbed the disk out of the old server, stuck it into the shiny new server and proceeded to dd the old disk to the new disk.

Except I got the devices around the wrong way (/dev/sda , /dev/sdb) and proceeded to copy the contents of a blank hard drive over the top of the old server's drive. Didn't notice until the process had finished...

I then discovered the benefit of DR plans the hard way (backups are useless unless you test a restore).

Long story short, I managed to recover most of the files using a variety of disk recovery tools, but I was still in the office the next morning when other people started arriving and began to ask me why, for example, the payroll application couldn't find it's database. I spent the next few days in panicked forensics mode until the company was operating to everyone's satisfaction.

When I left that company years later I had implemented many redundant layers of backups, proper DR plans that I religiously followed, and developed a meticulous habit of testing any commands that needed to be run on any production server.

[+] shakna|9 years ago|reply
> dd ... got the devices around the wrong way

dd always makes me nervous as hell when I need to use it. I usually end up checking four or more times. Still got it wrong a few times.

Nothing like having to recover data with forensics to make you build a fantastic backup system with great redundancy.

[+] flurdy|9 years ago|reply
Been there, done that. I managed to loose 30 days of billing data without a working backup.

At a 7 man music streaming startup 10+ years ago, there was an issue with our production server. The application was working, but the reporting tool on another server was no longer getting the daily copy of the Firebird database from the live server.

The database server had run out of disk space so was no longer able to make backups to transfer. So I stopped the apps, stopped the database, cleared out a lot of old logs and backups that was no longer needed, and brought everything back up again.

And then I swore, as I am sure YP did at Gitlab when he realised what had just happened.

The database had started up using as you would expect its last persisted state. In this case, its state 30 days as it was then it ran out of disk space. :( Firebird had been happy running in memory since then though not able to persist any changes. And since the backup procedure was to export the database to a local disk file then scp it to other nodes, it had been happily transferring 0 bytes files for weeks. :/

Had I cleared up some disk space then exported the database before I shut it down then there would have been no problem.

As I realised the severity of this I quickly got hold of our CEO to say I totally fucked up. We then worked together to piece together the missing data from access logs, 3rd party purchase records, and other reports and sources that he had available. We managed to rebuild most of the missing data though there were some gaps in the last 7 days. Not recovering 30 days at all may have killed our tiny company.

We learnt from the mistake and I worked there happily for another year before the company got bought up. Naturally the next project I did was to write a decent alerting system (I won't go into the 300 duplicate text messages I received from it during one night whilst on holiday in southern France).

I have made many mistakes since, just never the same mistake. And with the years I take better and better precautions, scale horizontally, test backup restores etc. But mistakes still happen, just don't panic, and don't try to be the midnight oil hero :)

[+] shermanyo|9 years ago|reply
> I have made many mistakes since, just never the same mistake. And with the years I take better and better precautions, scale horizontally, test backup restores etc. But mistakes still happen, just don't panic, and don't try to be the midnight oil hero :)

Perfectly put. Thanks for sharing.

[+] nicostouch|9 years ago|reply
I was browsing through some web services code I had written a few months prior and was now doing a bug fix for it when I noticed an if statement with a boolean condition that would be easier to read if it were the other way around. I modified the condition to improve readability but in doing so actually flipped the condition. Luckily QA caught it otherwise it would have broken customer sign up through the web portal for a number of clients. Not a great mistake to make but it taught me a great lesson - never ever ever refactor something without the proper tests in place first, it's not worth the risk.
[+] davidgerard|9 years ago|reply
My stories to put interviewees at ease:

* (2010) When you're asked to restore last Sunday's backup to the dev CMS, make sure you're actually on the dev instance, and not, say, on the live instance. That literally every editorial person in the company uses. The day before deadline. (I got to restore 36 two-hourly incremental backups in sequence by hand. We lost only a couple of hours' work. But we verified our backups work!!)

* (2005) Never trust a UPS manual. Ever. Particularly, when it says that the "bypass" switch works smoothly, rather than, e.g., glitching the power and taking down all 75+ Windows PCs in the computer room. (The Sun boxes were of course unaffected.) Recovering the Windows network took most of the morning; the NT admins were less than impressed. And I was a contractor too. Fortunately working under the direction of the in-house admin.

The important thing being, of course, to recover and learn from the experience :-)

[+] shermanyo|9 years ago|reply
thanks for sharing :) its great when an 'unscheduled verification of backups' goes well ;)
[+] oompahloompah|9 years ago|reply
I was working support at a VPS provider for my first real-world tech job fresh out of college and a customer was having issues with their system not booting correctly. They were smart enough to use our integrated backups service so I told them that they could delete their current disks and restore from backup. So they did...

Or at least they tried to.

The backups system was incredibly wobbly at the time and would corrupt its archives pretty frequently which is exactly what happened. They lost everything on that server.

Did I mention that was their sole server and they had no other backups?

It turned out that they were a company providing services to a government entity and had some pretty strict record-keeping requirements which they relied on our service to fulfill.

I was freaking out thinking I was going to be fired after being there for less than a year but everything was resolved fairly well (somehow).

I learned to never trust backups and the rule of thumb "two is one, one is none" as it applies to them.

[+] sofaofthedamned|9 years ago|reply
I previously put this up at /r/sysadmin but here goes again:

I was a programmer in my first IT job in 1992 for a large retailer in the UK. I was working on some stock related code for the branches, of which they had thousands. They sold a lot of local goods like books which were only sold in a couple of stores each - think autobiographies of local politicians, local charity calendars, that sort of thing.

Problem with a lot of these items was that they were not on the central database. This caused a problem with books especially as you don't pay VAT on books, but if you can't identify the book then the company had to pay it. This makes sense because some books or magazines you DID pay VAT on, because they came with other stuff - think computer magazines with a CD on the front. So my code looked at different databases and historical info to work out the actual VAT portion payable, which was usually nil.

I wrote the code (COBOL, kill me now), the testers tested it, all went OK until when they deployed, on a Friday night. The first I knew was coming in Monday morning. All the ops had been working throughout the weekend as the entire stock status for each branch had been wiped. They had to pull a previous weeks backup from storage, this didn't work as they didn't have the space for both copies to merge so IBM had to motorcycle courier some hardware from Amsterdam, etc etc. As this was a IBM mainframe with batch jobs we also had to stop subsequent jobs in case it made the fuckup worse, so none of the stock/finance stuff could run at all.

The branches were royally fucked on Monday as, without any stock status to know what to order, they got nothing - no newspapers, books, anything. We even made it to the Daily Mail, I think it took at least 3 weeks before ordering was automatic again. Cost the company literally millions in overtime, not being able to sell stuff, consultants and reputational damage - it was big news in the national newspapers.

The root cause? I processed data on a run per-branch. I'd copy the branch data to a separate area, delete the main data, then stream it back. My SQL however deleted the main data for ALL branches. It didn't get picked up in QA as, like me, they only tested with a single branch dataset at a time.

I literally spent the week in a daze drinking hard, thinking my career was over. My boss saved my career and me by being absolutely stellar about it. Wherever you are Mike Addis, I thank you!

[+] sokoloff|9 years ago|reply
I've lost my personal home dir twice in my life:

1993 I tar.gz'd it as I was leaving college and ftp'd that file NOT in binary mode; didn't discover it until too late.

1995, I blew away the mount point for my NFS server with all my home dir and data but had left the server mounted (and was running as root, no root squash, etc)

At work, training a new operator, I had them run the script that shutdown all web servers rather than regenerating the CMS caches on them. As the alerts rolled in, I reassured him that we'd done the right thing. Many minutes later, we looked at the logs and saw "webservers-shutdown-all" instead of "webservers-regen-all"

[+] steventhedev|9 years ago|reply
I managed to wipe mine by creating a directory called "~" in a REPL and then trying to clean up a few days later by running rm -rf ~. Hit Ctrl-C, but it still managed to chew through most of the dotfiles, and was halfway through a few checkouts of AOSP before I stopped it.
[+] davman|9 years ago|reply
I fdisk'd the LVM partition that was used as iSCSI storage for 200+ virtual machines.
[+] rosser|9 years ago|reply
Mine is fairly simple: an UPDATE without a WHERE clause on a table tracking Other People's Money. Except that was how we discovered that backups weren't working...

Really posting to share one that happened to me, though:

At a previous job, my PostgreSQL clusters were on metal (blades; not my choice, but I wasn't given one). We were in the process of replacing the SSDs in the blades, both for capacity and performance. We'd, if I remember correctly, replaced one set (the replica, I think, so we could fail over to it and then upgrade the master).

The lead sysadmin decided this represented a golden opportunity to get a side-by-side performance analysis of the new and old drives. (I'll leave aside for the moment the fact that he never said anything about doing this to me, which, you know...)

So he ran fio. In read-write mode. Against the block device, not the filesystem.

And he did that on the primary, replica, and performance test (so, identical to prod) machines at the same time.

About 20 minutes later, one of the developers, who was investigating some other issue, reported seeing "strange errors" in his psql session. I looked at the logs, and everything got very, very still for a moment...

We ended up having to rebuild the machines and restore from the previous night's backups (taken 14h prior), troll the Rails logs to find the affected orders, and refund them. I did finally get the large box for WAL archiving I'd long been lobbying for out of the deal, too, so that was nice.

[+] tluyben2|9 years ago|reply
In 1999 I ran one of those rm -fR thingies with an unset var on a client production system. Systems were dog slow those days but I only noticed when the client called that his ERP was down. This was one my country's most successful car rental companies and all went through there. Ofcourse (...) the backups were broken and we did not use CVS yet. The client, a very nice man, said 'well that is unfortunate' and that was all. We restored a very old backup and copied the source files from my dev system to it. After that we ran a mirror in our office (nightly db copies via ssh), used CVS and weekly backup tests. Jikes.

Another one which was less my fault but I did blame myself for was dropping a server with 200.000 web sites on it because we had to move datacenters and it was xmas eve and very very slippery with ice. We slid and the server fell which wrecked the (hardware) RAID disks. This had working tape backups so there was a few hours downtime which was going to happen anyway as we were moving.

Now that I am writing this anyway; the most traumatic was mid of the 80s with my second computer when I was 10. I had one disk(!); they were expensive and I did not get a lot pocket money. I was learning assembly after Basic became too slow for what I wanted. I was building a game (Chuckie Egg rip off) and after a long time not saving I ran the game and it worked. I was happy and saved the game on the one disk with all the software I wrote with the save command. When I pressed enter I remembered, and I remember this very vividly, that I used the the disk basic ram space because I was running out of memory. The disk started spinning, computer rebooted and ... files (dir) command after gave a disk I/O error... The misery.

Edit: ugh. Just remembered a 1984 one; my father brought home a modem for the MSX-1 and those things were, for my notion of money, kind of bare gold, price wise. It was 100s of guilders. But it could only do Viditel. Which sucked. I wanted BBS access and that required shoving the thing into the MSX after Basic was already booted. The MSX cartridge ports are connected intimately to vital computer parts. So shoving it in crooked had my best friend looking at a purple screen after that I decided better would be to solder in a switch. I did that before with stuff I found by the road. I had to cut an IC pin to do it which I had done quite often; this time I cut it and it flew off... Eventually I was forgiven.