Ask HN: We've all been there, what was your big stuff up?
To show that this sort of thing happens to the best of us, let's share some of our horror stories :)
A few months ago, I joined a new team and was still finding my way around the environments. I was tasked with performing manual deployments to a Dev, QA and Staging environment that weren't wired up to our automation system yet. We'd scheduled maintenance windows a week apart for the QA and Staging envs as we allow customers to test against these.
So the day of my QA deployment, I start by applying the database changes which all complete successfully. Next, I upload the new .ear files and deploy the new build of our web app. Again, all looks good, so I tell the QA team they can start testing.
Then the alerts started...
I deployed the app to the Staging env by mistake (and unexpectedly restarted the app server). I didn't realise the naming scheme of the hostnames indicated the environment in this case :/
Our UI broke immediately due to the schema changes, so my mistake was _very_ visible. I was lucky I could roll back the change easily, but I don't think I'll forget that day any time soon.
[+] [-] herghost|9 years ago|reply
Cut to 1:30am after a full day of eking out 1 or 2 endpoints here or there, and I've figured out a new method to try. But first I need to test it and make sure it's not going to break anything else, so I create a separate asset group in the AV software and add only my machine to it. I add a simple "hello, world!" type script just to show that the script is executing and wait.
And wait.
No "hello, world!". It's 2am, I'm back in the office at 7am, my new insights will hold until tomorrow. I'm going to bed.
About 6:45 I'm in the queue at the shop to get coffee and bacon and my boss walks in for the same. We small talk and then he gets an incident call.
There's a virus affecting all of <locality's> machines. Uh-oh. He's getting ready to abandon his coffee and bacon aspirations (he's the Head of Security), when I ask what's actually happened.
"As everyone's logging in in <locality> this morning they're getting a command prompt pop up that just says 'hello, world!'"
Oh. Fuck.
I abandoned my coffee and bacon aspirations and assured him that this wasn't a virus, it was a misconfiguration that I'd made only hours before.
It was sorted within minutes and was broadly taken with good humour. But I was referred to as "World" for a while afterwards when people greeted me.
[+] [-] passiveincomelg|9 years ago|reply
I surely contributed to that list, but was too tired to remember any details. :)
[+] [-] oAlbe|9 years ago|reply
That feels meaner than it sounds...
[+] [-] petecooper|9 years ago|reply
All products at Tesco have an 8-digit product number (SKU) in addition to the EAN/UPC. There's also a three digit case size number. Like this:
05123456-024
Each product has an estikmated weekly sale and a capacity (shelf plus warehouse) to aid efficient warehousing. Each product has a case size of less than 1,000. Well, all but one -- white sugar. That has a case size of 1,024. It's annotated on the shelf ticket as '024', dropping the leading '1'.
I didn't know about this until ~43 tonnes of sugar arrived on 6 trucks the following day. For a new store. In a small town.
It turns out that me misreading '024' as a case size and over-ordering sugar by a factor of 43x was enough to have the internal ordering software updated.
[+] [-] jacobush|9 years ago|reply
[+] [-] masklinn|9 years ago|reply
Were you at the new store when the trucks arrived or did you get told afterwards?
Were there negative consequences for you personally or was it considered a process/tooling error only?
[+] [-] gumby|9 years ago|reply
I read about this happening to a novice commodities trader in London: several barges full of coal supposedly showed up at their building at Canary Wharf.
(Oh drat, I looked it up and the only reference I could find was on DailyWTF so...maybe not so true: http://thedailywtf.com/articles/Special-Delivery )
[+] [-] confluence|9 years ago|reply
[+] [-] sokoloff|9 years ago|reply
So, I'd turned off the svn backups (dumps and post-commit incrementals) when the targets moved about a week before the final people move. We got into the new building and in the rush of getting everything setup, I'd forgotten to re-enable backups (had not made a checklist). Sure enough, svn server crashes, BDB corrupted, last backups about 8 days old.
Fortunately, we had nightly build snapshots, code on dev workstations, etc, so it was mostly a rock-fetch project to put things back together starting from a fresh repo. We had other automation that used the repo path and revision, so I created a "devtemp" repo and restored the backup re-imported all the code there and then laid on incrementals from nightly builds and dev workstations. In the process, I checked in the vast majority of our code as the author of "revision 6".
10 years later, I was still getting svn blame based questions "about this code you wrote (in -r 6)" "Man, that sokoloff dude wrote a whole lot of crappy code..."
Now that we've been mostly on git for 2 years and only have those repos for historical archeology, the questions are finally dying off.
[+] [-] shermanyo|9 years ago|reply
That's fantastic haha
[+] [-] phaemon|9 years ago|reply
This script would do some stuff, and put a new website in place, and then remove the old one. So, my bash script had the line:
You can see it already. I ran it with $olddir unset. I think I had it in my head that the directory would simply not be found so that was fine. For those of you unfamiliar with bash, since olddir="", what actually ran was: Gigabytes lost (back then, a GB was a lot!). We had backups but they took hours to restore. Horrible, horrible day.[+] [-] proaralyst|9 years ago|reply
[+] [-] tech2|9 years ago|reply
I'd written a script to clean out /tmp (since it wasn't a virtual fs back then) at boot. Problem is that it hadn't successfully changed to /tmp but was running instead in /etc
Goodbye /etc, it was nice knowing you... first I knew about it was when my box spectacularly failed to boot.
However, this was _the_ best learning experience of my life. No internet (since that was my only computer at the time) gradually rebuilding /etc by hand from a root prompt.
[+] [-] nl|9 years ago|reply
Fortunately I had backups. But yeah.. don't do this.
[+] [-] beaconstudios|9 years ago|reply
It makes you think - if we had these threads more often, perhaps we'd all get to learn more about these little process changes that could avert a disaster.
[+] [-] taspeotis|9 years ago|reply
[+] [-] shermanyo|9 years ago|reply
[+] [-] gargravarr|9 years ago|reply
So, what do you know, I broke something in the source folder and the whole Trac install was unusable. I decided to nuke it and start again.
I'm sure you all know where this is going by now.
Back then, I had a habit of tryping ./* for anything in the current directory, rather than just * .
I forgot the .
Me being a total n00b and naive, I thought the permissions warnings I got were genuine (I didn't initially run the command as root) and that because I was chown'ing stuff to www-data... yep. sudo !!
And of course, even though --no-preserve-root was a thing even back then, that only works if the argument given to rm is '/'. Otherwise, bash resolves the wildcard and passes each one in.
It took about 5 seconds to kill my SSH session, just long enough for me to notice the missing . and go OHSHI-
Worse, the machine wasn't backed up. It had a reasonably concise wiki on the company in-house software. On the flipside, that meant the boss shared the blame with me because there was no backup. We were able to rescue the SVN repos, but the MySQL data tables were gone.
So I can totally relate to the poor Gitlab sysadmin who's probably suffering PTSD right now. For want of a single . I managed to trash a production machine too.
As one of my friends would later tell me, 'root is a state of mind'.
[+] [-] shakna|9 years ago|reply
The network had been rebuilt four times by three different people, and only half documented each time.
One time, each VM had been named after planetary bodies. Sol was the AD, Jupiter the print server, etc.
We found one called Mars. Completely undocumented. Doesn't exist so far as the docs knew. The previous admin didn't remember it.
I ran Wireshark, and got nothing.
So... I didn't just shut it down, but I deleted it.
Took 10 minutes for mass panic to hit the office.
Mars was the gateway for our publicly exposed servers. No website, no email, no VPN.
Our daily backup only copied data, not actual images.
So, just hoping, I threw a reverse pass through proxy up on the same IP, with routes for our servers.
Quiet returned, as I went about recovering the Mars image I had deleted.
Lesson learned: if you are working in unknown territory, let it break before deleting. Also, add VM images to the backup routine.
[+] [-] shermanyo|9 years ago|reply
[+] [-] danieltillett|9 years ago|reply
[+] [-] dawnerd|9 years ago|reply
[+] [-] partisan|9 years ago|reply
Lesson: Read the code before jumping headlong into a debugging session. If you make a mistake, inform someone immediately. They would probably rather hear it from you than discover it on their own. I stuck to this principal at that job and it served me really well.
[+] [-] clooless|9 years ago|reply
[+] [-] chewyfruitloop|9 years ago|reply
[+] [-] shermanyo|9 years ago|reply
[+] [-] tangus|9 years ago|reply
Time to use it!
Nothing happens. It needs some time to read all users' configuration files before displaying the user interface, but it's taking too long. What's happening? I decide to interrupt it. Shortly after, we find out all user accounts starting with A, B, and C are wiped out.Apparently, unbeknown to me, somebody had previously written a utility to delete user accounts. It was named uc (user clear), and was installed in /usr/bin. Fortunately we had fairly recent backups.
That day I learned about hash -r.
[+] [-] synicalx|9 years ago|reply
Everything was going ok, then I get to the big switch and move over to one of it's port channels and start adding the vlan there;
Shortly after my telnet session dropped. How weird, I thought. I tried re-connecting, no dice. Tried pinging it, nothing.Then I hear a loud "What the fuck?" from the other side of room, and I look up to see about 30 bright red alerts on our board and a huge flood of red from our GLTail monitoring board showing a very large number of PPPoE session ending suddenly.
I'd missed the word "add" in my command and had wiped the other 200+ vlan's from that interface which had in turn killed the Internet, Phone, and IPTV of about 30,000 customers.
After restarting the switch to get it back to it's startup config, I returned to my desk to find the golden pineapple already displayed prominently on my desk. I also had to wear a cowboy hat for the next 10 changes I made in production.
I'd say lesson learned, but then my co-worker did the exact same thing 2 months later while drunk and on call so I don't think we really learned anything there.
[+] [-] amingilani|9 years ago|reply
[+] [-] Intermernet|9 years ago|reply
I waited until everyone else had left the building, grabbed the disk out of the old server, stuck it into the shiny new server and proceeded to dd the old disk to the new disk.
Except I got the devices around the wrong way (/dev/sda , /dev/sdb) and proceeded to copy the contents of a blank hard drive over the top of the old server's drive. Didn't notice until the process had finished...
I then discovered the benefit of DR plans the hard way (backups are useless unless you test a restore).
Long story short, I managed to recover most of the files using a variety of disk recovery tools, but I was still in the office the next morning when other people started arriving and began to ask me why, for example, the payroll application couldn't find it's database. I spent the next few days in panicked forensics mode until the company was operating to everyone's satisfaction.
When I left that company years later I had implemented many redundant layers of backups, proper DR plans that I religiously followed, and developed a meticulous habit of testing any commands that needed to be run on any production server.
[+] [-] shakna|9 years ago|reply
dd always makes me nervous as hell when I need to use it. I usually end up checking four or more times. Still got it wrong a few times.
Nothing like having to recover data with forensics to make you build a fantastic backup system with great redundancy.
[+] [-] flurdy|9 years ago|reply
At a 7 man music streaming startup 10+ years ago, there was an issue with our production server. The application was working, but the reporting tool on another server was no longer getting the daily copy of the Firebird database from the live server.
The database server had run out of disk space so was no longer able to make backups to transfer. So I stopped the apps, stopped the database, cleared out a lot of old logs and backups that was no longer needed, and brought everything back up again.
And then I swore, as I am sure YP did at Gitlab when he realised what had just happened.
The database had started up using as you would expect its last persisted state. In this case, its state 30 days as it was then it ran out of disk space. :( Firebird had been happy running in memory since then though not able to persist any changes. And since the backup procedure was to export the database to a local disk file then scp it to other nodes, it had been happily transferring 0 bytes files for weeks. :/
Had I cleared up some disk space then exported the database before I shut it down then there would have been no problem.
As I realised the severity of this I quickly got hold of our CEO to say I totally fucked up. We then worked together to piece together the missing data from access logs, 3rd party purchase records, and other reports and sources that he had available. We managed to rebuild most of the missing data though there were some gaps in the last 7 days. Not recovering 30 days at all may have killed our tiny company.
We learnt from the mistake and I worked there happily for another year before the company got bought up. Naturally the next project I did was to write a decent alerting system (I won't go into the 300 duplicate text messages I received from it during one night whilst on holiday in southern France).
I have made many mistakes since, just never the same mistake. And with the years I take better and better precautions, scale horizontally, test backup restores etc. But mistakes still happen, just don't panic, and don't try to be the midnight oil hero :)
[+] [-] shermanyo|9 years ago|reply
Perfectly put. Thanks for sharing.
[+] [-] nicostouch|9 years ago|reply
[+] [-] davidgerard|9 years ago|reply
* (2010) When you're asked to restore last Sunday's backup to the dev CMS, make sure you're actually on the dev instance, and not, say, on the live instance. That literally every editorial person in the company uses. The day before deadline. (I got to restore 36 two-hourly incremental backups in sequence by hand. We lost only a couple of hours' work. But we verified our backups work!!)
* (2005) Never trust a UPS manual. Ever. Particularly, when it says that the "bypass" switch works smoothly, rather than, e.g., glitching the power and taking down all 75+ Windows PCs in the computer room. (The Sun boxes were of course unaffected.) Recovering the Windows network took most of the morning; the NT admins were less than impressed. And I was a contractor too. Fortunately working under the direction of the in-house admin.
The important thing being, of course, to recover and learn from the experience :-)
[+] [-] shermanyo|9 years ago|reply
[+] [-] oompahloompah|9 years ago|reply
Or at least they tried to.
The backups system was incredibly wobbly at the time and would corrupt its archives pretty frequently which is exactly what happened. They lost everything on that server.
Did I mention that was their sole server and they had no other backups?
It turned out that they were a company providing services to a government entity and had some pretty strict record-keeping requirements which they relied on our service to fulfill.
I was freaking out thinking I was going to be fired after being there for less than a year but everything was resolved fairly well (somehow).
I learned to never trust backups and the rule of thumb "two is one, one is none" as it applies to them.
[+] [-] sofaofthedamned|9 years ago|reply
I was a programmer in my first IT job in 1992 for a large retailer in the UK. I was working on some stock related code for the branches, of which they had thousands. They sold a lot of local goods like books which were only sold in a couple of stores each - think autobiographies of local politicians, local charity calendars, that sort of thing.
Problem with a lot of these items was that they were not on the central database. This caused a problem with books especially as you don't pay VAT on books, but if you can't identify the book then the company had to pay it. This makes sense because some books or magazines you DID pay VAT on, because they came with other stuff - think computer magazines with a CD on the front. So my code looked at different databases and historical info to work out the actual VAT portion payable, which was usually nil.
I wrote the code (COBOL, kill me now), the testers tested it, all went OK until when they deployed, on a Friday night. The first I knew was coming in Monday morning. All the ops had been working throughout the weekend as the entire stock status for each branch had been wiped. They had to pull a previous weeks backup from storage, this didn't work as they didn't have the space for both copies to merge so IBM had to motorcycle courier some hardware from Amsterdam, etc etc. As this was a IBM mainframe with batch jobs we also had to stop subsequent jobs in case it made the fuckup worse, so none of the stock/finance stuff could run at all.
The branches were royally fucked on Monday as, without any stock status to know what to order, they got nothing - no newspapers, books, anything. We even made it to the Daily Mail, I think it took at least 3 weeks before ordering was automatic again. Cost the company literally millions in overtime, not being able to sell stuff, consultants and reputational damage - it was big news in the national newspapers.
The root cause? I processed data on a run per-branch. I'd copy the branch data to a separate area, delete the main data, then stream it back. My SQL however deleted the main data for ALL branches. It didn't get picked up in QA as, like me, they only tested with a single branch dataset at a time.
I literally spent the week in a daze drinking hard, thinking my career was over. My boss saved my career and me by being absolutely stellar about it. Wherever you are Mike Addis, I thank you!
[+] [-] sokoloff|9 years ago|reply
1993 I tar.gz'd it as I was leaving college and ftp'd that file NOT in binary mode; didn't discover it until too late.
1995, I blew away the mount point for my NFS server with all my home dir and data but had left the server mounted (and was running as root, no root squash, etc)
At work, training a new operator, I had them run the script that shutdown all web servers rather than regenerating the CMS caches on them. As the alerts rolled in, I reassured him that we'd done the right thing. Many minutes later, we looked at the logs and saw "webservers-shutdown-all" instead of "webservers-regen-all"
[+] [-] steventhedev|9 years ago|reply
[+] [-] davman|9 years ago|reply
[+] [-] shermanyo|9 years ago|reply
[+] [-] rosser|9 years ago|reply
Really posting to share one that happened to me, though:
At a previous job, my PostgreSQL clusters were on metal (blades; not my choice, but I wasn't given one). We were in the process of replacing the SSDs in the blades, both for capacity and performance. We'd, if I remember correctly, replaced one set (the replica, I think, so we could fail over to it and then upgrade the master).
The lead sysadmin decided this represented a golden opportunity to get a side-by-side performance analysis of the new and old drives. (I'll leave aside for the moment the fact that he never said anything about doing this to me, which, you know...)
So he ran fio. In read-write mode. Against the block device, not the filesystem.
And he did that on the primary, replica, and performance test (so, identical to prod) machines at the same time.
About 20 minutes later, one of the developers, who was investigating some other issue, reported seeing "strange errors" in his psql session. I looked at the logs, and everything got very, very still for a moment...
We ended up having to rebuild the machines and restore from the previous night's backups (taken 14h prior), troll the Rails logs to find the affected orders, and refund them. I did finally get the large box for WAL archiving I'd long been lobbying for out of the deal, too, so that was nice.
[+] [-] tluyben2|9 years ago|reply
Another one which was less my fault but I did blame myself for was dropping a server with 200.000 web sites on it because we had to move datacenters and it was xmas eve and very very slippery with ice. We slid and the server fell which wrecked the (hardware) RAID disks. This had working tape backups so there was a few hours downtime which was going to happen anyway as we were moving.
Now that I am writing this anyway; the most traumatic was mid of the 80s with my second computer when I was 10. I had one disk(!); they were expensive and I did not get a lot pocket money. I was learning assembly after Basic became too slow for what I wanted. I was building a game (Chuckie Egg rip off) and after a long time not saving I ran the game and it worked. I was happy and saved the game on the one disk with all the software I wrote with the save command. When I pressed enter I remembered, and I remember this very vividly, that I used the the disk basic ram space because I was running out of memory. The disk started spinning, computer rebooted and ... files (dir) command after gave a disk I/O error... The misery.
Edit: ugh. Just remembered a 1984 one; my father brought home a modem for the MSX-1 and those things were, for my notion of money, kind of bare gold, price wise. It was 100s of guilders. But it could only do Viditel. Which sucked. I wanted BBS access and that required shoving the thing into the MSX after Basic was already booted. The MSX cartridge ports are connected intimately to vital computer parts. So shoving it in crooked had my best friend looking at a purple screen after that I decided better would be to solder in a switch. I did that before with stuff I found by the road. I had to cut an IC pin to do it which I had done quite often; this time I cut it and it flew off... Eventually I was forgiven.