My $500M Mars rover mistake

[+] jay-barronville|2 years ago|reply

Really well written story.

As a software engineer, I have a couple stories like this from earlier in my career that still haunt me to this very day.

Here’s a short version of one of them: Like 10 years ago, I was doing consulting work for a client. We worked together for months to build a new version of their web service. On launch day, I was asked to do the deployment. The development and deployment process they had in place was awful and nothing like what we have today—just about every aspect of the process was manual. Anyway, everything was going well. I wrote a few scripts and SQL queries to automate the parts I could. They gave me the production credentials for when I’m ready to deploy. I decided to run what you could call my migration script one last time just to be sure I’m ready. The very moment after I hit the Enter key, I realized I had made a mistake: I had just updated the script with the production credentials just before I made the decision to do another test run. The errors started piling and their service was unresponsive. I was 100% sure I had just wiped their database and I was losing it internally. What saved me was that one of their guys had just a couple hours earlier completed a backup of their database in anticipation of the launch; in the end, they lost a tiny bit of data but most of it was recovered via the backup. Ever since then, “careful” is an extreme understatement when it comes to how I interact with database systems—and production systems in general. Never again.

[+] cdogl|2 years ago|reply

Your excellent story compelled me to share another:

We rarely interact directly with production databases as we have an event sourced architecture. When we do, we run a shell script which tunnels through a bastion host to give us direct access to the database in our production environment, and exposes the standard environment variables to configure a Postgres client.

Our test suites drop and recreate our tables, or truncate them, as part of the test run.

One day, a lead developer ran “make test” after he’d been doing some exploratory work in the prod database as part of a bug fix. The test code respected the environment variables and connected to prod instead of docker. Immediately, our tests dropped and recreated the production tables for that database a few dozen times.

[+] markmark|2 years ago|reply

At a place I was consulting about 10 years ago one of the internal guys on another product dropped the prod database because he was logged into his dev db and the prod db at the same time in different windows and he dropped the wrong one. Then when they went to restore the backups hadn't succeeded in months (they had hired consultants to help them with the new product for good reason).

Luckily the customer sites each had a local db that synced to the central db (so the product could run with patchy connectivity), but the guy spent 3 or 4 days working looooong days rebuilding the master db from a combination of old backups and the client-site data.

[+] schemescape|2 years ago|reply

Anecdote: I ran a migration on a production database from inside Visual Studio. In retrospect, it was recoverable, but I nearly had a heart attack when all the tables started disappearing from the tree view in VS…

…only to reappear a second later. It was just the view refreshing! Talk about awful UI!

[+] donatj|2 years ago|reply

Around 15 years ago, I was packing up getting ready to leave for a long weekend. One of our marketing people I was friends with comes over with a quick change to a customers site.

I had access to the production database, something I absolutely should not have had but we were a tiny ~15 person company with way more clients than we reasonably should have. Corners were cut.

I write a quick little UPDATE query to update some marketing text on a product and when the query takes more than an instant I knew I had screwed up. Reading my query, I quickly realize I had ran the UPDATE entirely unbounded and changed the description of thousands and thousands of products.

Our database admin with access to the database backups had gone home hours earlier as he worked in a different timezone. It took me many phone calls and well over an hour to get ahold of him and get the descriptions restored.

The quick change on my way out the door ended up taking me multiple hours to resolve. My friend in marketing apologized profusely but it was my mistake, not theirs.

As far as I remember we never heard anything from the client about it, I put that entirely down to it being 5pm on Friday of a holiday weekend.

[+] askvictor|2 years ago|reply

One place I worked (some 20 years ago) had a policy that any time you run a sudo command, another person has to check the command before you hit enter. Could apply the same kind of policy/convention for anything in production.

[+] stickfigure|2 years ago|reply

I have a rule when working on production databases: Always `start transaction` before doing any kind of update - and pay close attention to the # of rows affected.

[+] hannofcart|2 years ago|reply

What strikes me as remarkable in all such stories is how almost always, the person committing the mistake is a junior who never deserves the blame. And how cavalier the handoff/onboarding by the 'seniors' working on the projects are.

Having worked in enough of these though, I am aware that even they (the "seniors") are seldom entirely responsible for all the issues. It's mostly business constraints that forces cutting of corners and that ends up jeopardizing the business in the long run.

[+] jeffreygoesto|2 years ago|reply

A friend once had to remotely do an OS update of a banking system. Being cautious, he thought he'd back up some central files, just in case and went "mv libc.so old_libc.so". Had to call some guy in that town to throw in the Solaris CD on prem at 2:30 in the morning...

[+] Tangurena2|2 years ago|reply

One way I mistake-proof things in SQL Management Studio is to have different colors for production vs test databases.

To do that, on the "connect to server" dialog, click "options". On the tab "connection properties" in the "connection" option group, check "use custom color". And I pick the reddest red there is. The bottom of the results window will have that color.

edit: my horrible foul-up was restoring a database to production. The "there is trouble" pagers were all Iridium pagers since they loved climbing mountains (where there was no cell service back then). But then that place didn't use source control, so it was disasters all the way down.

[+] ThePowerOfFuet|2 years ago|reply

>The very moment after I hit the Enter key, I realized I had made a mistake

This brief moment in time has a name: an ohnosecond.

https://en.wiktionary.org/wiki/ohnosecond

[+] justo-rivera|2 years ago|reply

Seems this is very typical, first time launches usually lose some data.

We never hear about first time launch deploys that wipe ALL data because whoever is so unlucky probably never got to browse hacker news

[+] totallywrong|2 years ago|reply

As a young consultant, I was once one Enter away from causing a disaster, but something stopped me. I still shudder even though it didn't actually happen. Nothing of the sort in many years since, so a great lesson in retrospect I guess.

[+] cloths|2 years ago|reply

We used to have another engineer watch over your shoulder when you do Prod stuff, can be very helpful.

[+] T-A|2 years ago|reply

That reminds me...

https://news.ycombinator.com/item?id=14476421

[+] RHSman2|2 years ago|reply

I’d love to know the long term physiological effect on the body of these events. Have had a few. Still feel shakey :)

[+] alberth|2 years ago|reply

Hope you bought that guy a beer.

Great story, thanks for sharing.

[+] PH95VuimJjqBqy|2 years ago|reply

to be fair, it's a rite of passage to do something like this.

But you should definitely have bought that man a beer :)

[+] NL807|2 years ago|reply

Hats off to the backup guys.

[+] 3abiton|2 years ago|reply

Backup saving the day.

[+] toasted-subs|2 years ago|reply

Nightmare fuel

[+] irjustin|2 years ago|reply

I'm reminded of the phrase - if your intern deleted the production database you don't have a bad intern; you have a bad process.

Whether this was a process problem or a human one we don't really get to judge since we do expect more from a FTE.

I'll just say putting myself into his shoes made me tear up as I read the dread and pangs of pain upon realizing what happened - then to have life again after the failure of the ray of hope. That weight, I've never had a project that so many people depended on.

All heroes in my book.

[+] drra|2 years ago|reply

This really resonates with my experience. Working at a major airline, I was the one who would pick the most difficult and risky projects. One was a quick implementation of a new payment provider for their website. That website sold millions of euros worth of tickets every day. Seconds after deployment, it turned out that I had failed to recognize the differences between the test and live environments as one of the crucial variables was blank in production. I could have expected this if I had spent more time preparing and reading documentation. Sales died completely, and my heart sank. After a lengthy rollback procedure that resulted in a few hours without sales, a massive surge of angry customers, and a loss of several million euros, I approached the CEO of the company. I still remember catching him in an elevator. I explained that this incident was all my fault and I had failed to properly analyse the environment. I assured him that I was ready to bear full consequences, including being fired. He burst into laughter and said something like this: "Why would I want to get rid of you? You made a mistake that you'll never do again. You are better at your job than you were yesterday!" This experience was formative to me on many levels including true leadership. I successfully completed many high risk projects since than.

[+] supriyo-biswas|2 years ago|reply

The language is just so anodyne and there’s just that bit of implausible detail in the story (approaching the CEO yourself when you’re the one who fucked up, also how parent claims to be a “top performer” and “I made my company lose millions” at the same time) makes me think this comment was written by an LLM, or at least a fabrication.

[+] bboygravity|2 years ago|reply

You worried about that?

I'm a frequent flyer and I got a feeling that most airline ticket booking pages are broken in some way more than half the time. Maybe not often broken to the point that they're blank, but definitely broken to the point that booking a ticket isn't possible (I prefer blank, so that I don't waste like 30 minutes on not being able to book a ticket).

Also most of the internet seems often broken. Oh hello Nike webshop errors upon payment (on Black Friday) for which helpdesk's solution is: just use the App.

[+] Cthulhu_|2 years ago|reply

I think the loss may not have been as much as you think; sure, nobody could buy tickets for a few hours, so theoretically the company lost millions of revenue during that time. But that assumes people wouldn't just try again later. Downtime does not, in practice, translate to losses I think.

I mean look at Twitter, which was famously down all the time back when it first launched due to it popularity and architecture. Did it mean people just stopped using Twitter? Some might, the vast majority and then some didn't.

Downtime isn't catastrophic or company-ending for online services. It may be for things in space or high-frequency trading software bankrupting the company, but that's why they have stricter checks and balances - in theory, in practice they're worse than most people's shitty CRUD webservices that were built with best practices learned from the space/HFT industries.

[+] quickthrower2|2 years ago|reply

Being an airline boss I would really have hoped the response would have been more in line with the ethos of a plane crash postmortem, i.e. find the system causes and fix those. Maybe you need a copilot when doing live deployments and that copilot had authority to stop the rollout. Along with the usual devops guards.

[+] rahimnathwani|2 years ago|reply

This reminds me of that old joke that ends "Why would I fire you? We just spent millions training you!".

People who take on high risk projects are underappreciated. But many managers prefer employees who can reliably deliver zero value, than those with positive expected value but non-zero variance.

[+] holografix|2 years ago|reply

Wow chatgpt is actually getting worse.

[+] chriscjcj|2 years ago|reply

I work in TV. During my first job at a small market station 30 years ago, I was training to be a tape operator for a newscast. All the tapes for the show were in a giant stack. There were four playback VTRs. My job was to load each tape and cue it to a 1-second preroll. When a tape played and it was time to eject that tape, it was _very_ easy to lose your place and hit the eject button on the VTR that was currently being played on the air instead of the one that they just finished with. The fella who was training me did something very annoying, but it was effective: every time I went to hit the eject button, he would make a loud cautionary sound by sucking air through his closed teeth as if to tell me I was about to make a terrible mistake. I would hesitate, double check and triple check to make sure it was the right VTR, and then I would eject the tape. He made that sound every single time my finger went for the eject button. It really got on my nerves, but it was a very good way to condition me to be cautious. Our station had a policy: the first time you eject a tape on the air got you a day off without pay; the second time put you on probation; the third time was dismissal. I had several co-workers lose their jobs and wreck the newscast due to their chronic carelessness. Thanks to my annoying trainer, I learned to check, check again, and check again. I never ejected a tape on the air. It certainly would not have been a half-billion dollar mistake if I had, but at that point in my career it would have felt like it to me.

[+] i-use-nixos-btw|2 years ago|reply

I agree that the person who made such a mistake will be the person who never makes that mistake again. That's why firing someone who has slipped up (in a technical way) and is clearly mortified is typically a bad move.

However, I don't agree that this is the "real" lesson.

Given the costs at play and the risk presented, the lesson is that if you have components that are tested with a big surge of power, give them custom test connectors that are incompatible with components that are liable to go up in smoke. That's the lesson. This isn't a little breadboard project they're dealing with, it's a vast project built by countless people in a government agency that has a reputation for formal procedures that are the source of great time, expense, and in some cases ridicule.

The "trust the 28 year old with the $500m robot that can go boom if they slip up" logic seems very peculiar.

[+] throwawayosiu1|2 years ago|reply

I'll add my story here for posterity:

My first job out of university, I was working for a content marketing startup who's tech stack involved PHP and PerconaDB (MySQL). I was relatively inexperienced with PHP but had the false confidence of a new grad (didn't get a job for 6 months after graduating - so I was desperate to impress).

I was tasked with updating some feature flags that would turn on a new feature for all clients, except for those that explicitly wanted it off. These flags were stored in the database as integers (specifically values 4 and 5) in an array (as a string).

I decided to use the PHP function (array_reverse)[https://www.php.net/manual/en/function.array-reverse.php] to achieve the necessary goal. However, what I didn't know (and didn't read up on the documentation) is that, without the 2nd argument, it only reversed the values not the keys. This corrupted the database with the exact opposite of what was needed (somehow this went through QA just fine).

I found out about this hours later (used to commute ~3 hrs each way) and by that time, the senior leadership was involved (small startup). It was an easy fix - just a reverse script - but it highlighted many issues (QA, DB Backups not working etc.)

I distinctly remember (and appreciate) that the lead architect came up to me the next day and told me that it was rite of passage of working with PHP - a mistake that he too had made early in his career.

I ended up being fired (grew as an engineer and was better off for it) but in that moment and weeks after it, it definitely demoralized me.

[+] curiousObject|2 years ago|reply

The story is compellingly written, but I thought it was also confusing.

It sounds as if this team made several mistakes, not just one mistake. It's also not clear if the result of these mistakes was that there might be real damage to the spacecraft, or if the result was just wasted time and hours of confusion about why the spacecraft wouldn't start up.

The first mistake is they didn't realize that the multimeter was not only measuring, but it was also completing the circuit.

That sounds like a real bad idea. But if it was totally necessary to arrange it like that, then that multimeter should never have been touched.

That's not just one guy's error. It's at least two guys at fault, along with whoever is managing them, and whoever is in charge of the system that allows it.

The second mistake is with the break-out-box. They think he misdirected power wrongly into the spacecraft. Then they jump to the conclusion that has generated a power surge which has damaged the spacecraft, because it won't start up.

But they're not sure where the power surge went and what might be damaged. Anyhow they're wrong.

The reason the spacecraft won't start up is just because he took the multimeter out of the circuit before the accident.

I'm still sort of confused about what happened or if they ever really figured out what happened.

He said "Weeks of analyses followed on the RAT-Revolve motor H-bridge channel leading to detailed discussions of possible thin-film demetallization".

Does this mean that they decided that the misdirected power surge might have flowed into the RAT-Revolve motor H-bridge channel and damaged that?

[+] mk_stjames|2 years ago|reply

I am a Mechanical/Aerospace engineer.... I wish my scariest stories 'only' involved a potential bricking of a main computer on an unmanned $500M rover.

No... I was the senior safety-crit signoff on things carrying human lives. I had to look over pictures of parts broken from a crash and have the potential feeling of 'what-if that's my calculation gone wrong'. My joint that slipped. My inappropriate test procedure involving accelerated fatigue life prediction, or stress corrosion cracking. My rushing of putting parts into production processes that didn't catch something before it went out the door.

It's interesting to read people's failure stories from similar fields but, to me, the ones that people so openly write about and get shared here on HN always come across as... well, workplace induced PTSD is not a competition. It's just therapy bills for some of us more than for others.

[+] atonalfreerider|2 years ago|reply

Reminds me of the NOAA-N Prime satellite that fell over, because there weren't enough bolts holding it to the test stand.

The roots cause, and someone correct me if this is not accurate, was that the x-ray tested bolts to hold it down were so expensive, that they had been "borrowed" to use on another project, and not returned, so that when the time came to flip the satellite into a horizontal position, it fell to the floor. Repairs cost $135M.

https://en.m.wikipedia.org/wiki/NOAA-19

[+] euroderf|2 years ago|reply

And so when "Check for bolts" is added to flip procedure errata/addendum, is light sarcasm called for ?

[+] initplus|2 years ago|reply

Interesting to see that the worry could have been avoided if they had lined up their timelines better in the first place. If they'd compared the timestamp on the test readout to the last timestamp from the telemetry system, they'd have seen that the telemetry failed BEFORE the test was executed. Partially caused by using imprecise language "we seem to have lost all spacecraft telemetry just a bit ago" rather than an accurate timestamp.

A cautionary lesson in properly checking how exactly events are connected during an incident. Easy to look at two separate signals and assume they must be causal in a particular direction, when in reality it is the other way around.

[+] unknown|2 years ago|reply

[deleted]

[+] HeavyStorm|2 years ago|reply

It's a interesting story, but the author may be overselling it. It's not a _failure_ story, nor was it a 500M mistake. I get that it was really stressful and the mistake could have cost him the job, but it didn't; it also didn't cost NASA anything other than a few hours of work (which, during testing, I would guess it's expected).

When I'm asked to share failures, I'm usually not thinking about "that one time when I almost screwed up but everything was fine", instead, I'm thinking of when I actually did damage to the business and had to fix it somehow.

[+] dgunay|2 years ago|reply

One thing about long aerospace missions like this with huge lead times that always gets me - you can spend years of your life working on a mission, only for it all to fail with potentially years until you can try again.

This is a refreshingly humanizing article, but is also one written from the perspective of a survivor. Imagine if the rover were actually lost. I asked the question "what would you do if the mission failed after all of this work? How could you cope?" to the folks at (now bankrupt) Masten Aerospace during a job interview, and maybe it was a bad time to ask such a question, but I didn't get the sense they knew either. "The best thing we can do is learn from failure," one of them told me. An excellent thing to do, but not exactly what I asked. This to me stands out as the defining personal risk of caring about your job and working in aerospace. Get too invested, and you may literally see your life's work go up in flames.

[+] cjbprime|2 years ago|reply

> And I still remember the shock when Project Manager Pete delivered the decision and the follow-on news: ‘These tests will continue. And Chris will continue to lead them as we have paid for his education. He’s the last person on Earth who would make this mistake again.’

I wonder whether Pete had followed this 1989 general aviation/accident analysis story:

> When he returned to the airfield Bob Hoover walked over to the man who had nearly caused his death and, according to the California Fullerton News-Tribune, said: "There isn’t a man alive who hasn’t made a mistake. But I’m positive you’ll never make this mistake again. That’s why I want to make sure that you’re the only one to refuel my plane tomorrow. I won’t let anyone else on the field touch it."

-- https://www.squawkpoint.com/2014/01/criticism/

(The incident above led to the creation and eventual mandated use of a new safety nozzle for refueling, which seems like a better long-term solution than having the people who've nearly killed you nearby to fuel your plane indefinitely: https://en.wikipedia.org/wiki/Bob_Hoover#Hoover_nozzle_and_H...)

[+] goku12|2 years ago|reply

If there is the possibility to make a mistake, somebody will certainly make it. You expect all the humans involved to be competent. But relying on that competence is a mistake. The emotional stress of dealing with such enormous responsibilities, the often long work hours and the long list of procedures will make any competent professional to inadvertently slip up at some point.

In case of electrical connectors, the connectors are often grouped together in such a way as to avoid making wrong connections. Connectors with different sizes, keying, gender, etc are chosen to make this happen. This precaution is taken at design time. JPL is extremely experienced in these matters. There is probably something else left unsaid, that led to this mistake being possible.

Meanwhile, motor controllers using H-bridge is something that's never boring. I once saw a motor control fail so spectacularly that we were scratching our heads for days afterwards. As always, a failure is never due to a single cause (due to careful design and redundancies). It's a chain of seemingly innocuous events with a disastrous final outcome. But the chain was so mind-bending that we had to write it down just to remember how it happened. Recently, I was watching the Chernobyl nuclear disaster and I got reminded of this failure. Our failure was nowhere near as disastrous - but the initial mistakes, the control system instability, the human intervention and the ultimate failure propagation were very similar in nature. Needless to say, it sent us back to the drawing board for a complete redesign. The robustness of the final design taught me the same lesson - failures are something you take advantage of.

[+] avar|2 years ago|reply

Are the electronics in these rovers really so bespoke that they don't have multiple copies of each electronic component warehoused on-site?

I'd expect that the rover body itself would be bespoke this late in the process, although a parallel test vehicle would be useful, do they have that?).

But in case someone fried the rover's electronics I'd think tearing it apart and replacing them while maintaining the chassis should be doable in 2 weeks, but what do I know?

[+] mywacaday|2 years ago|reply

Any test that could cause a fatal destructive error should be risk assesed and a suitable protocol approved with four eyes approval on a final checklist before going hot on the electrics. The issue here is poor project governance not human error.

[+] nomel|2 years ago|reply

Yeah, the borrowed multimeter really hammers how silly this all was. You don't touch other peoples lab equipment unless the other ends of the wires are hanging free. A finger pointing to a meter needs to be followed up with a clear confirmation that the wires can be disconnected, and special care isn't needed in the process. If I need something that's connected, I always ask the person to disconnect it for me. Definitely a process/culture problem.

[+] maestroia|2 years ago|reply

$500M? Pocket change...

THE LITTLE VAX THAT COULD https://userpages.umbc.edu/~rostamia/misc/vax.html

[+] cheese_van|2 years ago|reply

I cost my company about 5 times my yearly salary once, long ago. I sampled an enormous amount of seismic data at 2ms instead of the proper 4ms. This was back when we rented IBM our mainframes for a pretty penny. The job ran for the entire weekend and Monday morning I was summoned by management, informed of my error, and asked, "You won't ever do that again will you?" and sent back to work.

Knowing that you are allowed to fail, but are very much expected/required to learn from your failure, makes for rather a good employee, in my experience.

[+] fransje26|2 years ago|reply

> I was into my unofficial second shift having already logged 12 hours that Wednesday. Long workdays are a nominal scenario for the assembly and test phase.

Although the time pressure coming with the upcoming deadline is understandable, perhaps the bigger lessons here is that when you are possibly sleep-deprived, and have already pulled too long a shift, you are bound to make avoidable mistakes. And that is the last thing you want on a $500M mission with at limited flight window.

[+] SoftTalker|2 years ago|reply

Reminds me of a quote attributed to Thomas J Watson:

Recently, I was asked if I was going to fire an employee who made a mistake that cost the company $600,000. No, I replied, I just spent $600,000 training him. Why would I want somebody to hire his experience?

[+] huytersd|2 years ago|reply

Well be prepared to spend $600k more next month for the next “training” session.

343 comments