top | item 14476538

(no title)

Rezo | 8 years ago

Sorry, but if a junior dev can blow away your prod database by running a script on his _local_ dev environment while following your documentation, you have no one to blame but yourself. Why is your prod database even reachable from his local env? What does the rest of your security look like? Swiss cheese I bet.

The CTO further demonstrates his ineptitude by firing the junior dev. Apparently he never heard the famous IBM story, and will surely live to repeat his mistakes:

After an employee made a mistake that cost the company $10 million, he walked into the office of Tom Watson, the C.E.O., expecting to get fired. “Fire you?” Mr. Watson asked. “I just spent $10 million educating you.”

discuss

_jal|8 years ago

Seriously. The CTO in question is the incompetent one. S/he failed:

- Access control 101. Seriously, this is pure incompetence. It is the equivalent of having the power cord to the Big Important Money Making Machine snaking across the office and under desks. If you can't be arsed to ensure that even basic measures are taken to avoid accidents, acting surprised when they happen is even more stupid.

- Sensible onboarding documentation. Why would prod access information be stuck in the "read this first" doc?

- Management 101. You just hired a green dev just out of college who has no idea how things are supposed to work. You just fired him in an incredibly nasty way for making an entirely predictable mistake that came about because of your lack of diligence at your job (see above).

Also, I have no idea what your culture looks like, but you just told all your reports that honest mistakes can be fatal and their manager's judgement resembles that of a petulant 14 year-old.

- Corporate Communications 101. Hindsight and all that, but it seems inevitable that this would lead to a social media trash fire. Congrats on embarrassing yourself and your company in an impressive way. On the bright side, this will last for about 15 minutes and then maybe three people will remember. Hopefully the folks at your next gig won't be among them.

My take away is that anyone involved in this might want to start polishing their resumes. The poor kid and the CTO for obvious reasons, and the rest of the devs, because good lord, that company sounds doomed.

dkrich|8 years ago

Yeah when I read that my first thought was that the CTO reacted that way because he was in fear of being fired himself. I wouldn't be at all surprised if he wrote that document or approved it himself.

Rezo|8 years ago

Here's some simple practical tips you can use to prevent this and other Oh Shit Moments(tm):

- Unless you have full time DBAs, do use a managed db like RDS, so you don't have to worry about whether you've setup the backups correctly. Saving a few bucks here is incredibly shortsighted, your database is probably the most valuable asset you have. RDS allows point-in-time restore of your DB instance to any second during your retention period, up to the last five minutes. That will make you sleep better at night.

- Separate your prod and dev AWS accounts entirely. It doesn't cost you anything (in fact, you get 2x the AWS free tier benefit, score!), and it's also a big help in monitoring your cloud spend later on. Everyone, including the junior dev, should have full access to the dev environment. Fewer people should have prod access (everything devs may need for day-to-day work like logs should be streamed to some other accessible system, like Splunk or Loggly). Assuming a prod context should always require an additional step for those with access, and the separate AWS account provides that bit of friction.

- The prod RDS security group should only allow traffic from white listed security groups also in the prod environment. For those really requiring a connection to the prod DB, it is therefore always a two-step process: local -> prod host -> prod db. But carefully consider why are you even doing this in the first place? If you find yourself doing this often, perhaps you need more internal tooling (like an admin interface, again behind a whitelisting SG).

- Use a discovery service for the prod resources. One of the simplest methods is just to setup a Route 53 Private Hosted Zone in the prod account, which takes about a minute. Create an alias entry like "db.prod.private" pointing to the RDS and use that in all configurations. Except for the Route 53 record, the actual address for your DB should not appear anywhere. Even if everything else goes sideways, you've assumed a prod context locally by mistake and you run some tool that is pointed to the prod config, the address doesn't resolve in a local context.

unoti|8 years ago

You made a lot of insightful point here, but I'd like to chime in on one important point:

> - Unless you have full time DBAs, do use a managed db like RDS, so you don't have to worry about whether you've setup the backups correctly.

The real way to not worry about whether you've set up backups correctly is to set up the backups, and actually try and document the recovery procedure. Over the last 30 years I've seen situations beyond counting of nasty surprises when people actually try to restore their backups during emergencies. Hopefully checking the "yes back this up" checkbox on RDS covers you, but actually following the recovery procedure and checking the results is the only way to not have some lingering worry.

In this particular example, there might be lingering surprises like part of the data might be in other databases, storage facilities like S3 that don't have backups in sync with the primary backup, or caches and queues that need to be reset as part of the recovery procedure.

dsr_|8 years ago

And put a firewall between your dev machines and your production database. All production database tasks need to be done by someone who has permission to cross in to the production side -- a dev machine shouldn't be allowed to talk to it.

daxfohl|8 years ago

Would you recommend all these steps even for a single-person freelance job? Or is it overkill?

champagnepapi|8 years ago

I agree, it's the fault of the CTO. To me, the CTO sounds pretty incompetent. The junior engineer did them a favor. This company seems like it is an amateur hour operation, since data was deleted so easily by an junior engineer.

vvanders|8 years ago

Yup, I've heard stories of junior engineers causing millions of dollars worth out outages. In those case the process was drilled into, the control that caused it fixed and the engineer was not given a reprimand.

If you have an engineer that goes though that and shows real remorse your going to have someone who's never going to make that mistake(or similar ones) again.

karmajunkie|8 years ago

Yep. I had a junior working for me once a few years ago that made a rather unfortunate error in production which deleted all of several customers' data. I could tell he was on pins and needles when he brought it to me, so I let him off the hook right away and showed him the procedures to fix the issue. He said something about being thankful there was a way to fix the problem, and I just smiled and told him A) it would have been my fault if there hadn't been; and B) he wouldn't have had the access he did without safeguards in place. Then I told him a story about the time I managed to accidentally delete an entire database of quarantined email from a spam appliance I was working on several years earlier. Sadly, my CTO at the time did NOT prepare for that.

I lost a whole weekend of sleep in recovering that one from logs, and that was when I learned some good tricks for ensuring recoverability....

dheera|8 years ago

Agreed. Also, why didn't they have a backup of some sort? The hard drive on the server could have failed and it would have been just as bad.

Sounds like an incompetent set of people running the production server.

justbaker|8 years ago

"It's your first day, we don't understand security so here's the combination to the safe. Have fun!!"

cwilkes|8 years ago

"we have a bunch of guns, we aren't sure which ones are loaded, all the safeties are off and we modified them to go off randomly"

jdietrich|8 years ago

If someone on their first day of work can do this much damage, what could a disgruntled veteran do? If Snowden has taught us anything, it's that internal threats are just as dangerous as external threats.

This shop sounds like a raging tire fire of negligence.

hashkb|8 years ago

He didn't follow the docs exactly. That doesn't matter, though, your first day should be bulletproof and if it's not, it's on the CTO. The buck does not stop with junior engineers on their first day.

austenallred|8 years ago

> He didn't follow the docs exactly

Sure, but having the plaintext credentials for a readily-deletable prod db as an example before you instruct someone to wipe the db doesn't salvage competence very much.

FLUX-YOU|8 years ago

Don't tell Etsy that

ajeet_dhaliwal|8 years ago

Thanks for Tom Watson quote, I'd never heard it before, it's a good one. Also agree with everything else you just said, this is not the junior devs fault at all.

mschwaig|8 years ago

He might be inept, but in this instance the CTO is mainly just covering his own ass.

mikeryan|8 years ago

"Yeah the whole site is buggered, and the backups aren't working - but I fired the Junior developer who did it" Is not how you Cover Your Ass ™.