top | item 31223231

(no title)

parksy | 3 years ago

Many of the projects I worked on would have a near-identical replication of an environment from the network stack to the application and databases. Flipping from staging to production was sometimes as simple as a DNS update. It's always an eye opener to see businesses at this scale operating without a full replication of production in staging. It's always harrowing when you're testing a destructive change on dummy data just knowing there's a million ways a live deployment could go wrong, and the impact just scales the bigger you get so that kind of redundancy just seems even more important.

It could be argued at the scale of a company like Atlassian that this level of redundancy is prohibitively expensive, that's a lot of databases and files to have sitting around doing nothing 99.99% of the time, and it's hard to argue for prevention of something that's never happened and would be a costly thing to tool up for. But you can definitely factor scaling your redundant capacity into your model, both pricing-wise and engineering-wise. It's not like Atlassian products are cheap to begin with, I'm sure they can sustain some velocity / bottom line hit for the sake of something as basic as fully replicated staging environments. I definitely don't think this is on the engineers, it's a strategic oversight and shows where ultimate priorities lie within the company.

Putting your trust in a cloud service to take care of things you'd otherwise have to worry about yourself is a major decision, and safety is one of the top priorities of basically every user, and seeing the lack of process and glib approach to staging is a major red flag.

Anyway that aside I do appreciate their detailed write up and it does feel like a bluntly honest and truthful disclosure. That goes a long way to restoring trust, but it does also expose some of how the sausage is made and it's clear some of the ingredients are questionable. It does bear the hallmarks of a small successful software startup hitting the big time and scaling with acquisitions faster than supporting processes can safely scale; they have a team of engineers and it's up to them where to engage them and it seems being able to do proper dry runs of destructive changes wasn't seen as more valuable than getting more services on the products page.

Hopefully they'll act on the recommendations of the report and implement the improvements they said they would and not just refocus their efforts elsewhere once the spotlight moves on. I'd like to see regular updates on this as a long-term Atlassian user as it would factor greatly into me recommending Atlassian products over other stacks in the future. They could easily set up a public Jira / Trello board so we can keep track of progress on these promises.

Obviously this is not unique, these mistakes have happened before, so it's not just the kind of stuff that seems obvious in hindsight. I am sure there were engineers highlighting these issues internally but scaling redundancy is never as sexy as onboarding a new product and adding its customers (and revenue) to your quarterly reports. Hopefully the reputation hit is a stark reminder to the c-suite that yes, they are running a technology company, and that means that technology and engineering should be just as important as growth and penetration.

Anyway, good on them for being open. Well done to the engineers who worked to untangle the mess, good on management for allowing this level of transparency and taking ownership, things could have been a lot worse by the sounds of things.

discuss

No comments yet.