top | item 30990697

Atlassian: We estimate the rebuilding effort to last for up to 2 more weeks

502 points| tosh | 3 years ago |twitter.com

247 comments

order
[+] Rantenki|3 years ago|reply
As someone who is impacted, this is obviously immensely frustrating.

Worse, outside of "we have rebuilt functionality for over 35% of the users", I haven't seen any reports from the people who have ostensibly been recovered.

Next, their published RTO is 6 hours, so obviously they must have done something that completely demolished their ability to use their standard recovery methods: https://www.atlassian.com/trust/security/data-management

Finally, there have been some hints that this is related to the decommissioning of a plugin product Atlassian recently acquired (Insight asset management) which is only really useful to large organizations. I suspect that the "0.18% impacted" number is relative to ALL users of Atlassian, including free/limited accounts, and that the percentage of large/serious organizations who are impacted (and who would have a use for an asset management product), is much higher.

[+] Nextgrid|3 years ago|reply
I bet they screwed up royally, deleted some data and are down to either rebuilding it from logs, caches or other side-effects, or using data recovery software on the storage drives (which might involve third-party companies). I can't see many other reasons why this should take 2 weeks.
[+] planb|3 years ago|reply
As the person responsible for running Jira and Confluence on premises at my employer I‘m looking forward for the next time one of their sales droids contacts me to make us move to their cloud services (despite me stating that we are not interested multiple times)…
[+] perlgeek|3 years ago|reply
So, let me get this straight:

* It's been deleted for a week already, they estimate they might need two more weeks. Three in total.

* They claim to have "extensive backups", and hundreds of engineers working on it.

What? How? This simply doesn't go together. Why would restoring from backup take three weeks?

Either their backups aren't complete, or they need new software written for the restore, or something else doesn't add up.

I haven't administered their software yet, but what I've learned from the sidelines, at least Jira doesn't seem to be rocket science. A database, an application server (maybe a few instances for larger sites), a bit of config, some caches. This really shouldn't take three weeks to restore.

[+] jpalomaki|3 years ago|reply
If you would permanently delete data for selected customers from a large multitenant system, it could actually take some time to restore it - even with proper backups.

You can’t just do a full recovery as that would mess those customers who were not affected (it likely takes time to notice the mistake - others have continued to use the system). You might need to write some tools to migrate the data from backups. Also you really need to test everything very carefully - otherwise you might be in even deeper trouble (looking at corrupted instead of lost data).

In large organization this kind of ”manual” recovery might require people from multiple teams as no single person knows all the areas. This adds overhead. Throwing too many people in does not help either. When you start thinking about it, few weeks is not that long.

And JIRA is definitely not simple. It’s complicated beast and likely the SaaS features combined with all the legacy makes it even more complicated.

[+] jlbooker|3 years ago|reply
Restore from off-site tape backup. The kind of service where you ship them ~dozen new tapes in a lockbox each week and they ship you the oldest dozen back. It's supposed to be the "if all of a data centers happen to burn to ashes simultaneously" option. If you say "give us all of our tapes, asap" and then have some pour souls swapping them out as fast as the data can be read... it would probably take a few weeks.
[+] Nextgrid|3 years ago|reply
My theory in my other comment is that they've deleted some data and are waiting on third-party data recovery specialists. That would explain the timescale.
[+] mdoms|3 years ago|reply
Jira cloud became much more complex when they went all in on AWS. The reason for the cloud/server fork 4 or 5 years ago was so that cloud engineers could couple to a zillion AWS services without having to build back any backwards compatibility. So data stores are very much more disparate than just a PgSQL DB and redis (which is how it used to be).
[+] cryptonector|3 years ago|reply
Restoring everything from backups is very hard.

Something that can make restore-from-backups harder, and that I've seen happen, is when the backup/restore systems themselves get destroyed by the same black swan event. Then you have to first recover those by doing fresh installs, and you have to have all the people on hand who know what the configurations would have been to be able to then use the backup library. Then you have to begin restoring a few target systems to check that everything is OK with the restore process, then you have to restore everything though you'll be limited by the restore system's bandwidth.

How could this happen? Well, a disgruntled employee could make it happen. It happened at Paine Webber in 2002 [0]. In that case the attacker left a time bomb in the boot process on all systems they could reach, and that included the backup/restore servers. Worse, the time bomb was in the backups themselves, so restored systems ate themselves as soon as they were booted, which slowed down the recovery process.

  [0] https://www.independent.co.uk/news/business/news/disgruntled-worker-tried-to-cripple-ubs-in-protest-over-32-000-bonus-481515.html
  
      https://www.justice.gov/archive/criminal/cybercrime/press-releases/2002/duronioIndict.htm
[+] m0llusk|3 years ago|reply
If you are not regularly testing restores then you don't have backups.
[+] tpmx|3 years ago|reply
I have no specific information on this, but in general and in theory:

Isn't this how an incompetent, insincere and desperate company being subjected to a ransom attack would communicate publicly?

[+] Too|3 years ago|reply
My guess is they failed halfway through a major schema or api migration. If some of the services have already progressed too far, then rolling back another service to previous backup snapshot will make the two incompatible. Especially if one of the services is global and the other is per customer.

The only way out is to figure out the bugs and continue migrating forward, fixing issues as they appear one by one.

[+] coffeeling|3 years ago|reply
What if they have backups of the data, but not the specifics of users' sites?
[+] lapser|3 years ago|reply
For those of us not up to date, what exactly has happened? Their status page hasn't actually shown why they're having to rebuild.
[+] yabones|3 years ago|reply
We're mere weeks aware from migrating to their could platform after the self-hosted rugpull. This really doesn't give me confidence in their ability to not break my stuff.
[+] justin_oaks|3 years ago|reply
There are two ways this can go:

1) This outage will get their organization to prioritize work such that it never happens again.

2) This outage is representative of a dysfunctional organization that can't prioritize work correctly.

If you've been using Atlassian software for a while and are used to how they prioritize tickets then one of those options seems far more likely than the other.

[+] uuyi|3 years ago|reply
Same situation.

Absolutely no one even knew this was happening and doesn’t give a shit now because it’s a project death march.

JIRA as a whole has been a fucking shit show of a product over the last decade even on-prem.

[+] beachy|3 years ago|reply
Why would you continue with that plan? You couldn't get a clearer warning signal.
[+] alligatorplum|3 years ago|reply
My company is in the middle of a multi year transition from selfhosting atlassian products to using their cloud offerings, and I am sure the infrastructure team/management is very thrilled to see this news.
[+] croutonwagon|3 years ago|reply
While our tenant was unaffected, i told my management of this issue. They just shrugged and said we could watch and eat popcorn. I was half way expecting them to raise eyebrows

I was kinda like....."not really the point of bringing it up."

Its worth noting we have had them just delete things within our account before. In fact one of our Senior VP's had their account just....disappear one day. We couldn't @ them in chats, tickets etc. Atlassian just shrugged and "restored" the account and said it was some issue with a stored proc on their backend or something.

I have always felt uneasy about how flippant they are in their processes. But it seems that is not shared.

[+] GeorgeTirebiter|3 years ago|reply
I'm sure they will also be thrilled waiting sometimes seconds for a character to echo. Good Luck, I suffer with cloud Atlassian every day.
[+] ejb999|3 years ago|reply
I use JIRA and confluence every single day, I have to, it is everywhere - but imo it is such a horrific toolset in every way (even before this outage), I can't for the life of me figure out how it got so much market-share.
[+] rcurry|3 years ago|reply
Reminds me of the time a group I worked for at a National Lab decided to call the root folder for their project “core”. I bet you can’t guess what filename the backup scripts were configured to ignore…
[+] clhodapp|3 years ago|reply
I wish companies would stop with this "small number of customers" messaging. It always seems disingenuous and, besides, that matters for your internal estimation of business impact but means absolutely nothing to the customers affected.
[+] debarshri|3 years ago|reply
I guess they are using JIRA for planning their sprints.
[+] jds375|3 years ago|reply
Wow! I didn’t realize the scope and duration of this outage. This must be doing some serious damage to some of their clients (catastrophic if this does impact JIRA, Confluence, and OpsGenie broadly on a company level). Is there any report of approximately how many (or specific) companies have been affected as a result of this?
[+] floatinglotus|3 years ago|reply
Atlassian does not care about individual customers. They are purely driven by numbers. I listened to a presentation by one of their founders a long time ago where he admitted that statistics and number management was part of their DNA. They don’t think about the customer’s name, maybe this is right, maybe not.
[+] hogrider|3 years ago|reply
That's ever single company under capitalism tho. Anything else is only empty platitudes.
[+] barnabee|3 years ago|reply
If I was made to use Jira I’d be pretty happy about this
[+] ostenning|3 years ago|reply
Whats the legal implications of such downtime for Atlassian? I could imagine thousands of companies unable to manage employees, product releases, bug fixes, rollbacks and more from this.

Could Atlassian be liable for damages?

[+] aaronbrethorst|3 years ago|reply
The wildest part about this service outage to me is that Atlassian's stock is up 10% over the past month.
[+] hogrider|3 years ago|reply
The stock market parted with reality sometime during 2020.
[+] ergocoder|3 years ago|reply
Can you imagine screwing up this bad and customers still not leave?

I'll probably should buy their stock.

[+] threeseed|3 years ago|reply
Might have more to do with them moving their HQ from UK to US.

This would allow them access to more investors.

[+] Kwpolska|3 years ago|reply
Breaking news: stocks are meaningless numbers.
[+] radicalriddler|3 years ago|reply
The outage has been in the past week and it's down over 10%... talking about the last months price changes is irrelevant compared to the last week.
[+] api|3 years ago|reply
It's funny that they recently killed self-hosted Jira. If you'd self-hosted you'd be fine.
[+] nolok|3 years ago|reply
To be fair, Jira screams SaaS.

I don't want any of my company trapped on it, but if they were I'm sure as well not going to self host that spawn of hell.

[+] thematt|3 years ago|reply
They've only killed off one version of self-hosted Jira. Their datacenter edition is still alive.
[+] throwawayboise|3 years ago|reply
> If you'd self-hosted you'd be fine.

Maybe. But you're counting on your sysadmin(s), who are also managing dozens of other things, to keep up to speed on Jira and its quirks, and apply patches and new versions as they become available without missing any steps or screwing something up.

On average, you're still probably better off having a company that knows the product also host it for you, but obviously they can make mistakes too, and the downside is that when they do it might affect all clients, not just one.

[+] Overtonwindow|3 years ago|reply
I worked at a company that self hosted Jira, and it was miserable then, I can’t imagine depending on the cloud. I’ll never approve Atlassian products after that experience.
[+] media-trivial|3 years ago|reply
We're using the cloud version and we're fine too (no outage). What's your point? Are you claiming that self-hosted is never down? Or that self-hosted is more reliable? Because I doubt that. Difference is just that when self-hosted goes down, it doesn't end up in the news.