As someone who is impacted, this is obviously immensely frustrating.
Worse, outside of "we have rebuilt functionality for over 35% of the users", I haven't seen any reports from the people who have ostensibly been recovered.
Finally, there have been some hints that this is related to the decommissioning of a plugin product Atlassian recently acquired (Insight asset management) which is only really useful to large organizations. I suspect that the "0.18% impacted" number is relative to ALL users of Atlassian, including free/limited accounts, and that the percentage of large/serious organizations who are impacted (and who would have a use for an asset management product), is much higher.
I bet they screwed up royally, deleted some data and are down to either rebuilding it from logs, caches or other side-effects, or using data recovery software on the storage drives (which might involve third-party companies). I can't see many other reasons why this should take 2 weeks.
As the person responsible for running Jira and Confluence on premises at my employer I‘m looking forward for the next time one of their sales droids contacts me to make us move to their cloud services (despite me stating that we are not interested multiple times)…
* It's been deleted for a week already, they estimate they might need two more weeks. Three in total.
* They claim to have "extensive backups", and hundreds of engineers working on it.
What? How? This simply doesn't go together. Why would restoring from backup take three weeks?
Either their backups aren't complete, or they need new software written for the restore, or something else doesn't add up.
I haven't administered their software yet, but what I've learned from the sidelines, at least Jira doesn't seem to be rocket science. A database, an application server (maybe a few instances for larger sites), a bit of config, some caches. This really shouldn't take three weeks to restore.
If you would permanently delete data for selected customers from a large multitenant system, it could actually take some time to restore it - even with proper backups.
You can’t just do a full recovery as that would mess those customers who were not affected (it likely takes time to notice the mistake - others have continued to use the system). You might need to write some tools to migrate the data from backups. Also you really need to test everything very carefully - otherwise you might be in even deeper trouble (looking at corrupted instead of lost data).
In large organization this kind of ”manual” recovery might require people from multiple teams as no single person knows all the areas. This adds overhead. Throwing too many people in does not help either. When you start thinking about it, few weeks is not that long.
And JIRA is definitely not simple. It’s complicated beast and likely the SaaS features combined with all the legacy makes it even more complicated.
Restore from off-site tape backup. The kind of service where you ship them ~dozen new tapes in a lockbox each week and they ship you the oldest dozen back. It's supposed to be the "if all of a data centers happen to burn to ashes simultaneously" option. If you say "give us all of our tapes, asap" and then have some pour souls swapping them out as fast as the data can be read... it would probably take a few weeks.
My theory in my other comment is that they've deleted some data and are waiting on third-party data recovery specialists. That would explain the timescale.
Jira cloud became much more complex when they went all in on AWS. The reason for the cloud/server fork 4 or 5 years ago was so that cloud engineers could couple to a zillion AWS services without having to build back any backwards compatibility. So data stores are very much more disparate than just a PgSQL DB and redis (which is how it used to be).
Something that can make restore-from-backups harder, and that I've seen happen, is when the backup/restore systems themselves get destroyed by the same black swan event. Then you have to first recover those by doing fresh installs, and you have to have all the people on hand who know what the configurations would have been to be able to then use the backup library. Then you have to begin restoring a few target systems to check that everything is OK with the restore process, then you have to restore everything though you'll be limited by the restore system's bandwidth.
How could this happen? Well, a disgruntled employee could make it happen. It happened at Paine Webber in 2002 [0]. In that case the attacker left a time bomb in the boot process on all systems they could reach, and that included the backup/restore servers. Worse, the time bomb was in the backups themselves, so restored systems ate themselves as soon as they were booted, which slowed down the recovery process.
My guess is they failed halfway through a major schema or api migration. If some of the services have already progressed too far, then rolling back another service to previous backup snapshot will make the two incompatible. Especially if one of the services is global and the other is per customer.
The only way out is to figure out the bugs and continue migrating forward, fixing issues as they appear one by one.
We're mere weeks aware from migrating to their could platform after the self-hosted rugpull. This really doesn't give me confidence in their ability to not break my stuff.
1) This outage will get their organization to prioritize work such that it never happens again.
2) This outage is representative of a dysfunctional organization that can't prioritize work correctly.
If you've been using Atlassian software for a while and are used to how they prioritize tickets then one of those options seems far more likely than the other.
My company is in the middle of a multi year transition from selfhosting atlassian products to using their cloud offerings, and I am sure the infrastructure team/management is very thrilled to see this news.
While our tenant was unaffected, i told my management of this issue. They just shrugged and said we could watch and eat popcorn. I was half way expecting them to raise eyebrows
I was kinda like....."not really the point of bringing it up."
Its worth noting we have had them just delete things within our account before. In fact one of our Senior VP's had their account just....disappear one day. We couldn't @ them in chats, tickets etc. Atlassian just shrugged and "restored" the account and said it was some issue with a stored proc on their backend or something.
I have always felt uneasy about how flippant they are in their processes. But it seems that is not shared.
I use JIRA and confluence every single day, I have to, it is everywhere - but imo it is such a horrific toolset in every way (even before this outage), I can't for the life of me figure out how it got so much market-share.
Reminds me of the time a group I worked for at a National Lab decided to call the root folder for their project “core”. I bet you can’t guess what filename the backup scripts were configured to ignore…
I wish companies would stop with this "small number of customers" messaging. It always seems disingenuous and, besides, that matters for your internal estimation of business impact but means absolutely nothing to the customers affected.
Wow! I didn’t realize the scope and duration of this outage. This must be doing some serious damage to some of their clients (catastrophic if this does impact JIRA, Confluence, and OpsGenie broadly on a company level). Is there any report of approximately how many (or specific) companies have been affected as a result of this?
Atlassian does not care about individual customers. They are purely driven by numbers. I listened to a presentation by one of their founders a long time ago where he admitted that statistics and number management was part of their DNA. They don’t think about the customer’s name, maybe this is right, maybe not.
Meanwhile, people have been waiting since at least 2013 for Atlassian to deliver a way to automate backups for their Cloud offerings: https://jira.atlassian.com/browse/CLOUD-6498
Whats the legal implications of such downtime for Atlassian? I could imagine thousands of companies unable to manage employees, product releases, bug fixes, rollbacks and more from this.
Maybe. But you're counting on your sysadmin(s), who are also managing dozens of other things, to keep up to speed on Jira and its quirks, and apply patches and new versions as they become available without missing any steps or screwing something up.
On average, you're still probably better off having a company that knows the product also host it for you, but obviously they can make mistakes too, and the downside is that when they do it might affect all clients, not just one.
I worked at a company that self hosted Jira, and it was miserable then, I can’t imagine depending on the cloud. I’ll never approve Atlassian products after that experience.
We're using the cloud version and we're fine too (no outage). What's your point? Are you claiming that self-hosted is never down? Or that self-hosted is more reliable? Because I doubt that. Difference is just that when self-hosted goes down, it doesn't end up in the news.
[+] [-] Rantenki|3 years ago|reply
Worse, outside of "we have rebuilt functionality for over 35% of the users", I haven't seen any reports from the people who have ostensibly been recovered.
Next, their published RTO is 6 hours, so obviously they must have done something that completely demolished their ability to use their standard recovery methods: https://www.atlassian.com/trust/security/data-management
Finally, there have been some hints that this is related to the decommissioning of a plugin product Atlassian recently acquired (Insight asset management) which is only really useful to large organizations. I suspect that the "0.18% impacted" number is relative to ALL users of Atlassian, including free/limited accounts, and that the percentage of large/serious organizations who are impacted (and who would have a use for an asset management product), is much higher.
[+] [-] Nextgrid|3 years ago|reply
[+] [-] planb|3 years ago|reply
[+] [-] perlgeek|3 years ago|reply
* It's been deleted for a week already, they estimate they might need two more weeks. Three in total.
* They claim to have "extensive backups", and hundreds of engineers working on it.
What? How? This simply doesn't go together. Why would restoring from backup take three weeks?
Either their backups aren't complete, or they need new software written for the restore, or something else doesn't add up.
I haven't administered their software yet, but what I've learned from the sidelines, at least Jira doesn't seem to be rocket science. A database, an application server (maybe a few instances for larger sites), a bit of config, some caches. This really shouldn't take three weeks to restore.
[+] [-] jpalomaki|3 years ago|reply
You can’t just do a full recovery as that would mess those customers who were not affected (it likely takes time to notice the mistake - others have continued to use the system). You might need to write some tools to migrate the data from backups. Also you really need to test everything very carefully - otherwise you might be in even deeper trouble (looking at corrupted instead of lost data).
In large organization this kind of ”manual” recovery might require people from multiple teams as no single person knows all the areas. This adds overhead. Throwing too many people in does not help either. When you start thinking about it, few weeks is not that long.
And JIRA is definitely not simple. It’s complicated beast and likely the SaaS features combined with all the legacy makes it even more complicated.
[+] [-] jlbooker|3 years ago|reply
[+] [-] Nextgrid|3 years ago|reply
[+] [-] mdoms|3 years ago|reply
[+] [-] cryptonector|3 years ago|reply
Something that can make restore-from-backups harder, and that I've seen happen, is when the backup/restore systems themselves get destroyed by the same black swan event. Then you have to first recover those by doing fresh installs, and you have to have all the people on hand who know what the configurations would have been to be able to then use the backup library. Then you have to begin restoring a few target systems to check that everything is OK with the restore process, then you have to restore everything though you'll be limited by the restore system's bandwidth.
How could this happen? Well, a disgruntled employee could make it happen. It happened at Paine Webber in 2002 [0]. In that case the attacker left a time bomb in the boot process on all systems they could reach, and that included the backup/restore servers. Worse, the time bomb was in the backups themselves, so restored systems ate themselves as soon as they were booted, which slowed down the recovery process.
[+] [-] m0llusk|3 years ago|reply
[+] [-] tpmx|3 years ago|reply
Isn't this how an incompetent, insincere and desperate company being subjected to a ransom attack would communicate publicly?
[+] [-] Too|3 years ago|reply
The only way out is to figure out the bugs and continue migrating forward, fixing issues as they appear one by one.
[+] [-] coffeeling|3 years ago|reply
[+] [-] lapser|3 years ago|reply
[+] [-] btgeekboy|3 years ago|reply
[+] [-] haunter|3 years ago|reply
https://twitter.com/Atlassian/status/1511870509973090304
Most likely they wiped the data
[+] [-] yabones|3 years ago|reply
[+] [-] justin_oaks|3 years ago|reply
1) This outage will get their organization to prioritize work such that it never happens again.
2) This outage is representative of a dysfunctional organization that can't prioritize work correctly.
If you've been using Atlassian software for a while and are used to how they prioritize tickets then one of those options seems far more likely than the other.
[+] [-] uuyi|3 years ago|reply
Absolutely no one even knew this was happening and doesn’t give a shit now because it’s a project death march.
JIRA as a whole has been a fucking shit show of a product over the last decade even on-prem.
[+] [-] beachy|3 years ago|reply
[+] [-] thanatos519|3 years ago|reply
[+] [-] unknown|3 years ago|reply
[deleted]
[+] [-] timcavel|3 years ago|reply
[deleted]
[+] [-] alligatorplum|3 years ago|reply
[+] [-] croutonwagon|3 years ago|reply
I was kinda like....."not really the point of bringing it up."
Its worth noting we have had them just delete things within our account before. In fact one of our Senior VP's had their account just....disappear one day. We couldn't @ them in chats, tickets etc. Atlassian just shrugged and "restored" the account and said it was some issue with a stored proc on their backend or something.
I have always felt uneasy about how flippant they are in their processes. But it seems that is not shared.
[+] [-] GeorgeTirebiter|3 years ago|reply
[+] [-] ejb999|3 years ago|reply
[+] [-] rcurry|3 years ago|reply
[+] [-] ivanvas|3 years ago|reply
I don't envy engineers there right now, some dpts. Stay strong, don't burn out!
https://architectureau.com/articles/worlds-tallest-hybrid-ti...
[+] [-] clhodapp|3 years ago|reply
[+] [-] debarshri|3 years ago|reply
[+] [-] jds375|3 years ago|reply
[+] [-] floatinglotus|3 years ago|reply
[+] [-] hogrider|3 years ago|reply
[+] [-] mattweinberg|3 years ago|reply
[+] [-] barnabee|3 years ago|reply
[+] [-] abrookewood|3 years ago|reply
And there's still very little movement: https://community.atlassian.com/t5/Backup-Restore-articles/E...
But don't worry! It's in the Cloud! It's all fine!
[+] [-] ostenning|3 years ago|reply
Could Atlassian be liable for damages?
[+] [-] aaronbrethorst|3 years ago|reply
[+] [-] hogrider|3 years ago|reply
[+] [-] ergocoder|3 years ago|reply
I'll probably should buy their stock.
[+] [-] threeseed|3 years ago|reply
This would allow them access to more investors.
[+] [-] Kwpolska|3 years ago|reply
[+] [-] radicalriddler|3 years ago|reply
[+] [-] api|3 years ago|reply
[+] [-] nolok|3 years ago|reply
I don't want any of my company trapped on it, but if they were I'm sure as well not going to self host that spawn of hell.
[+] [-] thematt|3 years ago|reply
[+] [-] throwawayboise|3 years ago|reply
Maybe. But you're counting on your sysadmin(s), who are also managing dozens of other things, to keep up to speed on Jira and its quirks, and apply patches and new versions as they become available without missing any steps or screwing something up.
On average, you're still probably better off having a company that knows the product also host it for you, but obviously they can make mistakes too, and the downside is that when they do it might affect all clients, not just one.
[+] [-] Overtonwindow|3 years ago|reply
[+] [-] media-trivial|3 years ago|reply