InfluxDB Cloud shuts down in Belgium; some weren't notified before data deletion

[+] StopHammoTime|2 years ago|reply

It seems this probably happened due to some regulation or other. The sunset date for the service should have been a month prior so that influx could have kept the data legally until the 30th in case of this situation happening.

They wanted to have the euros flowing in right until the last minute.

1. Flash messages on all user facing consoles. 2. No new resource able to be created for 6 mo this. 3. Emails. 4. Service end date should have been at least a month prior to mandatory shut down. 5. Any people still running workloads in May should have had aggressive contact attempts made to ensure they were aware. 6. The console in the region should have switched to a final backup that can be exported by the user or moved to another region. This should have been available for 30 days.

You don’t do this because it’s fun, you do this because you need to save reputation. If I can’t trust you with business critical data then why would I use you for my critical business?

Also, as someone who works for a large enterprise, if you really believe email is the way to inform them of these changes, well I’d reconsider your beliefs.

[+] XCabbage|2 years ago|reply

There's no regulatory consideration involved as far as I can tell. On Slack at https://influxcommunity.slack.com/archives/CH8TV3LJG/p168894... they explain the shutdown thus:

> "The region did not get enough usage or growth to make it economically viable to operate, so it became necessary for InfluxData to discontinue service in those regions."

So it's worse than you believe. Yes, the handling is a scandal for all the reasons you say. But they weren't even pushed into this by some regulatory issue; it's pure cost-cutting.

[+] EdwardDiego|2 years ago|reply

Given they shut down two DCs half a world apart, it's not regulations. It's cost.

[+] shin_lao|2 years ago|reply

It's a cost-cutting measure that reeks of a company trying to cut costs as fast as possible.

[+] raverbashing|2 years ago|reply

And make no mistake some people will still miss the notification after all these warnings

[+] unknown|2 years ago|reply

[deleted]

[+] anileated|2 years ago|reply

Perhaps service shutdown is also the only valid case where it can be okay to intermittently fail API requests?

[+] sieabahlpark|2 years ago|reply

[deleted]

[+] ikiris|2 years ago|reply

companies generally want to be paid for their costs of holding your data liabilities, yes.

[+] jjgreen|2 years ago|reply

“But look, you found the notice, didn’t you?”

“Yes,” said Arthur, “yes I did. It was on display in the bottom of a locked filing cabinet stuck in a disused lavatory with a sign on the door saying ‘Beware of the Leopard.”

[+] dangoodmanUT|2 years ago|reply

It should not be understated how bad this is. Your #1 expectation as a cloud database provider is to keep data safe and recoverable.

I hope for at least their sake they took a backup of everyone's DB that could be restored in another region, but based on the fact that they didn't do a scream test, I doubt they thought about this either.

This must have been forced upon by upper management, because there is no way someone along the chain to actually delete data did not suggest a scream test. No way someone didn't say "this is a terrible idea, email is not reliable".

Adding Influx right next to GCP of providers I'm never using. Self-hosting is the way, and use ClickHouse.

[+] d33|2 years ago|reply

In case anyone else is wondering:

> The Scream Test is simple – remove it and wait for the screams. If someone screams, put it back. The Scream Test can be applied to any product, service or capability – particularly when there is poor ownership or understanding of it’s importance.

https://www.v-wiki.net/scream-test-meaning/

[+] influxisdead|2 years ago|reply

I agree. This should be an indication to all current users that they should no longer trust InfluxData with their business.

The CTO seems to have been checked out for a long time (just look at how little developer engagement there is on here) and the CEO seems to have no idea how to run a DBaaS. The fact that nobody else from the company has stepped in to try and defuse this should terrify anyone who has data on InfluxData's cloud.

This is the beginning of the end. It seems like all of the good people have left the company, and being willing to destroy credibility to cut costs is a clear sign that the company is running on fumes.

So, now is the time - find your alternative, whether it's Timescale, QuestDB, VictoriaMetrics, ClickHouse, or just self-hosting.

[+] jacquesm|2 years ago|reply

This is pretty much corporate suicide. I really don't understand what they are trying to achieve with this and their attitude in this thread is baffling.

[+] jkaplowitz|2 years ago|reply

I agree with your comments about how Influx handled this shutdown.

The several things you might mean by self-hosting have their own pros and cons. The right choice is very context-specific, and assuming that it’s always the right choice is wrong. It certainly can be, though.

As for ClickHouse, that mention seems like a throwaway comment, unless you are advocating a boycott of even the open source InfluxDB due to its corporate author’s behavior and view ClickHouse as the closest alternative.

This incident has nothing to do with the comparison of the open source InfluxDB vs the open source ClickHouse, nor would it impugn the viability of InfluxDB hosted by a more responsible data custodian than Influx the company.

And GCP hasn’t done any similar inadequately notified shutdown of service with immediate and irreversible data loss, as far as I know.

(Disclosure: I have worked for Google in the past, including GCP, but not in over 8 years. I’m speaking only for myself here. I’ve never worked for Influx ClickHouse.)

[+] simonw|2 years ago|reply

This kind of thing really does need a cooling off period.

Assume that your users won't see your emails. How do you help them avoid data loss when you shut down a service like this?

One option that I like is to take the service down (hence loudly breaking things that were depending on it) but keep backed up copies of the data for a while longer - ideally a month or more, but maybe just two weeks.

That way users who didn't see your messaging have a chance to get in touch and recover any data they would otherwise lose.

I'm not sure how best to handle the liability issues involved with storing backups of data for a period of time. Presumably the terms and conditions for a service can be designed to support this kind of backup storage "grace period" for these situations.

[+] SSLy|2 years ago|reply

you start with reliability brownouts. first fail 0.1% requests, then after a week 1%, then after a month 5%.

[+] pauldix|2 years ago|reply

Hi, cofounder and CTO here. We notified everyone via email on February 23, April 6 and May 15th. We also offered to help migrate all users. I realize that it's not ideal that we've shut down this system, but we made our best efforts to notify affected users and give them options to move over to other regions. If you've been impacted by this, please email me personally and I will do my best to help out: paul at influxdata.com.

[+] js2|2 years ago|reply

Paul: I'm surprised you didn't do a scream test. Not everyone is going to see those emails and even those that do may not understand what they are reading.

Internally at my company we always do scream tests as part of our EOL process because we know we can't reach everyone, even our own employees.

https://www.microsoft.com/insidetrack/blog/microsoft-uses-a-...

Fun story: my mortgage got sold last year. Not the first time. I got emails from the old mortgage company and the new mortgage company about the sale, but I skimmed them. I got letters via USPS from the old and new mortgage companies, but I mostly ignored those because 95% of what mortgage companies send me via USPS is junk. So I missed the fact that my automatic payments didn't transfer over. The new mortgage company let me get four months in arrears before they finally FedEx'd me something overnight. That got my attention. I was like: you guys should've FedEx'd me this in the first place. For all they knew, I wasn't getting their emails or letters in the first place because nothing had been sent signature required.

[+] SkyPuncher|2 years ago|reply

Wow, that's pretty pathetic and your attitude "we can't help our customers" is even more damning. Email is not reliable enough to simply rely on a few email blasts for this.

I would expect:

* Those 3 "email blast" notifications. I'm guessing one of two things happened here:

  * You sent them as an "email blast" from a marketing-type email service. These hit email filters because they came from a known spam IP.

  * You sent them as a transactional email, but blasted them too quickly and got pegged for spam. Never hit the inbox.

* Increasingly common "you haven't migrated emails" if you still detect traffic on these instances. This is pretty critical since some companies might not realize they have affected They should, but things get complex.

* Ideally, an automated transfer to another region with automated forwarding. It's okay to have poor performance, but it's not okay to go "poof" entirely.

* A soft-delete at the deadline, with 90 to 180 days to finalize migration. If this is costing you dearly, then drive prices up, but don't hard delete data.

Frankly, the last one is the real issue. It's literally unbelievable that a database provider didn't soft-delete. Further, I would expect that you'd be able to migrate these to another region to get customers back up an running.

[+] SentinelRosko|2 years ago|reply

This is insane.

> We notified everyone via email on February 23, April 6 and May 15th. We also offered to help migrate all users. I realize that it's not ideal that we've shut down this system, but we made our best efforts to notify affected users and give them options to move over to other regions.

What other communication methods were attempted beyond just emails? Big, red obnoxious banners and warnings in various UIs? Phone calls?

The fact that it seems as though quite a few customers didn't get your emails, what was the thought process when looking at the workloads that were clearly still active before nuking it from orbit? Or was there no check and it was just assumed that people got the email and migrated?

Of the customers who were in that region, how many actually migrated? Was someone tracking these statistics and regularly reporting them to leadership to adjust tactics if there weren't enough migrations or shutdowns happening?

This screams either gross incompetence or straight up negligence. This is such a solvable problem (as many here have already mentioned various solutions), but I'm honestly just flabbergasted that this is a problem that is even being discussed here right now.

As a DBaaS, the data of your customers should be your number one priority. If its not, y'all need to take a hard look at what the heck your value proposition is.

We weren't impact by this directly, but you can be sure that this is going to be one of the topics for discussion amongst my teams this week. Mostly how we can either move off InfluxDB Cloud or ensure that our DR plans are up to date for the rug being pulled out from under us from you guys in the future.

[+] chillfox|2 years ago|reply

Email only is not even close to best effort. I know it’s standard to only do email for tech companies, but all other types of companies usually do physical mail and phone calls on top of emails for important notifications.

I am not a customer, but it’s really annoying me how tech companies repeatedly think sending emails is somehow anything but the absolute minimum, most lazy option.

[+] arp242|2 years ago|reply

> we made our best efforts to notify affected users

You call three emails (the last of which was almost 2 months ago) "best efforts"?

I had to read your message three times because this is so reality-defying preposterous I just couldn't believe I didn't miss anything. How about warnings on the dashboard? How about an intentional error (or limited service interruption) so that people would log in to their dashboard?

[+] jacquesm|2 years ago|reply

Hi Paul, email is one-way communication and not guaranteed to be delivered. At a minimum you should have monitored who did and did not respond to the email with some kind of action and those that did not should have more effort expended to be able to reach them. Finally, you should have kept the data for a reasonable amount of time (say 90 days) post shut-down so users that did not get the notification could download it. What you've done is super rude and if I were still a customer in an unaffected region it would definitely be reason enough to leave because it's pointless to sit and wait to see how you'll deal with my data when the time comes. Better to preempt that and leave while I still have control.

[+] axman6|2 years ago|reply

Paul, are you actually for real right now? Did you really just say "We deleted all your data, and its your fault. We did whisper into the wind three times, you should have heard it. No, there is no chance of recovery"?

You might have literally deleted people's whole businesses, companies, who employ real people, who have families, now need to figure out how to continue. Not least of which, your own. If the company survives until Christmas I will be shocked; no one can trust your company ever again - your core business is storing other people's data, and you deleted it, for many, completely without warning.

I guess people still use Mongo even after finding it doesn't achieve any property of the CAP theorem, maybe some people will keep using a database provider with a track record of intentionally deleting their paying customers' data.

There just aren't enough adjectives for astonishment to adequately describe this situation.

I hope you offer Jay Clifford some support, he's clearly been put in the awful situation of having to explain the decisions of others and deliver the awful news. If I were him, I would be in need of serious mental health support, this is an absolutely awful thing to have responsibility for without any ability to rectify.

[+] mlhpdx|2 years ago|reply

Contrary to the majority of the thread here, I find this to be an architectural issue. For whatever reason the system was designed without a way to communicate important service and maintenance issues to the customer. That’s part of the good architectural design of a system – it must include human factors, communication among them.

[+] olliej|2 years ago|reply

Multiple comments in the linked issue report not receiving an email.

Did you use the same email you use for spam/"marketing" for this notification?

The correct course of action is to shutdown the service and give people time to fetch data, not to erase the data as the first indication of shutdown.

A few emails are not sufficient if the end result is dataloss, a comment in documentation or release notes is not sufficient (the only reference at least one person in the referenced issue found).

truly mind blowing behavior.

[+] asgeirn|2 years ago|reply

Former Belgium user here. Checked my inbox, no emails from Influx since June 2022.

Then again, I was only using the free tier, so I guess I got what I paid for.

[+] DavidKarlas|2 years ago|reply

Why did you feel need to send 3 emails, and not just 1? Is it because you find emails not reliable enough?

[+] dangoodmanUT|2 years ago|reply

> I realize that it's not ideal that we've shut down this system

Not ideal???

You backed up everyone's DB and moved that to another region so they can just restore and change DB endpoints, right?

I don't believe that someone along the chain didn't suggest a scream test or similar. If they did, they must have been ignored.

[+] ratg13|2 years ago|reply

If you are responsible for this the very least you can do is own up to it and apologize.

Trying to assert that you were doing what you thought was right only presents the image that your company is run poorly.

The correct thing to do is to admit that your best efforts were not aligned with best practices, and look into remediation.

Not “well, we tried”

[+] santafen|2 years ago|reply

You could have just responded with ¯\_(ツ)_/¯ and saved a lot of typing.

[+] ksajadi|2 years ago|reply

Unfortunately we’ve been bitten by influx operation issues a few times before. We adopted influxDB long time ago and always had to deal with breaking changes for each upgrade and every time we had an issue their answer would be: upgrade the latest version and see if it continues.

Then recently they made a change to Telegraf that broke all our data collection because they changed the environment variable replacer and their Jsonnet parser broke.

Now this. Shutting down a region without brownouts and only emails is not operationally acceptable.

We’ve moved on from influxDB for a while and only rely on telegraf now.

[+] galleywest200|2 years ago|reply

We self host influxdb, never had this problem.

[+] asymptotic|2 years ago|reply

At AWS, the hierarchy of service priorities is crystal clear: Security, Durability, and Availability. In that order. Durability, the assurance that data will not be lost, is a cornerstone of trust, only surpassed by security. Availability, while important, can vary. Different customers have different needs. But security and durability? They're about trust. Lose that, and it's game over. In this regard, InfluxDB has unfortunately dropped the ball.

Deprecation of services is a common occurrence at AWS and many other tech companies. But it's never taken lightly. A mandatory step in this process is analyzing usage logs. We need to ensure customers have transitioned to the alternative. If they haven't, we reach out. We understand why. The idea of simply "nuking" customer data without a viable alternative is unthinkable.

The InfluxDB incident brings to light the ongoing debate around soft vs. hard deletion. It's unacceptable for a hard delete to be the first step in any deprecation process. A clear escalation process is necessary: notify the customer, wait for explicit acknowledgement, disable their APIs for a short period, extend this period if necessary, soft delete for a certain period, notify again, and only then consider a hard delete.

The so-called ["scream test"](https://www.v-wiki.net/scream-test-meaning/) is not a viable strategy for a cloud service provider. Proactive communication and customer engagement are key.

This incident is a wake-up call. It underscores the importance of data durability and effective, respectful customer communication in cloud services and platform teams. Communication is more than three cover-your-ass emails; it's caring about your customers.

[+] lopkeny12ko|2 years ago|reply

Wow, the incredibly callous 3-word explanation of the issue by pointing to a docs link with no other context. Really gives off "it's your fault for not reading the wiki." Is this how InfluxDB treats their customers?

Incidentally at work we've been evaluating a new hosted observability provider, looks like we can rule out Influx as an option.

[+] john_max_1|2 years ago|reply

5% interest rate is breaking the tech companies. If you are dependent on a SAAS service for your infra, ensure that it is either

  - self-hosted
  - provided by a big deep-pocketed cloud infra

Otherwise, the service might shut down with a 30-day or so notice.

[+] plasma|2 years ago|reply

Hard to reverse actions need multiple safety switches, for example, turning off the machines in that region for 2 weeks before deleting them, which would bring support issues to attention ahead of the no-going-back step of deleting data.

[+] gtirloni|2 years ago|reply

So many easy ways that this could have been avoided. sigh

- Phone calls

- Scream tests

- Monitor services still in use. Contact these customers individually

- ...

Not a single individual said "Gee, people are still using that DC, should we really destroy it?"

Either this shows Influx is really naive and inexperienced or... they are in deep trouble cash-wise and were working in panic mode to cut costs.

[+] tredre3|2 years ago|reply

> According to the support, the notification emails to the users were sent on Feb 23, Apr 6 and May 15th. However, we did not receive those at all.

If true, this is concerning. One message getting lost in spam understandable. But three over 6 months would imply they're being blacklisted and/or their mail sender is simply broken.

Do serious companies not have canaries or other checks in place to ensure their notifications are correctly delivered to customers?

[+] shrubble|2 years ago|reply

It's tough to believe that the turning off of the service couldn't have involved at least a week of 'soak' time, where if you contacted them they then helped you move to another location. After all the additional cost/benefit ratio of keeping the VMs around, but not using CPU or bandwidth vs. retaining a few customers would indicate it is the right thing to do for both the customers and the business.

[+] thelittlenag|2 years ago|reply

I've seen better communications around company-internal services that have been deprecated and for which a replacement exists that we need to migrate to. Heck, I've seen this a couple of times.

We had even tried out Influx a few different times. It was always ok, but never quite good enough. Now with this, I think, this seals the deal on me ever considering Influx either as a product or as a service.

[+] kgeist|2 years ago|reply

>I've seen better communications around company-internal services

Our team maintains an internal CRM. When we plan to delete data or deprecate features, what we usually do (beyond sending emails):

- hide features/data from the UI without actually deleting them - if no one complains, after a few weeks, proceed with the removal

- for critical data, make sure there are backups, store them for about a month - if no one requests them, delete them

[+] RedShift1|2 years ago|reply

What did you choose over Influx?

[+] linsomniac|2 years ago|reply

I like Influx, and have been using it for all our TSDB needs for ~7 years. It sounds like there have been a lot of shake-ups internally, all their hopes for Flux (the InfluxQL language) have been aborted. I'm wondering about how they're doing.

This situation doesn't seem like it is specifically them doing anything wrong, if indeed they did send out multiple notifications over 6 months. It sounds like it's caught many people by surprise though, which makes one wonder if there was a problem with their announcements of this change. Definitely seems like it could have been handled better than "deletion with no recovery" though, some sort of "shutdown, wait a week or month, delete" would have been better.

[+] The_CK|2 years ago|reply

I received a response from their support and it’s hillarious.

TLDR: “We don’t plan to delete your data again in the forseeable future”

Full quote: “You can sign up for a new account here InfluxDB Cloud using a different region. We want to assure you that there are no more scheduled shutdowns planned. Therefore, once you have created the new account and begin writing to it, we do not foresee any data loss going forward”

No more information was provided in the mail.

[+] sergiotapia|2 years ago|reply

This could have been easily mitigated with a giant red ugly banner "YOUR DATA WILL BE LOST IN X DAYS. MIGRATE NOW".

Three emails clearly wasn't enough right? Now their name is in the dumps, customers are pissed and my only exposure to influxdb is a negative one.

I hope other saas guys learn this very expensive lesson.

[+] jmaker|2 years ago|reply

What a terrible start for InfluxDB 3. And that incomprehensible justification on their side… what a disappointment… I’ve been anticipating InfluxDB 3 going GA later this year and was just about to subscribe to their cloud offering since they’re making it available only over there presently. And I was going to migrate more workflows to the TICK stack. But they’ve just nuked their credibility in my eyes. Hope they can still recoup the dev costs for InfluxDB 3, but I’m now going to be very cautious about that company going forward. Hope Influx OSS remains viable. Inconceivable…

[+] testemailfordg2|2 years ago|reply

A lesson for us and senior managements of different companies, considering cloud hosting. If you are saving money by moving things away from on-premise or using a managed service to reduce employee costs, then you need to factor in these disaster scenarios and have procedures in place. If the senior management folks at time only thought about cost savings and not business continuity, then the blame of this fiasco should also go on their heads, as they took credit for saving money and now is time to take credit for data loss.

[+] xyst|2 years ago|reply

A total of 3 emails sent, lol.

Hope they didn’t have any big corp customers impacted. Some big corps would very easily use that to cancel contracts and void out payments. Then let the lawyers deal with it

266 comments