Fly.io Postgres cluster down for 3 days, no word from them about it

[+] spiderice|2 years ago|reply

There is now a response to the support thread from Fly[1]:

> Hi Folks,

> Just wanted to provide some more details on what happened here, both with the thread and the host issue.

> The radio silence in this thread wasn’t intentional, and I’m sorry if it seemed that way. While we check the forum regularly, sometimes topics get missed. Unfortunately this thread one slipped by us until today, when someone saw it and flagged it internally. If we’d seen it earlier, we’d have offered more details the.

> More on what happened: We had a single host in the syd region go down, hard, with multiple issues. In short, the host required a restart, then refused to come back online cleanly. Once back online, it refused to connect with our service discovery system. Ultimately it required a significant amount of manual work to recover.

> Apps running multiple instances would have seen the instance on this host go unreachable, but other instances would have remained up and new instances could be added. Single instance apps on this host were unreachable for the duration of the outage. We strongly recommend running multiple instances to mitigate the impact of single-host failures like this.

> The main status page (status.fly.io) is used for global and regional outages. For single host issues like this one we post alerts on the status tab in the dashboard (the emergency maintenance message @south-paw posted). This was an abnormally long single-host failure and we’re reassessing how these longer-lasting single-host outages are communicated.

> It sucks to feel ignored when you’re having issues, even when it’s not intentional. Sorry we didn’t catch this thread sooner.

[1] https://community.fly.io/t/service-interruption-cant-destroy...

[+] mrcwinn|2 years ago|reply

For what it’s worth, I left Fly because of this crap. At first my Fly machine web app had intermittent connection issues to a new production PG machine. Then my PG machine died. Hard. I lost all data. A restart didn’t work - it could not recover. I restored an older backup over at RDS and couldn’t be happier I left.

[+] gowthamgts12|2 years ago|reply

> While we check the forum regularly, sometimes topics get missed. Unfortunately this thread one slipped by us until today, when someone saw it and flagged it internally.

If it really got missed, then I don't understand how the thread was made private to only logged-in users?

[+] benjaminwootton|2 years ago|reply

Should losing a single host machine be a big deal nowadays? Instance failure is a fact of life.

Even if customers are only running one instance, I would expect the whole thing to rebalance in an automated way especially with fly.io being so container centric.

It also sounds like this is some managed Postgres service rather than users running only one instance of their container, so it’s even more reasonable to expect resilience to host failure?

[+] oefrha|2 years ago|reply

I was confused why support for platform failure relies on a forum where employees may or may not check. After checking docs[1], apparently you have to be on a paid plan (at least $29/mo) to access email support, so you may not have it even you’re paying for resources.

I won’t be using it for side projects where I’m okay with paying $5-10/mo but don’t want to have three day outages.

[1] https://fly.io/docs/about/support/

[+] emmelaich|2 years ago|reply

The irony or perhaps the tragedy of building a low friction service is that you have to have experts on the lower level high friction stuff.

I would hope that after a couple of hours downtime, they'd bring up a fresh machine with Ansible or whatever. Hardware or AWS/GCP Vm.

[+] bongobingo1|2 years ago|reply

Seems like the OP should have made a HN thread in the first place instead of posting to community.stri^H^H^H^Hfly.io

[+] plagiarist|2 years ago|reply

Why is it my responsibility to move instances from machine to machine to mitigate a cloud host's outages? What is their utility if not performing the bare minimum of cloud host responsibilities keeping my container up?

[+] quickthrower2|2 years ago|reply

> We strongly recommend running multiple instances to mitigate the impact of single-host failures like this.

Make it impossible not to do so, and make it frictionless then.

[+] burnerbob|2 years ago|reply

Fly have tried to hush this by making the thread [1] private to anyone not logged in.

One quote from thread:

> This is the second time I’ve had this kind of issue with Fly, where my service just goes down, Fly reports everything healthy, and there’s literally no information and nothing I can really do other than wait and hope it comes back up sometime

Another user:

> We had four machines (app + Postgres for staging and production) running yesterday, and three of the four (including both databases) are still down and can’t be accessed. I can replicate the issues others have mentioned here.

> This is our company’s external API app and so the issue broke all of our integrations.

> Our team ended up setting up a new project in fly to spin up an instance to keep us going which took a couple of hours (backfilling environment variables and configuration etc, not a bad test of our DR ability).

> There is no way I can find to get the data from the db machines. Thank goodness this isn’t our main production db and we were able to reverse engineer what we needed into there.

> Very keen to hear what’s happening with this and why after so many hours there’s no more info or updates.

Another user:

> As an aside, it’s kind of a kick in the teeth to see the status page for our organization reporting no incidents - the same page that lists our apps as under maintenance and inaccessible!

Another user:

> I’m feeling very lucky that none of our paid production apps or databases are affected currently (only our development environment is), but also really surprised that the issue has been ongoing for 17 hours now with no status page update, no notifications (beyond betterstack letting us know it was down) and one note on the app with not much info as to whats going on.

> It really worries me what would happen if it was one of our paid production instances that was affected - the data we’re working with can’t simply be ‘recovered’ later, it’d just get dropped until service resumed or we migrated to another region to get things running again

> Keen to know whats wrong and whats being done about it

Full thread (as at time of HN post; more has been added since): https://pastebin.com/ebmCSZkC

Someone tweeted Fly CEO: https://twitter.com/SouthPawNZ/status/1682181533673857024

[1] https://community.fly.io/t/service-interruption-cant-destroy...

[+] throwaway220033|2 years ago|reply

The worst thing about Fly is, when something goes wrong, it's not just one thing, there's bunch of things broken at the same time and their status page will show everything green.

Their typical response is either silence or so casual ("oh this is what happens we deploy on friday"). The product looks amazing but it's just a nice package around the most unreliable hosting service I've ever used.

You can't just keep breaking people's work every once a week, make them spend their weekend nights trying to bring back their stuff, and give these "we could have done better" answers. This is an excuse for exceptions, not patterns.

[+] throwawaaarrgh|2 years ago|reply

There's a lot of bullshit in this HN thread, but here's the important takeaway:

- it seems their staff were working on the issue before customers noticed it.

- once paid support was emailed, it took many hours for them to respond.

- it took about 20 hours for an update from them on the downed host.

- they weren't updating their users that were affected about the downed host or ways to recover.

- the status page was bullshit - just said everything was green even though they told customers in their own dashboard they had emergency maintenance going on.

I get that due to the nature of their plans and architecture, downtime like this is guaranteed and normal. But communication this poor is going to lose you customers. Be like other providers, who spam me with emails whenever a host I'm on even feels ticklish. Then at least I can go do something for my own apps immediately.

[+] pech0rin|2 years ago|reply

I really want to love Fly.io. It's super easy to get setup and use, but to be honest I don't think anyone should be building mission critical applications on their service. I ended up migrating everything over to AWS (which I reallllly didn't want to do) because:

* Frequent machines not working, random outages, builds not working

* Support wasn't responsive, didn't read my questions (kept asking same questions over and over again) -- I paid for a higher tier specifically for support.

* General lack of features (can't add sidecars, hard to integrate with external monitoring solutions)

* Lack of documentation -- For happy path its good but any edge cases the documentation is really lacking.

Anyway, for hobby projects its fine and nice. I still host a lot of personal projects there. But I have to move my companies infrastructure off of it because it ended up costing us too much time/frustration, etc. I really had high hopes going into it as I had read it was a spiritual successor of sorts to Heroku which was an amazing service in its day, but I don't think its there yet.

[+] tptacek|2 years ago|reply

Y'all, this is going to be deeply unsatisfying, but it's what I can report personally:

I have no earthly clue why this thread on our community site is unlisted.

We're looking at the admin UI for it right now, and there's like, a little lock next to do the story, but the "unlist story" option is still there for us to click. The best I can say is: I'm reasonably sure there wasn't some top-down edict to hide this thread (the site is public, anybody can sign up for an account and see the thread).

Say what you want about us, but hiding out from stuff like this isn't one of our flaws. When I find out more about what happened with this thread, I'll let you know (or Kurt will reply here and tell me I'm wrong).

I don't know enough about what happened with this Sydney server to be helpful to people who had instances running on it. When I know more about it, I'll be helpful, but I'm just learning about this stuff right now, after getting back in from a night out.

Almost immediately afterwards

It looks like... all the posts in the app-not-working category are "private"? Like it's some setting on the category itself? "Private" here means you need to have signed up for a Discourse account to see them?

[+] marcinzm|2 years ago|reply

Honest advice, probably to Kurt rather than you, is you need better processes, accountability and (probably) communication in your company. The tone of your reply (and other communications from fly.io) is reflective of the lack of those things given the public sentiment regarding fly.io. At 60+ employees and so many issues that tone goes from humanly endearing to indicative of a non-scaling business. Other replies indicate you don't want the things (process, oversight, etc.) that a growing B2B business needs to really succeed which is not a good sign. Sure there's a cost to that corporate-ness and you want to minimize that cost but it's also a necessary evil for the business you're in at the scale you're at.

If something breaks once it's an accident, if it breaks twice it's bad luck but if it breaks down three times it's broken processes. Based on the comment here things break at fly.io a lot more often than three times.

[+] xupybd|2 years ago|reply

From this my take away is that I could get fired for picking Fly.io for work. Not because there was an outage but because days could pass before getting support.

What assurances could you give the community here that the support would be better next time?

[+] sho|2 years ago|reply

> I have no earthly clue why this thread on our community site is unlisted.

Maybe it's hosted in the SYD region

[+] michaeldwan|2 years ago|reply

I don't know why the app-not-working category effectively delists threads, but until we find out, I just removed it so this thread is public again.

[+] gerhardlazu|2 years ago|reply

I really like the work that you're doing Thomas, this is the right approach. FWIW, https://fly.io/blog/carving-the-scheduler-out-of-our-orchest... is one of my favourite posts on your blog.

For everyone else reading this, we have been running https://changelog.com on Fly.io since April 2022. This is what our architecture currently looks like: https://github.com/thechangelog/changelog.com/blob/master/IN...

After 15 months & more than 100 million requests served by our Phoenix + PostgreSQL app running on Fly.io, I would be hard pressed to find a reason to complain. - Some deploys failed, and re-running the pipeline fixed it. - Early July 2023, 9k requests from Frankfurt returned 503s. Issue lasted 10 seconds. - While experimenting with machines, after many creations & deletions, one volume could not be deleted. Next day, the volume was gone.

That's about it after 15 months of running production workloads on Fly.io.

We mention about our Fly.io experience often in our Kaizen pod episodes, which we publish every ~2 months: https://changelog.com/topic/kaizen. For anyone curious, this is the episode in which we announced the migration: https://changelog.com/shipit/50. There is a detailed PR which goes with it: https://github.com/thechangelog/changelog.com/pull/407. We've been talking about our migration plan from apps v1 (Nomad) to apps v2 (flyd) recently: https://changelog.com/friends/2#transcript-138

I'm sorry to hear that many of you didn't have the best experience. I know that things will continue improving at Fly.io. My hope is that one day, all these hard times will make for great stories. This gives me hope: https://community.fly.io/t/reliability-its-not-great/11253

Keep improving.

[+] subarctic|2 years ago|reply

Glad to see you commenting here about this, I literally just posted a comment about how it's really messed up that you guys would do that

[+] teraflop|2 years ago|reply

There's also a lock icon next to the "App not working" category in the header, which I took to mean that that entire category is hidden from logged-out users (which experimentally seems to be the case).

[+] solarkraft|2 years ago|reply

Thanks for publicly responding to the criticism, that can't be taken for granted. I hope you'll manage to actually address them.

[+] tacker2000|2 years ago|reply

You might be right, but in light of this whole disaster it doesn’t sound too convincing and doesn’t make your company look good.

[+] unknown|2 years ago|reply

[deleted]

[+] throwaway220033|2 years ago|reply

It looks like being authentic is valued over anything else at Fly. I can’t explain how a company responds this immaturely to incidents like these.

[+] yard2010|2 years ago|reply

[deleted]

[+] dcchambers|2 years ago|reply

I like fly.io a lot and I want them to succeed. They're doing challenging work...things break.

Have to admit it's disappointing to hear about the lack of communication from them, especially when it's something the CEO specifically called out that they wanted to fix in his big reliability post to the community back in March.

https://community.fly.io/t/reliability-its-not-great/11253#s...

[+] pritambarhate|2 years ago|reply

Here the even bigger red flag is that Fly doesn't have a (automated?) way to quickly move workload from a faulty server to a good server. Especially when containers (and orchestrators) have abstracted away the concept of data volumes which can be attached and detached. (Yes, it needs a lot of serious technical investment to provide this and I think it's one of the reasons storage is expensive on the big 3 clouds.) If you are offering data persistence services then you absolutely need this capability.

I think there is an expectation mismatch between what Fly wants to offer and what the market wants from it. Fly wanted to innovate on offering the ability to the devs to be able run their apps from multiple data centers. But without a proper data persistence service, the ability to run apps from multiple data centers is not useful to a vast majority of people.

I think Fly is trying to solve the persistence issue with their SQLite replication, but that means the vast majority of the devs will have to change the way they develop applications to suit Fly platform.

I think Fly needs to choose between what it wants to become. A reliable and affordable Heroku replacement, which is a decent sized market or offer an opinionated way of developing apps which offer best performance to users all around the world.

But opinionated ways of doing things is a double edged sword. (Rails and Spring Boot are highly successful because of their opinionated defaults.) App Engine is an interesting case study in the app hosting domain. It was way ahead of the time and prescribed you a way of developing apps which allowed the apps to scale to very high traffic. But people didn't want to change the way they develop to adapt to it.

[+] neya|2 years ago|reply

I actually have been advocating against them for a while here on HN (https://news.ycombinator.com/item?id=31394179) for the same reason.

They had my account on some sort of shadow ban with no communication whatsoever after asking them to delete my account from their systems. I emailed them and to date never even got a response. I have moved everything over to Railway app and back to Google Cloud Run ever since.

[+] thyrox|2 years ago|reply

You know what's interesting? It feels like history is repeating itself with Fly.io, just like it did back when I first encountered Heroku. Back in the day, I was super excited about Fly.io – it had that same fresh, exciting vibe that Heroku had when it burst onto the scene.

I remember being blown away by Fly.io's simplicity and how easy it was to use. It was like hosting made simple, and I couldn't help but think, "This is it, this is the one!"

But, as time went on, I noticed little signs of trouble. Downtimes became more frequent, and my deployments, which were once snappy and seamless, turned into agonizingly slow affairs. It was like déjà vu from the time when Heroku's greatness started to wane.

It's disheartening to see Fly.io go down a similar path. As more people flocked to the platform, it seems like its performance began to suffer – just like what happened with Heroku. The more popular it got, the less reliable it seemed to become.

Scrolling through Hacker News, I can't help but feel a sense of disappointment. Others are expressing their frustration too, and it's like we're all reliving that moment when Heroku lost its charm and became a hassle.

I have to admit; it worries me. It's like a cautionary tale of how even the most promising platforms can fall from grace. It's the reality of the fast-paced tech world, but it's tough to accept.

So yeah, here I am, hoping against hope that Fly.io can somehow break free from this cycle and find its footing before it becomes as useless as Heroku was at its lowest point.

[+] SadTrombone|2 years ago|reply

Incredibly unimpressed at fly.io staff for hiding/making private the downtime forum support thread.

[+] siquick|2 years ago|reply

We tried to migrate all our staging environments to Fly last year but it was the flakiest experience I’ve experienced on any PaaS. Pushing simple containers up would fail 70-80% of the time with no useful error messages and non existent support. It’s a weird company that seems great until you actually use them.

[+] xyzzy_plugh|2 years ago|reply

I think fly.io is pretty incredible but I can't help but feeling they're doomed to follow in heroku's footsteps (unclear if good or bad). They've built some pretty wild stuff and I can't help but wonder if they're overcooking the ocean instead of just solving problems for their users.

Durable and available storage are all they really need to draw me away from big cloud providers but this combined with their answer to S3 being "use S3 or run minio" means I'll never take them seriously.

This is a bad look folks, not sure how you can walk back days of silence and hiding threads. Just open an issue and talk to your users.

[+] yowlingcat|2 years ago|reply

At least I could rely on Heroku in production. I've wanted to give Fly.io a try but this gives me pause. I really do miss the Heroku DX whenever I'm putzing around with the increasing complexity of AWS.

[+] unmole|2 years ago|reply

> use S3 or run minio

Is using Cloudflare R2 not an option?

[+] ThePhysicist|2 years ago|reply

Instances going down happens sporadically on Hetzner Cloud as well, but often by the time I see the e-mail alert that some instance is unreachable I log into the dashboard to find that it has been restarted or migrated to another host already. I've been running a production system there for more than 4 years now and had zero provider-related downtime (as I have some redundancy for most instances). In terms of features they move way slower than Fly.io and it took them years adding stuff like virtual networking, but everything they add works rock-solid. I guess there are just very different engineering cultures when it comes to building cloud infrastructure provider, and I have to say I prefer the "take your time and do it right" approach.

[+] jinzo|2 years ago|reply

I'm running some instances on Hetzner Cloud, the oldest is ~5 years old, only recently had 2hr or so downtime, other than that - without any problems. And we are talking the cheap ones.

I did have a problem with their dedicated server almost immediately after spinning it up. Noticed that NVMe is broken, and support went like:

- 16:28 -> I contacted them

- 16:36 -> Their first response

- 16:44 -> I sent them SMART data

- 16:48 -> They acknowledged that the NVMe needs replacing and asked me if I consent to that (and loosing of the data that was not already lost -> but running RAID so no problems there)

- 16:52 -> I agreed

- 17:30 -> NVMe was replaced and server booted

I don't have too much experience with hosting providers on that level, but that was freaking impressive response time from them. So a happy camper as well :D

EDIT: Formatting

[+] nik736|2 years ago|reply

Hetzner has a great price/performance ratio, but they are not rock-solid. Speaking of the private network... look at their forum where people complain about downtimes for their "vSwitch" every other week, sometimes it doesn't show up on the status page because it happens on the weekend (lol).

[+] constantly|2 years ago|reply

They’ve been working on Fly for years now and seems like they haven’t been able to turn it into a reliable service or profitable business (making assumptions about the second part here), and the overall general sentiment seems to be to avoid it for anything but the most toy applications. I note that the team was also unable to get their recruiting business off the ground either and shuttered it.

My assumption based on the creator’s very online hacker news commentary is that they seem to be at least smart in tech. So what’s the lesson here for the rest of us who may want to start a business? Is this a “shots on goal” thing and we’re just seeing these failures more publicly than most so it biases the perception, or is there some je ne sais quoi missing that we could learn from? No offense intended by my post, but I would be very keen to learn whether there’s some X Factor missing from an otherwise ostensibly smart team’s repeated failure that we could learn from.

[+] subarctic|2 years ago|reply

It's really disappointing that they made this forum thread private, apparently in response to this HN thread blowing up. This is the first negative HN thread I've seen about them, it's not even really that bad because this kind of downtime is expected, and they can't get to every forum post, and their response that someone posted here is totally reasonable in my opinion.

So why is the link to the thread 404ing and why does this post have to link to google webcache of it? I've grown to like fly.io and use them for my side projects now, and this just isn't sometime they would do. Going through some minor cognitive dissonance right now :/

[+] aledalgrande|2 years ago|reply

Wondering if for small/bootstrapped projects there's any alternative people suggest? Fly has a nice UX and accessible prices, but it's unstable at best. I use the big clouds at work, but for personal they are $$$. Also I want to keep devops tending asymptotically to zero.

[+] reustle|2 years ago|reply

I’m quite happy with https://render.com after leaving Heroku

[+] gowthamgts12|2 years ago|reply

Although, i have never used them, you can explore railway.app. it is the closest to fly.io and never heard any bad things.

I personally at the moment use digitalocean without any issues, but there's always the maintenance overhead of managing a server yourself.

[+] sho|2 years ago|reply

Honestly these days I am leaning towards this approach: https://github.com/mrsked/mrsk/

It's all just docker.

[+] danjac|2 years ago|reply

I use Dokku on top of Hetzner for my hobby projects - hosting is super cheap, for a little extra I can add a mounted volume for storage, and if the project outgrows a single server I can always just break out of Dokku and use some Docker containers behind a load balancer.

If you are outside of Europe, Digital Ocean or Linode may work better for you.

[+] q7xvh97o2pDhNrh|2 years ago|reply

Maybe just pick up 3 chonky EC2 boxes, set up iptables on each of them, have each one run a containerized version of your code that gets built and deployed from CI every time you push to Github, slap an ALB in front of it all, and call it a day?

And if you need state, then spin up a little RDS with your favorite SQL flavor of choice?

The CI deploy script could even bake in little health-checks so you can do rolling deploys with zero downtime. Depending on how fancy you wanted to get with your shell scripting, you could probably even make 1 of your 3 boxes a canary without too much trouble.

I'm realizing I haven't thought about this in a long time, since nowadays I just get to use the fancy stuff at work. Kind of a fun thought experiment!

[+] js4ever|2 years ago|reply

Try https://elest.io (Check the CI/CD part)

[+] mrcwinn|2 years ago|reply

Don't get me started with Fly — especially postgres machines. In my experience, a really nice idea with poor support and unreliable infrastructure.

[+] TekMol|2 years ago|reply

What do people get out of using special services like Fly.io instead of standard VMs like the ones you can get from $5/month these days?

Can anybody who uses Fly.io explain their rationale? Why do the additional integration with Fly.io, trust and install their special software on your machines and tie your project into their ecosystem?

What type of application are you running? How many users are using it?

[+] manish_gill|2 years ago|reply

Why is this company always on HN frontpage - ironically for their bad services? Normally, poor service from a provider isn't grounds for such attention - but seems like Fly.io has not done anything great.

They still continue to get love from the developer community who "wants them to succeed". I'm puzzled as to why? Because of some blog posts?

477 comments