top | item 46977451

(no title)

justjake | 18 days ago

Hello! Railway founder here

We'll have a post mortem for this one as we always write post mortems for anything that affects users

Our initial investigation reveals this affects <3% of instances

Apologies from myself + the Team. Any amount of downtime is completely unacceptable

You may monitor this incident here: https://status.railway.com/cmli5y9xt056zsdts5ngslbmp

discuss

order

vintagedave|18 days ago

Hi Jake. Appreciate your presence here on HN.

This affected a seemingly random set of services across three of my accounts (pro and hobby, depending on if this is for work or just myself.) That ranges from Wordpress to static site hosting to a custom Python server. All of the deployments showed as Online, even after receiving a SIGTERM.

While 3% is 'good', that's an awfully wide range of things across multiple accounts for me, so it doesn't feel like 3% ;) Please publish the post mortem. I am a big fan of Railway but have really struggled with the amount of issues recently. You don't want to get Github's growing rep. Some people are already requesting I move one key service away, since this is not the first issue.

Finally, can I make a request re communication:

> If you are experiencing issues with your deployment, please attempt a re-deploy.

Why can't Railway restart or redeploy any affected service? This _sounds_ like you're requiring 3% of your users to manually fix the issue. I don't know if that's a communication problem or the actual solution, but I certainly had to do it manually, server by server.

justjake|18 days ago

Totally! People who see the impact will likely see more impacted than say, 3% of their services. Not all disruption created equal.

We rolled out a change to update our fraud model, and that uses workload fingerprinting

Since, in all likelyhood, your projects are similarly structured, there will be more impacted workloads if the shape of your workloads was in the "false positive" set

Will have more information soon but very valid (and astute) feelings!

iJohnDoe|18 days ago

Many questions on their forum are similar to our situation. People wondering if they should restart their containers to get things working again. Worried about if they should do anything, risk losing data if they do anything, or just give everything more time.

iJohnDoe|18 days ago

Lots of concerns about doing a Restart or Redeploy since a lot of people are still offline 4+ hours.

Since there hasn't been any responses on the official support forum, maybe this will help someone.

I did a backup of our deployment first and did a Restart (not a Redeploy). Our service came back up thankfully.

Obviously do your own safety check about persistent volumes and databases first.

port3000|18 days ago

Second complete outage on railway in 2 months for us (there was also a total outage on December 16th), and many issues with stuck builds and other minor issues in the months before that.

Looking to move. It's a bit of hassle to setup coolify and Hetzner but I have lost all trust.