Incident with GitHub Actions, API requests, Codespaces, Git operations, Issues

[+] Wavelets|4 years ago|reply

Whew, glad I decided to scroll HN right now. I've been puzzling over why I'm getting "! [remote rejected] master -> master (Internal Server Error)" as well while trying to push and decided to take a break.

[+] adelarsq|4 years ago|reply

Time to take some coffee and configure Vim

[+] forgingahead|4 years ago|reply

It's been like that for at least 6 hours, randomly appearing. I would take a pause and try again and then it would work, but now it's definitely much more persistent.

Guess it's time to go play some video games....

https://xkcd.com/303/

[+] dgellow|4 years ago|reply

Yep, same here! Good time to make a new coffee :)

[+] ahmadrosid|4 years ago|reply

Same here got rejected when push. ! [remote rejected] HEAD -> main (Internal Server Error)

[+] distartin|4 years ago|reply

Never really realized that github had many technical incidents lol

[+] lukeinator42|4 years ago|reply

same here, I was having internet issues yesterday, and now that my internet is working github isn't, haha.

[+] m3nu|4 years ago|reply

dito

[+] avar|4 years ago|reply

I'm finding that pushes do go through eventually, this is probably grossly irresponsible, so I don't recommend its use, but I remembered I had this old alias to "push harder" in my ~/.gitconfig:

    [alias]
    thrust = "!f() { until git push $@; do sleep 0.5; done; }; f"

I've done a few pushes so far, and found that it's going through in <10 tries or so.

[+] gfunk911|4 years ago|reply

  # Retries a command a with backoff.
  #
  # The retry count is given by ATTEMPTS (default 100), the
  # initial backoff timeout is given by TIMEOUT in seconds
  # (default 5.)
  #
  # Successive backoffs increase the timeout by ~33%.
  #
  # Beware of set -e killing your whole script!
  function try_till_success {
    local max_attempts=${ATTEMPTS-100}
    local timeout=${TIMEOUT-5}
    local attempt=0
    local exitCode=0

    while [[ $attempt < $max_attempts ]]
    do
      "$@"
      exitCode=$?

      if [[ $exitCode == 0 ]]
      then
        break
      fi

      echo "Failure! Retrying in $timeout.." 1>&2
      sleep $timeout
      attempt=$(( attempt + 1 ))
      timeout=$(( timeout * 40 / 30 ))
    done

    if [[ $exitCode != 0 ]]
    then
      echo "You've failed me for the last time! ($@)" 1>&2
    fi

    return $exitCode
  }

[+] hackandtrip|4 years ago|reply

Add some kind of exponential backoff to be a good citizen!

[+] totony|4 years ago|reply

>Service degradation

>Time for some manual DoS

[+] doersino|4 years ago|reply

TIL about "until" loops! How neat.

[+] svnpenn|4 years ago|reply

half a second? Jesus dude calm down.

[+] mkoubaa|4 years ago|reply

The delay makes me think you should use the German word for thrust

[+] 5e92cb50239222b|4 years ago|reply

It's fine. Maybe it will force them to finally start paying attention to the quality of their work. If crap I'm writing for a living was misbehaving that frequently, I'd be sweeping the streets by now (or doing some other work that's actually useful to society).

[+] everfrustrated|4 years ago|reply

Does anybody else remember when GitHub's outage page used to have little graphs showing downtime?

Eventually they took it down as their outages were just too often.

GitHub has _always_ had terrible uptime. It's a great product - wish something would change but it seems cultural at this point.

[+] 15characterslon|4 years ago|reply

They had massive problems with their main database cluster (MySQL). If you read through their engineering blog, most of the outages were related to their growth and the main database cluster. They moved workloads for some features to different clusters, but that's only to buy more time. Eventually they'll do proper shredding (by user or org I guess, not by feature) but that takes time.

Their engineering blog is full of articles about MySQL and the main "mysql1" database cluster, e.g. https://github.blog/2021-09-27-partitioning-githubs-relation...

[+] pythux|4 years ago|reply

I have no idea if this is remotely close to reality but, what if, their culture of breaking things and bad uptime is what allowed them to move fast and build a great product in the first place?

[+] intsunny|4 years ago|reply

Whew, outage timestamps in UTC.

Now I won't have to know what time is it California, and if California currently has PST, PDT, PTSD, etc

[+] pdenton|4 years ago|reply

As someone with diagnosed PTSD, I never thought I'd psychologically level with an entire state ;)

[+] omegalulw|4 years ago|reply

To anyone who is reading this and genuinely wants to know: it's PDT, UTC-7.

[+] candiddevmike|4 years ago|reply

This is causing actions jobs to hang after completing, consuming precious minutes. I don't think I've ever seen a refund when this happens, so I recommend everyone check their jobs and cancel them for now.

[+] deckard1|4 years ago|reply

Two days they have been down now. Github has, by far, the worst uptime of any critical service I've seen going on multiple years now.

[+] jetpackjoe|4 years ago|reply

The github.com homepage, as well as api (via `gh`) are not working for me either.

[+] jetpackjoe|4 years ago|reply

Their status page is reflecting the new outages. Good on GitHub for actually updating that quickly.

[+] niel|4 years ago|reply

> The github.com homepage

Only while logged in, it seems.

[+] arpinum|4 years ago|reply

These incidents have to hurt Azure's brand value. It's a monster task to run something as big as GitHub, if they ever get it stable it will lend a lot of credibility to Microsoft's cloud skills.

[+] ryanbrunner|4 years ago|reply

There's not really all that much pointing to an infrastructure level failure - it's possible, but it's just as likely it's an application-level failure somewhere in Github's code. The API is returning 500s and not 503s and the failure is relatively quick, so it's not obviously a server outage.

[+] zinekeller|4 years ago|reply

Serious questions:

1) Is GitHub runing under Azure's technology stack?

2) Is GitHub under Azure's mamagement (in contrast to Visual Studio's team)?

I'm not sure about two but I'm pretty sure that GitHub doesn't run under Azure at all, considering that GitHub has fully separate networking from MSN's/Azure's (and GitHub's machines do pingback unlike most of Microsoft's machines which don't).

[+] gtirloni|4 years ago|reply

GitHub is pretty stable. What are you talking about? I doubt most GitHub users know it's on Azure.

[+] jaywalk|4 years ago|reply

I don't consider this a reflection on Azure at all. It's really just a reflection on GitHub under Microsoft's leadership.

[+] jakub_g|4 years ago|reply

At least one good thing about GH is that while things break, the status page is updated relatively fast compared to other companies, when all HN knows about outage for 1h+ until it's acknowledged.

[+] bloopernova|4 years ago|reply

And of course my developer teammates are still trying to merge PRs.

I don't care that it works "some of the time"! Don't mess with the repos when the repo host is having seemingly random issues.

[+] fritzo|4 years ago|reply

For example: while actions are down, branches can be merged without ci tests passing, even for protected branches. This just happened on one of my repos.

[+] unknown|4 years ago|reply

[deleted]

[+] PeterBarrett|4 years ago|reply

One of our systems runs AWS code repository in parallel to Github and builds are triggered from there (but not in us-east-1). Time to migrate the rest of our systems to having that fallback.

[+] lebski88|4 years ago|reply

It's almost the same time as their incident yesterday too. Although today the scope is wider - yesterday it was Webhooks and Actions. Today core git is broken as well as the APIs.

[+] pm90|4 years ago|reply

Yep. I hope they post an aws style postmortem… this is kinda ridiculous (although I do empathize as an ops person). Webhooks breaking broke all of our pr bots bringing development to a standstill yesterday; today everything seems f’d.

[+] WFHRenaissance|4 years ago|reply

Looks like the drinking started early at GitHub... good on them!

[+] timeimp|4 years ago|reply

It’s not DNS

There’s no way it’s DNS

It was DNS

[+] rvz|4 years ago|reply

Here we go again. GitHub going completely down at least once a month as I said. [0] So nothing has changed. That is excluding the smaller intermittent issues. Let's see if anyone implemented a self-hosted backup or failsafe just in case.

Oh dear.

[0] https://news.ycombinator.com/item?id=30149071

[+] bastardoperator|4 years ago|reply

The entire point of git is that it's decentralized, lol. If I've cloned locally like millions of people do daily, I have a backup.

[+] can16358p|4 years ago|reply

At some point GitHub main page 500'ed for me. The problem is probably somewhere down to the core, not at something isolated.

[+] lambda_dn|4 years ago|reply

This is why you should have your code on multiple remotes, i.e. Azure DevOps, Git labs, self hosted git server.

118 comments