Incident Report: Inadvertent Private Repository Disclosure

[+] Hovertruck|9 years ago|reply

We received an email from Github yesterday informing us that one of our repositories had been accessed by a third party due to this issue. While it's not a fun notification to receive, it definitely made our general security paranoia feel justified – we're lucky that from the get-go we've held best practices around keeping secrets out of the codebase. Obviously we still dedicated time as a team to prune through our repository history with a fine-toothed comb for anything that could potentially be a vulnerability, as we take this very seriously.

One of our engineers came up with a useful script to grab all unique lines from the history of the repository and sort them according to entropy. This helps to lift any access keys or passwords which may have been committed at any point to the top.

I think this is a great example to illustrate the tough edges of security to less experienced engineers. Github will most likely never let something like this happen to you, but on the off-chance that they do it's great to be prepared. Additionally, the response from Github was very well received. No excuses, just a thorough explanation of what happened.

I also can't help but mention that we're hiring, if you'd like to work at an organization that values security and data privacy very highly. :) usebutton.com/join-us

[+] foota|9 years ago|reply

I'm curious, how did they calculate entropy? My first thought was to do something with Huffman encoding.

[+] dorianm|9 years ago|reply

I thought the idea was interesting so here is a little PoC in Ruby:

    require 'facets'
    lines = Dir['**/*.rb', '**/*.py', '**/*.cpp'].map { |f| File.read(f).lines.map(&:chars) }.inject(&:+)
    puts lines.sort_by(&:entropy).map(&:join).last(10).reverse

Using: http://www.rubydoc.info/github/rubyworks/facets/Array%3Aentr...

[+] tw90210|9 years ago|reply

I got the other end of that email today, saying that an account in my organization had inadvertently downloaded private repos from another customer when fetching from one of our own. Fortunately for GitHub/that user, it was almost certainly our automated provisioning system so we never had any idea and whatever it was never made it anywhere interesting.

The email was kind of funny though, part of it was effectively "if you have this data pretty please delete it without looking at it". I'm sure that's the best they can do, but it still made me chuckle.

[+] tonyarkles|9 years ago|reply

Of the very small number of repos affected, there are now two of us reporting that it affected us :). And I had the same approximate response: well, I don't keep secrets in the repository, so not that big of deal. I'd rather the source not get shared with the world, but shit happens and they owned up to it right away. If that source were valuable enough, I'd be hosting it on-prem with encrypted off-site backups.

[+] jorge_leria|9 years ago|reply

Github takes security seriously, this disclosure post is a proof of that.

[+] kozak|9 years ago|reply

This honest report is a good example of transparency.

[+] Silhouette|9 years ago|reply

One of the most striking things about this report is the scale that GitHub has now reached: the whole incident apparently lasted only 10 minutes, but during that time 17 million requests were sent to their git proxy.

It's obviously unfortunate in this case, since even a relatively small and quickly fixed bug affecting a tiny proportion of requests still had serious consequences.

However, it's a remarkable achievement (if also a little terrifying for the software development industry from a single-point-of-failure perspective).

[+] bsder|9 years ago|reply

I approve of the handling, but this just underscores why you want self-hosted instances.

[+] asolove|9 years ago|reply

Does it? Except for very sophisticated organizations, I doubt it.

You don't hear about intrusions into self-hosted source repositories. Not because there are fewer, but because they likely don't have the security infrastructure in place to know that they ever happened.

[+] cm3|9 years ago|reply

If you have private repositories that should never be public, no matter what, which isn't true for most users of private repositories, then:

Given that git provides many transports and ways to push commits around, I agree. If you have to be safe, there's no reason to use github or a self-hosted git collaboration service (gitlab, etc) on a publicly accessible server, regardless of access control measures. If you really need to have sources on a remote machine, you can limit the potential damage by only sharing archives of a certain revision, without history.

I know many will dismiss it, but if you're serious about the repositories being private, then the most you make them accessible is via an on-premises hosted gitlab instance, which is local to the company network, not accessible via the public network, and only allowed to, if you want, by dialing into VPN first. Then, to be safe, you null-route anything but the VPN traffic on the connected off-premises developer machine.

Access keys get stolen, just like SSH keys are, so you need to use a VPN service that requires additional security like the use of OTP key generators or similar measures.

This probably sounds like a hassle in the day and age of people just going for the comfort of private github or gitlab repositories, but it's what companies have been doing for almost 20 years as standard practice.

You cannot consider any git repository, even on your own root server, safe to keep private code on. CIOs would argue against that practice for good reason. The same CIOs require work laptops to encrypt all data.

If you don't need to be that serious, then an incident like this should be planned and accounted for as part of using such hosting, and shouldn't be a big deal.

[+] closeparen|9 years ago|reply

We've had a couple of VPS provider admin panel compromises, Shellshock and other showstopper remote vulnerabilities for privately administered servers that don't get constant professional attention, etc. It's possible, but not obvious.

[+] ifhs|9 years ago|reply

Don't you mean on-premise?

[+] 0x0|9 years ago|reply

Leaking private repositories is one thing, but if you have a private build server that pulls and runs scripts, you could be in for a bad time even if you ended up pulling a random public repository, if the build script is malicious... hmmm...

[+] vemv|9 years ago|reply

In retrospect of course it's always easy to criticise, but still, the diff is really cringeworthy.

The deleted code is very specific-looking. Nobody writes that just casually or out of ignorance. Also it is what was at use in production.

It's very naive to just go and replace that with nice-looking, shorter code.

Key lessons:

- Understand what you are deleting

- Treat production code as sacred

- Add reasonably extensive comments for delicate code (as the original one). Git commit messages aren't enough.

- Try out infrastructure changes in production-like staging servers. I really doubt they properly did, as they say the "majority" of 17M requests failed.

[+] faitswulff|9 years ago|reply

How did they become aware of the bug so quickly (<10 minutes)? Unless I'm missing something from the report, it doesn't say.

[+] luhn|9 years ago|reply

The bug trigger a flood of errors.

> The impact of this bug for most queries was a malformed response, which errored and caused a near immediate rollback.

[+] antn|9 years ago|reply

we have a lot of internal tools that allow us to see quickly when things aren't behaving as expected after deploying

[+] webmaven|9 years ago|reply

Interesting that they don't mention expanding the information being logged to make the multiple joins they had to do unnecessary or more deterministic.

[+] unknown|9 years ago|reply

[deleted]

[+] revelation|9 years ago|reply

Next step: setup development system ?!

Surely they do some end-to-end testing?

[+] uxp|9 years ago|reply

They state in the post that of 17 million requests to their git-proxy server, only 230 of those requests could be identified as successful responses to incorrect data/repos, at a percent of 0.0013%.

I don't know of anyone that would recommend creating tests, even integration tests, that hammers a service to check to see if something like one hundredths of one percent of requests returns invalid data. If anything, the fact that a script is hammering a service that probably (in a Dev or QA environment) has much less data in it's database and file stores, and much less protection (like load balancing and caching) than it would in production would generate more false positives than it would generate in substantial data disclosure regression defects.

[+] wojcech|9 years ago|reply

Probs to github for the disclosure. And congratulations to gitlab for probably getting a nice boost in on premise support contracts:)

[+] OJFord|9 years ago|reply

Github Enterprise is on-premise too.

I don't know that this would make you necessarily want to make both the change to self-hosting, and the change of platform.

[+] cheiVia0|9 years ago|reply

The only private repository is one you created on your own computer and didn't push to github.

41 comments