We received an email from Github yesterday informing us that one of our repositories had been accessed by a third party due to this issue. While it's not a fun notification to receive, it definitely made our general security paranoia feel justified – we're lucky that from the get-go we've held best practices around keeping secrets out of the codebase. Obviously we still dedicated time as a team to prune through our repository history with a fine-toothed comb for anything that could potentially be a vulnerability, as we take this very seriously.
One of our engineers came up with a useful script to grab all unique lines from the history of the repository and sort them according to entropy. This helps to lift any access keys or passwords which may have been committed at any point to the top.
I think this is a great example to illustrate the tough edges of security to less experienced engineers. Github will most likely never let something like this happen to you, but on the off-chance that they do it's great to be prepared. Additionally, the response from Github was very well received. No excuses, just a thorough explanation of what happened.
I also can't help but mention that we're hiring, if you'd like to work at an organization that values security and data privacy very highly. :) usebutton.com/join-us
I got the other end of that email today, saying that an account in my organization had inadvertently downloaded private repos from another customer when fetching from one of our own. Fortunately for GitHub/that user, it was almost certainly our automated provisioning system so we never had any idea and whatever it was never made it anywhere interesting.
The email was kind of funny though, part of it was effectively "if you have this data pretty please delete it without looking at it". I'm sure that's the best they can do, but it still made me chuckle.
Of the very small number of repos affected, there are now two of us reporting that it affected us :). And I had the same approximate response: well, I don't keep secrets in the repository, so not that big of deal. I'd rather the source not get shared with the world, but shit happens and they owned up to it right away. If that source were valuable enough, I'd be hosting it on-prem with encrypted off-site backups.
One of the most striking things about this report is the scale that GitHub has now reached: the whole incident apparently lasted only 10 minutes, but during that time 17 million requests were sent to their git proxy.
It's obviously unfortunate in this case, since even a relatively small and quickly fixed bug affecting a tiny proportion of requests still had serious consequences.
However, it's a remarkable achievement (if also a little terrifying for the software development industry from a single-point-of-failure perspective).
Does it? Except for very sophisticated organizations, I doubt it.
You don't hear about intrusions into self-hosted source repositories. Not because there are fewer, but because they likely don't have the security infrastructure in place to know that they ever happened.
If you have private repositories that should never be public, no matter what, which isn't true for most users of private repositories, then:
Given that git provides many transports and ways to push commits around, I agree. If you have to be safe, there's no reason to use github or a self-hosted git collaboration service (gitlab, etc) on a publicly accessible server, regardless of access control measures. If you really need to have sources on a remote machine, you can limit the potential damage by only sharing archives of a certain revision, without history.
I know many will dismiss it, but if you're serious about the repositories being private, then the most you make them accessible is via an on-premises hosted gitlab instance, which is local to the company network, not accessible via the public network, and only allowed to, if you want, by dialing into VPN first. Then, to be safe, you null-route anything but the VPN traffic on the connected off-premises developer machine.
Access keys get stolen, just like SSH keys are, so you need to use a VPN service that requires additional security like the use of OTP key generators or similar measures.
This probably sounds like a hassle in the day and age of people just going for the comfort of private github or gitlab repositories, but it's what companies have been doing for almost 20 years as standard practice.
You cannot consider any git repository, even on your own root server, safe to keep private code on. CIOs would argue against that practice for good reason. The same CIOs require work laptops to encrypt all data.
If you don't need to be that serious, then an incident like this should be planned and accounted for as part of using such hosting, and shouldn't be a big deal.
We've had a couple of VPS provider admin panel compromises, Shellshock and other showstopper remote vulnerabilities for privately administered servers that don't get constant professional attention, etc. It's possible, but not obvious.
Leaking private repositories is one thing, but if you have a private build server that pulls and runs scripts, you could be in for a bad time even if you ended up pulling a random public repository, if the build script is malicious... hmmm...
In retrospect of course it's always easy to criticise, but still, the diff is really cringeworthy.
The deleted code is very specific-looking. Nobody writes that just casually or out of ignorance. Also it is what was at use in production.
It's very naive to just go and replace that with nice-looking, shorter code.
Key lessons:
- Understand what you are deleting
- Treat production code as sacred
- Add reasonably extensive comments for delicate code (as the original one). Git commit messages aren't enough.
- Try out infrastructure changes in production-like staging servers. I really doubt they properly did, as they say the "majority" of 17M requests failed.
Interesting that they don't mention expanding the information being logged to make the multiple joins they had to do unnecessary or more deterministic.
They state in the post that of 17 million requests to their git-proxy server, only 230 of those requests could be identified as successful responses to incorrect data/repos, at a percent of 0.0013%.
I don't know of anyone that would recommend creating tests, even integration tests, that hammers a service to check to see if something like one hundredths of one percent of requests returns invalid data. If anything, the fact that a script is hammering a service that probably (in a Dev or QA environment) has much less data in it's database and file stores, and much less protection (like load balancing and caching) than it would in production would generate more false positives than it would generate in substantial data disclosure regression defects.
[+] [-] Hovertruck|9 years ago|reply
One of our engineers came up with a useful script to grab all unique lines from the history of the repository and sort them according to entropy. This helps to lift any access keys or passwords which may have been committed at any point to the top.
I think this is a great example to illustrate the tough edges of security to less experienced engineers. Github will most likely never let something like this happen to you, but on the off-chance that they do it's great to be prepared. Additionally, the response from Github was very well received. No excuses, just a thorough explanation of what happened.
I also can't help but mention that we're hiring, if you'd like to work at an organization that values security and data privacy very highly. :) usebutton.com/join-us
[+] [-] foota|9 years ago|reply
[+] [-] dorianm|9 years ago|reply
[+] [-] tw90210|9 years ago|reply
The email was kind of funny though, part of it was effectively "if you have this data pretty please delete it without looking at it". I'm sure that's the best they can do, but it still made me chuckle.
[+] [-] tonyarkles|9 years ago|reply
[+] [-] jorge_leria|9 years ago|reply
[+] [-] kozak|9 years ago|reply
[+] [-] Silhouette|9 years ago|reply
It's obviously unfortunate in this case, since even a relatively small and quickly fixed bug affecting a tiny proportion of requests still had serious consequences.
However, it's a remarkable achievement (if also a little terrifying for the software development industry from a single-point-of-failure perspective).
[+] [-] bsder|9 years ago|reply
[+] [-] asolove|9 years ago|reply
You don't hear about intrusions into self-hosted source repositories. Not because there are fewer, but because they likely don't have the security infrastructure in place to know that they ever happened.
[+] [-] cm3|9 years ago|reply
Given that git provides many transports and ways to push commits around, I agree. If you have to be safe, there's no reason to use github or a self-hosted git collaboration service (gitlab, etc) on a publicly accessible server, regardless of access control measures. If you really need to have sources on a remote machine, you can limit the potential damage by only sharing archives of a certain revision, without history.
I know many will dismiss it, but if you're serious about the repositories being private, then the most you make them accessible is via an on-premises hosted gitlab instance, which is local to the company network, not accessible via the public network, and only allowed to, if you want, by dialing into VPN first. Then, to be safe, you null-route anything but the VPN traffic on the connected off-premises developer machine.
Access keys get stolen, just like SSH keys are, so you need to use a VPN service that requires additional security like the use of OTP key generators or similar measures.
This probably sounds like a hassle in the day and age of people just going for the comfort of private github or gitlab repositories, but it's what companies have been doing for almost 20 years as standard practice.
You cannot consider any git repository, even on your own root server, safe to keep private code on. CIOs would argue against that practice for good reason. The same CIOs require work laptops to encrypt all data.
If you don't need to be that serious, then an incident like this should be planned and accounted for as part of using such hosting, and shouldn't be a big deal.
[+] [-] closeparen|9 years ago|reply
[+] [-] ifhs|9 years ago|reply
[+] [-] 0x0|9 years ago|reply
[+] [-] vemv|9 years ago|reply
The deleted code is very specific-looking. Nobody writes that just casually or out of ignorance. Also it is what was at use in production.
It's very naive to just go and replace that with nice-looking, shorter code.
Key lessons:
- Understand what you are deleting
- Treat production code as sacred
- Add reasonably extensive comments for delicate code (as the original one). Git commit messages aren't enough.
- Try out infrastructure changes in production-like staging servers. I really doubt they properly did, as they say the "majority" of 17M requests failed.
[+] [-] faitswulff|9 years ago|reply
[+] [-] luhn|9 years ago|reply
> The impact of this bug for most queries was a malformed response, which errored and caused a near immediate rollback.
[+] [-] antn|9 years ago|reply
[+] [-] webmaven|9 years ago|reply
[+] [-] unknown|9 years ago|reply
[deleted]
[+] [-] revelation|9 years ago|reply
Surely they do some end-to-end testing?
[+] [-] uxp|9 years ago|reply
I don't know of anyone that would recommend creating tests, even integration tests, that hammers a service to check to see if something like one hundredths of one percent of requests returns invalid data. If anything, the fact that a script is hammering a service that probably (in a Dev or QA environment) has much less data in it's database and file stores, and much less protection (like load balancing and caching) than it would in production would generate more false positives than it would generate in substantial data disclosure regression defects.
[+] [-] wojcech|9 years ago|reply
[+] [-] OJFord|9 years ago|reply
I don't know that this would make you necessarily want to make both the change to self-hosting, and the change of platform.
[+] [-] cheiVia0|9 years ago|reply