top | item 15256121

Malicious software libraries found in PyPI posing as well known libraries

475 points| nariinano | 8 years ago |nbu.gov.sk | reply

245 comments

order
[+] hannob|8 years ago|reply
Ok, here's some ugly backstory on this: This problem has been known for a while, yet both the pypi devs and the python security team decided to ignore it.

Last year someone wrote his thesis describing python typosquatting and standard library name squatting: http://incolumitas.com/2016/06/08/typosquatting-package-mana...

However after that the packages used in this thesis - the most successful one being urllib2 - weren't blocked, they were deleted. Benjamin Bach was able to register urllib2 afterwards. Benjamin and I decided that we'd now try to register as many stdlib names as possible.

See also: https://www.pytosquatting.org/

[+] chatmasta|8 years ago|reply
Package managers seem to be an increasingly popular attack vector. It's only luck that none of the attacks have been particularly malicious yet. Considering how many package manager downloads go to a server in a datacenter, a widely distributed malicious package could control a botnet with extremely high throughput, or wreak havoc on any databases it comes into contact with.

It's only a matter of time before something like this happens. A big part of the problem is that application package managers, like pip or npm, are far less sophisticated than those of operating systems, like aptitude or yum. It needs to be easy for developers to open source their code, and to mark dependencies with precise commit hashes, but the download also needs to be secure and verifiable. There are many difficult tradeoffs to consider in terms of usability, centralization, security and trust.

[+] avsm|8 years ago|reply
This is why we are working on integrating TUF (The Update Framework) signing into the OCaml OPAM package manager. See https://github.com/hannesm/conex-paper/blob/master/paper.pdf for the talk from last year. There's one more iteration required on the implementation before we're happy with it, but we are aiming to get this live on the OPAMv2 package repository some time in 2018 for all the publicly available OCaml packages.

OPAMv2 also exposes sufficient hooks during the build process for using OS sandboxing during builds, and disconnecting network access/etc. It would be nice to factor this out to be more OS independent (e.g. for all the `unshare` tricks on Linux, or the sexp-format for sandboxing on OSX) in the future.

[+] raesene6|8 years ago|reply
Another fun fact to consider is that with many package formats, you can execute arbitrary code at install time so if a malicious package can get into a repository, it's very likely to start compromising systems quickly.

Whilst a package manager repo. compromise would be the biggest bang in terms of attack, compromising the credentials of the developers of popualar libraries would be an easier attack (and indeed is already happening https://twitter.com/chrispederick/status/892768218162487300)

[+] edanm|8 years ago|reply
It's also very possible that a far more malicious attack has happened, but not been discovered.
[+] jeffson256|8 years ago|reply
I'm always fascinated by the amount of trust being exhibited by the developers of some node projects I've seen. Their projects have an order of magnitude more dependencies than I'm used to - and at the other end of each one is someone publishing some small module to npm with an unknown amount of review. I feel safe(r) installing dependencies from apt because I know the processes the Debian community follows before packages are included in the official repos.
[+] mtkd|8 years ago|reply
More needs to be done by package managers to warn end users.

One scenario that worries me is where apps age and use popular trusted dependancies (e.g. gems on Github).

When those gems stop being maintained but need to be updated to work (say with latest OSX) - it's common to quickly look at the latest forks available and select the one that now works correctly - but without a detailed inspection of the new code it's potentially kryptonite for a production datacenter.

[+] kasabali|8 years ago|reply
Yet another attack vector that doesn't exist at all in Linux distributions but invented by language package managers, sadly.

They solved the issue 2 decades ago by heavily vetting packages before accepting them into repositories. Users are allowed to add and use packages from 3rd party repositories.

Maybe solution to this is creating curated repositories based on publicly open ones and using them by default (and requiring opt-in for using other repositories). Conda for Python and Stackage for Haskell seems like relevant solutions.

[+] mikehollinger|8 years ago|reply
There's a certain amount of work (and therefore money) required to do this. That incremental difference is small for a well designed application, but - someone must actually vet and curate the contents of the repo. That tends to slow down execution, leading to scenarios where things like docker a year or two ago from the canonical "trusty" repo were hopelessly behind the "real" docker since docker was evolving so quickly and trusty was by design slowing down.

Each commit that went into trusty required a team to submit and a team to approve. That costs money. ;-)

[+] IshKebab|8 years ago|reply
The Linux distribution approach to package management ("we'll package everything ourselves!") simply doesn't scale.
[+] thearn4|8 years ago|reply
It looks like the code phones home to a server in China:

IP: 121.42.217.44 Decimal: 2032851244 Hostname: 121.42.217.44 ASN: 37963 ISP: Hangzhou Alibaba Advertising Co.,Ltd. Organization: Hangzhou Alibaba Advertising Co.,Ltd. Services: None detected Type: Broadband Assignment: Static IP Blacklist: Click to Check Blacklist Status Continent: Asia Country: China cn flag State/Region: Zhejiang City: Hangzhou Latitude: 30.2936 (30° 17′ 36.96″ N) Longitude: 120.1614 (120° 9′ 41.04″ E)

[+] raverbashing|8 years ago|reply
I wonder what would happen if the return payload had some data that would trigger the GFoC
[+] IgorPartola|8 years ago|reply
This to me is the nightmare scenario. Well one of the two, the other one being that a developer of an obscure library I use has their password to PyPI compromised and a bad actor uploads a backdoored version of the library.

Fundamentally, the reason this is different from how thinks like Linux distos work is because Linux distros have maintainers who are in charge of making sure every new update to one of their packages is legit. I am sure you can try to sneak malicious code in, but it isn't going to be easy.

I am not advocating that PyPI (and npm) adopt the same model. That would be too restrictive. But maybe just showing the number of downloads isn't the best way to assure whether the package is legit. Perhaps some kind of built in review system would be nice.

[+] raesene6|8 years ago|reply
A review system unfortunately isn't likely to be practicable with current development models. npm alone has over 500,000 packages (http://www.modulecounts.com/) so even a one time review isn't going to happen.

If people want a more trusted solution the likely outcome is that they'll need to use a smaller more static set of libraries and then either do the audits themselves, or outsource that to a 3rd party.

Ofc with current speeds of change and deployments, it doesn't seem likely that many companies will adopt that model.

[+] IshKebab|8 years ago|reply
> Fundamentally, the reason this is different from how thinks like Linux distos work is because Linux distros have maintainers who are in charge of making sure every new update to one of their packages is legit.

How is that different?

[+] raesene6|8 years ago|reply
This isn't, in any way, a new problem. I did a presentation on this topic for OWASP AppSecEU 2015 (https://www.youtube.com/watch?v=Wn190b4EJWk&list=PLpr-xdpM8w...) and when doing the research for that I encountered cases of repo. attacks and compromise.

IME the problem will continue unless the customers (e.g. companies making use of the libraries hosted) are willing to pay more for a service with higher levels of assurance.

The budget required to implement additional security at scale is quite high, and probably not a good match with a free (at point of use) service.

[+] fovc|8 years ago|reply
If someone here wants to build a business around this, count me in for NPM (high willingness to pay) or PyPi (lower WTP).

Here's an idea: make it similar to Kickstarter, where customers can commit a certain amount of funds towards a specific package. If the package doesn't "tilt" in a certain amount of time money goes back. Otherwise you vet a point release and add it to your repo. you could offer subscriptions to keep packages updated or handle each update as its own project (with presumably lower costs if a recent release has been audited). Handling dependencies is key as an exercise for the reader

[+] cdnsteve|8 years ago|reply
I'm sure companies would pay for it. The service needs to be part of the main package service, not some third party.
[+] Sir_Cmpwn|8 years ago|reply
I think a more Linux-like approach to package repos is better - a curated package repository run by volunteers in maintainership roles. Then you have a human being verifying the upstream and keeping malware out, and get more consistency across packages as a bonus. If you want your package added it's as simple as sending an email and provides a new avenue for people to contribute to the success of the ecosystem as package maintainers.

When you make the next big thing, consider this approach.

[+] drdaeman|8 years ago|reply
Maybe you're right, but I see one possible downside that is quite important.

I have encountered the case "the package has an important bugfix but is not yet published on PyPI" way more than once or twice.

With the intermediate maintainers, that's going to get worse.

I believe namespaces and signatures are the way to go. With a special privileged namespace for the curated widely known packages (e.g. SciPy or Django) - a little like it's on the Docker Hub, where curated mainstream images are just "debian" or "python" but anyone can upload e.g. "jdoe/debian" if they need some customization.

[+] zzzcpan|8 years ago|reply
I don't think there are any maintainers that verify upstream code, they only manage packages and updates. Which is actually safer to do without maintainers, completely automatically, as it will eliminate a huge attack surface introduced by a maintainer.
[+] geekamongus|8 years ago|reply
[+] haypo|8 years ago|reply
While there is no public announcement from the PSF yet, I sent an email to the python-dev mailing list at least to announce the issue but also try to discuss how to mitigate/prevent it.

https://mail.python.org/pipermail/python-dev/2017-September/...

Honestly, I am impressed that the information gone so quick! The National Security Authority of Slovakia contacted the PSRT 10 days ago. All packages were removed 1h10 after we got their email. We were discussing how to communicate about this issue, while they published an advisory. A few hours after the advisory was published, I saw the information on IRC, Twitter, LWN, etc. I didn't expect that the advisory would be published so quickly. FYI last week there was also a CPython sprint attended by more than 20 Python core developers. We were busy on discussing Python enhancements.

[+] takluyver|8 years ago|reply
I guess that's because it's not a surprise. This has come up before, and it's basically unavoidable with the way PyPI is designed to work: if you see an unclaimed name, you can put whatever you want there.
[+] rantanplan|8 years ago|reply
The regex they have for identifying fake/harmful packages is wrong.

`pip list –format=legacy | egrep '^(acqusition|apidev-coop|bzip|crypt|django-server|pwd|setup-tools|telnet|urlib3|urllib) '`

This incorrectly lists `urllib3` or the `cryptography` package for example, which are perfectly valid packages.

[UPDATE]

Read "tobltobs" comment below. I incorrectly removed a trailing space from the regex.

[+] tobltobs|8 years ago|reply
Not for me. There is space at the end between the closing bracket and the apostrophe. Maybe you did remove this space when you corrected the smart apostrophes.
[+] nariinano|8 years ago|reply
I believe urllib3 is built-in. So if you have installed it from PyPI you've gotten a malicious version.
[+] jastr|8 years ago|reply
pip list --format=legacy | cut -d' ' -f1 | xargs egrep '^(acqusition|apidev-coop|bzip|crypt|django-server|pwd|setup-tools|telnet|urlib3|urllib)$'
[+] singularity2001|8 years ago|reply
"Success of the attack relies on negligence of the developer"

How about package manager managers accept their enourmous responsabilty? urllib vs urllib2, one is a virus? Sorry but that is not "negligence of the developer"

[+] singularity2001|8 years ago|reply
The least they can do is create an alias system for common libs or disallow some lib names.

Another easy thing to implement would be a popularity check: "This package was only installed nnn times. Did you mean xxx, or do you want to proceed with the installation of yyy by author [email protected]?"

Email verification is a must.

[+] zokier|8 years ago|reply
Managing supply chain is one the basic principles of good engineering. Not properly vetting your sources is negligence. The problem of course is that computers are really good at amplifying work, including mistakes. So small mistake, like a typo, could have catastrophic impact, like injecting malware that can take over the whole system.
[+] bughunter3|8 years ago|reply
There are over 100,000 packages and PyPI is run by volunteers. This is not practical.

PyPI is not a curated distribution.

[+] mwerty|8 years ago|reply
How about a Levenshtein distance threshold for new package names to be accepted? I.e only allow names that are different enough from the existing set to avoid typos (or whatever errors we are trying to guard against)?
[+] EstDelenda|8 years ago|reply
Any method of software distribution which is not rooted in cryptographic author verification against a fine-grained, user-manageable trust store should be put bellow the sanity waterline, 20 years ago.
[+] defined|8 years ago|reply
Here's something that contributes to typosquatting: the lack of responsiveness by package management organizations to claims on orphaned or unmaintainable packages.

People who upload packages often leave organizations, who are then stuck with a package they can't update because the password went with the person, and the email reset link points to a now-defunct email address.

Petitioning the package management team is sometimes fruitless, forcing a needless new instance of typosquatting.

[+] ehnto|8 years ago|reply
Part of my dislike for the Node ecosystem in particular and I am sure others have a similar problem, is the dependency trees are super complex.

Because packages tend to be small and many, and each of those has their own dependencies, you can end up with hundreds of packages installed which is simply impractical to manually review.

It is not node, but we do in fact manually review each package we utilize for our given language because it's feasible and worthwhile as the dependency tree is small in this ecosystem. Each and every package is a possible attack vector whether that be intentionally or just because it's poorly written and we can't simply ignore that because it's the done thing and "the community reviews them".

[+] bhouston|8 years ago|reply
I bet there are quite a few malicious NPM packages that we do not know about.

Is Node is used in government and military solutions? If so then the NPM ecosystem is likely targeted by state actors, and it is a sitting duck.

[+] thehardsphere|8 years ago|reply
State actors do not limit themselves to government and military targets; many of them target civilians for all sorts of purposes.
[+] asperous|8 years ago|reply
I once tried to upload a package called "requirements.txt" (since people do pip install requirements.txt all the time forgetting the -r).

Pypi actually blocks that name from being a package!

[+] EGreg|8 years ago|reply
Here is the general problem with dependencies:

When a dependency changes, all the projects that directly depend on it should get notified immediately and their maintainers should rush to test the new changes, to see if they break anything.

There is no shortcut around this, because if B1, B2, ... Bn depend on A1, the consequences may be different for each Bk.

The only real secure optimization that can be done is realizing that some of the Bk use A1 the exact same limited way and thus make an intermediate A1b that depends on A1 which those Bk's depend on. These "projection" builds may be automated by eg the set of methods called by the B's.

Anyway, this is the way that iOS does it before iOS 11 comes out to users. They release a beta to all developers. And they even fix bugs in the beta before releasing to the public.

Without beta testing periods, you can get laziness and just auto-accepting of whatever cane out.

There is be an "alpha release" feature in git where maintainers might put out the next version to be tested by all who depend on it. THIS FEATURE SHOULD NOTIFY THE MAINTAINERS SUBSCRIBED TO THE REPO. THE BUILD ITSELF SHOULD GET ISSUES AND RATINGS FROM MAINTAINERS AS THEY TEST THE NEW BUILD. And releases should not be too frequent.

This is the way to prevent bad things from happening. But that also means that the deeper the dependency is, the more levels this process could take to propagate to end-users.

[+] teilo|8 years ago|reply
I think we need a system to prevent this instead of the wild-west that PyPi has become. For example: Developer signatures that are checked against a community rating. If someone does `pip install` pip would look up the developer signature of the package and check a community rating that would verify this is a developer who has offered legit packages in the past. It's not foolproof, but it would go a long way towards solving this.
[+] wongarsu|8 years ago|reply
That sounds easy to defeat. Make some mundane, but legit packages (maybe on of those "$X but without the pointless complexity"-packages), gain trust, once trust is reached start uploading typo-squatting packages.

Knowing today's internet, programmers from cheap-labour nations (India & Co.) would soon start offering "trusted PyPi accounts" for sale on hacker forums.

[+] zokier|8 years ago|reply
You could add the ability for well known members to vet newbie developers, maybe by signing their key. And now you have re-invented web of trust.