> We have waited for the string of subpoenas to subside, though we were committed from the beginning to write and publish this post as a matter of transparency, and as allowed by the lack of a non-disclosure order associated with the subpoenas received in March and April 2023.
That's suspiciously specific. Sounds to me like they also received some other subpoenas they aren't allowed to talk about.
Being reminded that PyPI is a target for law enforcement makes me even more irked that they've removed end-to-end package signing without providing a replacement[0].
PGP signatures—even though rarely used—would allow someone to verify that a signed package was not modified by PyPI after being uploaded by its original author.
Without any sort of signing mechanism, we have to trust the U.S. Government to never demand that PyPI insert a backdoor, via a National Security Letter, FISA court order, or other kangaroo court process. Good luck with that.
The existing PGP signing mechanism had usability issues and security footguns[1], but was better than nothing. It's a shame they didn't roll out a more usable and secure alternative before removing the existing functionality.
If you want to start with tinfoil hat theories, think about this:
The PGP signatures were removed, nominally because few people used them. ...but the timing of the removals is coincidental, no?
"You need to have a backdoor that lets us see who's downloading what packages and let us inject custom code to particular targets"
"That's technically impossible because of..."
"Here is a court order. Implementation is your problem. You're not allowed to tell anyone you even received a court order."
"...well, I guess signed packages have to go then..."
(:
I don't actually believe that, since PGP signing was frankly, barely used and really there's hardly any meaningful difference between a PGP you can't verify (which was most of them) and not having it; in fact the illusion of security is probably worse than not having it at all.
...but still. As you say. It sucks there's no meaningful replacement for it.
I'm the author of that post. There is absolutely no meaningful sense in which PyPI's previous PGP support was (or ever did) provide end-to-end package signing. At the absolute most, when used correctly (which, overwhelmingly, it was not), it provided one half of package signing.
The other half (key retrieval and identity binding) was never provided, because PGP as an ecosystem made doing so intractable. It was not better than nothing, because it was nothing; anything you could have done with it can be done with your own sidecar signatures.
Perhaps look at Gentoo's model of a single monolithic Git repository. It is possibly the largest and most distributed Merkle tree of software distribution signatures in existence. It is updated a few times every hour by a diverse community and each commit has to be GPG signed so you have the opportunity to verify signatures by looking up developer websites, slides from FOSS conferences, etc to confirm whether the keys have been widely published.
There are some caveats:
* Avoid -9999 packages as you won't get any guarantee of authenticity of whatever will be obtained from the upstream repository, other than whatever trust you place in a X.509 certificate that in all likelihood is controlled by either Microsoft (GitHub) or otherwise accessible to Amazon, Google, etc by nature of common open source project hosting arrangements.
* When syncing your local repository, verify all changes since your last sync. This could be as simple as syncing to a point n-days ago, after which numerous developers you know have signed more recent commits on top (you at least know those developers have been impacted too if the whole repository was compromised and the compromise is now on the public record).
* You don't really know how many people are using the packages you care about, and thus how many other people across the world are also exposed to (and possibly reporting problems with) signatures that Gentoo developers have committed.
In addition to relying on existing sources such as the Gentoo Git repository, an additional way to build trust is setting up software "looking glass" tools in different jurisdictions to check that software downloaded from different carriers in different jurisdictions are all the same.
At least with these measures the attacker has to compromise everyone and make this compromise a public record, rather than just silently compromise one target.
> "Records of all Python Package Index (PyPI) packages uploaded by..." given usernames
> "IP download logs of any Python Package Index (PyPI) packages uploaded by..." given usernames
I don't think they'd want a list of packages uploaded by a given user if they were after yt-dlp devs. They'd be asking for a list of maintainers of a given package.
It seems much more likely that some typosquatter managed to compromise the security of government sites by uploading malware, and Uncle Sam wants to catch the culprit.
So they should promptly update their policies to a) stop logging so much, b) delete all past logs, and c) sharply limit the span of time until deletion of whatever logs they decide they really need to track for internal needs.
They should avoid logging, and rapidly rotate logs, to thwart future subpoenas from the total surveillance state.
PyPi is in a tough spot because they're also getting hit with an onslaught of malicious packages, which got to such a bad point they had to disable signups. How do they mitigate that kind of activity without logging basic metadata like the IP address that published a package? Also, as a user of PyPi, wouldn't you prefer that a malicious package is at least _somewhat_ traceable to an attacker? Of course most would be behind a VPN but it's better than nothing (or maybe it's not, depending on the tradeoff).
Note that the blog post doesn't say they handed the entire database over to the feds. They received three warrants scoped to specific packages and returned only the data they had available that was associated with those packages.
For the kind of service they are providing I think the logging is appropriate.
I mean if DOJ is interested in PyPI logs the only reason I could think of, is if it was used as a supply chain vector into breaking in into other organizations.
This is for package management. I want the supply chain to be secure and would rather know when something unusual happens. Not logging that data would be irresponsible on PyPi's part.
Suggestion: Start slipping unique URLs into the "hidden" backend fields of systems where you'd like to know if your data was breached, improperly used, or handed over to a three letter agency.
Suddenly getting hits at mydomain.com/[uuid]? At least you know somebody has looked at the data, or at the very least fed it through some processing tool that is extracting and visiting the URLs.
While they are transparent the events happened, they are not transparent about which packages and what authors are being flagged, which is unfortunate.
The subpoena is a command to the possessor of the data, which tells the possessor of the data to produce it, with a particular deadline. Since this deadline is in the future, the subpoena can be challenged legally (normally by requesting a court to "quash" it; more riskily, sometimes by complying imperfectly or not at all, and then arguing in response to an attempt to punish the noncompliance that this was reasonable). A subpoena can be issued by many entities, for example including some law enforcement entities themselves, or a lawyer actively involved in litigation. (Yes, lawyers can personally write and issue subpoenas.) The subpoena is, however, enforced by a court, in the sense that the court is asked to punish people who fail to obey it.
The warrant is a command to a law enforcement officer, which allows the law enforcement officer to personally go and search and seize things (or people), while overriding some rights that would normally prevent this. Normally it is issued by a court. Generally there is no way to challenge a warrant to prevent its execution, because it is not disclosed to the target before it's executed (i.e., a law enforcement officer shows up with the warrant and begins executing it immediately, by force if necessary).
(Edit: I wrote above that it's risky to comply imperfectly with a subpoena and then argue in court that this was reasonable, but usually if a lawyer gives a professional opinion that the subpoena is invalid or overbroad for some reason, then the recipient of the subpoena won't be punished for following that advice. The lawyer may also attempt to negotiate directly with the issuer of the subpoena, for example by sending a letter explaining why the the subpoena appears to be invalid. The legal standards for issuance of subpoenas are also pretty broad. For civil litigation, which is not what DoJ is doing here, they are set out in https://www.law.cornell.edu/rules/frcp/rule_26; notably, they can be issued to third parties.)
Not a dumb question: a subpoena is an order to provide information or access, while a warrant is a court-issued document authorizing the government (or an agent of the government) to perform an act (e.g., an arrest, or seizure of an item).
Subpoenas can be issued by attorneys (including prosecuting attorneys) as part of the investigative and discovery processes.
"9. IP download logs of any Python Package Index (PyPI) packages uploaded by the given usernames"
This was the point where I was wondering if this is really about some malicious packages or something more along the lines of copyright infringement software.
This definitely seems like a significant element of the ask, but for any popular package a list of all the downloaders would be pretty overwhelming in size (and I think of very limited utility). I'm guessing that some versions of some more obscure package(s) were identified as being used in an attack and they're either trying to identify potential attackers or other victims (or both) of that attack.
From a 2021 article[1] about packages used to deliver malware
"we have alerted PyPI about the existence of the malicious packages which promptly removed them. Based on data from pepy.tech, we estimate the malicious packages were downloaded about 30,000 times."
For comparison yt-dlp has tens of millions of total downloads and gets downloaded over 70,000 times every day [2]
One theory that I don't see mentioned yet is that someone used an upload to pypi to exfiltrate data or simply as a way to upload arbitrary data somewhere. In a sense pypi is just a file hosting service, so it could have nothing to do with any actual python projects at all.
Interesting approach to data exfil. Though it seems predictable that exactly this kind of subpoena would be issued. If you can predict it, you can probably mitigate it.
Which means the subpoena would only be useful if the criminals made an opsec mistake. That is generally how most sophisticated criminals get caught, but here it feels like anyone inventive enough to try will probably also be prudent enough not to leave a trail.
Normal police work doesn't go fishing for the IP addresses (potentially millions of users) of everyone who downloaded a package.
> "IP download logs of any Python Package Index (PyPI) packages uploaded by..." given usernames
Do you feel the same way if the cops are receiving the IPs of everyone who downloaded yt-dlp? IP addresses and timestamps resolve to physical locations and oftentimes street addresses.
Agreed - how else was the DOJ supposed to do their job? They clearly need the data for an investigation. No need for PyPI to give information about how current users can alter their accounts to thwart future requests.
Most likely caused by phishing, ransomware, or (unlikely) crypto mining. I'd bet someone from some agency had credentials leaked due to a malicious package. Honestly, PyPI is stuck between a rock and a hard place, but having something like a "verified" badge (where someone's real identity is tied to it) for certain packages would go a long way to ensure some level of security.
The problem gets a bit hairier when dealing with dependency chains, however.
It’s nice that they’re committed to user privacy, and this post really gives me confidence that my privacy will be reasonably protected.
…but why is that a goal for PyPi? As a publisher of packages, it’s a nice-to-have, but as an end user it’s kind of scary. I don’t want to use software packages published by anonymous and potentially unaccountable people. That’s probably why they have so many malicious packages.
Maybe you live in an oppressive regime who will imprison/murder you for publishing some code; ok, but that’s an outlier, and there are a lot of ways to get around that situation.
I just don’t see the benefit of privacy in this situation? Is it just to reduce the administrative overhead of collecting/verifying identity info? I’m genuinely curious to learn about a realistic use case that justifies the risks to all users.
I know you can self host your own package index, but very few users have the resources to do that.
I think largely because the prerogative is on the code author to reveal as little or as much about themselves, and the prerogative of library users is to sufficiently vet a package. If folks want to publish code pseudonymously, and folks want to use that code, as long as it's not abusive, what's to stop them? You can achieve basically the same effect with github, gitlab, or even plain self-hosted HTTP packages (pip just uses a convention for listing packages in a dir, any HTTP file host can be a package server), without PyPI.
I actually think the larger problem is Python's reliance on imperative code that executes at install time. Yeah you can use pip --download and extract it yourself, but folks rarely do that.
According to US news over the past 3-4 years, you can just ignore subpoenas, then get a contributor job on a cable news network. Bonus points, the more you flout the law as arrogantly as possible ;p
I don’t understand how the information requested is relevant at all for any purpose. Most users of pypi merely download through pip; they arent registering anything. Furthermore, I would think a bad actor who would register would spoof their ip and use burner accounts anyhow.
Correlating IP address use to something else happening at the same time? Like a malware author being incredibly dumb and using their home IP to upload PyPy packages, while IDK, using that same IP as a C&C server endpoint.
Presumably the 5 users in question were interesting in some way, not just random.
> I would think a bad actor who would register would spoof their ip and use burner accounts anyhow
Maybe, but they could find that out with the information. If there's a 10% chance each was sloppy or un-paranoid, there's a 40% chance they get at least one piece of real info.
The person might not have thought they were doing anything wrong. Some judge might have greenlit this for a piracy case against the five maintainaers of youtube_dl{c} or something silly.
[+] [-] wongarsu|2 years ago|reply
That's suspiciously specific. Sounds to me like they also received some other subpoenas they aren't allowed to talk about.
[+] [-] dpifke|2 years ago|reply
PGP signatures—even though rarely used—would allow someone to verify that a signed package was not modified by PyPI after being uploaded by its original author.
Without any sort of signing mechanism, we have to trust the U.S. Government to never demand that PyPI insert a backdoor, via a National Security Letter, FISA court order, or other kangaroo court process. Good luck with that.
The existing PGP signing mechanism had usability issues and security footguns[1], but was better than nothing. It's a shame they didn't roll out a more usable and secure alternative before removing the existing functionality.
[0]: https://news.ycombinator.com/item?id=36044543
[1]: https://blog.yossarian.net/2023/05/21/PGP-signatures-on-PyPI...
[+] [-] wokwokwok|2 years ago|reply
The PGP signatures were removed, nominally because few people used them. ...but the timing of the removals is coincidental, no?
"You need to have a backdoor that lets us see who's downloading what packages and let us inject custom code to particular targets"
"That's technically impossible because of..."
"Here is a court order. Implementation is your problem. You're not allowed to tell anyone you even received a court order."
"...well, I guess signed packages have to go then..."
(:
I don't actually believe that, since PGP signing was frankly, barely used and really there's hardly any meaningful difference between a PGP you can't verify (which was most of them) and not having it; in fact the illusion of security is probably worse than not having it at all.
...but still. As you say. It sucks there's no meaningful replacement for it.
[+] [-] woodruffw|2 years ago|reply
The other half (key retrieval and identity binding) was never provided, because PGP as an ecosystem made doing so intractable. It was not better than nothing, because it was nothing; anything you could have done with it can be done with your own sidecar signatures.
[+] [-] dhx|2 years ago|reply
There are some caveats:
* Avoid -9999 packages as you won't get any guarantee of authenticity of whatever will be obtained from the upstream repository, other than whatever trust you place in a X.509 certificate that in all likelihood is controlled by either Microsoft (GitHub) or otherwise accessible to Amazon, Google, etc by nature of common open source project hosting arrangements.
* When syncing your local repository, verify all changes since your last sync. This could be as simple as syncing to a point n-days ago, after which numerous developers you know have signed more recent commits on top (you at least know those developers have been impacted too if the whole repository was compromised and the compromise is now on the public record).
* You don't really know how many people are using the packages you care about, and thus how many other people across the world are also exposed to (and possibly reporting problems with) signatures that Gentoo developers have committed.
In addition to relying on existing sources such as the Gentoo Git repository, an additional way to build trust is setting up software "looking glass" tools in different jurisdictions to check that software downloaded from different carriers in different jurisdictions are all the same.
At least with these measures the attacker has to compromise everyone and make this compromise a public record, rather than just silently compromise one target.
[+] [-] Hackbraten|2 years ago|reply
[+] [-] LordShredda|2 years ago|reply
[+] [-] NelsonMinar|2 years ago|reply
[+] [-] vore|2 years ago|reply
[+] [-] WhyNotHugo|2 years ago|reply
> "Records of all Python Package Index (PyPI) packages uploaded by..." given usernames
> "IP download logs of any Python Package Index (PyPI) packages uploaded by..." given usernames
I don't think they'd want a list of packages uploaded by a given user if they were after yt-dlp devs. They'd be asking for a list of maintainers of a given package.
[+] [-] ed25519FUUU|2 years ago|reply
[+] [-] not2b|2 years ago|reply
[+] [-] slenk|2 years ago|reply
[+] [-] phkahler|2 years ago|reply
[+] [-] zerealshadowban|2 years ago|reply
So they should promptly update their policies to a) stop logging so much, b) delete all past logs, and c) sharply limit the span of time until deletion of whatever logs they decide they really need to track for internal needs.
They should avoid logging, and rapidly rotate logs, to thwart future subpoenas from the total surveillance state.
[+] [-] chatmasta|2 years ago|reply
Note that the blog post doesn't say they handed the entire database over to the feds. They received three warrants scoped to specific packages and returned only the data they had available that was associated with those packages.
[+] [-] takeda|2 years ago|reply
I mean if DOJ is interested in PyPI logs the only reason I could think of, is if it was used as a supply chain vector into breaking in into other organizations.
[+] [-] pluto_modadic|2 years ago|reply
[+] [-] hayleox|2 years ago|reply
[+] [-] jehb|2 years ago|reply
Suddenly getting hits at mydomain.com/[uuid]? At least you know somebody has looked at the data, or at the very least fed it through some processing tool that is extracting and visiting the URLs.
[+] [-] jacquesm|2 years ago|reply
Why not to the users themselves? Have they been prohibited from doing so? (TFA does not say afaict)
[+] [-] junon|2 years ago|reply
[+] [-] aa_is_op|2 years ago|reply
I've lost track of the number of "white hats" that contact us with extortion requests after they used some dependency confusion attack.
[+] [-] misterpigs|2 years ago|reply
[+] [-] voynich|2 years ago|reply
[+] [-] SV_BubbleTime|2 years ago|reply
>As a result we are currently developing new data retention and disclosure policies.
“I guess we don’t actually need that” should have been the idea from the start.
[+] [-] itake|2 years ago|reply
While they are transparent the events happened, they are not transparent about which packages and what authors are being flagged, which is unfortunate.
[+] [-] tomjen3|2 years ago|reply
Which is the most important part.
[+] [-] throwaway_13140|2 years ago|reply
[+] [-] Zetice|2 years ago|reply
[+] [-] schoen|2 years ago|reply
The warrant is a command to a law enforcement officer, which allows the law enforcement officer to personally go and search and seize things (or people), while overriding some rights that would normally prevent this. Normally it is issued by a court. Generally there is no way to challenge a warrant to prevent its execution, because it is not disclosed to the target before it's executed (i.e., a law enforcement officer shows up with the warrant and begins executing it immediately, by force if necessary).
(Edit: I wrote above that it's risky to comply imperfectly with a subpoena and then argue in court that this was reasonable, but usually if a lawyer gives a professional opinion that the subpoena is invalid or overbroad for some reason, then the recipient of the subpoena won't be punished for following that advice. The lawyer may also attempt to negotiate directly with the issuer of the subpoena, for example by sending a letter explaining why the the subpoena appears to be invalid. The legal standards for issuance of subpoenas are also pretty broad. For civil litigation, which is not what DoJ is doing here, they are set out in https://www.law.cornell.edu/rules/frcp/rule_26; notably, they can be issued to third parties.)
[+] [-] woodruffw|2 years ago|reply
Subpoenas can be issued by attorneys (including prosecuting attorneys) as part of the investigative and discovery processes.
[+] [-] paxys|2 years ago|reply
Subpoena = the court compels you to hand over the evidence we need.
[+] [-] jupp0r|2 years ago|reply
This was the point where I was wondering if this is really about some malicious packages or something more along the lines of copyright infringement software.
[+] [-] femto113|2 years ago|reply
From a 2021 article[1] about packages used to deliver malware "we have alerted PyPI about the existence of the malicious packages which promptly removed them. Based on data from pepy.tech, we estimate the malicious packages were downloaded about 30,000 times."
For comparison yt-dlp has tens of millions of total downloads and gets downloaded over 70,000 times every day [2]
[1] https://jfrog.com/blog/malicious-pypi-packages-stealing-cred...
[2] https://pepy.tech/project/yt-dlp
[+] [-] tgbugs|2 years ago|reply
[+] [-] rocqua|2 years ago|reply
Which means the subpoena would only be useful if the criminals made an opsec mistake. That is generally how most sophisticated criminals get caught, but here it feels like anyone inventive enough to try will probably also be prudent enough not to leave a trail.
[+] [-] etaioinshrdlu|2 years ago|reply
[+] [-] sneak|2 years ago|reply
> "IP download logs of any Python Package Index (PyPI) packages uploaded by..." given usernames
Do you feel the same way if the cops are receiving the IPs of everyone who downloaded yt-dlp? IP addresses and timestamps resolve to physical locations and oftentimes street addresses.
[+] [-] Vervious|2 years ago|reply
With PyPi hosting a ton of malicious packages and malware, certainly I am not morally opposed.
[+] [-] throwaway_13140|2 years ago|reply
[+] [-] dvt|2 years ago|reply
The problem gets a bit hairier when dealing with dependency chains, however.
[+] [-] bogwog|2 years ago|reply
…but why is that a goal for PyPi? As a publisher of packages, it’s a nice-to-have, but as an end user it’s kind of scary. I don’t want to use software packages published by anonymous and potentially unaccountable people. That’s probably why they have so many malicious packages.
Maybe you live in an oppressive regime who will imprison/murder you for publishing some code; ok, but that’s an outlier, and there are a lot of ways to get around that situation.
I just don’t see the benefit of privacy in this situation? Is it just to reduce the administrative overhead of collecting/verifying identity info? I’m genuinely curious to learn about a realistic use case that justifies the risks to all users.
I know you can self host your own package index, but very few users have the resources to do that.
[+] [-] kortex|2 years ago|reply
I actually think the larger problem is Python's reliance on imperative code that executes at install time. Yeah you can use pip --download and extract it yourself, but folks rarely do that.
[+] [-] whimsicalism|2 years ago|reply
Yeah no way they haven't had other subpoenas then.
[+] [-] b33j0r|2 years ago|reply
According to US news over the past 3-4 years, you can just ignore subpoenas, then get a contributor job on a cable news network. Bonus points, the more you flout the law as arrogantly as possible ;p
[+] [-] kjkjadksj|2 years ago|reply
[+] [-] buildbot|2 years ago|reply
[+] [-] caturopath|2 years ago|reply
Presumably the 5 users in question were interesting in some way, not just random.
> I would think a bad actor who would register would spoof their ip and use burner accounts anyhow
Maybe, but they could find that out with the information. If there's a 10% chance each was sloppy or un-paranoid, there's a 40% chance they get at least one piece of real info.
The person might not have thought they were doing anything wrong. Some judge might have greenlit this for a piracy case against the five maintainaers of youtube_dl{c} or something silly.
[+] [-] cubefox|2 years ago|reply