Sci Hub repository torrents of scientific papers

[+] michaelbuckbee|7 years ago|reply

Not being in a scientific field, one of the previous times this came up I asked if any of the working scientists felt like SciHub had positively impacted their work: exposing them to more papers than they would have read, guiding them in different ways, etc. and the answer was a pretty overwhelming "ohmygodyes".

From the outside, it's very difficult to see this as anything but a public good. It certainly seems like literally the entire world is benefitting from this (including the scientists involved in the publishing) and its being held back by a handful of publishers.

[+] nneonneo|7 years ago|reply

The total collection is 54.54 TiB, with 690 torrents as of writing. For preservation purposes, I put magnet links to all the torrents here: https://pastebin.com/zTAqS7wz

So, even if the torrents go away, the magnet links should still be usable.

(Edited: fixed link)

[+] userbinator|7 years ago|reply

What I find more amazing than the fact that some people have been so neighbourly as to share such a volume of information, is that this is still just a tiny fraction of all human knowledge. Scientific journals are certainly an important collection, but still miniscule in comparison to all the other books out there. I have collected ~20GB of automotive service manuals and related data, and that's a tiny amount too. To speak nothing of the many terabytes of entertainment others have...

[+] phyzome|7 years ago|reply

Downloading all the torrent files:

  mkdir torrents
  for i in {000..689}; do curl -sS "http://gen.lib.rus.ec/scimag/repository_torrent/sm_${i}00000-${i}99999.torrent" -o "torrents/sm_${i}00000-${i}99999.torrent" -m 30; done

...which themselves take up about 75 MB, so here is the full collection of torrent files for your convenience:

https://lab.brainonfire.net/drop/delete-after/20180630/torre... (please use the next link if possible, though!)

And a torrent that is basically equivalent to that file, thanks to the below commenter:

https://lab.brainonfire.net/drop/delete-after/20180630/scihu...

Each torrent appears to index 100 zip files (about 800 MB each), each of which presumably contains 1000... journal articles? I don't know. The seed isn't blazing fast, so it will be a bit before I can pull out a random zip file from this randomly selected torrent and inspect it.

[+] Iv|7 years ago|reply

Is there a way to see which are in the direst need of seeders?

I am currently participating in some of them at random, as I have nowhere near 50 TiB of free space but who love to do it more efficiently.

[+] syedkarim|7 years ago|reply

Even if an entire satellite transponder (36 MHz) was used to broadcast these files (assuming 100Mbps broadcast/downlink), it would still take two months to download the whole thing.

[+] logicallee|7 years ago|reply

???? This is obviously equal to 54,000 datasets of 1 gigabytes each, which I would say is a pretty large dataset to publish along with a paper, so it's hard for me to even imagine 54 thousand such 1-gigabyte datasets. A one gigabyte PDF is astronomical in size, PDF's are usually far far shorter.

So why is this so large? Aren't they just PDF's, basically? And usually just a few pages? Excuse my ignorance. I'm very surprised at the size that you quote (54 TiB).

[+] Gatsky|7 years ago|reply

This is a magnificent cultural artifact, a modern day library of Alexandria. That it had to be 'stolen' is disappointing.

Wonder how it must feel for Elsevier to have their entire business up on a torrent. Zero sympathy here.

[+] sktrdie|7 years ago|reply

This is awesome. If only we could share this huge dump of PDF files as a more structured format, perhaps using SQLite [1, 2], we could search through the torrents without having to wait for all of them to download beforehand.

Although I guess the fact of having to "download them all beforehand" forces the data to be spread across various computers, hence increases availability of the data.

One idea I had regarding this is perhaps structuring the contents of torrents as "append-only binary trees". So as new dumps are released every month, one can simply start downloading the torrent and has "search capabilities" for new data as well.

1. https://github.com/lmatteis/torrent-net

2. https://www.youtube.com/watch?v=EKttt8PYu5M&feature=youtu.be

[+] rocqua|7 years ago|reply

Maybe someone could compute and host an index of this stuff?

[+] jacquesm|7 years ago|reply

Google Scholar is something like that, though there is no easy API that I'm aware of that would allow you to build applications on top of the index.

[+] jancsika|7 years ago|reply

So now I'm curious--

What happens if you take the intersection of Wikipedia references with Sci-Hub content? Is it substantially less than the total 54Tb content on Sci-Hub?

Also, has anyone made a browser extension that hyperlinks Wikipedia references with articles available over Sci-Hub?

[+] Vinnl|7 years ago|reply

Just 1.1% of Elsevier items are cited on Wikipedia: https://mobile.twitter.com/rmounce/status/994923093901676545

(Data here: https://github.com/rossmounce/DOIs-in-Wikipedias/blob/master... )

[+] nyolfen|7 years ago|reply

>browser extension that hyperlinks Wikipedia references with articles available over Sci-Hub?

this works:

https://github.com/DorianDepriester/doi2scihub

[+] SiempreViernes|7 years ago|reply

Probably, in general most published papers are new and not too interesting, so it'd be surprising and worrying if there are wikipedia entries for a majority of the papers.

[+] Iv|7 years ago|reply

Tonight Aaron Swartz is finally at peace.

[+] 3131s|7 years ago|reply

This torrent page has been around for a long time.

[+] kriro|7 years ago|reply

What I don't get is...why don't governments simply declare that all research at publicly funded universities must be made available to the public. It's seems so trivial. You pay the researchers, you get the research results, you make it available for all citizens (or the world).

Companies do this, they keep the research results of their employees.

[+] Vinnl|7 years ago|reply

A major reason, unfortunately, is that they don't want to be interfering with the distribution of scientific articles. For example, if there's a very well-read journal in a specific discipline, but whose contents are only visible with a subscription, then if one countries prevents its researchers from publishing in those journals, research by that country's researchers will be less read. And the reason they do not want to do that, is because they by all means want to avoid being able to suppress the reach of research whose conclusions they might not like, to prevent situations like when the Catholic church was able to do so.

(That is, if they are actually actively aware of and see it as a problem. Lots of governments/funders also don't have an active Open Access policy, although this is starting to change.)

[+] jhanschoo|7 years ago|reply

If you google "open access eu", you'll find a lot of articles reporting in 2016 on an EU initiative to have all its funded research open access by 2020. I wonder how that has come along since...

[+] Aelius|7 years ago|reply

Scihub would strongly benefit from moving to ipfs, shocked that hasn't happened yet.

[+] atrexler|7 years ago|reply

https://www.ascb.org/newsletter/novemberdecember-2016-newsle...

"This money is effectively a surcharge, or tax, on scientific research imposed not by a government but by a for-profit industry. Imagine for a moment how much research could be carried out using these resources if they were channeled back into our academic enterprise."

[+] anon400232|7 years ago|reply

How are the torrents grouped or organized? What are some good ways to locally recreate the SciHub front-end search functionality?

[+] ddtaylor|7 years ago|reply

Wow! Any idea how much data is there all together? I'm on a work network right now and can't fiddle with it.

[+] jacquesm|7 years ago|reply

About 50T or so from my quick calculations, but:

https://www.reddit.com/r/DataHoarder/comments/850s1e/cheapes...

Has it at almost 60 T and it increases all the time.

[+] pen2l|7 years ago|reply

I too would love the answer to that question.

The JSTOR archive that Aaron Swartz wanted released was 35GB. I would imagine SciHub's archive is probably much bigger, maybe a TB+?

Also, I wonder if supplementary figures/videos/program code are included in these archives. Probably not, certainly we'd be looking at many TB if such was the case.

[+] toomuchtodo|7 years ago|reply

At least 26TB.

[+] DrBazza|7 years ago|reply

As a former scientist, and researching during the 90s, the publishing system that has arisen where a company can monopolise it, and by that, prevent access, is very unfortunate.

No scientist wants their work to be unseen and hidden behind a paywall, but that is what has happened.

Worse still, in some fields amateurs can make a reasonable contribution to the field (my experience is observation astronomy), and the current system hinders that.

So many comments along the lines of I can't find the paper I want without going to SciHub tells you just how broken the current publishing system is.

[+] raister|7 years ago|reply

Is this legal?

[+] mapt|7 years ago|reply

Of course not. Copyright law is extraordinarily restrictive, and you, me, and everybody you know is guilty of violation after violation, easily enough to bankrupt hundreds of millions of people if strict enforcement were applied.

SciHub is all normal copyrighted published work; The publishers claim the same commercial status as an album or novel. The fact that there are currently 69 million works bundled means that willful copyright infringement of them has statutory damages in excess of 10 trillion dollars, perhaps multiplied by the torrent seed ratio if the judge is feeling generous.

Plainly, our present copyright system is ridiculous, and ridiculously disproportionate. It's also (separately) morally outrageous to restrict scientific inquiry to institutional subscriptions, for work that was submitted for review for free. This is commonly acknowledged in academia however, where every other person is willing to help you get access to that paper or this preprint to help out.

The law of the land and the feelings of its population have an enormous disconnect here. The only thing preventing the two from colliding head-on and something reasonable coming out of that contact, are the fact that copyright infringement is litigated less than one time in a billion, and the vaguely defined, legally vulnerable principle of fair use.

[+] dbasedweeb|7 years ago|reply

Sometimes what’s legal is incredibly immoral, and what’s moral isn’t legal. “I didnt break any laws” can be a very thin shield to hide behind at times. By the same token, breaking some laws can be the right thing to do, although that won’t necessaril shield you from legal consequences. For concrete examples of legal yet immoral, see Jim Crow, Segregation, slavery, and present day prison work. For illegal yet moral, selling pot to a cancer patient, spreading the sum of human knowledge far and wide, etc.

[+] dzdt|7 years ago|reply

Legality is not always the only question. Is it a net benefit to humankind? Is it a direct harm to anyone? Is it motivated by greed, or by malice, or by fear, or by altruism?

[+] MarkMMullin|7 years ago|reply

What is interesting (thus far) to the responses is that no author of a paper has shown up going 'you're stealing my work' - nor is any money gathered by most of these organizations returned to research - myself, I've always envisioned Elsevier as the dutch dude in Austin Powers - One possible cure I heard from a librarian sobbing over the bill was that if there were companies that could hold copyright and companies that could distribute, but no one company could do both there might be some market tension

[+] jacquesm|7 years ago|reply

Definitely not. But you could have some interesting discussions about morality and the legality of hoarding publicly funded scientific results behind paywalls.

[+] teaspoons|7 years ago|reply

your tax dollars paid for the research which was then privately hoarded

[+] xamuel|7 years ago|reply

Anyone know why this is done as so many tiny torrents? Torrent users can already choose which files to download from a torrent, so why not do this as a smaller number of larger torrents?

[+] nneonneo|7 years ago|reply

The torrents each weigh in at 80.9 GiB on average, so they are not small. Each torrent contains 100 .zip files, which each in turn contain 1000 publications (averaging just under 1 MiB per publication). There's a total of 69 million publications here - it's a very hefty collection.

[+] jacquesm|7 years ago|reply

They're not that tiny. That's many thousands of papers / torrent, the file names are the paper id ranges in the system.

[+] kanzure|7 years ago|reply

Does anyone have a complete copy? The seeds are very slow. I know someone willing to pay for this data. Shoot me an email.

[+] nneonneo|7 years ago|reply

The _whole_ thing? It's 54 TB, how do you plan on receiving it?

[+] throwaway47861|7 years ago|reply

Since this is technically very illegal I'd be inclined to view your message as a bait to capture and jail one of the very first people who managed to mirror the whole thing.

No offense, just seems fishy.

[+] themodelplumber|7 years ago|reply

Were they to help you, would that person not be opening themselves up to >avg legal harm?

[+] unknown|7 years ago|reply

[deleted]

[+] mr_overalls|7 years ago|reply

For those of you who use SciHub: do you take any special precautions against malware?

PDFs are a convenient vector for viruses, trojans, etc. And the users downloading these papers tend to work for academic/research institutions that could be ripe targets for hacking & IP theft.

[+] aloisamae|7 years ago|reply

Because I'm mostly using this from school, I'm usually in a disposable VM behind a proxy. Those are the precautions I personally take.

[+] letters90|7 years ago|reply

you are referring to pdf viewer exploits, you can circumvent most of them by either using linux or utilizing less targeted pdf viewer software like sumatra.

[+] jancsika|7 years ago|reply

Has anyone with uni lib credentials done random hash checks on some of the PDFs from sci-hub?

[+] rasmusei|7 years ago|reply

Not sure that would work, since pdfs are usually marked or prepended with some metadata about download time and the library upon download. At least that's the case with my uni library.

[+] mpfundstein|7 years ago|reply

We need more sci-hub

[+] ReedJessen|7 years ago|reply

Is possession of these simple torrent files a copyright violation or does one have to actively download them to create the copyright violation?

255 comments