It's worth noting that many journals don't control the platform their scholarly content. It looks like ACS uses this [Atypon](http://www.atypon.com). That's the likely source of this spider trap, not ACS.
Scientific publishing is not just weird, it's sinister.
That said, ACS is pretty sinister, too. They opposed PubChem (http://en.wikipedia.org/wiki/PubChem#ACS.27s_concerns) and in general don't behave like the nonprofit scientist trade organization that they present themselves as.
In Highwire's case, they typically have a robots.txt blocking everyone but Google ... and the reason is not malice, it's inefficient software. Fetching a page once every few seconds is enough to overload their system.
Those client lists are a bit unfair to compare: Atypon lists large publishers (Elsevier, IEEE, Oxford University Press, Taylor & Francis, ACS) while HighWire's list has a lot of individual journals (Journal of Early Childhood Research, Monthly Notices of the Royal Astronomical Society: Letters, etc.)
Is it bad that I'm just as insulted by the so-called "spider trap"? It's so technologically simple as to be useless against anyone who could deploy a web scraper in the first place.
I mean, it's marked by comment tags that say "spider trap" right on them! Its the worst type of disambiguation system: likely to generate false positives, unlikely to catch real violators.
Yet the off-the-shelf bots that are just let loose on the web in general will likely fall for it, as long as the "spider trap" is not off-the-shelf itself; and the ones actually targetted at just you you likely can't defeat anyway.
Note how this means that anyone who is tricked into clicking that link has just blacked-out their entire institution. This has massive potential for abuse.
arXiv.org, back when it was still xxx.lanl.gov had a similar trap. Yes, I clicked on it. It gave a warning of the sort "don't this again, here's what's happening, if we see many more requests from your site then we'll shut off access."
I still remember that page. As a middle schooler who didn't know anything from anything, it was a perplexing thing. The site's got an xxx at the front, but looks like a legit government site from wait, Los Alamos? Like from "Surely You're Joking Mr. Feynman"? Oh jeez, I'm gonna get in trouble with the school...
Funny, we used to do this when I was working at arXiv.org. We had incessant problems with robots that didn't obey robots.txt so we needed spider traps to keep the site from going down.
That's some level of incompetence - the trappers I mean. A half arsed solution because they couldn't think of a better one. A registration system with abstracts and unlock-this-article links would be a better one, off the top of my head.
I'm willing to bet that they provide site licenses, where everyone in an entire university's subnet range might have access. In an open access journal, it shouldn't matter, but many journals are hosted on the same few platforms, and the spider trap is a feature of the platform.
Tl;dr: researcher is browsing source code of a research paper's web page and finds a strange link (but same domain). She clicks and is informed that her IP is banned for automated spidering.
Apparently, this research site is meant to be open-access...
-------
Pandora is a researcher (won’t say where, won’t say when). I don’t know her field – she may be a scientist or a librarian. She has been scanning the spreadsheet of the Open Access publications paid for by Wellcome Trust. It’s got 2200 papers that Wellcome has paid 3 million GBP for. For the sole reason to make them available to everyone in the world.
She found a paper in the journal Biochemistry (that’s an American Chemical Society publication) and looked at http://pubs.acs.org/doi/abs/10.1021/bi300674e . She got that OK – looked to see if they could get the PDF - http://pubs.acs.org/doi/pdf/10.1021/bi300674e - yes that worked OK.
What else can we download? After all this is Open Access, isn’t it? And Wellcome have paid 666 GBP for this “hybrid” version (i.e. they get subscription income as well. So we aren’t going to break any laws…
The text contains various other links and our researcher follows some of them. Remember she’s a scientist and scientists are curious. It’s their job. She finds:
<span id="hide"><a href="/doi/pdf/10.1046/9999-9999.99999">
<!-- Spider trap link --></a></span>
Since it's a bioscience paper she assumes it's about spiders and how to trap them.
She clicks it. Pandora opens the box...
Wham!
The whole university got cut off immediately from the whole of ACS publications. "Thank you", ACS
The ACS is stopping people spidering their site. EVEN FOR OPEN ACCESS. It wasn't a biological spider.
It was a web trap based on the assumption that readers are, in some way, basically evil..
Now I have seen this message before. About 7 years ago one of my graduate students
was browsing 20 publications from ACS to create a vocabulary.
Suddenly we were cut off with this awful message. Dead. The whole of Cambridge University. I felt really awful.
I had committed a crime.
And we hadn't done anything wrong. Nor has my correspondent.
If you create Open Access publications you expect - even hope - that people will dig into them.
So, ACS, remove your spider traps. We really are in Orwellian territory where the
point of Publishers is to stop people reading science.
I think we are close to the tipping point where publishers have no
value except to their shareholders and a sick, broken, vision of what academia is about.
UPDATE:
See comment from Ross Mounce:
The society (closed access) journal ‘Copeia’ also has these spider trap links in it’s HTML, e.g. on this contents page:http://www.asihcopeiaonline.org/toc/cope/2013/4
you can find
<span id="hide"><a href="/doi/pdf/10.1046/9999-9999.99999">
<!-- Spider trap link --></a></span>
I may have accidentally cut-off access for all at the Natural History Museum, London
once when I innocently tried this link, out of curiosity.
Why do publishers ‘booby-trap’ their websites? Don’t they know us researchers are an
inquisitive bunch? I’d be very interested to read a PDF that has a 9999-9999.9999
DOI string if only to see what it contained – they can’t rationally justify
cutting-off access to everyone, just because ONE person clicked an interesting link?
PMR: Note - it's the SAME link as the ACS uses. So I surmise that both society's outsource their web pages to some third-party
hackshop. Maybe 10.1046 is a universal anti-publisher.
PMR: It's incredibly irresponsible to leave spider traps in HTML. It's a human reaction to explore.
Seems like an easy way for a university-based "conscientious objector" to have this issue addressed would be to intentionally click on the spider trap link once a day?
I work for a (non profit) journal publisher and we do indeed cut off robot downloading but not after one click of a link. We analyze traffic to determine robot downloads. I suspect though that the whole entire university did not get cut off in this incident. Usually it is on a per IP basis and unless the University proxies all of their journal traffic through a single IP which is not common I think saying the whole university being blocked may be an exaggeration. I personally wish we had no robot monitor but then again we would get heavy spidering then of large files.
Oddly, the google cache version won't load for me either. The google cache header is there, but the content area is blank, with the chrome status bar saying "Waiting for blogs.ch.cam.ac.uk".
Looking at the source... there's some weird things going on, I think maybe the _original_ page loaded it's content with Javascript, and the google cached version is just the JS skeleton, waiting on trying to load JS from the original (overloaded) site which will actually load the content?
Ugh. The trend for JS-dependent sites for simple content breaks the web, people.
The warning message returned by the spider-trap says that it banned a particular IP address. How does this cut off the entire university? Is everyone behind a NAT?
For licensing purposes, they'd need to be able to associate ranges of IP addresses with a specific institution. So if they want to, it's easy to block that whole license for one violation.
This is an important topic, but that blog entry was not very well written. If I hadn't heard about this before already I would have been very confused what they actually wanted to say with this convoluted story.
1. get university with good ties to ACLU and other such movements.
2. subscribe
3. click link
4. sue them for breach of contract and damages. (they didn't deliver the content you paid for, it damaged your main source of income: providing knowledge to paying students)
Sigh, did no one notice that the link is in a <span id="hide"> ? Look at the style sheet and note that class 'hide' sets the link to be the same color as the background (it makes it invisible to humans) and yet it got clicked on anyway.
There are bad actors out there, they exploit services, and one of the ways the services detect them is to create situations that a script would follow but that a human would not. When they do something bad you've got a couple of choices, cut them off or lie to them (some of the Bing markov generated search pages for robots are pretty fun))
So she sends an email to the address provided, they talk to her, she gets educated and they re-enable access. If it happens again the issue gets escalated. Its the circle of fraud.
Some people also override site based CSS with their own, which could likely make the link that was intended to be hidden come unbidden. Most browsers I've used have that option.
A link the same colour as the background can still be seen (e.g. if it's selected by accident, or by Select All), can still be clicked on whether it's seen or not, etc.
I sometimes get a similar message from Google (maybe it's due to the search queries I use...), but they provide a CAPTCHA so you can (reasonably) show that you're a human.
[+] [-] freshyill|12 years ago|reply
Atypon has [a relatively small client list](http://www.atypon.com/our-clients/featured-clients.php). Compare it to [Highwire](http://highwire.stanford.edu/lists/allsites.dtl). I'd be willing to bet that all journals hosted with Atypon share this spider trap—even journals that are supposed to be open access where spidering should be OK.
Scientific publishing is weird. Source: I work in scientific publishing.
[+] [-] ak217|12 years ago|reply
That said, ACS is pretty sinister, too. They opposed PubChem (http://en.wikipedia.org/wiki/PubChem#ACS.27s_concerns) and in general don't behave like the nonprofit scientist trade organization that they present themselves as.
[+] [-] greglindahl|12 years ago|reply
[+] [-] Blahah|12 years ago|reply
[+] [-] lazyjeff|12 years ago|reply
[+] [-] s_q_b|12 years ago|reply
I mean, it's marked by comment tags that say "spider trap" right on them! Its the worst type of disambiguation system: likely to generate false positives, unlikely to catch real violators.
[+] [-] PavlovsCat|12 years ago|reply
[+] [-] Kliment|12 years ago|reply
[+] [-] danieltillett|12 years ago|reply
[+] [-] dsl|12 years ago|reply
[+] [-] naich|12 years ago|reply
[+] [-] dalke|12 years ago|reply
This was in the late 1990s.
[+] [-] HCIdivision17|12 years ago|reply
[+] [-] PaulHoule|12 years ago|reply
[+] [-] SixSigma|12 years ago|reply
[+] [-] freshyill|12 years ago|reply
[+] [-] danso|12 years ago|reply
Tl;dr: researcher is browsing source code of a research paper's web page and finds a strange link (but same domain). She clicks and is informed that her IP is banned for automated spidering.
Apparently, this research site is meant to be open-access...
-------
Pandora is a researcher (won’t say where, won’t say when). I don’t know her field – she may be a scientist or a librarian. She has been scanning the spreadsheet of the Open Access publications paid for by Wellcome Trust. It’s got 2200 papers that Wellcome has paid 3 million GBP for. For the sole reason to make them available to everyone in the world. She found a paper in the journal Biochemistry (that’s an American Chemical Society publication) and looked at http://pubs.acs.org/doi/abs/10.1021/bi300674e . She got that OK – looked to see if they could get the PDF - http://pubs.acs.org/doi/pdf/10.1021/bi300674e - yes that worked OK.
What else can we download? After all this is Open Access, isn’t it? And Wellcome have paid 666 GBP for this “hybrid” version (i.e. they get subscription income as well. So we aren’t going to break any laws…
The text contains various other links and our researcher follows some of them. Remember she’s a scientist and scientists are curious. It’s their job. She finds: <span id="hide"><a href="/doi/pdf/10.1046/9999-9999.99999"> <!-- Spider trap link --></a></span> Since it's a bioscience paper she assumes it's about spiders and how to trap them.
She clicks it. Pandora opens the box... Wham!
The whole university got cut off immediately from the whole of ACS publications. "Thank you", ACS
The ACS is stopping people spidering their site. EVEN FOR OPEN ACCESS. It wasn't a biological spider. It was a web trap based on the assumption that readers are, in some way, basically evil.. Now I have seen this message before. About 7 years ago one of my graduate students was browsing 20 publications from ACS to create a vocabulary. Suddenly we were cut off with this awful message. Dead. The whole of Cambridge University. I felt really awful.
I had committed a crime. And we hadn't done anything wrong. Nor has my correspondent. If you create Open Access publications you expect - even hope - that people will dig into them. So, ACS, remove your spider traps. We really are in Orwellian territory where the point of Publishers is to stop people reading science.
I think we are close to the tipping point where publishers have no value except to their shareholders and a sick, broken, vision of what academia is about.
UPDATE: See comment from Ross Mounce: The society (closed access) journal ‘Copeia’ also has these spider trap links in it’s HTML, e.g. on this contents page:http://www.asihcopeiaonline.org/toc/cope/2013/4
you can find
<span id="hide"><a href="/doi/pdf/10.1046/9999-9999.99999"> <!-- Spider trap link --></a></span>
I may have accidentally cut-off access for all at the Natural History Museum, London once when I innocently tried this link, out of curiosity. Why do publishers ‘booby-trap’ their websites? Don’t they know us researchers are an inquisitive bunch? I’d be very interested to read a PDF that has a 9999-9999.9999 DOI string if only to see what it contained – they can’t rationally justify cutting-off access to everyone, just because ONE person clicked an interesting link? PMR: Note - it's the SAME link as the ACS uses. So I surmise that both society's outsource their web pages to some third-party hackshop. Maybe 10.1046 is a universal anti-publisher.
PMR: It's incredibly irresponsible to leave spider traps in HTML. It's a human reaction to explore.
[+] [-] gmisra|12 years ago|reply
[+] [-] logfromblammo|12 years ago|reply
[+] [-] specialp|12 years ago|reply
[+] [-] dllthomas|12 years ago|reply
[+] [-] raverbashing|12 years ago|reply
Not sure it checks for styling before prefetching them.
[+] [-] acdha|12 years ago|reply
https://developers.google.com/chrome/whitepapers/prerender
[+] [-] nraynaud|12 years ago|reply
edit: and i was messing the webstats for advertisement.
[+] [-] owenversteeg|12 years ago|reply
[+] [-] sp332|12 years ago|reply
[+] [-] jrochkind1|12 years ago|reply
Looking at the source... there's some weird things going on, I think maybe the _original_ page loaded it's content with Javascript, and the google cached version is just the JS skeleton, waiting on trying to load JS from the original (overloaded) site which will actually load the content?
Ugh. The trend for JS-dependent sites for simple content breaks the web, people.
[+] [-] a3n|12 years ago|reply
[+] [-] k2enemy|12 years ago|reply
[+] [-] TillE|12 years ago|reply
[+] [-] DangerousPie|12 years ago|reply
[+] [-] DangerousPie|12 years ago|reply
[+] [-] dang|12 years ago|reply
[+] [-] gcb0|12 years ago|reply
2. subscribe
3. click link
4. sue them for breach of contract and damages. (they didn't deliver the content you paid for, it damaged your main source of income: providing knowledge to paying students)
5. repeat.
[+] [-] ChuckMcM|12 years ago|reply
There are bad actors out there, they exploit services, and one of the ways the services detect them is to create situations that a script would follow but that a human would not. When they do something bad you've got a couple of choices, cut them off or lie to them (some of the Bing markov generated search pages for robots are pretty fun))
So she sends an email to the address provided, they talk to her, she gets educated and they re-enable access. If it happens again the issue gets escalated. Its the circle of fraud.
[+] [-] abruzzi|12 years ago|reply
[+] [-] danudey|12 years ago|reply
[+] [-] unknown|12 years ago|reply
[deleted]
[+] [-] spb|12 years ago|reply
[+] [-] joshdance|12 years ago|reply
[+] [-] fit2rule|12 years ago|reply
Lets wait and find out how long it takes them to respond to the inevitable interest that 999999.99999 people will have sent their way ..
[+] [-] userbinator|12 years ago|reply
[+] [-] patcon|12 years ago|reply