American Chemical Society bans university after "spider-trap" is clicked

[+] freshyill|12 years ago|reply

It's worth noting that many journals don't control the platform their scholarly content. It looks like ACS uses this [Atypon](http://www.atypon.com). That's the likely source of this spider trap, not ACS.

Atypon has [a relatively small client list](http://www.atypon.com/our-clients/featured-clients.php). Compare it to [Highwire](http://highwire.stanford.edu/lists/allsites.dtl). I'd be willing to bet that all journals hosted with Atypon share this spider trap—even journals that are supposed to be open access where spidering should be OK.

Scientific publishing is weird. Source: I work in scientific publishing.

[+] ak217|12 years ago|reply

Scientific publishing is not just weird, it's sinister.

That said, ACS is pretty sinister, too. They opposed PubChem (http://en.wikipedia.org/wiki/PubChem#ACS.27s_concerns) and in general don't behave like the nonprofit scientist trade organization that they present themselves as.

[+] greglindahl|12 years ago|reply

In Highwire's case, they typically have a robots.txt blocking everyone but Google ... and the reason is not malice, it's inefficient software. Fetching a page once every few seconds is enough to overload their system.

[+] Blahah|12 years ago|reply

All the Atypon clients seem to have now disabled this trap (I tried the 5 biggest ones).

[+] lazyjeff|12 years ago|reply

Those client lists are a bit unfair to compare: Atypon lists large publishers (Elsevier, IEEE, Oxford University Press, Taylor & Francis, ACS) while HighWire's list has a lot of individual journals (Journal of Early Childhood Research, Monthly Notices of the Royal Astronomical Society: Letters, etc.)

[+] s_q_b|12 years ago|reply

Is it bad that I'm just as insulted by the so-called "spider trap"? It's so technologically simple as to be useless against anyone who could deploy a web scraper in the first place.

I mean, it's marked by comment tags that say "spider trap" right on them! Its the worst type of disambiguation system: likely to generate false positives, unlikely to catch real violators.

[+] PavlovsCat|12 years ago|reply

Yet the off-the-shelf bots that are just let loose on the web in general will likely fall for it, as long as the "spider trap" is not off-the-shelf itself; and the ones actually targetted at just you you likely can't defeat anyway.

[+] Kliment|12 years ago|reply

Note how this means that anyone who is tricked into clicking that link has just blacked-out their entire institution. This has massive potential for abuse.

[+] danieltillett|12 years ago|reply

I almost wish I was back at uni so I could have some fun with this.

[+] dsl|12 years ago|reply

That isn't true. http://blog.codinghorror.com/preventing-csrf-and-xsrf-attack...

[+] naich|12 years ago|reply

They have posted a reply here: http://www.asihcopeiaonline.org/doi/pdf/10.1046/9999-9999.99...

[+] dalke|12 years ago|reply

arXiv.org, back when it was still xxx.lanl.gov had a similar trap. Yes, I clicked on it. It gave a warning of the sort "don't this again, here's what's happening, if we see many more requests from your site then we'll shut off access."

This was in the late 1990s.

[+] HCIdivision17|12 years ago|reply

I still remember that page. As a middle schooler who didn't know anything from anything, it was a perplexing thing. The site's got an xxx at the front, but looks like a legit government site from wait, Los Alamos? Like from "Surely You're Joking Mr. Feynman"? Oh jeez, I'm gonna get in trouble with the school...

[+] PaulHoule|12 years ago|reply

Funny, we used to do this when I was working at arXiv.org. We had incessant problems with robots that didn't obey robots.txt so we needed spider traps to keep the site from going down.

[+] SixSigma|12 years ago|reply

That's some level of incompetence - the trappers I mean. A half arsed solution because they couldn't think of a better one. A registration system with abstracts and unlock-this-article links would be a better one, off the top of my head.

[+] freshyill|12 years ago|reply

I'm willing to bet that they provide site licenses, where everyone in an entire university's subnet range might have access. In an open access journal, it shouldn't matter, but many journals are hosted on the same few platforms, and the spider trap is a feature of the platform.

[+] danso|12 years ago|reply

Reporting the content since site is down:

Tl;dr: researcher is browsing source code of a research paper's web page and finds a strange link (but same domain). She clicks and is informed that her IP is banned for automated spidering.

Apparently, this research site is meant to be open-access...

-------

Pandora is a researcher (won’t say where, won’t say when). I don’t know her field – she may be a scientist or a librarian. She has been scanning the spreadsheet of the Open Access publications paid for by Wellcome Trust. It’s got 2200 papers that Wellcome has paid 3 million GBP for. For the sole reason to make them available to everyone in the world. She found a paper in the journal Biochemistry (that’s an American Chemical Society publication) and looked at http://pubs.acs.org/doi/abs/10.1021/bi300674e . She got that OK – looked to see if they could get the PDF - http://pubs.acs.org/doi/pdf/10.1021/bi300674e - yes that worked OK.

What else can we download? After all this is Open Access, isn’t it? And Wellcome have paid 666 GBP for this “hybrid” version (i.e. they get subscription income as well. So we aren’t going to break any laws…

The text contains various other links and our researcher follows some of them. Remember she’s a scientist and scientists are curious. It’s their job. She finds: <a href="/doi/pdf/10.1046/9999-9999.99999"> </a> Since it's a bioscience paper she assumes it's about spiders and how to trap them.

She clicks it. Pandora opens the box... Wham!

The whole university got cut off immediately from the whole of ACS publications. "Thank you", ACS

The ACS is stopping people spidering their site. EVEN FOR OPEN ACCESS. It wasn't a biological spider. It was a web trap based on the assumption that readers are, in some way, basically evil.. Now I have seen this message before. About 7 years ago one of my graduate students was browsing 20 publications from ACS to create a vocabulary. Suddenly we were cut off with this awful message. Dead. The whole of Cambridge University. I felt really awful.

I had committed a crime. And we hadn't done anything wrong. Nor has my correspondent. If you create Open Access publications you expect - even hope - that people will dig into them. So, ACS, remove your spider traps. We really are in Orwellian territory where the point of Publishers is to stop people reading science.

I think we are close to the tipping point where publishers have no value except to their shareholders and a sick, broken, vision of what academia is about.

UPDATE: See comment from Ross Mounce: The society (closed access) journal ‘Copeia’ also has these spider trap links in it’s HTML, e.g. on this contents page:http://www.asihcopeiaonline.org/toc/cope/2013/4

you can find

I may have accidentally cut-off access for all at the Natural History Museum, London once when I innocently tried this link, out of curiosity. Why do publishers ‘booby-trap’ their websites? Don’t they know us researchers are an inquisitive bunch? I’d be very interested to read a PDF that has a 9999-9999.9999 DOI string if only to see what it contained – they can’t rationally justify cutting-off access to everyone, just because ONE person clicked an interesting link? PMR: Note - it's the SAME link as the ACS uses. So I surmise that both society's outsource their web pages to some third-party hackshop. Maybe 10.1046 is a universal anti-publisher.

PMR: It's incredibly irresponsible to leave spider traps in HTML. It's a human reaction to explore.

[+] gmisra|12 years ago|reply

Seems like an easy way for a university-based "conscientious objector" to have this issue addressed would be to intentionally click on the spider trap link once a day?

[+] logfromblammo|12 years ago|reply

Too much work. Wget with cron. Then you can click every day without having to click every day.

[+] specialp|12 years ago|reply

I work for a (non profit) journal publisher and we do indeed cut off robot downloading but not after one click of a link. We analyze traffic to determine robot downloads. I suspect though that the whole entire university did not get cut off in this incident. Usually it is on a per IP basis and unless the University proxies all of their journal traffic through a single IP which is not common I think saying the whole university being blocked may be an exaggeration. I personally wish we had no robot monitor but then again we would get heavy spidering then of large files.

[+] dllthomas|12 years ago|reply

Is there reason to block instead of throttling?

[+] raverbashing|12 years ago|reply

Doesn't Chrome pre-load links as well?

Not sure it checks for styling before prefetching them.

[+] acdha|12 years ago|reply

That's opt-in using a <meta> tag:

https://developers.google.com/chrome/whitepapers/prerender

[+] nraynaud|12 years ago|reply

no I think this plan was scrapped, too dangerous (too many websites had stateful actions as GET). I think they just stuck to pre-loading DNS.

edit: and i was messing the webstats for advertisement.

[+] owenversteeg|12 years ago|reply

For anyone that can't load the page, here's the site from Google's cache: http://webcache.googleusercontent.com/search?q=cache:_EBW_po...

[+] sp332|12 years ago|reply

If the site is really down, it helps to link to the text-only version of the cache. http://webcache.googleusercontent.com/search?q=cache:_EBW_po...

[+] jrochkind1|12 years ago|reply

Oddly, the google cache version won't load for me either. The google cache header is there, but the content area is blank, with the chrome status bar saying "Waiting for blogs.ch.cam.ac.uk".

Looking at the source... there's some weird things going on, I think maybe the _original_ page loaded it's content with Javascript, and the google cached version is just the JS skeleton, waiting on trying to load JS from the original (overloaded) site which will actually load the content?

Ugh. The trend for JS-dependent sites for simple content breaks the web, people.

[+] a3n|12 years ago|reply

It would be interesting to see what conversations might happen if lots of people from lots of universities clicked on these traps.

[+] k2enemy|12 years ago|reply

The warning message returned by the spider-trap says that it banned a particular IP address. How does this cut off the entire university? Is everyone behind a NAT?

[+] TillE|12 years ago|reply

For licensing purposes, they'd need to be able to associate ranges of IP addresses with a specific institution. So if they want to, it's easy to block that whole license for one violation.

[+] DangerousPie|12 years ago|reply

Yes, it is.

[+] DangerousPie|12 years ago|reply

This is an important topic, but that blog entry was not very well written. If I hadn't heard about this before already I would have been very confused what they actually wanted to say with this convoluted story.

[+] dang|12 years ago|reply

Can anyone suggest a better url? If so, I'll change it.

[+] gcb0|12 years ago|reply

1. get university with good ties to ACLU and other such movements.

2. subscribe

3. click link

4. sue them for breach of contract and damages. (they didn't deliver the content you paid for, it damaged your main source of income: providing knowledge to paying students)

5. repeat.

[+] ChuckMcM|12 years ago|reply

Sigh, did no one notice that the link is in a ? Look at the style sheet and note that class 'hide' sets the link to be the same color as the background (it makes it invisible to humans) and yet it got clicked on anyway.

There are bad actors out there, they exploit services, and one of the ways the services detect them is to create situations that a script would follow but that a human would not. When they do something bad you've got a couple of choices, cut them off or lie to them (some of the Bing markov generated search pages for robots are pretty fun))

So she sends an email to the address provided, they talk to her, she gets educated and they re-enable access. If it happens again the issue gets escalated. Its the circle of fraud.

[+] abruzzi|12 years ago|reply

Some people also override site based CSS with their own, which could likely make the link that was intended to be hidden come unbidden. Most browsers I've used have that option.

[+] danudey|12 years ago|reply

A link the same colour as the background can still be seen (e.g. if it's selected by accident, or by Select All), can still be clicked on whether it's seen or not, etc.

[+] unknown|12 years ago|reply

[deleted]

[+] spb|12 years ago|reply

Some people view source.

[+] joshdance|12 years ago|reply

Tack spider traps and booby trapped documents to the long list of scientific publishing problems.

[+] fit2rule|12 years ago|reply

This is only interesting for as long as ACS is asleep at the wheel.

Lets wait and find out how long it takes them to respond to the inevitable interest that 999999.99999 people will have sent their way ..

[+] userbinator|12 years ago|reply

I sometimes get a similar message from Google (maybe it's due to the search queries I use...), but they provide a CAPTCHA so you can (reasonably) show that you're a human.

[+] patcon|12 years ago|reply

Some asshole just discovered a whole new reason to wardrive...

86 comments