UK chemist on Elsevier's ban on textmining

[+] ChristianMarks|14 years ago|reply

An outrage. I know an eminent chemist who will not referee for Elsevier--he was ahead of his time. The monopoly that the scientific publishers have consists of controlled access to 60 years of copyrighted work. An embargo of a few years after publication would be considerably better than what we have, but the intellectual monopolists want those billions. The cost of publishing hasn't gone up--the money is going into their executive suites. And most likely to their executive's sweetie pies.

[+] jrockway|14 years ago|reply

I'm not sure how a tool to read some text and display a diagram of the chemical reaction described falls under the "law" of the quoted passage. Copyright law certainly allows for this, so all the journals can do is say, "you don't get to buy our feed anymore if you run your program on our articles," but surely this is a game of chicken because no company wants to lose tens of thousands of dollars a year for no reason.

I would implement this as a browser plugin that uploads the content to a server (like Google Translate), and let the journals deal with each rogue user individually.

Telling people what software they can use to read text doesn't scale.

[+] michael_dorfman|14 years ago|reply

What would happen if you actually did try to textmine it. surely this is a game of chicken because no company wants to lose tens of thousands of dollars a year for no reason.

Elsevier is willing to play this game of chicken; they are convinced that their customers (i.e., research universities) cannot do without their product. Hence the OP reporting that twice, they instantly and without warning cut off all access to their product to the entire university, because of detected scraping.

Telling people what software they can use to read text doesn't scale.

Unfortunately, I think it does. If individual users, even a large number of them, use a plug-in, they might get away with it-- but this would result in an incomplete data set. To systematically textmine the corpus (which is the task at hand) requires some kind of systematic access to the data, and this is where Elsevier steps in and shuts it down.

[+] dredmorbius|14 years ago|reply

First: it's very difficult to grok copyright issues from a one-sided, non-legal presentation. Settled case law means going to trial and often going through multiple appeals.

Second: as it stands, this looks more to be a contractual than copyright matter (Elsevier is contractually banning use of automated text processing by subscribers), though it might possibly attempt to put teeth into this by asserting copyright. The remedy sought by Elsevier would be not to allege infringement, but to cancel future subscriptions. It might be nice for, say, a large collection of institutions to call Elsevier's bluff.

Third: Copyright law as it exists (and particularly in the US where I understand it reasonably well) is very mechanical: it governs the making of copies of an expressive work. Copyright does NOT govern facts, it does not apply to works which are not expressive, it does not apply to works which are functional in nature.

The real test here would be to put this (and possibly other) contract claims to test in a court. Unfortunately, contracts are governed (in the US) under state, not federal law, and while there's some uniformity of language, it would probably take several such cases (and appeals to at least the Federal Circuit) to establish reasonable case law.

Otherwise, what's significant about this to me is that, once again, it's a case far less about the availability and copying of information (journal articles are routinely copied), than it is about power and control within an information market. This is an area in which conventional economics is far too often lacking (though it's also an area in which much interesting work is starting to happen).

[+] chwahoo|14 years ago|reply

I think prevention of semantic extraction to diagrams would better pitched as "another unforeseen negative consequence of agreeing to Elsevier's terms", rather than as a significant part of the problem. If this story became a major part of the larger narrative, Elsevier could respond to this particular concern in a limited way (special access for researchers that apply for it), rather than dealing with the more central concerns.

[+] PaulHoule|14 years ago|reply

people don't realize how quickly text mining is coming along these days.

the chemistry papers talked about in this article are an easy domain, but similar shredding of more general documents will be possible by 2018.

developing all this takes people's time and other resources, for which money is a proxy. the best way to beat Elsevier is to build something better -- but that something better is going to take substantial funding.

Scientific papers can be published online at a cost well under $35. Add bureaucracy to that, however, and the cost can increase to 100 times. Conventional funding agencies have a limited interest in long-term programs (as opposed to projects) so that organizations find it difficult to afford to maintain digital libraries after they are built. To bring the evil empire down, somebody needs to figure out how to get that $35.

[+] billswift|14 years ago|reply

Here is the follow-up he mentioned: Textmining: My years negotiating with Elsevier, http://blogs.ch.cam.ac.uk/pmr/2011/11/27/textmining-my-years...

[+] alexi_dst|14 years ago|reply

I feel really sad about this as well. I actually happened to have published a few articles in Elsevier journals. I wish what I published is available for everyones eyes. The point of publishing it was to share my findings with the world not only with those who pay.

To make more a available and searchable I actually uploaded everything to academia.edu. I hope they don't get sued by Wiley and/or Elsevier for what the service they offer. If anyone wants to check out some chemistry you can check out what I have here http://unlv.academia.edu/AlexiNedeltchev

Few side thoughts: I think there are many things that can be improved in the science publishing: 1. Articles could be more interactive by providing discussion /commenting section. 2. Currently, if you want to see if the article you are checking out is worth reading you have to refer to the overall rating of the journal (this is known was impact factor). I think every article should have separate ratings. That way you can tell which are high impact articles and which are not. Btw the impact factors is calculated based on how many articles referred the article in hand. Does that sound familiar? It's the same concept as a webpage SEO. The more links pointing in the higher the rank. 3. Publishing process is every inefficient. It takes months to get something published since it was to be peer-reviewed. This obsolete approach that begs to be improved. Any ideas?

[+] Loic|14 years ago|reply

Repost of my comment on the blog (because waiting in the queue and not approved).

You should take the time to discuss a bit with your librarian. As I did my PhD in Denmark (DTU), I naively wrote a robot to download the issues of a well known chemical data journal. In about a week of balanced usage, I went to discuss with our librarian. He had seen my usage, was nice not to talk about it, but told me this: I downloaded more than the entire university in a year… and it was not a lot. It means that at that time, they paid a bit less than the $35 per article price.

What is really important to notice is that Elsevier are not selling knowledge for most of the scientific communities but influence. That is, you are published, cited, you get ranking and your university reward you. This is what we need to address if we want to have really open access. We need a better way to “sell” influence to the university researchers and deans.

As I am building Cheméo http://chemeo.com a chemical data search engine, I suffer too. It is maybe time to unit and propose a legal, efficient and rewarding way for the researchers to publish their papers. We can do that on the side and let our influence grow.

Additional notes for HN readers as yeah, we are a bit more on the programming side. What we basically need is a parallel DOI system easy to use, able to load all the open repositories and able to accept "direct" submissions.

We are not going to solve the problem in a year, this is an influence issue, it will take time, years, to really address it, be it by our own work or by "law".

[+] dalke|14 years ago|reply

How come your search system doesn't appear to know chemistry? That is, I searched for "CCO" and found ethyl alcohol, but I searched for "OCC" and found nothing. Are you only doing a text search on the SMILES rather than a canonicalization first?

[+] estevez|14 years ago|reply

This is precisely where the Copyright Office (and its UK equivalent} can step in and make explicit that textmining in this fashion is unambiguously fair use.

[+] aswanson|14 years ago|reply

Google "Swanson linking". This. ban. is probably impeding discovery along that vector as well.

[+] aba_sababa|14 years ago|reply

It's not the semantic interpretation of text that they're banning, but the scraping of text, which deals with copying what they'd prefer to sell you. Still an abhorrence, but let's get our facts straight.

[+] politician|14 years ago|reply

Actually, it sounds like they are trying to sell their interpretation of a recipe because they recently acquired a company which extracts recipes. They've banned their customers from running the same tool which they probably use internally in order to protect their slice of the market.

[+] fleitz|14 years ago|reply

Easy solution, create a human intelligence task distribution system, have every university that has access participate, assign lab students to perform the HIT of downloading the document. Voila problem solved.

After it's all been mined stop submitting to Elsvier.

[+] guard-of-terra|14 years ago|reply

Why can they do that and be listened to? Why not do the textmining anyway and say "so sue me"?

[+] 7952|14 years ago|reply

So I can't use Ctrl-F?

[+] JulianMorrison|14 years ago|reply

End copyright.

[+] arguesalot|14 years ago|reply

Until the academia realizes that computers can read too, it's important for companies that DO have access to these papers (like google scholar[1]) to create 3rd party APIs so that we can at least have better search tools.

1. http://code.google.com/p/google-ajax-apis/issues/detail?id=1...

[+] arguesalot|14 years ago|reply

Just throwing it into the mix, what would it take to convince Google to stop indexing the content of journal articles from closed-access journals? Surely without search, the articles are siloed; both journals and authors get a taste of the importance of opening access.

[+] flashingleds|14 years ago|reply

Unlikely to happen, but in any case also unlikely to be all that effective. Google scholar is getting good, but when looking for papers I will usually use scopus/sciencedirect (Elsevier) or ISI (Thompson Reuters). Sowing the seeds of my own destruction and so forth.

29 comments