The Future of Science

[+] nkoren|14 years ago|reply

On a related note, some years ago I read an academic paper -- alas, it was printed on a dead tree, and I can't find a link to a digital version of it -- which pointed out that the rate at which papers were cited was driven primarily by the rate at which papers were cited. This is not as much of a tautology as it sounds like.

Think about the process of writing a paper: you do some keyword searches for recent articles on related subject. You then look at the bibliographies for those articles, pick out whatever looks relavent to your topic, look at the bibliographies of those articles, etc. What this means is that apart from your initial keyword search, the primary criteria for including an article in your research is: "has it been cited already?". Relevance is merely a secondary filter.

This paper pointed out the effects of this phenomena: the vast majority of published scientific papers are never cited again; a moderate number are cited only a few times, and the remaining few -- having reached a bibliographic critical mass -- are cited thousands of times. The authors of the paper made a strong case that this was not a good reflection of the quality of research. In many cases, the process of reaching bibliographic critical was simply based on the almost random chance of acquiring those first few citations. The authors provided several examples of important scientific ideas which had been lost for decades, arguably because they had not attracted a critical mass of citations in the years immediately after publication.

In other words, humans suck at pagerank.

Anyhow, it occurred to me that this is a problem which could be solved with technology. Imagine an online word processor which -- in a sidebar -- suggests potentially related articles from ArXiv and Google Scholar. This would be based not on crawling bibliographies, but rather on semantic analysis of the adjacent paragraphs.

I think this would create some real benefits. It would remove much of the problems with citation bias, ensuring that important ideas aren't lost, and also that prior research isn't unwittingly duplicated. Wish I had time to implement something like this!

[+] sb|14 years ago|reply

I think that your impression of doing related work is overly simplified. Since you are submitting to a conference/journal in your field, chances are high that the reviewers are knowledgable in the subject area and will point out errors in attributing due credits to related work.

While I agree that there are systemic problems with peer review and how the science "enterprise" works, there is fitting analogy from politics by Winston Churchill: "The worst form of government except for all the others."

[+] gliese1337|14 years ago|reply

I am always looking for ideas for software that I could work on that seems more useful than cat pictures, and this gets pretty high up in my estimation. Most of my minimal NLP expertise is being used up in developing software for foreign language instruction in the near future, but I'm definitely bookmarking this just in case I can ever take it up later.

[+] lhnz|14 years ago|reply

Somebody needs to make a relevance algorithm which takes into consideration, (1) noise at low levels of citation activity, and (2) social proof being misleading at high levels of citation.

[+] simonster|14 years ago|reply

Yes, there is a time lag problem. However, instant distribution has been around for a long time (in the case of arXiv.org, since 1991). It's widely accepted in the physics community, but it hasn't gained much traction in most other scientific disciplines. I think there are two reasons for this: the chicken and egg problem, and the peer review problem.

The chicken and egg problem is that no one in these disciplines publishes unreviewed manuscripts because no one reads them. The corollary here is that if you do something interesting and someone happens to read it, take your idea, and publish first, as far as credit goes, you're fucked. This happens with any form of public presentation of ideas, not all that often but often enough that every scientist knows someone who it has happened to. If you just sank a year of your life into a project, you want to make damn sure you're going to get credit for it. At present, instant distribution is too risky. If the profile of instant distribution can rise to the point where a manuscript will be sufficiently widely read to be acknowledged as the source of an idea, scientists in less competitive areas may be more open to it.

The bigger issue is, I think, that scientists actually appreciate peer review. Peer review ensures both quality and fairness in research. If I read a paper in a high-impact journal, I generally believe can trust the results regardless of who wrote it. By contrast, any reputation-based metrics will be strongly colored by the reputation of the lab from which the paper originates. (I have a hunch that this is already true for citation metrics.) Replacing peer review with reputation-based metrics may mean research gets out there faster, but it may also mean that a lot of valuable research gets ignored. This still sucks, and it may suck more. Turning a paper into a startup that may succeed or may fail depending on how well a scientist can market his or her findings would absolutely suck ass. IMHO, scientific funding is already too concentrated in the hands of established labs, and these labs are often too large to make effective use of their personnel. Reputation-based metrics would only contribute to this problem. They would also lead to confusion in the popular press, which is already somewhat incapable of triaging important and unimportant scientific results. This is a much bigger deal in biomedical science than in theoretical physics, because the former has direct bearing on individuals' lives.

On top of this, citation metrics are simply not peer review. In his previous article, Richard Price pointed out that researchers need to spend a lot of time performing peer review. This is absolutely the way it should be. Researchers should spend hours poring over new papers, suggest ways of improving them to the authors, and ultimately ensure that whatever makes it into press is as high quality as possible. IMHO, the easiest way to get quality research out faster is to encourage journals to set shorter peer review deadlines and encourage researchers to meet them, not to throw away the entire system.

OTOH, I think open sharing of data sets among researchers will massively enhance scientific progress, and has a reasonable chance of happening because the push is coming from funding agencies, not startups. As a scientist, the idea of being able to ask my own questions with other people's data gets me far more excited than being able to read their papers before release.

[+] RichardPrice|14 years ago|reply

I totally agree with you about data-sharing. I wanted to spend more time on that in the article, but didn't because I didn't want to make the article longer. I think the ability to share and ask questions about data really has enormous potential to drive science forward. The fact that enormous amounts of scientific data remains private to the lab, and not shared, is really a big loss to science. It's going to be very exciting as that data starts getting shared more.

The key to making that happen is disrupting the credit system. Right now scientists aren't incentivized to curate and share their data, so they don't put in the work to do it. You can't put data-sets on your resume, much like you can't put blog posts, or anything that is not a paper. As soon as scientists start getting credit for sharing data-sets, I think we'll start to see it happen.

Similar points apply, as you mention, to instant distribution. Instant distribution will happen more as scientists start getting credit for scientific ideas that they distribute instantly. You are already seeing some disruption to the credit system. In the last 5-10 years, since citation counts have been made publicly available by Google Scholar, citation counts have started to play a much larger role in resource allocation decisions, e.g. decisions by hiring committees and grant committees. I did my PhD at Oxford in philosophy from 2001-2007, and remained involved with some of the hiring decisions at the Oxford philosophy department until 2011, and it's been very interesting to watch the increased influence, over those years, of citation counts in hiring decisions.

Citation counts aren't perfect, but they are another signal. Hiring committees, I have experienced, are desperate for more signals that they can take into account when comparing candidates. Comparing candidates is a tough job. As with any signal, to wield it properly, you need to know its pros and cons. Fundamentally what the community is looking for here is a variety of signals that show how much a highly respected chunk of the scientific community has interacted with a piece of your content, and found it useful.

To get data-sets, and other media, to attract scientific credit, we need to develop metrics that demonstrate the traction that those pieces of media are getting in highly respected parts of the scientific community. I think those metrics will get developed, and that new metrics will play an enormous role in allowing different kinds of media to be shared, and everything to be shared faster.

[+] 3am|14 years ago|reply

Admirable cause, but the author doesn't do themselves any favors by dramatically overstating the role of publication in knowledge sharing (informal channels & conferences exist, publication serves more of a recognition purpose), and with somewhat offensive, unsupported claims like,

"The stakes are high. If these inefficiencies can be removed, science would accelerate tremendously. A faster science would lead to faster innovation in medicine and technology. Cancer could be cured 2-3 years sooner than it otherwise would be, which would save millions of lives"

[+] Thrymr|14 years ago|reply

I agree strongly, conferences in particular are very important in this regard. Perhaps in some fields people are cautious about sharing full results at conferences, but in most it is very much encouraged and beneficial to get your ideas out there to the community before a paper can come out. The "12 month time-lag" cited in the article usually includes at least one conference presentation, in which the main results can be presented, often receiving useful feedback that can result in a stronger finished paper as well.

Could scientists make better use of modern communication media? Sure. I particularly wish more fields would adopt the arXiv model. But the peer-reviewed journal is not going to be displaced anytime soon, the best we can hope for is to make the process more transparent and open, more reflective of the interests of Science and scientists, and less of the for-profit journal industry. I highly doubt that the problem will be solved by "science startups." There are large non-profit interests as well, I expect the progress to be made by individual scientists, universities, and professional organizations who have an interest in destroying the status quo.

[+] RichardPrice|14 years ago|reply

I don't think informal channels, and conferences, which are infrequent, and really expensive to travel to, are enough. Before the 1600s, science was largely done by wealthy people who had large enough houses to have a laboratory in. Scientific results weren't publicly shared; at best they were shared between the experimenter and a few of his/her friends, who communicated by private letters.

In the late 1600s, the first journal was founded, which meant that it became the norm for all scientific results to be shared publicly. This era coincided with the birth of the Scientific Revolution, which was an incredible flourishing of scientific thinking, that formed the basis of modern science.

I think that with a much more connected scientific community, that operated more as a global brain, rather than relatively disconnected nodes, scientific progress could double. So if cancer would normally be cured in, x years, I can see that coming down to 1/2x years, with an accelerated science, and, given the length of x, that shortening is likely to be a matter of years.

[+] mturmon|14 years ago|reply

Indeed, the post completely overlooks conferences, which in many fast-moving disciplines (most of CS, e.g.) are really the main venue for presenting new work, superseding journal publications. Think ICML or NIPS in machine learning -- 6 months apart.

I question the need for a for-profit enterprise like the one the author promotes inserting itself into the research enterprise. Publishers are bad enough.

[+] reasonattlm|14 years ago|reply

This is a time for revolution in the methods of science and the funding of science, long overdue and enabled by the internet. It will be a mix of removing the barriers to entry, blurring the priesthood at the edges, open and iterative publishing of data, drawing crowdfunding directly from interested groups of the public rather than just talking to the traditional funding bodies.

Astronomy has long been heading in this direction, actually - it's a leading indicator for where fields like medicine and biotechnology are going. People can today do useful and novel life science work for a few tens of thousands of dollars, and open biotechnology groups are starting to formalize (such as biocurious in the Bay Area).

There is a lot of good science and good application of science that can be parallelized, broken up into small fragments, distributed amongst collaborative communities. The SENS Foundation's discovery process for finding bacterial species that might help in attacking age-related buildup of lipofuscin, for example: cheap, could be very parallel. In this, these forms of work are much like software development - consider how that has shifted in the past few decades from the varied enclosed towers to the open market squares below.

This greater process is very important to all of us, as it is necessary to speed up progress in fields that have great potential, such as biotechnology. Only a fraction of what could be done will be done within our lifetimes without a great opening of funding and data and the methodologies of getting the work done.

[+] fl3tch|14 years ago|reply

I agree with a lot of what you said, but this:

> People can today do useful and novel life science work for a few tens of thousands of dollars

makes me wonder if you've ever done bench work or furnished a lab. Sure, you can do a few weeks or months of work for tends of thousands of dollars (which doesn't produce a lot of useful results in that time frame, but can produce some), but that's assuming you have a functioning lab. It often takes $100K just to stock one, which is why most new investigators get special startup money just for that. To produce useful results often takes years at a rate of at least $100K a year. Equipment and reagents (especially enzymes) can be expensive. $300 for a gel box here, $1000 for a pipette there, $5000 for a thermocycler, $2000 for an enzyme, it adds up. And that's not even getting into salaries. Most people don't work alone.

I think it would be incredibly difficult to crowdfound science a la something like Kickstarter, especially with the amount of money currently spent on science (about $50 billion annually in the US alone). But maybe someone on HN will be the person who proves me wrong.

[+] RichardPrice|14 years ago|reply

That's a cool point about parallelization. There was a fascinating experiment done a few years ago by the mathemetician Tim Gowers, called the 'Polymath project', where he took a problem in math, and asked for the mathematicians who read his blog to solve parts of the problem. 40 people took part, and 7 weeks later, Gowers announced on his blog that the problem was 'probably solved'. A couple of papers came out of it, published by 'D.H.J Polymath'.

More info on this is on the Wikipedia page http://en.wikipedia.org/wiki/Polymath_Project

You're right, it would be cool if we could see more of this kind of thing happening. There is now a whole site dedicated to applying parallelization to other problems in math http://polymathprojects.org/

[+] timdellinger|14 years ago|reply

The problem with distributed, grass roots peer review is that you get poor quality reviewers. The current structure is slow and very "old media", but it is this way because it's the only way to guarantee quality peer reviews.

If journals cease to exist, and a new publish-it-anywhere-then-publicize-it paradigm emerges, along with some associated metrics (kinda sorta like Reddit), then I predict that conference presentations will become the new metric of success. They have gatekeepers, and scarcity due to limited bandwidth (i.e. there are a limited number of time slots available). The whole journal publishing infrastructure will just be shifted over to conferences... along with the ecosystem of for-profit vs. trade group, etc., and the Slowness and Single Mode of Publication problems that the OP describes.

[+] thisisnotmyname|14 years ago|reply

"The norms don’t encourage the sharing of an interactive, full-color, 3 dimensional model of the protein, even if that would be a more suitable media format for the kind of knowledge that is being shared."

This is simply wrong - when you solve a protein structure is is mandatory that you submit it to the pdb (i.e http://www.pdb.org/pdb/101/motm.do?momID=148) and nearly every journal I read has both color figures and extensive online supplementary materials.

[+] RichardPrice|14 years ago|reply

You're right, there are some trends in the right direction, which is terrific. But these are early. By and large scientists only get credit for publishing papers, which means they aren't taking advantage of the full interactive power of the web.

It's rare for scientists to share things like data sets, or other things like a video of a certain physical process that is going on. Most graphs or tables in scientific papers are non-interactive: you can't change the x and y axes, or other properties of the graph, as you can with graphs in Google Analytics, or generally data that is displayed for native web consumption. Similarly the code that scientists use to run on their data sets, which generates conclusions that end up in their papers, doesn't get shared.

I think the key to opening up richer sharing is to provide credit metrics that incentivize this kind of activity. When scientists can get credit for sharing data-sets, code, videos, and a wider array of rich media, they will start sharing more, and taking greater advantage of the rich media power of the web.

[+] archgoon|14 years ago|reply

No 3d models for new proteins?

The protein databank exists precisely for that reason.

http://www.rcsb.org/pdb/home/home.do

[+] rflrob|14 years ago|reply

More broadly, lots of data that doesn't lend itself to a single figure has either repositories (like the Gene Expression Omnibus) or supplemental attachments to the paper that can be included.

[+] wiggins37|14 years ago|reply

I'm glad that the author is thinking about ways to increase communication between scientific authors, but some of the statements he made, specifically regarding "curing cancer 2-3 years sooner" make him sound ignorant of some of the challenges facing researchers. Not all scientific knowledge is presented only through journal articles. As others have already mentioned, conferences with "poster presentations" are pretty common in medicine to discuss ideas before the paper comes out. In addition, labs across the country working on similar problems often exchange ideas and substrates by email and mail respectively. I agree that it would be great if there was a more centralized repository of information online to get information. If anyone has any experience with blogs, forums or websites specifically addressing oncology (that are not just press releases) I would appreciate learning about them.

[+] tel|14 years ago|reply

Imagine if all the stories in your Facebook News Feed were 12 months old. People would be storming the steps of Congress, demanding change.

To play devil's advocate, the time lag forces your conversations to strive to a higher standard of quality, comprehensiveness, correctness, and context than Facebook updates could even be compared to.

Then again, striving for the higher standard also invents pseudosciences, bad stat, and outright fraud.

In short, I don't think the solution is to replace the paper with something instantaneous. I agree that instantaneous (public) communication could be better used in the academic community, but there's a trend that way already as blog posts begin to signal a certain kind of good advisor.

I especially don't agree that search engines have any business replacing peer review.

[+] stephenhandley|14 years ago|reply

Many of the new approaches to science publishing I’ve seen haven’t done enough to directly address the silo problem or provide significantly improved alternatives. I suspect this is primarily because they’re trying to create viable scientific publishing businesses of their own. I believe taking a different approach around free, distributed, open source publishing and aggregation software would be better suited to transforming scientific communication into a more open, continuous, efficient, and data-driven process.

more here: http://tldr.person.sh/on-the-future-of-science

[+] kirk21|14 years ago|reply

It always strikes me how much time it takes to finish your paper (eg. conference template, correcting spelling mistakes, formatting figures etc.) while you could outsource this. Student-assistants are an option

Furthermore it would be nice to discuss your ideas without having to spend months writing a paper. Guess there is a difference between alpha and beta sciences.

Finding relevant conferences is a challenge as well (since I'm still a junior researcher).

[+] mukaiji|14 years ago|reply

I recently had a chit-chat with a 5th year phd friend of mine in front of the stanford bookstore. We both did academic research, and both know all too well the incredible frustration of tech not-having full penetrated academic research. If ever you'd like to work toward making research faster, reply to this. We could think about a couple of things and start cranking out some solutions.

[+] drewbuschhorn|14 years ago|reply

(since you're in cali, and I'm not) I'd like to draw your attention to the science hack days in SF. Have a look at this wiki[0]. WilliamGunn and cazDev on github are two people I know who've participated in / have some connections that might help a project like this happen.

[0] http://sciencehackday.pbworks.com/w/page/45740104/SFideas#ag...

[+] janardanyri|14 years ago|reply

I did just enough academic research to discover that real progress could be more effectively pursued elsewhere. I would love to explore this; janardan.yri at me.com.

[+] RichardPrice|14 years ago|reply

mukaiji - I would love to chat to you. Drop me a line at richard [at] academia.edu.

[+] eli_gottlieb|14 years ago|reply

Yo. My email address is in my HN profile, and above I was thinking about the idea of adding an "automated literature reviewer" to paper submissions who would "advertise" relevant papers the way Google tries to advertise relevant to search queries.

39 comments