The Scientific Paper is Obsolete (2018)

[+] sam-2727|4 years ago|reply

I think this is a great idea theoretically, but in reality for most papers I don't want to see the data/underlying code. While it would be great to publish data/code with the paper (in the field I've worked on the most, astronomy, most data is already published with the paper anyways), I don't want/need to look through a notebook with the underlying code of the paper in order to just read the intro/conclusions (and maybe one key methods section). Interactive figures are a great idea, but again, oftentimes I don't really care to interact with the figure, or fiddle stuff around, I just want to know why the paper is important and how I should use its conclusions. The two-column format of most papers is very useful for skimming. So instead I would argue notebooks shouldn't replace papers, but supplement them (as they sometimes do already, in fact, but perhaps journals could make it an actual requirement to create a supplementary notebook).

As the article mentions, scientific fields are gigantic nowadays, and skimming papers is critical when you're citing 100+ references in your paper.

[+] nextos|4 years ago|reply

EMBL-EBI and others had some RDF-related effort to provide machine readable abstracts, which I thought was a really cool idea.

IMHO, the biggest problem with papers is politics and reviews. In many top journals like Nature there's no double-blind review (actually in Nature it's now optional but big groups never use it). And even if there was double-blind review, referees have no skin in the game. So the usual outcome is to get reviewed by a big name in your field, who is actually interested in controlling research trends and killing "competitors".

This is hindering progress and hurting new ideas. For example, proponents of Alzheimer's disease being caused by an infection or dysbiosis have had a hard time to do research, get grants and publish articles during the last 2 decades. Despite their theory is able to explain the etiology quite well, unlike competing alternatives.

Another problem is that to publish in good journals you need cool results. Cool results are rare, but Nature, Science, Cell et al. are full of articles every month. So, most groups are overselling and misreporting things. Research fraud, p-value hacking and data manipulation are really common.

[+] bloaf|4 years ago|reply

You should always want to have the underlying code available. Without the exact procedures they used to process their data, the only kind of "using their conclusions" you can do is the superficial "take it at face value" kind. So many important details get hand-waved away in papers that say things like "we used the well known blahblahblah method to analyze the data."

If you do it right, the code should in no way interfere with your ability to read abstracts.

[+] TuringTest|4 years ago|reply

The point of interactive notebooks is not seeing and having access to all the data - it's seeing the abstractions at work, having a direct grasp of how they act on particular examples as an aid to understand their formal definition.

Nothing prevents you from having two-column notebooks, if you find that advantageous, as well as abstract and conclusions sections. The part that you don't get with static paper is that of navigating the abstraction ladder[1] up and down with direct manipulation aids, instead of having to work it all in your head or by following dense detailed paragraphs.

[1] As also explained by Bret Victor in http://worrydream.com/LadderOfAbstraction/

[+] Helmut10001|4 years ago|reply

I have made an experiment with my last paper: Write everything from scatch in Jupyter Notebook, including data preprocessing and generation of all figures (etc.) (10 Notebooks in total). Start of the conceptualization was in 2017, we just submitted it 2 weeks ago (it got desk rejected for not fitting the journals topic).

I learned a lot and it was definitly worth it. The next paper will be easier with this knowledge. Nonetheless, there is an overhead and I feel that this overhead is not valued with the current makeup of journals, where you really need to dig deep to find any supplementary materials.

[+] GracefullyBlind|4 years ago|reply

I think having the ability to focus on the things you care about the paper mostly is what would be more beneficial for all readers. You care more about an overview? You can easily find it (perhaps with graphics and walkthroughs), you care more about proofs? Then you can get them, what about code and experiments? And so on and so forth.

Readability and scalability is about making all this data available in the publication record, but easy to navigate for whoever is looking for whatever.

[+] PhilipVinc|4 years ago|reply

Papers today are longer than ever and full of jargon and symbols. They depend on chains of computer programs that generate data, and clean up data, and plot data, and run statistical models on data. These programs tend to be both so sloppily written and so central to the results that it’s contributed to a replication crisis, or put another way, a failure of the paper to perform its most basic task: to report what you’ve actually discovered, clearly enough that someone else can discover it for themselves.

[+] rtkaratekid|4 years ago|reply

I think this almost every time I read the paper. It’s like Linus’ “show me the code.” I just want papers now to “show me the data and the code.” And include a discussion about why these results are important. I think it’s a great time for the scientific community to improve transparency on these fronts.

Sincerely, someone who reads a lot of research but contributes none because I’m an amateur.

Edit: when I say data, I mean the raw data.

[+] 14|4 years ago|reply

I remember in grade school reports seemed to logical. Propose something, create a hypothesis what you think you will see, record your data and what you observe during the experiment, summarize the results as to what actually happened vs what you initially expected. Most papers now seem like a foreign language and I can only glimpse at what is happening relying on some math genius to reply the significance.

[+] haihaibye|4 years ago|reply

One of the first suggestions I have is to use source control and store the Git hash of code used to generate data. A few times I've heard back "we don't have time for that" - pretty easy to see how the replication crisis flows from processes like that.

[+] ketozhang|4 years ago|reply

Certain scientific software packages (e.g., Tensorflow, pymc3, etc) do have frameworks that you follow to return pipeline and result objects that follow some data model that others can learn quickly (e.g., an arviz::InferenceData result object). I wish there was a more extensive framework where this is applied end-to-end from data input, to library components in a pipeline processes, to the result, and then to plot.

[+] _Microft|4 years ago|reply

Papers aren’t pop-sci articles, they do not target an audience that does not knows anything about the field yet. They are from experts for experts. If someone wants to familiarize themselves with the language, symbols and methods of a field, a textbook is a better thing to start with. Over time they will also learn the shared knowledge of the field that isn’t even mentioned in these articles.

[+] fho|4 years ago|reply

Worse in that a lot of researchers actually have only the slightest grasp on statistics. To the point that I would assume that a lot (1 in 20? :-)) papers will contain an error in their statistical analysis of their results.

[+] johnsutor|4 years ago|reply

I feel like the website paperswithcode.com addresses this very well, especially with their feature "quick start in Colab". For example, here's the top paper on the website as of now: https://paperswithcode.com/paper/towards-real-world-blind-fa.... Instead of going through the process of cloning a repo, initializing a fresh Anaconda environment from scratch, reading through nebulous, haphazard documentation about how to download the necessary training data, and then converting that training data into a format that's compatible with the code, I just click a link and run a couple of lines. Bam. I have an intuition about the code that's 100x better than reading the paper alone. Even though Colab isn't applicable to all fields and is largely used by the Data Science and Computer Science community, it is a promising step at modernizing science, especially the replicability of discoveries.

[+] musicale|4 years ago|reply

It's really disappointing that technical societies like the ACM and IEEE haven't done this already.

For many journals and conferences there isn't even a way to submit the code or other digital artifacts with the PDF. A few have badging for whether digital artifacts are provided and whether the results have been reproduced or repeated by others - steps in the right direction at least.

As much as I intensely dislike their practices of overcharging for journals and milking digital library subscriptions to fund administrative overhead, the technical societies are technically non-profits and exist to serve their members and the research and professional community. This is really something they should be doing.

[+] williamkuszmaul|4 years ago|reply

In my field, at least, I think the problem is less about the medium, and more about the incentives. Researchers are incentivized to write papers that seem impressive (and intimidating) rather than clear and intuitive.

To make matters worse, this is an evolved trait: researchers whose papers are intimidating are more likely to succeed, which means they're more likely to have future PhD students, which means that the style of writing is more likely to get passed on.

I think the main way to address this is to change the incentives. In particular, by creating publication venues that value simplicity and clarity (one such conference is SOSA, which has had a lot of impact on theoretical computer science in the last few years).

[+] diognesofsinope|4 years ago|reply

> Researchers are incentivized to write papers that seem impressive (and intimidating) rather than clear and intuitive.

Ah, a fellow economist lol. Lack of clarity is a strategic advantage because (1) (as you said) it looks impressive and (2) it's hard to validate that it's correct.

So many papers contain such elementary statistics mistakes such as survivorship bias, e.g. 'returns to education' is almost exclusively measured by asking individuals who graduated (on average 50% of enrolled students don't) and respond back to surveys (good chance of bias).

Pubs are how you get jobs. It's not about science anymore, it's about navigating bureaucracy for an elite job.

[+] derekpankaew|4 years ago|reply

I hope the "trend" of researchers writing for the public, via books & blog posts, can contribute some incentive towards being clear and intuitive.

I'm thinking about the success of Freakonomics, Thinking Fast and Slow, The Elegant Universe, etc. These are all academics, who've "translated" their research for the masses. That translation ended up being much more impactful - and prestigious - than an intimidating paper.

I hope this becomes more of a trend, and an incentive structure, in the future.

[+] dr_dshiv|4 years ago|reply

http://distill.pub/about/

It’s been done and it is amazing. This is the best journal in the world imho

[+] antognini|4 years ago|reply

It is also on hiatus :(

https://distill.pub/2021/distill-hiatus/

[+] queuebert|4 years ago|reply

As a practicing scientist, I firmly believe the world would be much better off if we simply published version-controlled Jupyter notebooks on a free site, such as GitHub or ArXiv.

[+] imranq|4 years ago|reply

This is a little exaggerated. Most papers have to be somewhat readable to be accepted into journals and notable conferences. The fact that the layman cannot understand an advanced biology paper is nothing new. I'd wager the scientific paper "golden age" the author cites as having such readable papers were not very readable for the general public of the time. It's just we are taught those things in elementary school and so can see the concepts in those older papers much better than the people of the time can.

[+] csours|4 years ago|reply

Another way to say this is "The UX of Scientific Papers is Poor"

It's not hard to imaging a better UX - for the field what are the top 5 questions you want to answer before you start reading?

Eg: Sample size, Funding, etc. Put those at the top of the paper with symbols.

[+] serverlessmom|4 years ago|reply

I definitely agree with the author that a major gap that has continually failed to be bridged between the scientific community and the more general bulk of people is true understanding of the field of science as a whole. Even considering the way that most people casually throw around the term "research" is elucidating of this problem: a genuine lack of understanding of not just complex scientific and mathematical models but science as an industry and a tool for understanding life on this planet.

None the less I wonder if the goal should be to make it so any person could understand a complex paper? Should all people strain to understand every study? There are experts in certain fields for that very reason. It is not always possible to accurately explain higher level concepts to people who lack foundational knowledge that can take years to accrue. I am not certain if changing papers to be more interactive is going to bridge the gap as this author hopes, or if it is even the goal that should be pursued.

[+] deepzn|4 years ago|reply

James Somers is one of my favorite science/tech writers. Thank you, James for enriching my mind over the years. I have a bunch of your New Yorker articles to read but throughly enjoyed your 'The Coming Software Apocalypse' story. Guys check it out here https://www.theatlantic.com/technology/archive/2017/09/savin...

And he has all his articles listed on his site- https://jsomers.net/

[+] beckman466|4 years ago|reply

wow great stuff, thanks for posting!

[+] xipho|4 years ago|reply

It's easy to say things are obsolete when you are your own publisher (Victor, Wolfram, Perez) or when can suggest your favorite (even if it's very cool) approach (Jupyter Notebooks) as a potential key solution. If you can't be your own publisher, it's a much more difficult proposition.

We're trying to figure out how to facilitate taxonomists publishing their own taxon pages, i.e. species descriptions, from a science 250+ years old. Our MVP use case is ~20k pages, one per species, for one project. There will be many of these projects, though maybe not many with 20k pages, and some with much greater than 20k pages. Updates are needed with as little latency as possible with data from from multiple sources. There has to be basically zero cost to serve these (I know, nothing is free). Sites must be trivially configurable (e.g. clone a GH template repo and edit a YAML file and some markdown). Even if we can get this in infrastructure in place we then have to figure out how to get the social structure in place to have this type of product recognized as equivalent to traditional on paper publishing, i.e. advance people's careers because they "published".

In my field, until we give people the power to publish on their own, I don't see traditional publishing go away. Many in the past have indeed published (traditionally) their own species descriptions on their own dime, meeting the rules within the various international codes of nomenclature. I also don't have a problem with dead wood- if we go digital too fast we will loose so much for any number of reasons associated with the ephemeral nature of electron-based infrastructures.

[+] otrahuevada|4 years ago|reply

I myself enjoy reading papers.

At least after I transfer them to a single column, 14/16pt tall font with real headers, that is.

The graphic format itself is dated and annoying, yes, but I find the expositional tone and immediately searchable references pretty cool.

[+] shadowfox|4 years ago|reply

How do you go about converting the papers?

[+] dang|4 years ago|reply

Discussed at the time:

Redesigning the Scientific Paper - https://news.ycombinator.com/item?id=16764321 - April 2018 (107 comments)

[+] GracefullyBlind|4 years ago|reply

Great read. Taking from my field (CS), I think a lot of papers suffer from the idea that you only are supposed to show "the interface", like, what the result is, what you achieved. The "how" or the "why" are sometimes neglected, regarded merely as a "technicality" to account for the rigorous mathematical framework that "must be there".

There is little effort in making your results understandable and easy to replicate. Academia values paper production, which requires convincing peer reviewers that your results are not trivial and are worth publishing. Contrary to what the essay states, I don't think many scientists today think their research is "incremental". In fact, this word is used in many places as a derogatory term to indicate certain result doesn't contain enough novelty to deserve publication. Researchers are more incentivized to make their constructions and results as complicated and less accessible as possible.

This is not just a theory, this is something I've seen over and over throughout the years.

[+] multilogit39|4 years ago|reply

As an academically employed scientist of 20 years, the notion that scientific communication suddenly needs better standards puzzles me. The core research curriculum of nearly every scientific field I’ve seen, STEM or otherwise, is that the data needed for replication are non-negotiable. A paper that doesn’t include it would be table rejected by any editor. Or one would hope. This is taught at the UNDERgraduate level, for heaven’s sake.

The thought clusters emerging from the recent “replication crisis” are a fascinating rabbit hole to crawl into. If you stay near the surface, you will find mostly young scholars cheerleading open science as the obvious solution to replication difficulties. The concepts of pre-registering your study, committing to sharing data, and publishing online are all various components of this idea, varying in their necessity by the author’s devotion to their cause.

But there are several downsides to such a system that aren’t immediately obvious. For example, does the skill set of the successful scientist broaden to include how skilled they are at poaching ideas from public data that wasn’t immediately seen by their authors?

Some of the more recent criticisms invoked the spectre of “platform capitalism”, and suggested the Facebook and Linkedin-ification of science by dumping all its data on a centralized platform would likely have a net negative effect.

This article was written in 2018, and most of the discussions I’ve read since then have suggested that the open science initiative has failed despite the rapid penetration of Jupyter and visualization tools in the scientific process. Perhaps, like most things, the unseen market will pick and choose the good out of the dubious.

[+] BeetleB|4 years ago|reply

> The core research curriculum of nearly every scientific field I’ve seen, STEM or otherwise, is that the data needed for replication are non-negotiable. A paper that doesn’t include it would be table rejected by any editor. Or one would hope. This is taught at the UNDERgraduate level, for heaven’s sake.

This may vary based on discipline, but in both the subdisciplines of experimental and theoretical physics I was involved in: No - very few will provide the data/derivation. My professors were very open about this: They don't want to lose their competitive edge. Almost no experimentalist I knew could take papers from his/her field and reproduce the results, because the papers lacked enough detail to do so. They would mention a technique, but there are lots and lots of nuances involved when building equipment to carry out the technique[1], and these are intentionally excluded. It's unlikely you'll be able to build the equipment the same way the original authors would.

[1] Most experimental physics involves building your own equipment, or at the least modifying existing equipment.

[+] jiggunjer|4 years ago|reply

Poaching results from public data is an oxymoron. Poaching ideas from data isn't even possible.

[+] Ice_cream_suit|4 years ago|reply

So I am speed reading this: DNA Methylation and Protein Markers of Chronic Inflammation and Their Associations With Brain and Cognitive Aging https://n.neurology.org/content/97/23/e2340.abstract

I read the abstract, in reverse order. Discussion and Results first.

If it seems plausible ( much research is trash, churned out to pad a CV ) and interesting, then I scan the Methods section.

This heuristic helps classify 95% as utter trash or outside my current area of interest in less than 10 seconds.

If it seems seriously interesting, I then read the full text the same way ie: backwards.

At no point do I want pretty visualisations made by wannbe PhD candidates, full of misinterpretations, wishful thinking or outright fraud.

[+] ketozhang|4 years ago|reply

> At no point do I want pretty visualisations made by wannbe PhD candidates

See Figure (5). Your argument doesn't really counter any part of the scientific notebook. A notebook will still have the abstract and conclusion (result & discussion). The tools mentioned in the article describes how to restructure the methods, data, and figures. You're note going to look at these anyways until the abstract and conclusion intrigues you.

[+] whatshisface|4 years ago|reply

>The earliest papers were in some ways more readable than papers are today. They were less specialized, more direct, shorter, and far less formal. Calculus had only just been invented. Entire data sets could fit in a table on a single page. What little “computation” contributed to the results was done by hand and could be verified in the same way.

This is not even close to true. Look up Tycho Brahe's observations, and he's only the earliest I can think of.

[+] popcube|4 years ago|reply

in another way, if you read the early article in biology, they are very easy to read and funny! e.g., Wallace's article looks like travel note. but use them in research just a nightmare...

[+] agnosticmantis|4 years ago|reply

Computational notebooks are great, but only seeing what the author (or coder) wanted you to see is not enough to evaluate their work. In addition to data and code, we need to see the path they took, the exploration and experimentation that led to the final presentation of ideas in the notebook.

For this we can use cloud based environments controlled by funding agencies/universities that ensure every interaction with data is recorded from the very beginning.

Something like this would at least reduce the risk of p-hacking practices that would otherwise be there even if everyone used notebooks instead of papers.

[+] periheli0n|4 years ago|reply

As a publishing scientist myself I would have hoped to read more about how I can actually publish Jupyter Notebooks in a way that is recognised academically. That's at least what the title implied for me.

But the article is actually about Mathematica vs. Jupyter notebooks. Still, it's well researched and very interesting.

Nevertheless, the question how to publish better remains open. I for one think that some progress could already be made if ArXiv published html articles by default, rather than those unwieldy PDFs that really only work best when printed on paper.

[+] tpoacher|4 years ago|reply

I don't disagree with the main point of the article, but I think it underplays the extent to which most publications being information-poor is the fault of the medium rather than the low standard of writing that we've become accustomed to and complacent with over the years.

This is not too unlike maintainable code. The code platform itself matters to some extent, but far less than the extent to which the author wrote with maintainability in mind.

[+] bborud|4 years ago|reply

Years ago I did a lot of original research in a field that wasn't very well developed at the time. However, that research took place in the context of a startup, not academia. I was rewarded for producing solutions that worked - not for publishing papers.

I was approached by a few academics about publishing what I'd worked on, but I never did. I never did because I did consume large stacks of papers every month, and I absolutely hated the pompous, obfuscated portioning out of ideas fragment by fragment. It was an unnecessarily time consuming, and often quite useless way of sharing information. Especially since source code often wasn't part of what was published so a lot of important information got lost (which I guess was the entire point of not publishing code).

I particularly remember a 4-5 page paper that was so poorly written it took me a couple of readings (weeks apart) to realize that it described something I too had worked on. How bad is a paper when it is so obfuscated that it takes effort to recognize something you have worked on too?

I wasn't interested in wasting time dressing up my notes in drag. And if my notes as they were were not good enough, well, then someone else would surely do the same work independently and publish something at some point. Lots of the things I worked on inevitably were described by other people.

I have a love-hate relationship to scientific papers for the simple reason that they sometimes aren't really about science, but about scoring points in academia and certain types of research organizations. Yes, a lot of interesting goodies are published, but my god there is a lot of garbage that gets published. Not least because people in academia are incentivized to get as many papers as possible out of what ought to be a single publishable unit.

If we incentivize authors to spam us, they will spam us.

162 comments