Why scientists need to be better at data visualization

[+] stared|6 years ago|reply

When I was a Ph.D. student I was surprised that in academia (my field was: quantum information theory) the majority of researchers put zero effort, and thought, in presenting data in a clean and clear way. (As a reaction I devoted a section in my thesis to say what is data vis.) One my day "but it is for other specialists". Nope, it is not the problem. Of course, every data vis should be tailored for a specific audience. But there is typically zero though put their either. See section 3.1.2 from https://arxiv.org/abs/1412.6796.

From a more recent one ("Simple diagrams of convoluted neural networks", https://medium.com/inbrowserai/simple-diagrams-of-convoluted...):

"[In academic] research, visualization is a mere afterthought (with a few notable exceptions, including the Distill journal, https://distill.pub/). One may argue that developing new algorithms and tuning hyperparameters are Real Science/Engineering™, while the visual presentation is the domain of art and has no value. I couldn’t disagree more! Sure, for computers running a program it does not matter if your code is without indentations and has obscurely named variables. But for people — it does. Academic papers are not a means of discovery — they are a means of communication."

[+] amitport|6 years ago|reply

In my experience, it's just about cost-effectiveness. Proper visualization is hard, researchers have a lot of ground to cover and theory and proper experiments are just more important (not only for journal reviewers). Sure, if have a big research group with man hours to spare someone should focus on that, but it's rare to have that capacity.

[+] jacobolus|6 years ago|reply

Friendly note: That arXiv paper would benefit from better color choices in its diagrams. HSL/HSV models should generally be avoided in all contexts.

Beyond that, the paper colors reference links #0000ff (neon green). This makes them illegible and unpleasantly distracting. Try a darker (and less intense) color to improve legibility. The bright red links are also a bit distracting, but that’s more a matter of personal preference.

There are some good resources in https://courses.cs.washington.edu/courses/cse512/19sp/ which I just ran across recently. e.g. http://www.personal.psu.edu/cab38/ColorSch/ASApaper.html

[+] jofer|6 years ago|reply

A key aspect that this misses is the target audience. The general public is not the target of most scientific visualizations. Instead, it's other scientists in the field. There are invariably lots of conventions and general historical trends that it's important to follow to communicate clearly.

For example, in seismology high sonic velocities are always shown in blue/cool colors and low sonic velocities are always shown in red/warm colors. This is non-ideal for colorblind users and confusing for other audiences (red != high value), but it's a near-universal convention. We can use a more perceptually uniform red to blue colormap, but keeping the warm/cool convention is very important.

If your audience is used to seeing a certain type data in a particular way, you'll confuse people if you don't follow that convention. Clear labels and legends are great, but people follow convention first, labels second.

When the goal is clear communication, following established convention is much more important than making a "better" visualization. That's not to say that these guidelines aren't very important, just that it's vital to keep your audience's expectations in mind.

> "Scientists also tend to follow convention when it comes to how they display data, which perpetuates bad practices."

In my experience, scientists follow convention because it enables clear, consistent communication with other scientists in the field. Violating convention is okay, but should be done with utmost care and the knowledge that you'll need to spend more time explaining things.

[+] mnky9800n|6 years ago|reply

the article even talks about cultural expectations making visualizations harder to read if the cultural expectations are violated. Except that the article completely ignores your point. Science has a culture and violating that culture to make a "better" visualization must be done with care.

[+] b0rsuk|6 years ago|reply

If scientists can't be bothered to read the legend first, who can?

[+] rflrob|6 years ago|reply

While I absolutely agree that scientists need more formal training in data visualization, the claim that "few scientists take the same amount of care with visuals as they do with generating data or writing about it" doesn't ring true to me about the scientists in my field (genetics/genomics). There is a widespread recognition that effective figures are the strongest way to communicate your message, the question is what is the the best way to create those effective figures. Part of the problem is that as datasets get bigger, it's rarely sustainable to put a lot of care into each and every version of a plot, but automating creation of figures is really hard.

If I were going to start designing a course in creating scientific figures, I think I'd have a roughly even split between the psychophysics of visual perception (e.g. distinguishing between similar quantities of lengths/angles/colors/etc; designing for color-blind readers; ) and hands-on work in a real programming environment turning data to figures.

[+] Thriptic|6 years ago|reply

I agree with everything you said and can confirm that in our (translational vascular bio) lab the bulk of effort spent on paper drafting was concerned with creation of high quality figures. Many scientists (myself included), will read a paper abstract and then head straight for the figures as they usually contain the highest density of data for the reader.

[+] jfim|6 years ago|reply

> Part of the problem is that as datasets get bigger, it's rarely sustainable to put a lot of care into each and every version of a plot, but automating creation of figures is really hard.

That's actually a good reason to learn R and the ggplot2 package. Whenever I write a paper, what I do is that I make a quick shell script that invokes Rscript, with a simple R program that takes a CSV file and outputs a PDF file of the plot, which can be automatically loaded in LaTeX.

Whenever the data changes, it's just a matter of updating the CSV file and running the script that rebuilds the figures and the LaTeX document. As an added bonus, it makes keeping the data with the paper easy, since they're part of the same source control repository.

[+] puttermesser|6 years ago|reply

Relatedly, here’s some cool visualization work that comes from genetics data viz.

http://scalable-insets.lekschas.de/

[+] noobermin|6 years ago|reply

I'd say there is a range, as there always is. I've seen fantastic and clear visuals of FDTD simulations of laser interactions in my field in 3d, then, I've seen jet used to pcolor data that ranges from positive to negative. I would say though that a good enough fraction (not sure if it's greater than 50% but I wouldn't be surprised if it was) do care about making good figures.

[+] psalminen|6 years ago|reply

The course you are describing is one which I took. It was called "Scientific Visualization", and was a CS course mainly taken by science majors. The meta-information describing color schemes and scales was by far the most interesting part of it.

[+] hinkley|6 years ago|reply

Everyone needs to be better at data visualization. If for no other reason than to be an informed consumer. If Mark Twain had lived through informatics I'm sure it would be four ways to lie: "Lies, Damned Lies, Statistics, and Charts"

There are a lot of bullshit techniques used in data viz that are tantamount to lying and people are often sincerely shocked when you call them on it.

"I didn't do that on purpose," as if they didn't learn when they're 4 that it doesn't matter if you meant it if you did it. You still have to apologize and try to make amends.

Things people do from ignorance or malice:

Remove the origin on the graph and the relative height of the lines is skewed. 3d pie charts are 'larger' on the bottom half. Circle charts conflate diameter with area (humans are bad at judging area). On a log scale plot, a fat enough line can make anything look like a trend, because the end of the line is literally orders of magnitude wider than at the origin.

Friends don't let friends use any of these techniques, and the jaded instantly distrust anyone who is using them.

The used bookstore on the closest university campus has an entire shelf full of old editions of Edward Tufte books. I don't know if you'll be so lucky, but it's well worth a shot.

[+] stefco_|6 years ago|reply

I agree that these techniques are often used in popular media in deceptive ways, but a couple of those plotting techniques have valid use cases.

Specifically, if you're really concerned about a delta, removing the origin is good for visualization; it lets you just see the difference between two trends (in particle physics, you can show an energy excess this way). Likewise, for exponential phenomena, log plots are the correct choice, since uncertainty will be magnified for larger values. In both cases, of course, you need to include error bars, but that is always true. But I don't think you can "instantly distrust" someone who is using these techniques when they are valid.

[+] knzhou|6 years ago|reply

You should be specific about who you’re calling out. I’m sure this is good advice for journalists making infographics for the public, but it isn’t for scientists talking to other scientists.

I design my plots for readability and in the process regularly break every rule you listed here, because I trust my audience (other scientists) to understand how to read axes. Yes, it might be confusing to somebody who doesn’t read graphs with axes that often, but if I optimized my papers for their convenience, they would all have to be 100 pages long.

[+] mattkrause|6 years ago|reply

Dogmatically insisting on origins-at-zero is silly: it depends too much on the data, the message, and the audience.

For example, my lab studies the electrical activity of brain cells. A neuron normally sits at -70 mV relative to the extracellular space. When it fires, it can briefly cross 0 mV (though it probably won’t reach +70 mV) and sometimes you might want to show that complete spike. Other times, subthreshold changes in the cells’ activity are more important to a hypothesis, and then you’ll zoom in on a smaller range (maybe -90 to -55 mV). Neither of these situations would benefit from a plot centered at zero; in fact, it would look decidedly odd.

(Your other comment about log-log plots also seems odd to me, because the width of the line is almost never meaningful, unless there’s something like a confidence/credible interval band, which case that’s the whole point.

[+] dredmorbius|6 years ago|reply

Sorry, can you explain / show an example of the log plot issue?

[+] leeoniya|6 years ago|reply

agreed. in working on uPlot, i've explicitly tried to stay away from misleading, hard to interpret, but pretty shit like line smoothing, stacked areas, etc.

https://github.com/leeoniya/uPlot#non-features

[+] nycticorax|6 years ago|reply

This is like a non-programmer telling programmers they need to comment their code better. In a sense, it's probably true most of the time. But it's also sort of hard to take very seriously, because the non-programmer doesn't have any real understanding of the day-to-day challenges and trade-offs faced by people who write code for a living.

[+] SubiculumCode|6 years ago|reply

Not to mention that most of us scientists spend quite a bit of time on our figures.

[+] setr|6 years ago|reply

One major difference is that code comments aren't at all meant for public consumption (even in open sourced code, comments have a very constrained expected audience).

If anything, public documentation is probably a better analogy -- expected audience of developers, but of potentially wildly different experience levels.

And in that case, it's generally fair for say a novice to complain about the lack of examples, clarity,etc.

And these charts are the same -- this is the primary entry point for a novice in the subject, and a helpful tool for experts. A poor chart would be harmful for all expected audiences.

[+] noobermin|6 years ago|reply

I sort of agreed until I googled the author's name. They were a "Knight Science Journalism Fellow at MIT" which I'll be honest, not sure I'm super aware of what that is, but I imagine they are somewhat knowledgeable.

[+] malshe|6 years ago|reply

Alberto Cairo recently released his "How Charts Lie". It's a delightful read.

https://www.amazon.com/How-Charts-Lie-Getting-Information/dp...

[+] archgoon|6 years ago|reply

I remember a decade or so ago, the only way for me to figure out certain material values (this was material science / electrical engineering / chemistry) was to open up MSPaint and manually draw lines on chart data to find intercepts. Has the field gotten better about releasing raw data for papers?

Granted, this was sometimes because the only team that had measured a particular substance did it back in the 70s.

[+] btrettel|6 years ago|reply

There are many programs to automate that, e.g., I use g3data. Unfortunately raw data is still rarely released today. One tip if the raw data wasn't released as a computer file: Look for a dissertation associated with the journal article you want data from. Dissertations often have tabulated data in an appendix.

[+] mkl|6 years ago|reply

If you have the plot as a PDF or PostScript file you can sometimes write a script to extract the data directly from the coordinates of the plot elements (for PDF, uncompress the file with PDFTK, and then the commands are ~readable text). Sometimes the values extracted from plots contain more precision than values reported in tables.

It can be quite fiddly, though, so it might not be worth the effort unless you really need it, or you need to do it a lot with many similar figures.

[+] tgb|6 years ago|reply

It varies by field. Some have standard repositories for data to go in - all RNA seq experiments get their raw data uploaded to GEO or similar, for example. But they usually don't specify their pipeline sufficiently precisely to be able to recreate the exact analysis done.

[+] noobermin|6 years ago|reply

Not always, but for more reason papers, you could always shoot someone an email and ask for data.

[+] exergy|6 years ago|reply

I do simulations for a living aka numerical modelling. The results that I obtain need to be explained using plots. How behaviour changes with time, how component sizing will affect xyz parameter etc.

What I would dearly love is some way to animate the systems I simulate. A way to show HOW MUCH flow is going through a pipe, or how much heat transfer is happening in a heat exchanger. Something that's easier to grasp than dull plots.

Sadly, a) I do not even know what to google to find solutions for this, and b) all my primitive searches seem to lead to Blender, which has a large learning curve and way too much time investment requirements.

[+] astrophysician|6 years ago|reply

https://yt-project.org/ -- yt for python was used by a few of my old colleagues for visualizing astrophysical simulations. There is yet another program that I've since forgotten but will try to come back here and comment if I remember.

If you can generate a single static figure, you can manually generate animations by generating each frame separately using your favorite plotting software. You can then stitch them together using many different tools (I use matplotlib) to generate an animation (video, gif, etc.).

What language do you like to use? What's your background -- are you a Matlab person?

[+] noobermin|6 years ago|reply

Someone mentioned yt, I'll go ahead and mention the imo clunky but standard in a lot of plasma/fluid physics, which is visit[0]. Not really a fan personally but a lot of people I know like it.

Tbh, for 2D data, I've written scripts that sort of just make a bunch of pngs from matplotlib and string them together using ffmpeg into a movie. For 3D data, I used to use mayavi and ffmpeg to make movies, but that too is pretty clunky. Mayavi feels closest to matplotlib which is why I like it, yt feels like there is way too much boilerplate to just get a plot (for example, I have to specify units for my data before a plot a flat array, I mean wtf).

[0] https://wci.llnl.gov/simulation/computer-codes/visit/

[+] bigger_cheese|6 years ago|reply

Some of my coworkers have used the open source VTK library for this sort of stuff (Heat flow in a reactor in this case).

From what I saw it was C++ code so may not be the easiest thing to write.

[+] xorand|6 years ago|reply

I use javascript, especially with d3.js

[+] martopix|6 years ago|reply

This is very true, don't get me wrong. But we have a problem, which is that scientists already need to be good (and efficient) at:

- Doing research - Supervising students - Teaching - Giving talks in a clear way - Writing papers in a clear way - Writing grants in an enticing way - Performing administrative tasks

This unfortunately is one of the causes why, in certain fields, industry research is much better than academic research. Google is draining brains from compsci departments all over the world, because a scientist at google is mostly a scientist. And there is much, much more teamwork.

[+] lancebeet|6 years ago|reply

The "challenges" included in that article are a bit confusing to me, in particular the pie chart. It asks for the third largest segment and claims the answer is B. To me, it seems like B is the second largest and H is the third largest. I did some thresholding of the image and it also shows the order being C, B, H, D. Maybe I'm missing something?

[+] knowablemag|6 years ago|reply

We've updated the story and added an editor's note at the end with the correction. Thank you again for spotting! -Katie, Knowable Magazine

[+] knowablemag|6 years ago|reply

Hi, this was flagged to the editors, and a correction is coming. Thank you for spotting this problem! -Katie, Knowable Magazine

[+] unknown|6 years ago|reply

[deleted]

[+] datashow|6 years ago|reply

Most charting packages have a terrible default: set y axis original point around y min, which is rarely appropriate in data visualization. In my opinion, the default should always be zero.

Another one is bars in bar/column charts are horribly wide.

[+] geoalchimista|6 years ago|reply

> In my opinion, the default should always be zero.

This won't work well for a lot of cases. Imagine plotting the atmospheric CO2 concentration time series in the past 200 years. Setting the original point at zero ppm would not make sense because it can never happen. I'd say setting y axis original point at y min is a good trade-off since the plotting package is agnostic about the underlying nature of the plot.

[+] djrobstep|6 years ago|reply

This a common claim, but really non-zero y-mins are absolutely fine.

Do you think that every weather chart is wrong because it doesn't begin from absolute zero?

Usually when charts have misleading y-ranges, the real problem is lack of x-axis context.

For instance, a chart with unemployment over the last three years can be made clearer by showing unemployment over the last 50 years.

Much better to give a wider picture than pointlessly add whitespace.

[+] omginternets|6 years ago|reply

Also, the jet colormap.

[+] frumiousirc|6 years ago|reply

A web site talking about information visualization and includes an obscuring top bar and throwing a huge pop up after loading. Yeah, that's a tab closing.

[+] mongol|6 years ago|reply

Not only scientists. Politicians too. That would hopefully change the discourse to be more material and fact based and less retorical.

[+] taurath|6 years ago|reply

Maybe if they used something prettier than R it would help?

[+] lallysingh|6 years ago|reply

ggplot2 is pretty damn nice

[+] jejeiloo9|6 years ago|reply

[deleted]

[+] pintxo|6 years ago|reply

Didn't even start reading, with the prominent pie-chart directly above the article. Not sure the author has any reasonable knowledge about visualization at all.

[+] maxnoe|6 years ago|reply

Disadvantages and advantages of pie charts are discussed sensibly further down in the article

76 comments