Datashader: turns even the largest data into images, accurately

[+] mhalle|7 years ago|reply

Looks like a great project. Contrary to other comments, rendering != visualization. This project seems to have paid attention to lots of the seemingly little but critical details of this type of visualization that are a pain to handle yourself (anti-aliasing of multi-scale data, terrain shading, large- and out-of-core visualization).

Any one of these topics can bring a visualization project to a screeching halt, or make the results look misleading or bad.

Even better that they built a tool that works with existing libraries, rather than replacing them. Good work!

[+] BubRoss|7 years ago|reply

> anti-aliasing of multi-scale data, terrain shading, large- and out-of-core visualization

Webgl will basically do all of that for you, including the out of core if you can stream the data in.

[+] IanCal|7 years ago|reply

Datashader is a great project. Very fast, very easy to use. You can throw a lot of data at it in a notebook and get back a zoomable interactive pane.

Here's a 2016 talk on it: https://www.youtube.com/watch?v=fB3cUrwxMVY

There's likely a lot of improvements since then, but that should help show some of the core parts and explain why it's a useful tool.

[+] 24gttghh|7 years ago|reply

https://anaconda.org/jbednar/gerrymandering/notebook

I wish more people were outraged at this kind of election tampering. Great visualizations though! Zoom in on some of those tight masses of black outlines. The shapes are ridiculous. Maryland 3rd? Come on.

[+] unknown|7 years ago|reply

[deleted]

[+] itodd|7 years ago|reply

I've used datashader for plotting NGS (Next Generation Sequencing) enrichments. At the time I had to hack together the ability to use the polygon select tool on the data, but it worked and blew my mind.

Very elegant solution to a difficult problem (overplotting).

[+] abcc8|7 years ago|reply

Do you have any examples of this you could point to online? I am looking at different visualization tools for various NGS-based analyses currently.

[+] unknown|7 years ago|reply

[deleted]

[+] tokyodude|7 years ago|reply

> Turns even the largest data into images, accurately

The first image, the image of the USA, seems really mis-representative to me. LA and NYC should be way way way more bright in relation to everything else than the entire area east of the Mississippi.

At least to my eyes that map makes it look like parts of Denver, Kansas City, Salt Lake City, Atlanta, and the San Joaquin Valley are just as dense as Manhattan.

Atlanta's population density 630 per square mile

Manhattan's population density 70826 per square mile

It seems like an accurate data image would have Atlanta's brightness 1/100th of Manhattan's. Basically it looks like they saturated out at around 250 people do anything over 250 people is the same brightness.

[+] jbednar|7 years ago|reply

By default, Datashader accurately conveys the shape of the distribution in a way that the human visual system can process. If you want a linear representation, you can do that easily; see the first plot in http://datashader.org/topics/census.html , but you'll quickly see that the resulting plot completely fails to show that there are any patterns anywhere besides the top few population hotspots, which is highly unrepresentative of the actual patterns in this data. There is no saturation here; what it's doing in the homepage image is basically a rank-order encoding, where the top brightness value is indeed shared by several high-population pixes, the next brightness value is shared by the next batch of populations, etc. Given only 256 possible values, there has to be some grouping, but it's not saturating.

[+] pwang|7 years ago|reply

Yes, datashader actually gives you the ability to dial-in as much gamma compensation as you want, to account for the human visual system's nonlinear response to luminance.

[+] johnmarinelli|7 years ago|reply

Looks like a really cool project. One thing that I would be interested in see would be using Datashader as a dynamic visualisation library - for example, generative art projects. Probably not the main interest of data visualisation practitioners but hey, if you've got a sweet pipeline to render all those points, why not?

[+] jbednar|7 years ago|reply

The attractors at http://datashader.org/topics/strange_attractors.html are probably closer to art than science...

[+] whoisjuan|7 years ago|reply

What license is this project using? The repo has a license but besides the provisions listed there I don't see any standard license.

[+] tnvaught|7 years ago|reply

The repo has a standard 3-clause BSD license.

[+] burtonator|7 years ago|reply

This actually gave me an interesting idea regarding bitcoin passphrase mnemonics.

Instead of text we could use the same algorithm to generate images.

So you could have an index of images and generate them. I'm actually wondering if you could use nouns and verbs to maybe make stories if you could mutate the nouns reliably.

Like 'bird flying' vs 'bird sleeping' ...

This could help to remember long passphrases visually which people seem to be better at.

[+] simplyinfinity|7 years ago|reply

Is there anything similar for network graphs?

[+] lmeyerov|7 years ago|reply

Yep -- at https://github.com/graphistry/pygraphistry, we started by making millions of nodes/edges interactive. If you use notebooks, can signup on our site and get going. The trick is we connect GPUs in the browser to GPUs in the cloud, and encapsulate it enough that you can stick to writing standard SQL/pandas/etc.

We've been curious about server-side static tile rendering for larger graphs, but has been on the back-burner. (We already connect to GPUs on the server, so not rocket science.) Currently, we're actively increasing how much can be ingested + computed on, such as for finding influencers, communities & rings, etc. However, visualizing that hasn't been an operational priority for our users. More useful to generate the communities, and then either inspect individual ones, or see how communities stitch together: quickly run out of pixels otherwise due to too many edges. Likewise, we're building connectors to gigascale-petascale graph DBs: titan, janus, aws neptune, tigergraph, spark graphx, etc.

We still are interested, but more for when we start supporting geographic maps: you can see that is the primary use for datashader. Also, because data art is fun :)

[+] jbednar|7 years ago|reply

Datashader itself renders networks: http://datashader.org/user_guide/7_Networks.html

[+] jointpdf|7 years ago|reply

Perhaps Gephi? It was used for those Graphcore neural network visualizations (millions of edges and nodes) and they look stunning.

[1] https://gephi.org/features/

[2] https://www.graphcore.ai/posts/what-does-machine-learning-lo...

[+] maliker|7 years ago|reply

My team has had luck rendering an SVG from the graph and sending it to a browser. It works well for about 10k vertices and edges. Above that scale we use datashader, and we're investigating a potential to move to QGIS. We tried Gephi a few years ago and it had trouble at these scales.

[+] IanCal|7 years ago|reply

You can render networks in datashader, there's a line primitive.

I added edge bundling (probably the slowest thing in datashader!) but I know there's examples of flight path rendering in the video I linked in another comment.

[+] BubRoss|7 years ago|reply

[deleted]

[+] jameskilton|7 years ago|reply

And this is how you push people out of our field.

How about instead of starting with an insult (I can't believe you didn't already know this) you instead congratulate them on putting together a full working library, with pretty, easy-to-grasp examples then offer up some research links that they could use to further refine and improve their system. It's our job to teach people, you can't expect everyone to suddenly know everything.

To the Datashader team: I apologize for the above comment. Good job in building and launching a tool for others to use, and great choices for examples!

[+] pwang|7 years ago|reply

You're missing the point of this project. It's not about the feasibility of throwing a billion points at a pile of software, to get an image. I can do that with a simple Python script. It's about doing so to create a meaningful and accurate data visualization, and not just a picture of, say, shiny spheres or a scene from Avatar.

I actually have a background in 3D computer graphics, and it's precisely because of my detailed knowledge of raytracing, rasterization, OpenGL, BMRT, photon maps, computational radiometry, BDRFs, computational geometry, and statistical sampling, etc... that when I came to the field of data science & specifically the problem of visualizing large datasets, I realized the total lack of tooling in this space.

The field of information visualization lags behind general "computer-generated imagery" by decades. When I first presented my ideas around Abstract Rendering (which became Datashader) to my DARPA collaborators, even to famous visualization people like Bill Cleveland or Jeff Heer, it was clear that I was thinking about the problem in an entirely different way. I recall our DARPA PM asking Hanspeter Pfister how he would visualize a million points, and he said, "I wouldn't. I'd subsample, or aggregate the data."

Datashader eats a million points for breakfast.

Since you're clearly a computer graphics guy, the way to think about this problem is not one of naive rendering, but rather one of dynamically generating correct primitives & aesthetics at every image scale, so that the viewer has the most accurate understanding of what's actually in the dataset. So it's not just a particle cloud, nor is it nurbs with a normal & texture map; rather, it's a bunch of abstract values from which a data scientist may want to synthesize any combination of geometry and textures.

I chose the name "datashader" for a very specific and intentional reason: we are dynamically invoking a shader - usually a bunch of Python code for mathematical transformation - at every point, within a sampling volume (typically a square, but it doesn't have to be). One can imagine drawing a map of the rivers of the US, with the shading based on some function of all industrial plants in its watershed. Both the domain of integration and the function to evaluate are dynamic for each point in the view frustum.

[+] IanCal|7 years ago|reply

> thinking up fancy names for reinventing the wheel

They're not claiming to have reinvented the wheel, they're just explaining what it is.

> 'Turning data into images' isn't exactly a new concept.

No, but doing so on large data accurately (the last word is important that you cut off) is not something I know I can easily achieve in a different python library faster. I'd like to know if I could.

[+] candu|7 years ago|reply

Ah, the good old "MapReduce is basically functional programming 101" trope, usually resulting from a fundamental misunderstanding of the problem the framework / tool in question solves.

[+] douglaswlance|7 years ago|reply

It doesn't matter if something has been done before if the new way hooks into new systems. Tools do not exist in a vacuum. All tools are apart of a system of tools that should always be considered when evaluating any component piece.

[+] sevensor|7 years ago|reply

I also was expecting something new, but in their defense they've made a very appealing version of something old. I'm sure there are a lot of people out there who haven't thought about saturation with "large" data sets before.

[+] Eli_P|7 years ago|reply

What are those fractals on the 3rd picture? They remind me of Lissajous curves used in oscilloscopes.

69 comments