top | item 25952669

(no title)

long | 5 years ago

In general, the code philosophy behind ggplot2 and related tools (the so-called "tidyverse" in R) embraces functional programming, in particular doing computation by pure composition of smaller computations.

Using the "+" operator to denote composing parts of visualizations is not the greatest syntax but I think we're basically stuck with it for a bit due to historical baggage. See this note from the creator of ggplot, Hadley Wickham: https://community.rstudio.com/t/why-cant-ggplot2-use/4372/7

discuss

order

melling|5 years ago

Maybe the new R native pipe operator will fix the issues? >|

lmeyerov|5 years ago

It may make more sense when you see analysts writing & sharing a lot of code sessions, especially via notebooks. Functional plotting ends up helping a lot! For big graph-y graphs, we made pygraphistry that way, which enables multi-cell flows like:

```

df = cudf.read_csv('1GB.csv').drop_duplicates(['user_ip', 'click'])

g1 = graphistry.edges(df, 'user_ip', 'click')

g1.plot()

g2 = g1.encode_point_color('risk', ['blue','yellow','red])

g2.plot()

g2.edges(cudf.read_csv('file2.csv')).plot() # reuse g2's color settings

g1.edges(cudf.read_csv('file2.csv')).plot() # ... or just g1's graph shape

```

Being able to 'fork' plots and interactively swap in different data / encodings is super great over the course of a session. You can always go back to an earlier one as you make progress. Likewise, you can rerun notebook cells and read them top-to-bottom without worrying too much.

So while we're looking at some V2 additions, maybe supporting R, and updating some of the core (more automatic GPU goodness!)... we're definitely keeping the compositional style.

Interesting nit: Libraries copying the original grammar of graphics can likely benefit from friendlier functional DSL presentation styles. As is, I think they make it much harder to read + write, undercutting much of the productivity potential. I love the academic concept of making everything a composable value, but doing naked composition over a massive namespace of diverse types.. is super confusing to read + write.

Learning from pandas & jquery, we ended up instead steering users to chaining for the typical case: `g.bind(...).edges(...).nodes(...).encode(...).plot()`. It's functional so you can always do `g_intermediate = g...` and likewise still do first-class GoG-syntax-style things with them of you really want `f(g._bindings)`. However, those are the minor case, and people doing them make code harder to read + write:

-- Reading GoG code is confusing: In `x + f(y)`, often unclear what x, y, and f(y) are, and more so in dynamic languages like Python + R that they're used in. In `g.bind(..).encode(...).plot(...)`, each composition is pretty obvious in the typical case, and you can always read back or do first-class in the atypical case.

-- GoG plot authoring is jarring: When doing `x + ...`, tab complete doesn't get you far. If tab complete does somehow kick in, you are dealing with a big namespace dump. Instead, I see people turn to google for almost every step! In contrast, table complete on `g.nodes(df)...` will pull up the most likely next settings to add, and then again for the arguments to fill into whatever command you pick.

GoG defaults to those for the typical case, vs atypical one, so a 2nd-class imperative API may be easier. But with chaining, we get functional composition without losing straight-line reading and tab-complete. Best of both worlds!