top | item 16981925

(no title)

carljv | 7 years ago

The Jupyter team deserves every accolade they get and more. The console, notebook, and now JupyterLab are some of the key reasons why Python's data ecosystem thrives.

I think Jupyter notebooks are quite useful as "rich display" shells. I often use them to set up simple interactive demos or tutorials to show folks or keep notes or scratch for myself.

That being said, I do think the "reproducibility" aspect of the notebook is overblown for the reasons other comments cite. Notebooks are hard to version control and diff, and are easy to "corrupt." I often see Jupyter notebooks described as "literate programs," and I really don't think that's an apt description. The notebook is basically the IPython shell exposed to the browser where you can display rich output.

This is where I think the R ecosystem's approach to the problem is better (a bit like org-mode & org-babel). For them, there is a literate program in plain text. Code blocks can be executed interactively and results displayed inline by a "viewer" on the document (like that provided by RStudio), but executing code doesn't change the source code of the program, and diffs/versions are only created by editing the source. At any point, the file can be "compiled" or processed into a static output document like HTML or PDF.

This is essentially literate programming but with an intermediate "interactive" feature facilitated by an external program. RMarkdown source doesn't know its being interacted with or executed, and you can edit it like any other literate program.

Interaction, reproducibility, and publication have fundamental tensions with each other. Jupyter notebooks are trying to do all three in the same software/format, and my sense is that they're starting to strain against those tensions.

discuss

order

goerz|7 years ago

Notebooks can be reproducible, they just aren’t automatically so. It requires a little bit of effort and discipline, if reproducibility is a goal. https://www.svds.com/jupyter-notebook-best-practices-for-dat... is an excellent starting point. Personally, I use notebooks to keep a record of large computational pipelines. The key is to cache all results to disk. This allows for an iterative process where I modify the notebook, kill the kernel, and rerun everything. Only new calculations will be executed, everything previously calculated will simply be loaded from disk. In the end, I have a reproducible record of the entire project (and rerunning the notebook is fast) This kind of make-like functionality is implemented through the doit Python package (http://pydoit.org). An example workflow for this is http://clusterjob.readthedocs.io/en/latest/pydoit_pipeline.h...

PurpleRamen|7 years ago

So basically you write a script instead of a notebook? If you save data on disc, are they still displayed with the rich formating of Jupyter?

your-nanny|7 years ago

I agree, 120%.

I like the r approach so much more.

carljv|7 years ago

I mean, as a medium for interactive exploration where you might want graphs and widgets or other rich/dynamic output, I still think the notebook is superior. But as a medium for developing complete, share-able, reproducible data analyses, I do think R has the upper hand.

your-nanny|7 years ago

I just like python pandas better than R.