top | item 41505224

(no title)

spiralk | 1 year ago

I dislike how Jupyter notebooks have become normalized. Yes, the interactive execution and visuals are nice for more academic workflows where the priority is quick results over code organization. However, when it comes to sharing code with others for the sake of doing reproducible science, jupyter notebooks cause more trouble than they are worth. Using cell based execution with python is so elegant with '# %%' lines in regular .py files (though it requires using VSCode or fiddling with vim plugins which not all scientists want to do I suppose). No .ipynb is necessary, .py files can be version controlled and shared like normal code while sill retaining the ability to use interactively, cell by cell.

Its much easier to organize .py files into a proper python module, and then share and collaborate with others. Instead, groups will collect jumbles of slightly different versions of the same jupyter notebooks that progressively become more complex and less manageable over time. It's not a hypothetical unfortunately, I've seen this happen at major university labs. I'm not blaming anyone because I understand -- the funding is there to do science and not rewrite code to build convenient software libraries. Yet, I can't help but wish jupyter notebooks could be removed from academic workflows.

discuss

order

epistasis|1 year ago

I think there's a fundamental mistunderstanding and mismatch between what you want to do, and what Jupyter notebooks are for. The distinction is between code versus the results.

If the code is the end product, sure, use a python package.

But does your .py with `# %%` in it also store the outputs? If not, why even bring this up? A .py output without the plots tied to the code doesn't meet the basic use case.

If the end product is the plot, I want to see how that plot was generated. And a Jupyter notebook is a much much better artifact than a Python package, unless that Python package hard codes the inputs and execution path like a notebook would.

Over the past 20 years of my career I have run into this divergence of use cases a lot. Software engineers seem to not understand the end goals, how it should be performed, and the learnings of the practitioners that have been generating results for a long time. It's hard to protect data scientists from these inflexible software engineers that see "aha that's code, I know this!" without bothering to understand the actual use case at hand.

spiralk|1 year ago

Not having the outputs tied into the code is actually preferable if the ultimate goal is reproducible science. Code should be code, documentation should be documentation, and outputs should be outputs. Having multiple copies of important code in non-version controlled files is not a good practice. Having documentation dispersed with questionable organization in unsearchable files is not good a practice. Having outputs without run information and timestamps is not a good practice. Its easy to fall in to those traps with Jupyter notebooks. It might speed up initial set up and experimentation, but I've been working academic labs long enough to see the downstream effects.

Twirrim|1 year ago

I use jupyter notebooks at work, not so much for academic stuff, but often to help build and show a narrative to folks, including executives (where I have any even remotely technical leadership). It's great for narrative stuff, especially being able to emit PDFs and what not. I've been in a number of meetings where I've got the code up in Jupyter, sharing the screen, and leadership want us to tweak numbers and see the consequences.

It's great for exploring code and data too, especially situations where I'm really trying to feel my way towards a solution. I get to merrily intermingle rich text narrative and code so I explain how I got to where I got to and can walk people through it (I did that with some experimenting with an SMT solver several months ago, meant that people that had no experience with an SMT solver could understand the model I built).

I'd never use it to share code though. If we get to that stage, it's time to export from jupyter (which it natively supports), and then tidy up the code and productionise it. There's no way jupyter should be the deployed thing.

spiralk|1 year ago

That seems like a reasonable way to use jupyter notebooks since you have an actual plan to move beyond it when necessary. My issue is mostly with the way its misused, often by people who are arguably at the top of the field.

ants_everywhere|1 year ago

We've seen how this ends because mathematicians have been sharing Mathematica notebooks forever. It's not pretty.

Like you I see the appeal, but they're a usability nightmare beyond a few lines. Part of the problem, I think, is that you can't really incrementally improve them. Who wants to refactor a notebook and deal with all the cell dependency breakage?

So they start off okay and then slowly become terrible until they're either irreplaceable or too terrible to work with and a new one is started.

abdullahkhalids|1 year ago

The same problem exists with spreadsheets. Should we get rid of excel (the single tool that literally runs half the world), and start manually writing markdown tables in text files?

The tool and the tool maker are supposed to serve the user. The user is not supposed to conform to the whims of the tool maker.

ants_everywhere|1 year ago

Since 94% of business spreadsheets contain errors [0], then probably yes we should get rid of or significantly improve spreadsheets.

Probably the solution is that things like Jupyter notebooks and spreadsheets should be views into some better source of truth rather than the source of truth themselves.

[0] https://phys.org/news/2024-08-business-spreadsheets-critical.... I remember a similar figure from studies a decade or so ago.

paddy_m|1 year ago

Another issue is that jupyter, pandas, and polars don't take displaying tabular data seriously. Just have a better default table display widget. Look at ipydatagrid, perspective, or buckaroo (my project) for examples of how it could be done better.

KolenCh|1 year ago

I don’t disagree anything you said. Jupytext can be a good tool to bridge some gap, where you pair ipynb to a py script and can then commit the py only (git-ignore all ipynb for your collaborators.)

Also, while many practices out there is questionable, in alternative scenarios where ipynb doesn’t exist, they might have been using something like matlab for example. Eg, in my field (physics), often time there are experimentalists doing some coding. Ipynb can be very enabling for them.

I think a piece of research should be broken down and worked by multiple people to improve the state of the project. Some scientists might be passing you the initial prototype in the form of a notebook, and some others should be refactoring to something more suitable for deployment and archival purpose. Properly funding these roles is important, and is lacking but improving (eg hiring RSE.)

In my field, the most prominent way when ipynb is shared a lot is for training. It’s a great application as that becomes literate programming. In this sense notebook is highly underused as literate programming still hasn’t got mainstream.

spiralk|1 year ago

I've looked into Jupytext, but ultimately decided to go with pure python. Most of the practical functionality can be replicated, but I do admit there isn't a easy single install tool or guide to replace notebooks at the moment.

I think the notebooks are a fine learning tool to introduce people to programming initially, but I'm afraid it doesn't allow for growth beyond a certain level. You have a good point about funding for those software roles. Perhaps this may not be as big of a concern if there were more software talent in these labs to handle the issues that arise.

luplex|1 year ago

In the end, usability wins. In a Jupyter notebook, you have a much better idea of state between cells, you can iterate much faster, you can write documentation in readable markdown. Often, Jupiter notebooks are more like interactive markdown than they are like python scripts.

dxbydt|1 year ago

> In a Jupyter notebook... > Often, Jupiter notebooks...

Everytime I search my Slack, I have to run two searches because DS can't agree on how to spell the damn thing.

ambicapter|1 year ago

The form factor of Jupyter notebooks seems to fit well with peoples workflows though. Looks like you just wish the internals of Jupyter were better architected.

spiralk|1 year ago

Imo, the better architected .ipynb is simply .py with '# %%' blocks. It does almost everything a .ipynb can do with the right VSCode extensions. Even interactive visualizations can be sent to a browser window or saved to disk with plotly. Though I do wish '# %%' cell based execution was accessible to more people.

There isn't a single install tool that "just works" for this at the moment. If editors came with more robust support for it by default, I think the notebook format wouldn't be needed at that point and people could use regular python and interactive cell based python more interchangeably. I've seen important code get buried under collections of jupyter notebooks across different users so I have a good reason for this. Notebooks simply dont scale beyond a certain complexity.