Show HN: Pyper – Concurrent Python Made Simple
156 points| pyper-dev | 1 year ago |github.com
We're excited to introduce the Pyper package for concurrency & parallelism in Python. Pyper is a flexible framework for concurrent / parallel data processing, following the functional paradigm.
Source code can be found on [github](https://github.com/pyper-dev/pyper)
Key features:
Intuitive API: Easy to learn, easy to think about. Implements clean abstractions to seamlessly unify threaded, multiprocessed, and asynchronous work.
Functional Paradigm: Python functions are the building blocks of data pipelines. Let's you write clean, reusable code naturally.
Safety: Hides the heavy lifting of underlying task execution and resource clean-up. No more worrying about race conditions, memory leaks, or thread-level error handling.
Efficiency: Designed from the ground up for lazy execution, using queues, workers, and generators.
Pure Python: Lightweight, with zero sub-dependencies.
We'd love to hear any feedback on this project!
solidasparagus|1 year ago
But I'm not sure I can use this even though I have a specific use-case that feels like it would work well (high-performance pure Python downloading from cloud object storage). The examples are a bit too simple and I don't understand how I can do more complicated things.
I chunk up my work, run it in parallel and then I need to do a fan-in step to reduce my chunks - how do you do that in Pyper?
Can the processes have state? Pure functions are nice, but if I'm reaching for multiprocess, I need performance and if I need performance, I'll often want a cache of some sort (I don't want to pickle and re-instantiate a cloud client every time I download some bytes for instance).
How do exceptions work? Observability? Logs/prints?
Then there's stuff that is probably asking too much from this project, but I get it if I write my own python pipeline so it matters to me - rate limiting WIP, cancellation, progress bars.
But if some of these problems are/were solved and it offers an easy way to use multiprocessing in python, I would probably use it!
pyper-dev|1 year ago
One thing I'd mention is that we don't really imagine Pyper as a whole observability and orchestration platform. It's really a package for writing Python functions and executing them concurrently, in a flexible pattern that can be integrated with other tools.
For example, I'm personally a fan of Prefect as an observability platform-- you could define pipelines in Pyper then wrap it in a Prefect flow for orchestration logic.
Exception handling and logging can also be handled by orchestration tools (or in the business logic if appropriate, literally using try... except...)
For a simple progress bar, tqdm is probably the first thing to try. As it wraps anything iterable, applying it to a pipeline might look like:
halfcat|1 year ago
Have you tried multiprocessing.shared_memory to address this?
globular-toast|1 year ago
Concurrency in general isn't about parallelism. It's just about doing multiple things at the same time.
rtpg|1 year ago
I don't really need pipelining that much, but pipelining along with a certain level of durability and easy multiprocessing support? Now we're talking
t43562|1 year ago
I suppose one excellent thing about this would be if you could just change 1 parameter and switch from multiprocessing to threaded.
giancarlostoro|1 year ago
> pipeline = task(get_data, branch=True) \
> | task(step1, workers=20) \
> | task(step2, workers=20) \
> | task(step3, workers=20, multiprocess=True)
dec0dedab0de|1 year ago
you could reassign every line, but it would look nicer with chained functions.
edit:I would be tempted to do something like this:
morkalork|1 year ago
yablak|1 year ago
minig33|1 year ago
JackC|1 year ago
It's surprisingly annoying in built-in python to do something like this. The most recent thing I was trying to do was:
- load URLs from a file - hand them out to one subprocess per cpu - download them concurrently in threads or async within each subprocess - pull the results back into a single process for formatting and storing
Getting this to work and handle queues, ctrl-c, exceptions etc. is just a whole mess involving python builtins created at different times with different interfaces; I hacked until I kind of got it working, but didn't love it. Bundling it all in a single tested package would be great.
zenapollo|1 year ago
grandma_tea|1 year ago
pyper-dev|1 year ago
The important design point we're differing on is that Pyper implements 'pipelines' as functions, whereas pypeln seems to implement 'pipelines' as iterable objects.
gpderetta|1 year ago
- my biggest issue with concurrency in python (especially with asyncio) is leaking tasks. Pyper should provide structured concurrency support a-la trio.
- I don't see the opposite of branch to collect the output of multiple sub pipelines into a single stage. I need this pretty much always and it is a chore to implement.
- Async need not force the full pipeline to be async. There should be an option to run async funcitons in background event loops. Especially as you already support threaded executions.
pyper-dev|1 year ago
Even though there's currently no built-in support for this, a workaround could be to just define synchronous helper functions to handle running your async logic in an event loop.
d0mine|1 year ago
urduntupu|1 year ago
unknown|1 year ago
[deleted]
ge96|1 year ago
gpderetta|1 year ago
kissgyorgy|1 year ago
jeremieca2|1 year ago
[deleted]