`riko` is pure python stream processing library for analyzing and processing streams of structured data. It's modeled after Yahoo! Pipes [1] and was originally a fork of pipe2py [2]. It has both synchronous and asynchronous (via twisted) APIs, and supports parallel execution (via multiprocessing).
Out of the box, `riko` can read csv/xml/json/html files; create text and data based flows via modular pipes; parse and extract RSS/ATOM feeds; and bunch of other neat things. You can think of `riko` as a poor man's Spark/Storm... stream processing made easy!
Feedback welcome so let me know what you think!
Resources: FAQ [3], cookbook [4], and ipython notebook [5]
Nice project. I wrote something similar in C# long time ago [1]. Mostly to monitor job feeds and craigslist [2] :-) It supports RSS and Atom, async, various filters, deduplication, etc
Yahoo Pipes was a nice project, but as its popularity grew, it started getting blocked more and more. It was also hard to build and maintain pipelines with more than a few steps.
This is an implementation of "Flow Based Programming", right? Its a programming paradigm invented before its time IMHO; perfect for a world of streaming data.
I was a heavy user of pipes and I'm now a heavy user of python. I have built my own dodgy simple replacement for some of the things I used to rely on pipes for. I'm very eager to see what you've got here, at first glance it seems like an excellent fit for my needs.
I just read about dask earlier today, very neat project! riko already handles parallel processing [1] but adding distributed processing sounds tempting. TBH though, distribution isn't high on the priority list. But I'll be happy to accept a PR if you are so inclined :)
Interesting project. I hadn't come across this one yet. One difference that riko has is it's based around functions whereas this library (and practically every stream processing lib I've come across) is based around classes.
I personally prefer the functional approach much better. And if you compare the word count examples on the respective readmes [1, 2], you will see riko is much more succinct. But I suppose the verbosity of the other libraries come with benefits like scaling across a cluster of servers.
Hadn't heard of this one before. There isn't a readme and the site seems to have been taken over. But from what I can tell based on the examples, it has a yaml based scheduler and does some pretty nifty things like IRC notifications.
riko doesn't have a scheduler (although the original pipe2py has a json based one). However, I do plan to integrate with Airflow/Oozie/Luigi [1-3] in the future to make it easier to design workflows.
The notification system reminds me of Huggin [4]. Since riko is twisted based, it should be fairly straightforward to implement something similar for IRC/IMAP/FTP/etc.
This is really interesting. Have you looked at Apache Beam? What I think is interesting about Beam -in this specific context- is that it has a standalone runner (java), that similarly as riko let you write pipelines without worrying about a complex setup. But then, if you need to scale your computation, Beam is runner-independent and you can take the same code and run it at scale on a cluster, wether it's spark, flink, or google cloud. You can read more here [1].
As for riko more specifically, Beam will have soon a python sdk, but I'm unsure if there will be a python standalone runner. Maybe this is something to look into...
> This is really interesting. Have you looked at Apache Beam?
Just gave it a look. Took a while to find some examples with code, but once I did it made a bit more sense.
> Beam is runner-independent and you can take the same code and run it at scale on a cluster, wether it's spark, flink, or google cloud.
I thought that was pretty cool.
> As for riko more specifically, Beam will have soon a python sdk, but I'm unsure if there will be a python standalone runner. Maybe this is something to look into...
A python standalone runner would be very useful. Otherwise I'm hesitant to go much further since my goal is to have a pure python solution for working with streaming data. Most libraries require installing java and that is what I'd like to avoid.
if someone can spin up a usable gui, charge enough to make a living without compromising on performance, promise some longevity and a way to export of my stuff I would probably pay for that, I loved pipes, the GUI was a big deal for me.
Have you investigated any of the existing GUIs? [1-3] I'd love to hear your thoughts on their pros/cons. I do plan to integrate a nice GUI framework if I can find one.
Sweet. I put together something similar for NodeJS which is now called 'turtle' (because it's turtles all the way down...). There's a bit of a focus on AWS Lambda & other FaaS solutions as a means of building Lambda architectures, but it can be used by itself.
Mashups [1] and Extract Transform Load (ETL) [2] are two big use cases. I developed a freelance project aggregator using an earlier version of riko [3].
While I didn't use yahoo pipes too often, I loved it. Having this as a python library (I'm trying to get deeper into python), is great! Kudos and good luck!
I don't know if it's bc of the language (java) or something else, but I've never been able to grok apache data projects. I theoretically know what they do, but there's no way I can understand the code, e.g. [1].
Eventually. It would essentially be a "source" pipe. But ultimately, I want to build a plugin system so that end users can create/share their own pipes. I also plan to add pipes that let you add streams to a database.
[+] [-] reubano|9 years ago|reply
Out of the box, `riko` can read csv/xml/json/html files; create text and data based flows via modular pipes; parse and extract RSS/ATOM feeds; and bunch of other neat things. You can think of `riko` as a poor man's Spark/Storm... stream processing made easy!
Feedback welcome so let me know what you think!
Resources: FAQ [3], cookbook [4], and ipython notebook [5]
Quickie Demo:
[1] https://web.archive.org/web/20150930021241/http://pipes.yaho...[2] https://github.com/ggaughan/pipe2py/
[3] https://github.com/nerevu/riko/blob/master/docs/FAQ.rst
[4] https://github.com/nerevu/riko/blob/master/docs/COOKBOOK.rst
[5] http://nbviewer.jupyter.org/github/nerevu/riko/blob/master/e...
[+] [-] olviko|9 years ago|reply
Yahoo Pipes was a nice project, but as its popularity grew, it started getting blocked more and more. It was also hard to build and maintain pipelines with more than a few steps.
[1] https://github.com/olviko/RssPercolator
[2] https://github.com/olviko/RssPercolator/blob/master/RssPerco...
[+] [-] alexchamberlain|9 years ago|reply
[+] [-] Fuzzwah|9 years ago|reply
Thanks!
[+] [-] reubano|9 years ago|reply
[+] [-] tanlermin|9 years ago|reply
It can handle parallel and distributed parts for you.
https://github.com/dask/dask
[+] [-] reubano|9 years ago|reply
[1] https://github.com/nerevu/riko#parallel-processing
[+] [-] oellegaard|9 years ago|reply
[+] [-] reubano|9 years ago|reply
I personally prefer the functional approach much better. And if you compare the word count examples on the respective readmes [1, 2], you will see riko is much more succinct. But I suppose the verbosity of the other libraries come with benefits like scaling across a cluster of servers.
[1] https://github.com/plecto/motorway#word-count-example
[2] https://github.com/nerevu/riko#word-count
[+] [-] raimue|9 years ago|reply
[1] https://github.com/miyagawa/plagger
[+] [-] reubano|9 years ago|reply
riko doesn't have a scheduler (although the original pipe2py has a json based one). However, I do plan to integrate with Airflow/Oozie/Luigi [1-3] in the future to make it easier to design workflows.
The notification system reminds me of Huggin [4]. Since riko is twisted based, it should be fairly straightforward to implement something similar for IRC/IMAP/FTP/etc.
[1] https://github.com/apache/incubator-airflow
[2] http://oozie.apache.org/
[3] https://github.com/spotify/luigi
[4] https://github.com/cantino/huginn
[+] [-] ecesena|9 years ago|reply
As for riko more specifically, Beam will have soon a python sdk, but I'm unsure if there will be a python standalone runner. Maybe this is something to look into...
[1] https://www.oreilly.com/ideas/future-proof-and-scale-proof-y...
[+] [-] reubano|9 years ago|reply
Just gave it a look. Took a while to find some examples with code, but once I did it made a bit more sense.
> Beam is runner-independent and you can take the same code and run it at scale on a cluster, wether it's spark, flink, or google cloud.
I thought that was pretty cool.
> As for riko more specifically, Beam will have soon a python sdk, but I'm unsure if there will be a python standalone runner. Maybe this is something to look into...
A python standalone runner would be very useful. Otherwise I'm hesitant to go much further since my goal is to have a pure python solution for working with streaming data. Most libraries require installing java and that is what I'd like to avoid.
[+] [-] tudorw|9 years ago|reply
[+] [-] reubano|9 years ago|reply
[1] https://azkaban.github.io/ [2] https://developers.google.com/blockly/ [3] http://nodered.org/
[+] [-] qw|9 years ago|reply
[+] [-] drdoom|9 years ago|reply
What kind of a demand is there for a pipes-kind of product or even a customizable/searchable rss/feed integrator?
How much would a typical user be willing to pay for it?
[+] [-] ewindisch|9 years ago|reply
https://github.com/iopipe/turtle
[+] [-] reubano|9 years ago|reply
[+] [-] et2o|9 years ago|reply
[+] [-] reubano|9 years ago|reply
[1] http://mashable.com/2009/10/08/top-mashups/#0XwtqVCCXPq2
[2] https://www.quora.com/How-do-ETL-tools-work
[3] http://app.kazeeki.com/
[+] [-] mxuribe|9 years ago|reply
[+] [-] svieira|9 years ago|reply
[1]: http://camel.apache.org/
[+] [-] reubano|9 years ago|reply
[1] http://camel.apache.org/etl-example.html
[+] [-] aioprisan|9 years ago|reply
[+] [-] xnxn|9 years ago|reply
[+] [-] reubano|9 years ago|reply
[+] [-] pastaking|9 years ago|reply
[+] [-] reubano|9 years ago|reply
[+] [-] DyslexicAtheist|9 years ago|reply
[+] [-] reubano|9 years ago|reply
[+] [-] satai|9 years ago|reply
[+] [-] reubano|9 years ago|reply