Show HN: Riko – A Python stream processing engine modeled after Yahoo! Pipes

[+] reubano|9 years ago|reply

`riko` is pure python stream processing library for analyzing and processing streams of structured data. It's modeled after Yahoo! Pipes [1] and was originally a fork of pipe2py [2]. It has both synchronous and asynchronous (via twisted) APIs, and supports parallel execution (via multiprocessing).

Out of the box, `riko` can read csv/xml/json/html files; create text and data based flows via modular pipes; parse and extract RSS/ATOM feeds; and bunch of other neat things. You can think of `riko` as a poor man's Spark/Storm... stream processing made easy!

Feedback welcome so let me know what you think!

Resources: FAQ [3], cookbook [4], and ipython notebook [5]

Quickie Demo:

    >>> from riko.modules import fetch
    >>> 
    >>> stream = fetch.pipe(conf={'url': 'https://news.ycombinator.com/rss'})
    >>> item = next(stream)
    >>> item['title'], item['link']
    ('Master Plan, Part Deux', 'https://www.tesla.com/blog/master-plan-part-deux')

[1] https://web.archive.org/web/20150930021241/http://pipes.yaho...

[2] https://github.com/ggaughan/pipe2py/

[3] https://github.com/nerevu/riko/blob/master/docs/FAQ.rst

[4] https://github.com/nerevu/riko/blob/master/docs/COOKBOOK.rst

[5] http://nbviewer.jupyter.org/github/nerevu/riko/blob/master/e...

[+] olviko|9 years ago|reply

Nice project. I wrote something similar in C# long time ago [1]. Mostly to monitor job feeds and craigslist [2] :-) It supports RSS and Atom, async, various filters, deduplication, etc

Yahoo Pipes was a nice project, but as its popularity grew, it started getting blocked more and more. It was also hard to build and maintain pipelines with more than a few steps.

[1] https://github.com/olviko/RssPercolator

[2] https://github.com/olviko/RssPercolator/blob/master/RssPerco...

[+] alexchamberlain|9 years ago|reply

This is an implementation of "Flow Based Programming", right? Its a programming paradigm invented before its time IMHO; perfect for a world of streaming data.

[+] Fuzzwah|9 years ago|reply

I was a heavy user of pipes and I'm now a heavy user of python. I have built my own dodgy simple replacement for some of the things I used to rely on pipes for. I'm very eager to see what you've got here, at first glance it seems like an excellent fit for my needs.

Thanks!

[+] reubano|9 years ago|reply

Please let me know what you think. I worked pretty hard on the readme so let me know if anything is confusing and/or doesn't make sense.

[+] tanlermin|9 years ago|reply

Can you consider Dask integration? http://distributed.readthedocs.io/en/latest/queues.html https://github.com/dask/dask

It can handle parallel and distributed parts for you.

https://github.com/dask/dask

[+] reubano|9 years ago|reply

I just read about dask earlier today, very neat project! riko already handles parallel processing [1] but adding distributed processing sounds tempting. TBH though, distribution isn't high on the priority list. But I'll be happy to accept a PR if you are so inclined :)

[1] https://github.com/nerevu/riko#parallel-processing

[+] oellegaard|9 years ago|reply

If you're looking for a stream processing engine more close to Storm, etc. but also simple, check out Motorway: https://github.com/plecto/motorway :-)

[+] reubano|9 years ago|reply

Interesting project. I hadn't come across this one yet. One difference that riko has is it's based around functions whereas this library (and practically every stream processing lib I've come across) is based around classes.

I personally prefer the functional approach much better. And if you compare the word count examples on the respective readmes [1, 2], you will see riko is much more succinct. But I suppose the verbosity of the other libraries come with benefits like scaling across a cluster of servers.

[1] https://github.com/plecto/motorway#word-count-example

[2] https://github.com/nerevu/riko#word-count

[+] raimue|9 years ago|reply

I am still a user of Plagger [1], but development halted quite some time ago. Maybe this could be a good replacement.

[1] https://github.com/miyagawa/plagger

[+] reubano|9 years ago|reply

Hadn't heard of this one before. There isn't a readme and the site seems to have been taken over. But from what I can tell based on the examples, it has a yaml based scheduler and does some pretty nifty things like IRC notifications.

riko doesn't have a scheduler (although the original pipe2py has a json based one). However, I do plan to integrate with Airflow/Oozie/Luigi [1-3] in the future to make it easier to design workflows.

The notification system reminds me of Huggin [4]. Since riko is twisted based, it should be fairly straightforward to implement something similar for IRC/IMAP/FTP/etc.

[1] https://github.com/apache/incubator-airflow

[2] http://oozie.apache.org/

[3] https://github.com/spotify/luigi

[4] https://github.com/cantino/huginn

[+] ecesena|9 years ago|reply

This is really interesting. Have you looked at Apache Beam? What I think is interesting about Beam -in this specific context- is that it has a standalone runner (java), that similarly as riko let you write pipelines without worrying about a complex setup. But then, if you need to scale your computation, Beam is runner-independent and you can take the same code and run it at scale on a cluster, wether it's spark, flink, or google cloud. You can read more here [1].

As for riko more specifically, Beam will have soon a python sdk, but I'm unsure if there will be a python standalone runner. Maybe this is something to look into...

[1] https://www.oreilly.com/ideas/future-proof-and-scale-proof-y...

[+] reubano|9 years ago|reply

> This is really interesting. Have you looked at Apache Beam?

Just gave it a look. Took a while to find some examples with code, but once I did it made a bit more sense.

> Beam is runner-independent and you can take the same code and run it at scale on a cluster, wether it's spark, flink, or google cloud.

I thought that was pretty cool.

> As for riko more specifically, Beam will have soon a python sdk, but I'm unsure if there will be a python standalone runner. Maybe this is something to look into...

A python standalone runner would be very useful. Otherwise I'm hesitant to go much further since my goal is to have a pure python solution for working with streaming data. Most libraries require installing java and that is what I'd like to avoid.

[+] tudorw|9 years ago|reply

if someone can spin up a usable gui, charge enough to make a living without compromising on performance, promise some longevity and a way to export of my stuff I would probably pay for that, I loved pipes, the GUI was a big deal for me.

[+] reubano|9 years ago|reply

Have you investigated any of the existing GUIs? [1-3] I'd love to hear your thoughts on their pros/cons. I do plan to integrate a nice GUI framework if I can find one.

[1] https://azkaban.github.io/ [2] https://developers.google.com/blockly/ [3] http://nodered.org/

[+] qw|9 years ago|reply

Apache Nifi looks promising. https://nifi.apache.org

[+] drdoom|9 years ago|reply

Just curious:

What kind of a demand is there for a pipes-kind of product or even a customizable/searchable rss/feed integrator?

How much would a typical user be willing to pay for it?

[+] ewindisch|9 years ago|reply

Sweet. I put together something similar for NodeJS which is now called 'turtle' (because it's turtles all the way down...). There's a bit of a focus on AWS Lambda & other FaaS solutions as a means of building Lambda architectures, but it can be used by itself.

https://github.com/iopipe/turtle

[+] reubano|9 years ago|reply

Reminds me of https://github.com/node-machine/machine

[+] et2o|9 years ago|reply

Looks interesting. What kind of applications do people use this for?

[+] reubano|9 years ago|reply

Mashups [1] and Extract Transform Load (ETL) [2] are two big use cases. I developed a freelance project aggregator using an earlier version of riko [3].

[1] http://mashable.com/2009/10/08/top-mashups/#0XwtqVCCXPq2

[2] https://www.quora.com/How-do-ETL-tools-work

[3] http://app.kazeeki.com/

[+] mxuribe|9 years ago|reply

While I didn't use yahoo pipes too often, I loved it. Having this as a python library (I'm trying to get deeper into python), is great! Kudos and good luck!

[+] svieira|9 years ago|reply

Also in this space (and worth looking at for inspiration, especially for other potential sources and sinks of data) - Apache Camel [1].

[1]: http://camel.apache.org/

[+] reubano|9 years ago|reply

I don't know if it's bc of the language (java) or something else, but I've never been able to grok apache data projects. I theoretically know what they do, but there's no way I can understand the code, e.g. [1].

[1] http://camel.apache.org/etl-example.html

[+] aioprisan|9 years ago|reply

Is there anything like this available that's based on node.js with a decent GUI?

[+] xnxn|9 years ago|reply

http://nodered.org/

[+] reubano|9 years ago|reply

http://noflojs.org/example/

[+] pastaking|9 years ago|reply

Also might want to check out http://concord.io, it's a bit more work to set up, but it's much faster than most stream processing systems

[+] reubano|9 years ago|reply

How does concord differ from the others? spark/storm/flink/etc...? Aside from being written in C that is.

[+] DyslexicAtheist|9 years ago|reply

This is absolutely beautiful. Love the fact that it's using RSS for this.

[+] reubano|9 years ago|reply

Thank you. Apparently RSS never got the memo that it "died" ;).

[+] satai|9 years ago|reply

Looks nice. Are there any plans for twitter support?

[+] reubano|9 years ago|reply

Eventually. It would essentially be a "source" pipe. But ultimately, I want to build a plugin system so that end users can create/share their own pipes. I also plan to add pipes that let you add streams to a database.

67 comments