top | item 1870029

S4

350 points| m0th87 | 15 years ago |s4.io | reply

59 comments

order
[+] chr15|15 years ago|reply
The Github repo has an example application: https://github.com/s4/examples/tree/master/twittertopiccount...

It's a twitter topic counter: "This application detects popular hashtags on Twitter by listening to the Twitter gardenhose."

From http://labs.yahoo.com/event/99 :

"S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data. Keyed data events are routed with affinity to Processing Elements (PEs), which consume the events and do one or both of the following: (1) emit one or more events which may be consumed by other PEs, (2) publish results. The architecture resembles the Actors model [1], providing semantics of encapsulation and location transparency, thus allowing applications to be massively concurrent while exposing a simple programming interface to application developers."

Actor model: http://en.wikipedia.org/wiki/Actor_model

[+] jamii|15 years ago|reply
The model is very similar to the one argued for in this paper:

http://www.cidrdb.org/cidr2007/papers/cidr07p15.pdf

That paper was a big inspiration when we were redesigning the betting exchange at smarkets. It's a very well reasoned exposition of why this is the only sensible architecture for large scale distributed systems.

[+] protomyth|15 years ago|reply
I always thought it would very interesting to try an build a language that had an Actor focus but used simple objects for basic stuff (a pure model seems a little odd - I guess C++ for actors). It just seems like a natural way to organize a big system. We talk about certain objects as doing things to other objects and it would provide a simpler concurrency.

Actors, linda, tuples..... hum....

[+] SteveArmstrong|15 years ago|reply
I was hoping to find a "for example, S4 can be used to" line in there, but I didn't see it initially. I assume filtering the Twitter fire-hose of data could be a common use?
[+] ajessup|15 years ago|reply
Yahoo just released a paper explaining what S4 is, the rationale for it's development, and provides detailed comparison with Hadoop (and map/reduce frameworks in general).

http://labs.yahoo.com/node/476

[+] barrkel|15 years ago|reply
It sounds like a slightly more structured and distributed Unix shell pipeline; but from looking at the twitter example, a lot more awkward to use, owing to being structured around Java.

I imagine a composition language (DSL) wrapped around it could improve its usability - especially ad-hoc experimentation - greatly; at least one better than Spring IoC xml.

[+] gfodor|15 years ago|reply
This looks really great. If it delivers as advertised will be a very nice replacement for certain classes of MapReduce jobs.
[+] dnewcome|15 years ago|reply
I wish I had heard about this a few months ago. I wanted to implement a way to create and connect streaming web services. I hacked up something then called webpipes (https://github.com/dnewcome/webpipes) using node.js. Unfortunately I haven't looked at it again since I first put it up on github. S4 looks a ton more advanced than anything I was envisioning, but I still think that something simple done in one of the evented servers like node.js (the S4 implementation looks to be Java) would be useful.
[+] requinot59|15 years ago|reply
A good, simpler than S4, solution may be to use zeromq (PUB/SUB).
[+] skullsplitter|15 years ago|reply
In the twitter demo, I noticed this pathological looking string concat statement

https://github.com/s4/examples/blob/master/twittertopiccount...

Im too dim to figure out why it would be done this way (besides the fact thats its an early proof of concept demo). Any idea?

[+] mzl|15 years ago|reply
I'm not sure why you think it looks pathological? If you want to construct the data they want to construct, what would you do?
[+] wrath|15 years ago|reply
Does anyone have any real life examples of what this could be used for? I get what it does, just not quite sure where it fits in.

For example, do I push data into S4, does S4 poll for data. Is this like a distributed task system, where I distribute my tasks evenly across multiple servers seamlessly?

[+] bryansum|15 years ago|reply
This seems conceptually similar to http://www.cascading.org/ (at least looking the code examples: http://www.cascading.org/1.1/userguide/html/ch02.html).
[+] anandkesari|15 years ago|reply
S4 processes streams of data, one element at a time, as they arrive; outputs are produced incrementally. MapReduce and its derivatives (Cascading, Pig, etc) are batch-oriented.

Stream processing jobs can be massaged to fit into the MapReduce paradigm, but S4 provides a more natural solution.

[+] LiveTheDream|15 years ago|reply
Nitpick - the highlighted/emphasized text is essentially indistinguishable from hyperlinks.
[+] sriramk|15 years ago|reply
My first reaction is that this sounds similar to SQL Server Stream Insight (in terms of processing continuous streams of data)
[+] stupidsignup|15 years ago|reply
Well, I would need a little bit more info than the "detailed information" presented there.
[+] ithkuil|15 years ago|reply
great hype but still not yet clear what is this good for. Does anybody know about any other use case except the twitter topic count example?
[+] sabat|15 years ago|reply
S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data.

I'm sure this is cool and useful technology. At this moment, from the marketing-speak, I have no idea what it does except that it has something to do with volumes of streaming data. Whose data? Is it a service? (Maybe not, since you can download it?) What could it do for me (in simple terms)? What's a basic use case? Why do people assume that we can mind-read?

[+] scott_s|15 years ago|reply
It's not marketing speak, it's research speak. I have worked on a similar project (and will work on it again in the future), and I know exactly what they mean by those things.

General purpose: in the same way C is a "general purpose" language. It can handle arbitrary problems.

Distributed: designed to be used across multiple compute nodes.

Scalable: they've made the effort to ensure that performances increases as they increase the number of compute nodes.

Partially fault-tolerant: node failure does not mean the results of the computation are lost. "Partially," I assume, implies they can't guarantee this completely.

Continuous unbound streams of data: think sensors that are constantly sending more data. Or a stock market ticker. Or radio telescopes constantly monitoring the sky. Or a medical patient's various monitors.

The reason these terms don't resonate with you is that these type of applications - this type of programming - is something you're not familiar with.

[+] olalonde|15 years ago|reply
My attempt at a more friendly description of S4 (based on my very limited understanding of it - please correct me if I'm wrong):

When processing large amounts of streaming data, you have to process the data as fast as it comes in or else you can't keep up. You probably wonder: "Why not simply delay the processing of new data while the old data is being processed?". The problem lies in the fact we're dealing with a stream which always brings in new data. Eventually, the virtual line up of delayed data will occupy all available memory.

A solution to this is to dispatch the data between multiple computers which can each independently process the data they receive and then send back the result of their processing to a central computer whose job is to put back the results together. Can't keep up with the stream? Simply add a new computer! That's more or less what S3 does.

[+] samratjp|15 years ago|reply
"At Yahoo! Labs we design algorithms that are primarily driven by large scale applications for data mining and machine learning in a production environment. We show that the S4 design is surprisingly flexible and lends itself to run in large clusters built with commodity hardware."

Via http://labs.yahoo.com/event/99

I could've sworn there was a blurb there about where they are using and I recall using "real time map-reduce jobs" for things live bidding on ADs and such; another use case would be with stock market data.

You're right that it's marketing speak, this project has gotten too much attention on HN in the last few days even when all the git repo had was initial commit. It's too bad they don't have proper explanation, maybe it's because they weren't probably expecting all this attention yet.

[+] LiveTheDream|15 years ago|reply
> Whose data? Anyone's data that you can stuff into the system: "The drivers to read from and write to the platform can be implemented in any language making it possible to integrate with legacy data sources and systems."

> Is it a service? No, it's a platform. You could turn it into a Platform-as-a-Service, like Amazon does with various technologies. "S4 is a ... platform"

> What could it do for me (in simple terms)? What's a basic use case? My first thought would be real-time trending calculations. You have a massive, never-ending stream of data...how do you extract real-time insights from that?

> Why do people assume that we can mind-read? Perhaps because after being immersed in a project for a long time, it's easy to forget what is obvious to you, but non-obvious to others.

[+] noodle|15 years ago|reply
looks like it lets you take a large, continuously updating stream of data and distribute the processing of it across multiple nodes. kinda like mapreduce.
[+] jamii|15 years ago|reply
As I understand it the processing model is similar to Yahoo pipes or good old unix pipes, except with easy support for parallel processing, distribution and fault tolerance.
[+] igrekel|15 years ago|reply
Just looked a little at the documentation. It seems like its an engine for stream processing. Think of it of data mining up front. You figure out what information you want and collect it as the data comes in instead of storing all the data and mining for what you want from the accumulated pile.
[+] olalonde|15 years ago|reply
Up voted the story just so more people read your comment. This kind of "marketing-speak" focused on technical details is well too widespread. The project may be technical by nature but there's got to be a higher level way of describing it.
[+] earl|15 years ago|reply
If this were relevant to you, you'd know what most of that means, particularly stream processing. Why exactly is yahoo faulted for your lack of understanding of the technical vocabulary? I'm mostly annoyed that you think that using the usual vocabulary to discuss a problem means that people have to mind-read.