"S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data. Keyed data events are routed with affinity to Processing Elements (PEs), which consume the events and do one or both of the following: (1) emit one or more events which may be consumed by other PEs, (2) publish results. The architecture resembles the Actors model [1], providing semantics of encapsulation and location transparency, thus allowing applications to be massively concurrent while exposing a simple programming interface to application developers."
That paper was a big inspiration when we were redesigning the betting exchange at smarkets. It's a very well reasoned exposition of why this is the only sensible architecture for large scale distributed systems.
I always thought it would very interesting to try an build a language that had an Actor focus but used simple objects for basic stuff (a pure model seems a little odd - I guess C++ for actors). It just seems like a natural way to organize a big system. We talk about certain objects as doing things to other objects and it would provide a simpler concurrency.
I was hoping to find a "for example, S4 can be used to" line in there, but I didn't see it initially. I assume filtering the Twitter fire-hose of data could be a common use?
Yahoo just released a paper explaining what S4 is, the rationale for it's development, and provides detailed comparison with Hadoop (and map/reduce frameworks in general).
It sounds like a slightly more structured and distributed Unix shell pipeline; but from looking at the twitter example, a lot more awkward to use, owing to being structured around Java.
I imagine a composition language (DSL) wrapped around it could improve its usability - especially ad-hoc experimentation - greatly; at least one better than Spring IoC xml.
If you want to stick with Java and want to use IoC without the hell that is Spring, I suggest Guice (which consists of a smaller, cleaner core and uses annotations and DSLs in place of XML):
I wish I had heard about this a few months ago. I wanted to implement a way to create and connect streaming web services. I hacked up something then called webpipes (https://github.com/dnewcome/webpipes) using node.js. Unfortunately I haven't looked at it again since I first put it up on github. S4 looks a ton more advanced than anything I was envisioning, but I still think that something simple done in one of the evented servers like node.js (the S4 implementation looks to be Java) would be useful.
Does anyone have any real life examples of what this could be used for? I get what it does, just not quite sure where it fits in.
For example, do I push data into S4, does S4 poll for data.
Is this like a distributed task system, where I distribute my tasks evenly across multiple servers seamlessly?
S4 processes streams of data, one element at a time, as they arrive; outputs are produced incrementally. MapReduce and its derivatives (Cascading, Pig, etc) are batch-oriented.
Stream processing jobs can be massaged to fit into the MapReduce paradigm, but S4 provides a more natural solution.
S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data.
I'm sure this is cool and useful technology. At this moment, from the marketing-speak, I have no idea what it does except that it has something to do with volumes of streaming data. Whose data? Is it a service? (Maybe not, since you can download it?) What could it do for me (in simple terms)? What's a basic use case? Why do people assume that we can mind-read?
It's not marketing speak, it's research speak. I have worked on a similar project (and will work on it again in the future), and I know exactly what they mean by those things.
General purpose: in the same way C is a "general purpose" language. It can handle arbitrary problems.
Distributed: designed to be used across multiple compute nodes.
Scalable: they've made the effort to ensure that performances increases as they increase the number of compute nodes.
Partially fault-tolerant: node failure does not mean the results of the computation are lost. "Partially," I assume, implies they can't guarantee this completely.
Continuous unbound streams of data: think sensors that are constantly sending more data. Or a stock market ticker. Or radio telescopes constantly monitoring the sky. Or a medical patient's various monitors.
The reason these terms don't resonate with you is that these type of applications - this type of programming - is something you're not familiar with.
My attempt at a more friendly description of S4 (based on my very limited understanding of it - please correct me if I'm wrong):
When processing large amounts of streaming data, you have to process the data as fast as it comes in or else you can't keep up. You probably wonder: "Why not simply delay the processing of new data while the old data is being processed?". The problem lies in the fact we're dealing with a stream which always brings in new data. Eventually, the virtual line up of delayed data will occupy all available memory.
A solution to this is to dispatch the data between multiple computers which can each independently process the data they receive and then send back the result of their processing to a central computer whose job is to put back the results together. Can't keep up with the stream? Simply add a new computer! That's more or less what S3 does.
"At Yahoo! Labs we design algorithms that are primarily driven by large scale applications for data mining and machine learning in a production environment. We show that the S4 design is surprisingly flexible and lends itself to run in large clusters built with commodity hardware."
I could've sworn there was a blurb there about where they are using and I recall using "real time map-reduce jobs" for things live bidding on ADs and such; another use case would be with stock market data.
You're right that it's marketing speak, this project has gotten too much attention on HN in the last few days even when all the git repo had was initial commit. It's too bad they don't have proper explanation, maybe it's because they weren't probably expecting all this attention yet.
> Whose data?
Anyone's data that you can stuff into the system:
"The drivers to read from and write to the platform can be implemented in any language making it possible to integrate with legacy data sources and systems."
> Is it a service?
No, it's a platform. You could turn it into a Platform-as-a-Service, like Amazon does with various technologies.
"S4 is a ... platform"
> What could it do for me (in simple terms)? What's a basic use case?
My first thought would be real-time trending calculations. You have a massive, never-ending stream of data...how do you extract real-time insights from that?
> Why do people assume that we can mind-read?
Perhaps because after being immersed in a project for a long time, it's easy to forget what is obvious to you, but non-obvious to others.
looks like it lets you take a large, continuously updating stream of data and distribute the processing of it across multiple nodes. kinda like mapreduce.
As I understand it the processing model is similar to Yahoo pipes or good old unix pipes, except with easy support for parallel processing, distribution and fault tolerance.
Just looked a little at the documentation. It seems like its an engine for stream processing. Think of it of data mining up front. You figure out what information you want and collect it as the data comes in instead of storing all the data and mining for what you want from the accumulated pile.
Up voted the story just so more people read your comment. This kind of "marketing-speak" focused on technical details is well too widespread. The project may be technical by nature but there's got to be a higher level way of describing it.
If this were relevant to you, you'd know what most of that means, particularly stream processing. Why exactly is yahoo faulted for your lack of understanding of the technical vocabulary? I'm mostly annoyed that you think that using the usual vocabulary to discuss a problem means that people have to mind-read.
[+] [-] chr15|15 years ago|reply
It's a twitter topic counter: "This application detects popular hashtags on Twitter by listening to the Twitter gardenhose."
From http://labs.yahoo.com/event/99 :
"S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable platform that allows programmers to easily develop applications for processing continuous unbounded streams of data. Keyed data events are routed with affinity to Processing Elements (PEs), which consume the events and do one or both of the following: (1) emit one or more events which may be consumed by other PEs, (2) publish results. The architecture resembles the Actors model [1], providing semantics of encapsulation and location transparency, thus allowing applications to be massively concurrent while exposing a simple programming interface to application developers."
Actor model: http://en.wikipedia.org/wiki/Actor_model
[+] [-] jamii|15 years ago|reply
http://www.cidrdb.org/cidr2007/papers/cidr07p15.pdf
That paper was a big inspiration when we were redesigning the betting exchange at smarkets. It's a very well reasoned exposition of why this is the only sensible architecture for large scale distributed systems.
[+] [-] protomyth|15 years ago|reply
Actors, linda, tuples..... hum....
[+] [-] SteveArmstrong|15 years ago|reply
[+] [-] LiveTheDream|15 years ago|reply
Sure enough, there is a Twitter example.
[+] [-] rchowe|15 years ago|reply
http://www.supersimplestorageservice.com/
[+] [-] maximilian|15 years ago|reply
[+] [-] ajessup|15 years ago|reply
http://labs.yahoo.com/node/476
[+] [-] barrkel|15 years ago|reply
I imagine a composition language (DSL) wrapped around it could improve its usability - especially ad-hoc experimentation - greatly; at least one better than Spring IoC xml.
[+] [-] strlen|15 years ago|reply
If you want to stick with the JVM, have you considered using Scala and the cake pattern?
http://jonasboner.com/2008/10/06/real-world-scala-dependency...
If you want to stick with Java and want to use IoC without the hell that is Spring, I suggest Guice (which consists of a smaller, cleaner core and uses annotations and DSLs in place of XML):
http://code.google.com/p/google-guice/
[+] [-] gfodor|15 years ago|reply
[+] [-] vijayr|15 years ago|reply
[+] [-] dnewcome|15 years ago|reply
[+] [-] requinot59|15 years ago|reply
[+] [-] skullsplitter|15 years ago|reply
https://github.com/s4/examples/blob/master/twittertopiccount...
Im too dim to figure out why it would be done this way (besides the fact thats its an early proof of concept demo). Any idea?
[+] [-] mzl|15 years ago|reply
[+] [-] wrath|15 years ago|reply
For example, do I push data into S4, does S4 poll for data. Is this like a distributed task system, where I distribute my tasks evenly across multiple servers seamlessly?
[+] [-] bryansum|15 years ago|reply
[+] [-] anandkesari|15 years ago|reply
Stream processing jobs can be massaged to fit into the MapReduce paradigm, but S4 provides a more natural solution.
[+] [-] warstory|15 years ago|reply
[+] [-] Inviz|15 years ago|reply
[+] [-] LiveTheDream|15 years ago|reply
[+] [-] unknown|15 years ago|reply
[deleted]
[+] [-] nspiegelberg|15 years ago|reply
[+] [-] lrm242|15 years ago|reply
[+] [-] sriramk|15 years ago|reply
[+] [-] stupidsignup|15 years ago|reply
[+] [-] ithkuil|15 years ago|reply
[+] [-] unknown|15 years ago|reply
[deleted]
[+] [-] sabat|15 years ago|reply
I'm sure this is cool and useful technology. At this moment, from the marketing-speak, I have no idea what it does except that it has something to do with volumes of streaming data. Whose data? Is it a service? (Maybe not, since you can download it?) What could it do for me (in simple terms)? What's a basic use case? Why do people assume that we can mind-read?
[+] [-] scott_s|15 years ago|reply
General purpose: in the same way C is a "general purpose" language. It can handle arbitrary problems.
Distributed: designed to be used across multiple compute nodes.
Scalable: they've made the effort to ensure that performances increases as they increase the number of compute nodes.
Partially fault-tolerant: node failure does not mean the results of the computation are lost. "Partially," I assume, implies they can't guarantee this completely.
Continuous unbound streams of data: think sensors that are constantly sending more data. Or a stock market ticker. Or radio telescopes constantly monitoring the sky. Or a medical patient's various monitors.
The reason these terms don't resonate with you is that these type of applications - this type of programming - is something you're not familiar with.
[+] [-] olalonde|15 years ago|reply
When processing large amounts of streaming data, you have to process the data as fast as it comes in or else you can't keep up. You probably wonder: "Why not simply delay the processing of new data while the old data is being processed?". The problem lies in the fact we're dealing with a stream which always brings in new data. Eventually, the virtual line up of delayed data will occupy all available memory.
A solution to this is to dispatch the data between multiple computers which can each independently process the data they receive and then send back the result of their processing to a central computer whose job is to put back the results together. Can't keep up with the stream? Simply add a new computer! That's more or less what S3 does.
[+] [-] allertonm|15 years ago|reply
In the enterprise software world, this is what's called CEP - Complex Event Processing: http://en.wikipedia.org/wiki/Complex_event_processing
[+] [-] samratjp|15 years ago|reply
Via http://labs.yahoo.com/event/99
I could've sworn there was a blurb there about where they are using and I recall using "real time map-reduce jobs" for things live bidding on ADs and such; another use case would be with stock market data.
You're right that it's marketing speak, this project has gotten too much attention on HN in the last few days even when all the git repo had was initial commit. It's too bad they don't have proper explanation, maybe it's because they weren't probably expecting all this attention yet.
[+] [-] LiveTheDream|15 years ago|reply
> Is it a service? No, it's a platform. You could turn it into a Platform-as-a-Service, like Amazon does with various technologies. "S4 is a ... platform"
> What could it do for me (in simple terms)? What's a basic use case? My first thought would be real-time trending calculations. You have a massive, never-ending stream of data...how do you extract real-time insights from that?
> Why do people assume that we can mind-read? Perhaps because after being immersed in a project for a long time, it's easy to forget what is obvious to you, but non-obvious to others.
[+] [-] noodle|15 years ago|reply
[+] [-] jamii|15 years ago|reply
[+] [-] igrekel|15 years ago|reply
[+] [-] olalonde|15 years ago|reply
[+] [-] earl|15 years ago|reply