Why Node.js streams are awesome

[+] masklinn|14 years ago|reply

> The only downside is that it’s conceptually more complicated, and requires some understanding of underlying components (zip files, http responses, streams).

There's at least one more downside: the user loses all indication of progress as the Content-Length is unknown when the headers are sent

[+] dmmalam|14 years ago|reply

Nice catch, We are working on it!

dumply knows the exact size of each image as it is saved in the DB on upload, and all the zip byte headers are fixed, so the zip file size should be deterministic and calculable even before the first byte is sent. Remember we don't compress the already compressed images.

If you didn't know the file sizes, for example you had raw unknown input streams, or had compressible data, you can still guesstimate the content-length so the user got some progress bar, even if it wasn't 100% accurate.

[+] nevinera|14 years ago|reply

You can do a solid best-guess estimate, especially if those images are already in compressed formats.

[+] colinmarc|14 years ago|reply

I was playing around and did something similar with video encoding. The server code starts a running ffmpeg process, and then the handler code just looks like this:

    server = http.createServer(function(request, response) {
        request.pipe(ffmpeg.process.stdin);
        ffmpeg.process.stdout.pipe(response);
    });

What a nice interface! The end result is that you can do weird stuff like:

    $ curl -T my_video.mp4 http://localhost:9599 | mplayer

[+] timc3|14 years ago|reply

Think you will find VLC, PS3/Xbox streaming servers and a whole load of others do the same thing

[+] EvanMiller|14 years ago|reply

Not to poop in your cocoa puffs, but I wrote an Nginx module to do the same thing in 2007.

https://github.com/evanmiller/mod_zip

The module is quite mature at this point, and is used in production on many websites (including Box.net, which commissioned the initial work). The module supports the Content-Length header, Range and If-Range requests, ZIP-64 for large archives, and filename transcoding with iconv. Being written in C, it will probably use much less RAM than an equivalent Node.js module.

I have found that the hardest part of generating ZIP files on the fly has nothing to do with network programming; it's producing files that open correctly on all platforms, including Mac OS X's busted-ass BOMArchiveHelper.app.

[+] dmmalam|14 years ago|reply

The point wasn't that creating on the fly zips is new, it was that using pipeable steam abstractions is a composable way to build network servers, and nodejs is just something we found this easiest to express with.

Having a large number of stream primitives means you can easily wire up endpoints, for example say you wanted to output a large db query as xml, or consume and editing gigabytes of json, or consume, transcode and output a video.

You can by all means write a nginx module in C for each usecase and this is probably the right solution for very HEAVY specific loads.

But writing a C module is probably a barrier too high for many, whereas implementing a nodejs stream isn't. Respond to a few events, emit a few events and you have a module that can work with the hundreds of other stream abstractions available. (npm search stream)

You still need the specific domain knowledge (eg how zip headers work) and this is usually the complicated bit. mod_zip looks excellent, and I wonder if some of the domain knowledge of handling zips can be resused in zipstream.

[+] eridius|14 years ago|reply

What's busted about BOMArchiveHelper? I don't think I've ever run across a zip that doesn't open in BOMArchiveHelper and yet opens in other software.

[+] chrisacky|14 years ago|reply

Nice approach.

This is how we handle it currently.

> User adds images to a virtual lightbox.

> User decides that he wants to download all the images in this lightbox, so presses "Download Folder". The user is then presented with a list of possible dimensions that they can request.

> The user selects "Large" and "Small" and hits "Download"

> This request gets added to our Gearman job queue.

> The job gets handled and all the files are downloaded from Amazon S3 to a temporary location on the locale file server.

> A Zip object is then created and each file is added to the Zip file.

> Once complete, the file is then uploaded back to Amazon S3 in a custom "archives" bucket.

> Before this batch job finishes, I fire off a message to Socket.io / Pusher which sends the URL back to the client who has been waiting patiently for X minutes while his job has been processing.

This works okay for us because when users create "Archives" of their ligtboxes, generally they do this because they want to share the files with other people. This means that they attach the URL to emails to provide to other people.

So for us, it's actually neccessary to save the file back to S3... however, I'm sure that not everyone needs to share the file... it would definitely be worth investigating if the user plans to return back to the archive, in which case implementing streams could potentially save us on storage and complexity.

[+] dmmalam|14 years ago|reply

I think you have pretty much described our original ('ghetto') solution with caching ('lipstick').

With streams, there is no need to cache, as recreating the download is dirt cheap. Essentially just a few extra header bytes to pad the zip container, ontop of the image content bytes that you will have to always send.

The use case you mentioned, of sharing the download link, works exactly the same. You send the link, and the what ever user clicks on the links gets an instant download.

True you are bufferring data through your app, instead of letting S3 take care of it. But if your on AWS, S3 to EC2 is free and fast (200mb/s+), and then bandwidth out of EC2 costs the same as S3. If it goes over an elastic IP, then a cent more per GB. You app servers also handle some load, but nodejs (or any other evented framework) live to multiplex IO, with only a few objects worth of overhead per connection.

In return, you can delete a whole load of cache and job control code. Less code to write, test and maintain.

[+] timc3|14 years ago|reply

Alternately you could hand it off to the web server which is probably a better more elegant solution.

http://wiki.nginx.org/X-accel and mod_zip for instance.

Why do people keep reinventing the wheel, thinking node is the be all and end all when this is nothing new at all?

[+] nevinera|14 years ago|reply

I'm pretty sure any evented framework in any language can do the same thing.

[+] masklinn|14 years ago|reply

Even non-evented ones, the interesting part really is not node itself (despite what the blog says) but the ability to pipeline streams without having to touch every byte yourself.

It should be possible to do something similar using e.g. generators (in Python) or lazy enumerators (in Ruby)

In fact, in Python's WSGI handlers return an arbitrary iterable which will be consumed, so that pattern is natively supported (string iterators and generators together, then return the complete pipe which will perform the actual processing as WSGI serializes and sends the response). Ruby would require an adapter to a Rack response of some sort as I don't think you can reply an enumerable OOTB.

[+] dmmalam|14 years ago|reply

Ye, 100% true. I've spent many man-years writing entire systems like this in C and Java.

It's just in node, doing it in the evented way was actually simpler and quicker to implement, than the 'ghetto' way. This isn't usually the case, and I always recommend doing the simplest thing that works first. It's just nice here that the simplest thing is also a tight solution.

[+] robfig|14 years ago|reply

What is the connection between evented and streaming? It seems like a thread-per-request server would have to do exactly the same thing (except, they would not have to worry about giving back their event loop thread).

[+] mcantelon|14 years ago|reply

With Node it's brain-dead easy and requires less code.

i.e. http://news.ycombinator.com/item?id=3753019

[+] latchkey|14 years ago|reply

I used to have this same exact issue while working for a large porn company. We needed to make zips of hundreds of megs of images. We were creating them on the fly to start with, which sucked for all the same reasons mentioned in the blog post. After doing a ton of analysis and not finding a good streaming library that didn't require either C or Java (this is long before Node came along), we realized that as part of the publishing process, we could just create the zip and upload it to the CDN. Problem solved with the minimal amount of complexity.

[+] chubot|14 years ago|reply

This is really cool. How are errors handled though? What if you have a transient error to 1 of 50 images -- does that bork the whole download? The user could get a corrupted file.

[+] georgefox|14 years ago|reply

I'm curious about this as well. While it's all very neat and improves the user experience when everything is working, what happens if things break? If you can't connect to S3 or something, but you've already sent HTTP headers for the ZIP download, what do you do? Throw an error message in a text file the ZIP? Send the user an empty ZIP? A corrupted ZIP, as chubot mentions, seems like it would be the worst-case scenario in terms of UX.

[+] mckoss|14 years ago|reply

Would be very cool to support HTTP Range requests [1] on the virtual stream - then client could restart at any point.

[1] http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html section 14.35.

[+] Benvie|14 years ago|reply

The important bit is that this is THE core abstraction used in node and the node community. If for no other reason, you should do it (if you're using node) because it's how you hook into the existing libraries.

The main benefit here isn't that it's possible to do this thing, as many people pointed out the myriad ways this is accomplished elsewhere. The key point is that everything that manipulates data, node core as well as the userland libraries, implement the same interface.

[+] sirclueless|14 years ago|reply

That's not really true. Most libraries expose a callback mechanism, where the result of some IO is passed as a Javascript primitive to a callback function that you provide. The Dumply guys used to use an API like that.

The notion of piping the output from some I/O (say, a request to S3) into the input of some other I/O (say, a currently writing HTML response) without ever referencing it is blessed by node, which has a stream type as part of its standard library. But it's far from the most common abstraction of asynchronous work.

[+] robfig|14 years ago|reply

Using the Play! Framework (Java):

  public static void myEndpoint() {
    HttpResponse resp = WS.url("url-of-file").get();
    InputStream is = resp.getStream();
    renderBinary(is);
  }

Or am I missing something?

(EDIT: This doesn't do the zipping or multiple files -- I guess I need a ZipOutputStream to take it the rest of the way)

[+] arthurschreiber|14 years ago|reply

I don't know the Play! framework, but the main difference probably is the use of nonblocking IO in nodejs, in contrast to blocking IO in the example you just have given. (I'm not saying either is better).

[+] simonw|14 years ago|reply

Will that definitely stream from one to the other without buffering the full file in memory? That's the main benefit of the Node.js streaming approach - it doesn't need to hold the whole thing in memory at any time, it just has to use a few KBs of RAM as a buffer.

[+] WiseWeasel|14 years ago|reply

I like how this is done, but I do see one problem with this approach for users connected through certain wireless ISPs, such as Verizon, who have all their http image requests automatically degraded to a lower bitrate to save bandwidth. They might think they're getting a usable local copy of their project, when they've actually got ugly, butchered versions of all the assets. That would not have been an issue with the server-side implementation.

[+] icebraining|14 years ago|reply

This is still server-side; it just streams instead of downloading and then pushing.

[+] aioprisan|14 years ago|reply

that's all dandy until you run our of RAM, as everything is done in RAM and nothing to disk. you honestly don't see a scalability issue here? it may be ok for a few thousand concurrent downloads but anything above that will kill it. heck, you might not even get to 1k concurrents, depending on the file size..

[+] atesti|14 years ago|reply

He's streaming: Node will only buffer a few kb per connection and push it right out to the downloader. There is absolutely no need to download complete files. That's the beauty of streams and pies!

[+] moonboots|14 years ago|reply

Have you considered jszip and/or webworkers to perform the zipping on the client?

[+] dmmalam|14 years ago|reply

The full size origs are only stored on the server, the client just uses thumbnails so not much point in zipping on the client

[+] lucaspiller|14 years ago|reply

...or you could use Erlang. :)

[+] marcocampana|14 years ago|reply

Sure you could use Erlang for that, but what Dharmesh is saying is that building this kind of solution in node.js would definitely be easier and possibly faster to code that writing it with other languages/frameworks.

[+] luriel|14 years ago|reply

or Go.

[+] bluespice|14 years ago|reply

Stop it already.

http://teddziuba.com/2011/10/node-js-is-cancer.html

[+] tomgruner|14 years ago|reply

I have to be honest, the writing quality of that article is so low and aggressive that I could not even finish it.

[+] mcantelon|14 years ago|reply

Node is possibly overhyped, and certainly not a panacea, but that article is silly in its absolute dismissal of the framework. Node is very well-suited for I/O bound TCP/IP applications.

65 comments