Next generation Unix pipe by Alex Larsson

[+] rbanffy|13 years ago|reply

I find the tendency to repeat Microsoft's mistakes deeply disturbing. Even if, in this case, the author acknowledges PoweeShell goes too far, his own idea goes too far.

I'd be all in with flags that make ps or ls spit JSON or XML, but this typed nonsense? What when I want to output a color? Will I need a new type?

Oh... and the sort thing... its not hard to sort numerically.

[+] zokier|13 years ago|reply

>I find the tendency to repeat Microsoft's mistakes deeply disturbing.

What, in this case, you consider "Microsoft's mistake"? I thought that PowerShell was commonly considered conceptually sound, but flawed in the implementation, mostly for it's verbosity making it unwieldy for interactive use. If this project can solve that, then I don't see it "repeating Microsoft's mistakes". Instead it would be correcting them.

[+] chris_wot|13 years ago|reply

What do you mean by "what when I want to output a color"?

[+] yason|13 years ago|reply

I wonder where this comes from. There's no need for next generation.

People have, for thirty years or so, successfully printed their data into a suitable textual streams for processing with programs glued together with pipes, and optionally parsed the results back into some native format if so required.

Meanwhile, none of the "next generation" pipes have gained any momentum. Obviously they solve something which is either not a problem or they do solve some problems but create new ones in greater numbers than what they solved, tipping the balance into negative.

Any object or intermediary format you can think of can be represented in text, and you're back to square one. For example, even if there's XSLT and XQuery, you can just serialize trees of XML elements into a row-based format of expression and use grep on the resulting stream to effectively make hierarchical searches inside the tree.

[+] ufo|13 years ago|reply

I have the opposite experience, actually. Its so damn annoying to have programs communicating structured data with each other over pipes that more complicated things inevitably diverge into a) monocultural programs made with a single programming language and that don't communicate with the outside world or b) Some form of exchange format (JSON, XMl, etc) that needs to be explicitly supported by every participant.

And unix utilities suck at handling structured data. If your file format is line based you might have a chance of it being easy to work with but don't even ask what happens if you insert a newline inside a textfield then.

[+] veyron|13 years ago|reply

That is a terrible idea: sometimes the app can take advantage of a constraint to minimize work done.

In your example, if we just wanted to filter for a particular user, dps would have to print out ALL of the information and then you could pick at it. This doesn't seem bad for ps (because there's a hard limit) but in many other examples the output could be much larger than what is needed. That's why having filtering and output flags in many cases is more efficient in generating everything.

As a side note: To demonstrate a dramatic example, I tried timing two things:

    - dumping NASDAQ feed data for an entire day, pretty-printing, and then using fgrep
    - having the dumper do the search explicitly (new flags added to program)

Both outputs were sent to /dev/null. The first ran in 35 minutes, the second in less than 1 minute

[+] saulrh|13 years ago|reply

Streams clamp everything in them to O(n). That's a problem in some cases; for example, your NASDAQ feed dumper probably has some kind of database inside itself that lets it run filters in massively sublinear time, and making it linear would be a significant performance hit.

However, there are an equal number of tasks that are not sublinear. Some of them are also very common and important sysadmin-y things. Iterate through a directory applying some operation to every file. Slurp a file and look for a particular chunk of bits. And so on. For those sysadmins, a little structure in their stream can make their job a lot easier. It'd be like the difference between assembly and C: all of a sudden things have names.

[+] alexlarsson|13 years ago|reply

Obviously for many cases avoiding output is better than post-output filtering. For these cases the originating process should do the filtering. However, in many practical situations the data sets are small enough to not matter, or the operation wanted will not filter out most data anyway.

Basically, you're arguing that grep is a bad tool (it has the same issues) yet its a very commonly used tool.

[+] Someone|13 years ago|reply

If "sometimes the app can take advantage of a constraint" is an argument here, you should be against all usage of pipes.

[+] Daniel_Newby|13 years ago|reply

So the NASDAQ dumper should accept a structured query as its input. This is an architecture issue, not a data format issue.

[+] mongol|13 years ago|reply

For certain types of unix pipeing, I have found it useful to pipe from tool to CSV, and then let sqlite process the data, using SQL statements. SQL solves many of the sorting, filtering, joining things that you can do with unix pipes too, but with a syntax that is broadly known. Especially the joining I have found hard to do well with shell/piping.

I think a sqlite-aware shell would be awesone, especially if common tools had a common output format (like csv with header) where that also included the schema / data format.

[+] agumonkey|13 years ago|reply

very clever

[+] 0xbadcafebee|13 years ago|reply

My preference for a "next generation pipe": Shared file descriptors. (Sort of)

It would work virtually the same as a standard pipe; the difference being you could control whether it was read, write, or both, and every application you 'piped' to would have access to the same file descriptors as the parent, unless a process in the path of the pipe closes one.

The end result will be the equivalent of passing unlimited individual arbitrary bitstreams combined with the ability to chain arbitrary programs. In fact, you could simplify things by simply passing the previous piped command's output as a new file descriptor to the next program, so you could easily reference the last piped program's output, or any of the ones before it.

For example:

cat arbitrary_data.docx | docx --count-words <$S[0] | docx --head 4 <$S[0] | docx --tail 4 <$S[0] | docx --count-words <$S[2] | echo -en "Number of words: $STREAM[1]\n\nFirst paragraph: $STREAM[2]\n\nLast paragraph: $STREAM[3]\n\nNumber of words in first paragraph: $STREAM[4]\n"

STREAM[0] is the output of 'cat'. STREAM[1] is the counted words of STREAM[0] ($S[0] is an alias). STREAM[2] is the first 4 lines of the doc. STREAM[3] is the last 4 lines of the doc. STREAM[4] is the counted words from STREAM[2] (note the "<$S[2]"). And STREAM[5] is the output of 'echo', though since it's the last command, it becomes STDOUT.

There may be a more slick way of doing this, but you can see the idea. Pass arbitrary streams as you pipe, and reference any of them at any point in the pipe to continue processing data arbitrarily in a one-liner.

...

Actually, it looks like this is already built into bash (sort of), as the Coprocesses functionality. I don't know if you can use it with pipes, but it's very interesting.

[+] DHowett|13 years ago|reply

I like the idea of processes dumping structured objects: pipes are rather often used for the processing of structured data, and while tabulated output certainly makes it easier, we still end up effectively using constants: cut to the third column, sort the first 10 characters, and print the first four lines.

This method is fragile when given diverse input: what if the columns could themselves contain tabs, newlines, or even nul bytes?

Passing objects as binary blobs, on the other hand, doesn't allow for ease of display or interoperability with other tools that don't support whatever format they happen to be. This, of course, can be rectified with a smart shell with pretty-print for columnar data (insofar as a shell could be charged with data parsing; you may imagine an implicit |dprint at the end of each command line that outputs blobs).

I'd also be interested in seeing a utility that took "old-format" columnar data and generated structured objects from it, of course, with the above format caveats.

[+] chris_wot|13 years ago|reply

Something like a cut, only we call it dcut? Actually sounds like a pretty good idea - that way those who don't want to switch to the new format don't have to, and you can pipe it through this program to create the new style structured output...

[+] dfc|13 years ago|reply

How would a column contain a newline?

[+] rogerbinns|13 years ago|reply

What would be ideal to solve first is some sort of initial format negotiation on pipes. Otherwise you will end up with the wrong thing happening (eg having to reimplement every tool, spewing "rich" format to tools that don't know it, or regular text to tools that could do better).

We've already seen something like this - for example ls does column output if going directly to a screen, otherwise one per line, and many tools will output in colour if applicable. However this is enabled by isatty() which uses system calls, and inspecting the terminal environment for colour support.

Another example is telnet which does feature negotiations if the other end is a telnet daemon, otherwise just acts as a "dumb" network connection. (By default the server end initiates the negotiations.)

However the only way I can see this being possible with pipes is with kernel/syscall support. It would provide a way for either side to indicate support for richer formats, and let them know if that is mutually agreeable, otherwise default to compatible plain old text. For example an ioctl could list formats supported. A recipient would supply a list before the first read() call. The sender would then get that list and make a choice before the first write() call. (This is somewhat similar to how clipboards work.)

So the question becomes would we be happy with a new kernel call in order to support rich pipes, which automatically use current standard behaviour in its absence or when talking to non-rich enabled tools?

I would love it if grep/find/xargs automatically knew about null terminating.

[+] dfc|13 years ago|reply

man grep:

   -Z, --null
      Output a zero byte (the ASCII NUL character) instead  of  the  character  that  normally
      follows  a  file  name.   For example, grep -lZ outputs a zero byte after each file name
      instead of the usual newline.  This option makes the output  unambiguous,  even  in  the
      presence  of file names containing unusual characters like newlines.  This option can be
      used with commands like find -print0,  perl  -0,  sort  -z,  and  xargs  -0  to  process
      arbitrary file names, even those that contain newline characters.

   -z, --null-data
      Treat the input as a set of lines, each  terminated  by  a  zero  byte  (the  ASCII  NUL
      character)  instead of a newline.  Like the -Z or --null option, this option can be used
      with commands like sort -z to process arbitrary file names.

man xargs:

   --null
   -0     Input  items are terminated by a null character instead of by whitespace, and the quotes
      and backslash are not special (every character is taken literally).  Disables the end of
      file  string,  which  is treated like any other argument.  Useful when input items might
      contain white space, quote marks, or backslashes.  The GNU find -print0 option  produces
      input suitable for this mode.

man find:

   -print0
      True;  print  the  full  file  name on the standard output, followed by a null character
      (instead of the newline character that -print uses).  This allows file names  that  con‐
      tain newlines or other types of white space to be correctly interpreted by programs that
      process the find output.  This option corresponds to the -0 option of xargs.

[+] alexlarsson|13 years ago|reply

My code does format negotiation on the pipe to determine whether to send the data in textual form or binary form.

It uses file locks (F_SETLK) on the pipe with a magic offset value offset to do the negotiation.

[+] huhtenberg|13 years ago|reply

At the risk of stating the obvious - this won't take off for a simple reason of being too complex by Unix standards.

[+] IsTom|13 years ago|reply

I'm not sure, it's probably not too complex by GNU standards.

[+] Daniel_Newby|13 years ago|reply

Have you seen the command line options for ps? They're going to have to start using Unicode accent marks if they extend it much further.

[+] dfc|13 years ago|reply

"Even something as basic as numerical sorting on a column gets quite complicated."

    sort -g -k field_num

[+] comex|13 years ago|reply

Two problems: the header is sorted along with the fields, and you have to look up the field number. Insurmountable? No; but somewhat complicated.

[+] DeepDuh|13 years ago|reply

Sometimes I feel like POSIX idioms are like the bible to some here: untouchable.

[+] snprbob86|13 years ago|reply

Neat! I've commented about this very problem before on several of the many threads regarding "object pipes", ie. REPLs.

http://news.ycombinator.com/item?id=1033623

http://news.ycombinator.com/item?id=1566325

http://news.ycombinator.com/item?id=2527217

Since that last comment, I've been working a bunch with Clojure, which has a far more expressive variant of JSON, as well as some heavy duty work with Google's Protocol Buffers.

A few points:

1) Piping non-serializable objects is a BAD IDEA. That's not a shell, that's a REPL. And even in a REPL, you should prefer inert data, a la Clojure's immutable data structures.

2) Arbitrary bit streams is, fundamentally, unbeatable. It's completely universal. Some use cases really don't want structured data. Consider gzip: you just want to take bytes in and send bytes out. You don't necessarily want typed data in the pipes, you want typed pipes, which may or may not contain typed data. This is the "format negotiation" piece that is mentioned in the original post. I'd like to see more details about that.

3) There seems to be some nebulous lowest common denominator of serializable data. So many things out there: GVariant, Clojure forms, JSON, XML, ProtoBuf, Thirft, Avro, ad infinitum. If everything talks its own serialization protocol, then none of the "do one thing well" benefits work. Every component needs to know every protocol. One has to "win" in a collaborative shell environment. I need to study GVariant more closely.

4) Whichever format "wins", it needs to be self-describing. A table format command can't work on field names, unless it has the field names! ProtoBufs and Thrift are out, because you need to have field names pre-compiled on either side of the pipe. Unless, of course, you start with a MessageDescriptor object up front, which ProtoBufs support and Avro has natively, but I digress: Reflection is necessary. It's not clear if you need header descriptors a la MessageDescriptor/Avro, or inline field descriptions a la JSON/XML/Clojure. Or a mix of both?

5) Order is critical. There's a reason these formats are called "serializable". Clojure, for example, provides sets using the #{} notation. And, like JSON, supports {} map notation. Thrift has Maps and Sets too. ProtoBufs, however, don't. On purpose. And it's a good thing! The data is going to come across the pipe in series, so a map or set doesn't make sense. Use a sequence of key-value-pairs. It might even be an infinite sequence! It's one thing to support un-ordered data when printing and reading data. It's another thing entirely to design a streaming protocol around un-ordered data. Shells need a streaming protocol.

6) Going back to content negotiation, this streaming protocol might be able to multiplex types over a single stream. Maybe gzip sends a little structured metadata up front, then a binary stream. ProtoBufs label all "bytes" fields with a size, but you might not know the size in advanced. Maybe you need two synchronized streams on which you can multiplex a control channel? That is, each pipe is two pipes. One request/response pair and the other a modal byte stream vs typed message stream.

Overall. This is the nicest attempt at this idea I've seen yet. I've been meaning to take a crack at it myself, but refused to do it without enough time to re-create the entire standard Unix toolkit plus my own shell ;-)

[+] alexlarsson|13 years ago|reply

Regarding order. The dtools approach uses a stream (i.e. potentially infinite) of variants. Each variant is a self contained typed data chunk which is by itself not "streamable" (i.e. you have to read all of it). The data chunk is strongly typed and the type is self-described.

The supported primitive types are: bool, byte, int16, uint16, int32, uint32, int64, uint64, double, utf8 string (+ some dbus specific things).

These can be recursively combined with: arrays (of same type), tuples, dicts (primitive type -> any type map), maybe type, and variant type

In my dps example I generate a stream of dictionaries mapping from string to variant (i.e. any type). The type of each item in the map differs. For instance cmdvec is an array of strings, whereas euid is an uint32.

[+] forgotusername|13 years ago|reply

I've quickly jotted some thoughts here: http://damnkids.posterous.com/rich-format-unix-pipes

Regarding this version, standardizing on a particular transfer format is a bad idea. If history has shown anything, it's that we like to reinvent this stuff and make it more complicated than necessary (see also XDR, ASN.1, XML, etc. :) pretty much on a 5 year cycle or thereabouts.

Do the bare minimum design necessary and let social convention evolve the rest.

[+] alexlarsson|13 years ago|reply

Having to many different formats is also a problem though, as incompatible formats means you can't combine two apps in a pipepine.

The negotiations in dtools is made using a F_GETLK hack with a magic value offset. That approach could easily be extended to support multiple formats.

[+] rhizome|13 years ago|reply

If I can do a slight PG impression, "what problem does this solve?"

[+] comex|13 years ago|reply

Among others, this problem:

http://www.dwheeler.com/essays/fixing-unix-linux-filenames.h...

find -print0 is a lame hack, and even filenames with spaces (not newlines) are somewhat messy to work with on the Unix shell.

Or a little recurring problem I have: How do I grep the output of grep -C (matches showing multiple lines delimited with a "--" line)? I wrote a custom tool to do it, which does the job, but really it would be nice if I could use all the normal line-based Linux tools (sort, uniq, awk, wc, sed) with a match as a "line".

[+] chris_wot|13 years ago|reply

Don't need to always use sed and awk. You can more easily sort, without recoursing to cut. You can let your pipe work on ranges.

Basically, use your imagination :-)

[+] luriel|13 years ago|reply

If I can do a slight Henry Spencer impression, "the problem of not understanding Unix and being condemned to reinvent it, poorly."

[+] m_eiman|13 years ago|reply

Also see https://github.com/unconed/TermKit/

[+] sprobertson|13 years ago|reply

I've been playing around with a similar idea - using plain JSON as the message format, you can make a set of pipeable command line utilities for manipulating data from many web APIs.

[+] zootm|13 years ago|reply

Have you seen RecordStream? It's some good cli tools based on streams of json which might be handy.

https://github.com/benbernard/RecordStream

I've used it a lot and it's a godsend for a most "record-y" manipulation.

[+] alexlarsson|13 years ago|reply

A textual format needs parsing though at every stage in the pipeline. Thats why I think using an (optional, when supported) binary format is important.

[+] lubutu|13 years ago|reply

I've noticed that the NUL-termination problem [1] has come up a number of times in these comments. If you want a solution to this that isn't so drastic as an object system, perhaps take a look at Usul [2], non-POSIX 'tabular' Unix utilities which use an AWK-style $RS.

[1]: http://news.ycombinator.com/item?id=4369699

[2]: http://lubutu.com/soso/usul

[+] enthalpyx|13 years ago|reply

[+] chris_wot|13 years ago|reply

What about providing a filter that converts to whatever format you can think of? e.g. outputs in JSON or XML

[+] Groxx|13 years ago|reply

Because what are you converting from? It can't be turtles all the way down, at some point there must be a defined system that everything speaks. Adding output formats after that is relatively simple.

[+] pjmlp|13 years ago|reply

Actually I prefer Powershell's approach to transfer objects, as it is more flexible than standardize in a specific transfer format.

But I do concede that it has the downside that if the object lacks the properties you want to access, then it might be painful in some cases.

[+] manojlds|13 years ago|reply

But it is not like you cannot do normal string processing using cmdlets like "Select-String". And an object missing a property is almost same a column missing in the returned text output right?

[+] pka|13 years ago|reply

I can't believe this. Just 2 or so weeks ago I set about writing exactly something like this in Haskell [1]. It's by no means complete or even working at this point, but basically what I had in mind was something like:

    yls | yfilter 'mdate = yesterday && permissions.oread = true' | yformat -ls

Every tool emits or consumes "typed" JSON (i.e. JSON data with an additional JSON schema). Why typed? Because then the meaning of things like mdate = yesterday can be inferred from the type of mdate and mean different things depending on whether mdate is a string or a date. In the case of a date, the expression mdate = yesterday can automatically be rewritten to mdate >= 201208110000 && mdate < 201208120000 etc. In the case of a string we do string comparison. In the case of a bool we emit an error if the compared-to value isn't either true or false, etc.

Basically, I wanted to build a couple of standard tools inspired by the FP world, like filter, sort, map, fold (reduce) and have an universal tool for outputting formatted data in whatever form is desired - be it JSON, csv files, text files or custom formats. Every tool would support an -f parameter, which means that its output is automatically piped through the format tool, so that something like

    yls -fls

is functionally equivalent to

    yls | yformat -ls

which would output the JSON data from yls in the traditional ls way on a unix system.

    yls | yformat -csv

would output csv data. Some more examples:

    yls | yfold '+ size' 0

prints out the combined size of all files in the current directory.

    yls | ymap 'name = name + .jpg' | ymv

would append .jpg to all files in the current directory.

    ycontacts | yfilter -fcsv 'name = *John*'

would print out all Google contacts containing John in their name as a csv file.

    yps | yfilter 'name = java*' | yeval 'kill name'

would kill all processes whose names start with 'java'.

The cool thing about this is that this approach conserves one of the main selling points of FP: composability. I.e. you can throw something like yfold '+ size' 0 in a shell script and then write:

    yls | size.sh

This way people would be able to build an ever growing toolbelt of abstracted functionality specifically tailored to their way of doing things, without losing composability.

[1] https://github.com/pkamenarsky/ytools

[+] flogic|13 years ago|reply

Personally, I'm not feeling the quotes and would prefer parens since they're nestable.

[+] Rovanion|13 years ago|reply

So JSON?

82 comments