top | item 13352659

Dgsh – Directed graph shell

178 points| nerdlogic | 9 years ago |dmst.aueb.gr | reply

51 comments

order
[+] chubot|9 years ago|reply
This looks pretty interesting, although I'll have to dig more into the examples to see why they chose this set of primitives (multipipes, multipipe blocks, and stored values).

Here is a 2009 paper, "Composing and executing parallel data-flow graphs with shell pipes", which is also a bash extension. (I'm impressed with anyone who successfully enhances bash's source code.)

Although it has a completely different model and I think more suitable for "big data".

https://scholar.google.com/scholar?cluster=98697598478714306...

http://dl.acm.org/citation.cfm?id=1645175

In this paper we extend the concept of shell pipes to incorporate forks, joins, cycles, and key-value aggregation.

I have a printout of this paper, but unfortunately it doesn't appear to be online :-(

[+] xiaq|9 years ago|reply
I've always thought about integrating this functionality into elvish https://github.com/elves/elvish but cannot cone up with a good syntax. dgsh has a good one, but unfortunately using & breaks its traditional semantics. Does anyone has some idea of a tradition-compatible grammar?

Also, to nitpick, this is more accurately called a directed acyclic graph shell, or simply a DAG shell. The language doesn't seem to allow cycles. dagsh reads nicer than dgsh too.

[+] mtrn|9 years ago|reply
I've worked with and looked at a lot of data processing helpers. Tools, that try to help you build data pipelines, for the sake of performance, reproducibility or simply code uniformity.

What I found so far: Most tools, that invent a new language or try to cram complex processes into lesser suited syntactical environments are not loved too much.

A few people like XSLT, most seem to dislike it, although it has a nice functional core hidden under a syntax that seems to come from a time, where the answer to everything was XML. There are big data orchestration frameworks, that use an XML as configuration language, which can be ok, if you have clear processing steps.

Every time a tool invents a DSL for data processing, I grab my list of ugly real world use cases and most of the tools fail soon, if not immediately. That's a pity.

Programming languages can be effective as they are, and with the exceptions that unclean data brings, you want to have a programming language at your disposal anyway.

I'll give dgsh a try. The tool reuse approach and the UNIX spirit seems nice. But my initial impression of the "C code metrics" example from the site is mixed: It reminds me of awk, about which one of the authors said, that it's a beautiful language, but if your programs getting longer than hundred lines, you might want to switch to something else.

Two libraries which have a great grip at the plumbing aspect of data processing systems are airflow and luigi. They are python libraries and with it you have a concise syntax and basically all python libraries plus non-python tools with a command line interface at you fingertips.

I am curious, what kind of process orchestration tools people use and can recommend?

[+] samuell|9 years ago|reply
Exactly our experience too, from complex machine learning workflows in various aspects of drug discovery.

We basically did not really find any of the popular DSL-based bioinformatics pipeline tools (snakemake, bpipe etc) to fit the bill. Nextflow came close, but in fact allows quite some custom code too.

What worked for us was to use Spotify's Luigi, which is a python library rather than DSL.

The only thing was that we had to develop a flow-based inspired API on top of Luigi's more functional programming based one, in order to make defining dependencies fluent and easy enough to specify for our complex workflows.

Our flow-based inspired Luigi API (SciLuigi) for complex workflows, is available at:

https://github.com/pharmbio/sciluigi

We wrote up a paper on it as well, detailing a lot of the design decisions behind it:

http://dx.doi.org/10.1186/s13321-016-0179-6

Then, lately we are working on a pure Go alternative to Luigi/SciLuigi, since we realized that with the flow-based paradigm, we could just as well just rely on the Go channels and go-routines to create an "implicit scheduler" very simply and robustly. This is work in progress, but a lot of example workflows already work well (it has 3 times less LOC than a recent bioinformatics pipeline tool written in python and put into production). Code available at:

https://github.com/scipipe/scipipe

It is also very much a programming library rather than a DSL.

It in fact even implements streaming via named pipes, seemingly allowing somewhat similar operations as dgsh, with a bit more code probably, but with the (seeming) benefit of a bit easier handling of multiple inputs and outputs (via the flow-based progr. ports concept).

dgsh looks real interesting for simpler operations where there is one main input and output though - which occur a lot for ad-hoc work in the shell, in our experience. Will have to test it out for sure!

[+] dwhitena|9 years ago|reply
Thanks for sharing your experience. I work with Pachyderm, which is an open source data pipelining and data versioning framework. Some things like might be relevant to this conversation are the fact that Pachyderm is language agnostic and that it keeps analyses in sync with data (because it triggers off of commits to data versioning). This makes it distinct from Airflow or Luigi, for example.
[+] nerdponx|9 years ago|reply
This post is making me think it would be a great educational exercise to construct equivalent data processing flows in some popular tools: Make, Airflow, Luigi, Snakemake, Rake, others?
[+] rtpg|9 years ago|reply
I've been thinking about this space a lot too, would you mind listing out some of the messier use cases that you have?
[+] cturner|9 years ago|reply
Something I have found fun in the past: using xslt where the underlying document is not xml. In order for xslt to work (in java setting, apache libs) you do not need an underlying xml document, just something that satisfies the appropriate java interface. For example, you could wrap a filesystem directory structure.
[+] timthelion|9 years ago|reply
I downvoted your comment, because it doesn't seem to me that you read the article and are responding to the contents. You are simply responding with a pre-formed opinion. Conversations only work when you read first, then think, and finally respond. But I guess conversations cannot happen on HN, because everything has to be so FAST in silicone valley.
[+] karlmdavis|9 years ago|reply
This is perhaps a bit off-topic, but what I really wish more data processing/ETL tools supported is the concept of transactional units. Too many of them seem to start with the worldview that "we need to shove in as many of the separate bits as we possibly can."

What's often needed for robust systems, instead, is solid support for error handling such that "if this bit doesn't make it in, then neither does that bit." Data is always messy and dirty, and too many ETL systems don't seem architected to cope with that reality.

Of course, maybe I just haven't found the right tools. Anyone know of tools that handle this particularly well?

[+] visarga|9 years ago|reply
I write complex shell commands every day, but when it gets longer than 2-3 rows I switch to a text editor and write it in Perl instead. I see no need to use bash up to that complexity, doesn't look good in terminal.

Poorman version of multiple pipes is to write intermediate results into files, then "cat" the files as many times as needed for the following processes. I use short file names "o1", "o2" standing for output-1, output-2 and see them as temp variables.

[+] vinceguidry|9 years ago|reply
This is what it comes down to to me too. Using the shell to do programming seems to me like putting your job on hard mode.

When I had to do a lot of data processing at my last job, I started building up tools in Ruby. If I had time, I'd hack the workflow so that the next time I needed it, I could just run the tool from the command line.

Eventually I had a pluggable architecture that I could use to pull data from any number of sources and mix it with any other data. Do that with a shell? Why?

[+] db48x|9 years ago|reply
Funny, just two/three weeks ago I was saying that I really needed a dag of pipes in a shell script that I was writing...
[+] tingletech|9 years ago|reply
Interesting, this seems to be from a couple of people at Information Systems Technology Laboratory (ISTLab) at the Athens University of Economics and Business. I wonder what the motivation is. Security, or does it utilize multiple processor cores better than traditional pipes?
[+] ufo|9 years ago|reply
The impression I got is that it is still using traditional unix tools and pipes under the hood so I would expect the same efficiency as now. I think the big difference here is the syntax. Traditional shells are great if you have a linear dataflow where each program has one standard input and one standard output. However, if you want to have programs receiving multiple inputs from pipes or writing to multiple pipes then the `|` syntax is not enough.
[+] mtdewcmu|9 years ago|reply
This looks like potentially a great tool. It might be helpful if the author showed the code examples alongside the equivalent code in bash, so it's easy to see both what the example code is doing and how much effort is saved by doing it in dgsh.
[+] nerdponx|9 years ago|reply
It doesn't look all that different to me. Seems like it's just saving you mess around with assigning function inputs and outputs to shell variables. Otherwise it just looks like piping stuff around between functions.
[+] be21|9 years ago|reply
I am not familiar with the project. What are the advantages of Dgsh in comparision to pipexec: https://github.com/flonatel/pipexec
[+] DSpinellis|9 years ago|reply
Pipexec offers a versatile pipeline construction syntax, where you specify the topology of arbitrary graphs through the numbering of pipe descriptors. Dgsh offers a declarative directed graph construction syntax and automatically connects the parts for you. Also dgsh comes with familiar tools (tee, cat, paste, grep, sort) written to support the creation of such graphs.
[+] haddr|9 years ago|reply
I wonder if there is any perfomance benchmark of this graph shell? Especially on some complex pipelines running huge datasets?
[+] DSpinellis|9 years ago|reply
We have measured many of the examples against the use of temporary files and the web report one against (single-threaded) implementations in Perl and Java. In almost all cases dgsh takes less wall clock time, but often consumes more CPU resources.