This looks pretty interesting, although I'll have to dig more into the examples to see why they chose this set of primitives (multipipes, multipipe blocks, and stored values).
Here is a 2009 paper, "Composing and executing parallel data-flow graphs with shell pipes", which is also a bash extension. (I'm impressed with anyone who successfully enhances bash's source code.)
Although it has a completely different model and I think more suitable for "big data".
I've always thought about integrating this functionality into elvish https://github.com/elves/elvish but cannot cone up with a good syntax. dgsh has a good one, but unfortunately using & breaks its traditional semantics. Does anyone has some idea of a tradition-compatible grammar?
Also, to nitpick, this is more accurately called a directed acyclic graph shell, or simply a DAG shell. The language doesn't seem to allow cycles. dagsh reads nicer than dgsh too.
I've worked with and looked at a lot of data processing helpers. Tools, that try to help you build data pipelines, for the sake of performance, reproducibility or simply code uniformity.
What I found so far: Most tools, that invent a new language or try to cram complex processes into lesser suited syntactical environments are not loved too much.
A few people like XSLT, most seem to dislike it, although it has a nice functional core hidden under a syntax that seems to come from a time, where the answer to everything was XML. There are big data orchestration frameworks, that use an XML as configuration language, which can be ok, if you have clear processing steps.
Every time a tool invents a DSL for data processing, I grab my list of ugly real world use cases and most of the tools fail soon, if not immediately. That's a pity.
Programming languages can be effective as they are, and with the exceptions that unclean data brings, you want to have a programming language at your disposal anyway.
I'll give dgsh a try. The tool reuse approach and the UNIX spirit seems nice. But my initial impression of the "C code metrics" example from the site is mixed: It reminds me of awk, about which one of the authors said, that it's a beautiful language, but if your programs getting longer than hundred lines, you might want to switch to something else.
Two libraries which have a great grip at the plumbing aspect of data processing systems are airflow and luigi. They are python libraries and with it you have a concise syntax and basically all python libraries plus non-python tools with a command line interface at you fingertips.
I am curious, what kind of process orchestration tools people use and can recommend?
Exactly our experience too, from complex machine learning workflows in various aspects of drug discovery.
We basically did not really find any of the popular DSL-based bioinformatics pipeline tools (snakemake, bpipe etc) to fit the bill. Nextflow came close, but in fact allows quite some custom code too.
What worked for us was to use Spotify's Luigi, which is a python library rather than DSL.
The only thing was that we had to develop a flow-based inspired API on top of Luigi's more functional programming based one, in order to make defining dependencies fluent and easy enough to specify for our complex workflows.
Our flow-based inspired Luigi API (SciLuigi) for complex workflows, is available at:
Then, lately we are working on a pure Go alternative to Luigi/SciLuigi, since we realized that with the flow-based paradigm, we could just as well just rely on the Go channels and go-routines to create an "implicit scheduler" very simply and robustly. This is work in progress, but a lot of example workflows already work well (it has 3 times less LOC than a recent bioinformatics pipeline tool written in python and put into production). Code available at:
It is also very much a programming library rather than a DSL.
It in fact even implements streaming via named pipes, seemingly allowing somewhat similar operations as dgsh, with a bit more code probably, but with the (seeming) benefit of a bit easier handling of multiple inputs and outputs (via the flow-based progr. ports concept).
dgsh looks real interesting for simpler operations where there is one main input and output though - which occur a lot for ad-hoc work in the shell, in our experience. Will have to test it out for sure!
Thanks for sharing your experience. I work with Pachyderm, which is an open source data pipelining and data versioning framework. Some things like might be relevant to this conversation are the fact that Pachyderm is language agnostic and that it keeps analyses in sync with data (because it triggers off of commits to data versioning). This makes it distinct from Airflow or Luigi, for example.
In this case the task resource http://converge.aster.is/0.5.0/resources/task/ might help, as it allows you to create a directed graph using any kind of interpreter (for example, Python or Ruby) instead of having to use the DSL.
This post is making me think it would be a great educational exercise to construct equivalent data processing flows in some popular tools: Make, Airflow, Luigi, Snakemake, Rake, others?
Something I have found fun in the past: using xslt where the underlying document is not xml. In order for xslt to work (in java setting, apache libs) you do not need an underlying xml document, just something that satisfies the appropriate java interface. For example, you could wrap a filesystem directory structure.
I downvoted your comment, because it doesn't seem to me that you read the article and are responding to the contents. You are simply responding with a pre-formed opinion. Conversations only work when you read first, then think, and finally respond. But I guess conversations cannot happen on HN, because everything has to be so FAST in silicone valley.
This is perhaps a bit off-topic, but what I really wish more data processing/ETL tools supported is the concept of transactional units. Too many of them seem to start with the worldview that "we need to shove in as many of the separate bits as we possibly can."
What's often needed for robust systems, instead, is solid support for error handling such that "if this bit doesn't make it in, then neither does that bit." Data is always messy and dirty, and too many ETL systems don't seem architected to cope with that reality.
Of course, maybe I just haven't found the right tools. Anyone know of tools that handle this particularly well?
I write complex shell commands every day, but when it gets longer than 2-3 rows I switch to a text editor and write it in Perl instead. I see no need to use bash up to that complexity, doesn't look good in terminal.
Poorman version of multiple pipes is to write intermediate results into files, then "cat" the files as many times as needed for the following processes. I use short file names "o1", "o2" standing for output-1, output-2 and see them as temp variables.
This is what it comes down to to me too. Using the shell to do programming seems to me like putting your job on hard mode.
When I had to do a lot of data processing at my last job, I started building up tools in Ruby. If I had time, I'd hack the workflow so that the next time I needed it, I could just run the tool from the command line.
Eventually I had a pluggable architecture that I could use to pull data from any number of sources and mix it with any other data. Do that with a shell? Why?
Interesting, this seems to be from a couple of people at Information Systems Technology Laboratory (ISTLab) at the Athens University of Economics and Business. I wonder what the motivation is. Security, or does it utilize multiple processor cores better than traditional pipes?
The impression I got is that it is still using traditional unix tools and pipes under the hood so I would expect the same efficiency as now. I think the big difference here is the syntax. Traditional shells are great if you have a linear dataflow where each program has one standard input and one standard output. However, if you want to have programs receiving multiple inputs from pipes or writing to multiple pipes then the `|` syntax is not enough.
This looks like potentially a great tool. It might be helpful if the author showed the code examples alongside the equivalent code in bash, so it's easy to see both what the example code is doing and how much effort is saved by doing it in dgsh.
It doesn't look all that different to me. Seems like it's just saving you mess around with assigning function inputs and outputs to shell variables. Otherwise it just looks like piping stuff around between functions.
Pipexec offers a versatile pipeline construction syntax, where you specify the topology of arbitrary graphs through the numbering of pipe descriptors. Dgsh offers a declarative directed graph construction syntax and automatically connects the parts for you. Also dgsh comes with familiar tools (tee, cat, paste, grep, sort) written to support the creation of such graphs.
We have measured many of the examples against the use of temporary files and the web report one against (single-threaded) implementations in Perl and Java. In almost all cases dgsh takes less wall clock time, but often consumes more CPU resources.
[+] [-] chubot|9 years ago|reply
Here is a 2009 paper, "Composing and executing parallel data-flow graphs with shell pipes", which is also a bash extension. (I'm impressed with anyone who successfully enhances bash's source code.)
Although it has a completely different model and I think more suitable for "big data".
https://scholar.google.com/scholar?cluster=98697598478714306...
http://dl.acm.org/citation.cfm?id=1645175
In this paper we extend the concept of shell pipes to incorporate forks, joins, cycles, and key-value aggregation.
I have a printout of this paper, but unfortunately it doesn't appear to be online :-(
[+] [-] xiaq|9 years ago|reply
Also, to nitpick, this is more accurately called a directed acyclic graph shell, or simply a DAG shell. The language doesn't seem to allow cycles. dagsh reads nicer than dgsh too.
[+] [-] mtrn|9 years ago|reply
What I found so far: Most tools, that invent a new language or try to cram complex processes into lesser suited syntactical environments are not loved too much.
A few people like XSLT, most seem to dislike it, although it has a nice functional core hidden under a syntax that seems to come from a time, where the answer to everything was XML. There are big data orchestration frameworks, that use an XML as configuration language, which can be ok, if you have clear processing steps.
Every time a tool invents a DSL for data processing, I grab my list of ugly real world use cases and most of the tools fail soon, if not immediately. That's a pity.
Programming languages can be effective as they are, and with the exceptions that unclean data brings, you want to have a programming language at your disposal anyway.
I'll give dgsh a try. The tool reuse approach and the UNIX spirit seems nice. But my initial impression of the "C code metrics" example from the site is mixed: It reminds me of awk, about which one of the authors said, that it's a beautiful language, but if your programs getting longer than hundred lines, you might want to switch to something else.
Two libraries which have a great grip at the plumbing aspect of data processing systems are airflow and luigi. They are python libraries and with it you have a concise syntax and basically all python libraries plus non-python tools with a command line interface at you fingertips.
I am curious, what kind of process orchestration tools people use and can recommend?
[+] [-] samuell|9 years ago|reply
We basically did not really find any of the popular DSL-based bioinformatics pipeline tools (snakemake, bpipe etc) to fit the bill. Nextflow came close, but in fact allows quite some custom code too.
What worked for us was to use Spotify's Luigi, which is a python library rather than DSL.
The only thing was that we had to develop a flow-based inspired API on top of Luigi's more functional programming based one, in order to make defining dependencies fluent and easy enough to specify for our complex workflows.
Our flow-based inspired Luigi API (SciLuigi) for complex workflows, is available at:
https://github.com/pharmbio/sciluigi
We wrote up a paper on it as well, detailing a lot of the design decisions behind it:
http://dx.doi.org/10.1186/s13321-016-0179-6
Then, lately we are working on a pure Go alternative to Luigi/SciLuigi, since we realized that with the flow-based paradigm, we could just as well just rely on the Go channels and go-routines to create an "implicit scheduler" very simply and robustly. This is work in progress, but a lot of example workflows already work well (it has 3 times less LOC than a recent bioinformatics pipeline tool written in python and put into production). Code available at:
https://github.com/scipipe/scipipe
It is also very much a programming library rather than a DSL.
It in fact even implements streaming via named pipes, seemingly allowing somewhat similar operations as dgsh, with a bit more code probably, but with the (seeming) benefit of a bit easier handling of multiple inputs and outputs (via the flow-based progr. ports concept).
dgsh looks real interesting for simpler operations where there is one main input and output though - which occur a lot for ad-hoc work in the shell, in our experience. Will have to test it out for sure!
[+] [-] dwhitena|9 years ago|reply
[+] [-] steveb|9 years ago|reply
In this case the task resource http://converge.aster.is/0.5.0/resources/task/ might help, as it allows you to create a directed graph using any kind of interpreter (for example, Python or Ruby) instead of having to use the DSL.
[+] [-] nerdponx|9 years ago|reply
[+] [-] rtpg|9 years ago|reply
[+] [-] cturner|9 years ago|reply
[+] [-] timthelion|9 years ago|reply
[+] [-] karlmdavis|9 years ago|reply
What's often needed for robust systems, instead, is solid support for error handling such that "if this bit doesn't make it in, then neither does that bit." Data is always messy and dirty, and too many ETL systems don't seem architected to cope with that reality.
Of course, maybe I just haven't found the right tools. Anyone know of tools that handle this particularly well?
[+] [-] visarga|9 years ago|reply
Poorman version of multiple pipes is to write intermediate results into files, then "cat" the files as many times as needed for the following processes. I use short file names "o1", "o2" standing for output-1, output-2 and see them as temp variables.
[+] [-] vinceguidry|9 years ago|reply
When I had to do a lot of data processing at my last job, I started building up tools in Ruby. If I had time, I'd hack the workflow so that the next time I needed it, I could just run the tool from the command line.
Eventually I had a pluggable architecture that I could use to pull data from any number of sources and mix it with any other data. Do that with a shell? Why?
[+] [-] db48x|9 years ago|reply
[+] [-] tingletech|9 years ago|reply
[+] [-] ufo|9 years ago|reply
[+] [-] mtdewcmu|9 years ago|reply
[+] [-] nerdponx|9 years ago|reply
[+] [-] be21|9 years ago|reply
[+] [-] DSpinellis|9 years ago|reply
[+] [-] CDokolas|9 years ago|reply
[+] [-] haddr|9 years ago|reply
[+] [-] DSpinellis|9 years ago|reply
[+] [-] nerdponx|9 years ago|reply
[+] [-] T3RMINATED|9 years ago|reply
[deleted]