I'm a huge fan of awk but the "Python vs awk" page this links to [1] shows python code that's almost deliberately atrocious.
Take this function the author wrote for converting a list of integers (or strings) into floats
def ints2float(integerlist):
for n in range(0,len(integerlist)):
integerlist[n]=float(integerlist[n])
return integerlist
Using `range(0,len(integerlist))` immediately betrays how the author doesn't understand python. The first arg in `range` is entirely redundant. Mutating the input list like this is also just bad design. If someone has used python for longer than a month, you'd write this with just `[float(i) for i in integerlist]`.
Further down in the function `format_captured` you see this attempt at obfuscation:
> python code that's almost deliberately atrocious
That code was so bad I felt I had to step in too, I used chatGPT to simplify it a bit but it also introduced some errors, so I found what appears to be an input file to test it on [1]. The only difference with the awk program is that it uses spaces while the original python program used tabs.
#!/usr/bin/env python3
import sys
freq, fc, ir = [], [], []
with open(sys.argv[1]) as f:
for line in f.readlines():
words = line.split()
if "Frequencies" in line:
freq.extend(words[2:])
elif "Frc consts" in line:
fc.extend(words[3:])
elif "IR Inten" in line:
ir.extend(words[3:])
for i in range(len(freq)):
print(f"{freq[i]}\t{fc[i]}\t{ir[i]}")
Agreed - this is pretty much the perfect use case for list comprehensions, which are one of the best features of Python. Normally "oh but there's a better way to do it in that language" isn't a particularly interesting observation, but here it completely turns the author's point on its head. I can't think of many more elegant ways to convert a list of ints to floats, in any language, than `[float(i) for i in integerlist]`.
I wanted to point out that the Python code was written to be 2.7 compatible, and maybe the atrociousness was due to that, but then I looked up when list comprehensions were introduced - 2.0, with PEP202.
Looks like he copy pasted the python version from another forum post, and didn't look at it carefully. I'd suspect it can be made to look a lot cleaner (edit, yes, e.g. by just translating each of the main lines in the awk script to an if statement) I agree with the strawman comment.
I don’t think you’re right to put Awk into the one-liner category. It actually scales up remarkably well to a couple hundred lines or so, as long as the problem does not strain its anaemic data-structure capabilities.
Compared to straight Python (i.e. not Numpy/pandas), it can also be surprisingly fast[1]. I experienced this personally on a trivial problem: take a TSV of (I,J) pairs, histogram I/J into a given number of bins. I can’t remember the exact figures now, but it went like this: on a five-gig file, pure Python is annoyingly slow, GNU awk is okayish but I still have to wait, mawk is fast enough that I don’t wait for it, C is of course still at least an order of magnitude faster but at that point it doesn’t matter.
If I have a text file of csv that I need to do something with, I'll usually start with a shell script and coreutils + sed and awk.
If I need a script to generate some output (a regular use case that seems to come up for me is random numbers with some property), I tend to use python.
I also use python if I need to do more complicated aggregations or something on tabular data that pandas is better at. Though it's fun to try with `join` and awk sometimes (parsing csv can get tricky).
If I need to plot something I tend to use jupyter notebook but it's way more satisfying to use gnuplot, which I mention because it fits naturally into workflows that use shell tools like awk.
Awk sed bash and perl are extremely underrated and nearly always beats python for elegance and succinctness of the repeating problems in the daily chores of sys admin
A long time ago, I built a relatively complex program that managed some other systems in awk. It was really a great fit for the problem and I was, at the time, working in an environment with poor developer tooling. The target systems were heterogeneous and I could not depend on Perl even being available. But awk was guaranteed to be there.
The problem was that every time someone else was asked to add any feature to it, they freaked out at the language choice and I had to get on a plane.
If you find yourself piping any combination of cut, grep, sed, uniq (and likely others I'm missing) together, you can probably do it all in awk. If you can guarantee usage of gawk, you can add sort to that list (tbf you can also implement any algorithm you want in awk, but arguably at that point you're wasting time) - and it's also worth noting that you can dedupe in awk _without_ having sorted input, albeit at the cost of storing all unique lines in memory.
Pipes are great because they enable you to trivially send data between programs, and they're terrible for the same reason. While the execution time on modern computers for the average data size isn't noticeable, on larger datasets or repeated execution, it absolutely is. If you don't have to pipe, don't.
Awk is actually amazing as long as you operate within its limitations and core problems that it solves well—it is definitely not limited to one-liners, but it has other limitations.
It is just so damn good at the things it is good at, but nobody learns to use it, because learning Awk is inefficient in the grand scheme of things. There are better things you can learn.
There is one place where Awk has undeniable superiority—and that is its use in environments where bureaucratic rules prohibit the distribution of programs / code (Perl, Python), but where Awk is permitted.
In 1996 I was paid to teach Unix to a group of customers. As a graduation work for my course they had to write an awk program for budgeting without use of any databases. I'm sure they had a lot of fun and cursed me for this sadistic approach to teaching at that time. However decades later some of them were still thankful they have learned regex and awk back then.
pdf / csv / excel export from my three webbanks, a bit of pdftotext, or soffice conversion just to pipe to awk to augment it and render properly formatted spreadsheet
I found the awk syntax to be surprisingly discoverable, once I got the rough structure of scripts.
I think the confusing factor with awk is that it allows you to leave out variuos levels of structure in the really simple scripts, meaning that the same scripts you see around will look quite different.
E.g. all the following would be the same (looking for the string "something" in column 1, and printing only those lines):
Awk syntax is basically what became core Javascript according to its creator [1]. Bourne shell syntax is very different so I take your comment as a frustrated reaction to the "Python obsolete" comment which must be seen in context with Python introducing itself as an awk replacement among other things (though not nearly as aggressive as Perl which use to have the a2p script to bogusly convert awk to Perl code).
I can agree on bash syntax being crazy, but certainly not on awk. Awk is very simple, a man page is all you need if you need to find something. Otherwise what's so complex with awk?
For larger programs, I wrote and use ngetopt.awk: https://github.com/joepvd/ngetopt.awk. This is a loadable library for gawk that lets you add option parsing for programs.
This can be a very powerful idiom (basically, code generation at the shell prompt).
It’s well suited to iterative composition of the commands: I’ll write the query/find part, and (with ctrl P) add the awk manipulations, and then pipe to sh.
If it doesn’t have side effects you can pass through “head” before “sh” to check syntax on a subset.
It is a guilty pleasure but I like writing awk scripts that write shell scripts that get piped into sh, for example
ls | awk '{if (length($1)==7) {print "cat " $1 }}' | sh
it is something you really aren't supposed to do because bad inputs could be executed by the shell. Personally the control structures for bash never stick in my mind because they are so unlike conventional programming languages (and I only write shell scripts sporadically) so I have to look them up in the info pages each time. I could do something like the above with xargs but same thing, I find it painful to look at the man page for xargs.
When I show this trick to younger people it seems most of them aren't familiar with awk at all.
For me the shell is mostly displaced by "single file Python" where I stick to the standard library and don't pip anything, for simple scripting it can be a little more code than bash but there is no cliff where things get more difficult and I code Python almost every day so I know where to find everything in the manual that isn't on my fingertips.
That’s not so terrible if you at least verify the output before the final “| sh”.
Though you’d have to be confident that running it twice is going to give the same results. If it’s remote data that could change then weird/bad/nasty things could happen.
For anything non trivial, best to separate those steps and generate a temp script to execute.
Once I wrote a 300-line Awk script to install a kernel driver. It would scan a hardware bus, and ask the user questions before loading the driver onto the system. Lots of fun!
The existence of things like "git implemented in awk"* serves as a great reminder that there are a lot of developers out there who are far, far, far, far more talented than I will ever be. I salute them.
*Especially keeping in mind that these people wrote things like this for fun.
[+] [-] cpdean|3 years ago|reply
Take this function the author wrote for converting a list of integers (or strings) into floats
Using `range(0,len(integerlist))` immediately betrays how the author doesn't understand python. The first arg in `range` is entirely redundant. Mutating the input list like this is also just bad design. If someone has used python for longer than a month, you'd write this with just `[float(i) for i in integerlist]`.Further down in the function `format_captured` you see this attempt at obfuscation:
Why bother with a `filter`? Who hurt you? That said, the author's implementation in awk does look pretty clean. I'm just peeved that they straw-manned the other language.[1] https://pmitev.github.io/to-awk-or-not/Python_vs_awk/
[+] [-] elesiuta|3 years ago|reply
That code was so bad I felt I had to step in too, I used chatGPT to simplify it a bit but it also introduced some errors, so I found what appears to be an input file to test it on [1]. The only difference with the awk program is that it uses spaces while the original python program used tabs.
[1] https://dornshuld.chemistry.msstate.edu/comp-chem/first-gaus...[+] [-] NoboruWataya|3 years ago|reply
[+] [-] sgarland|3 years ago|reply
[+] [-] version_five|3 years ago|reply
[+] [-] asicsp|3 years ago|reply
* Can I solve the problem using one-liners (grep, sed, awk, perl, sort, etc)? Or perhaps from within Vim?
* Can I glue together one-liners with minimal control flow as a Bash script?
* If not, go for Python
---
Discussion for https://blog.jpalardy.com/posts/why-learn-awk/ mentioned at the end of the article: https://news.ycombinator.com/item?id=22108680 (420 points | Jan 21, 2020 | 235 comments)
[+] [-] mananaysiempre|3 years ago|reply
Compared to straight Python (i.e. not Numpy/pandas), it can also be surprisingly fast[1]. I experienced this personally on a trivial problem: take a TSV of (I,J) pairs, histogram I/J into a given number of bins. I can’t remember the exact figures now, but it went like this: on a five-gig file, pure Python is annoyingly slow, GNU awk is okayish but I still have to wait, mawk is fast enough that I don’t wait for it, C is of course still at least an order of magnitude faster but at that point it doesn’t matter.
[1] https://brenocon.com/blog/2009/09/dont-mawk-awk-the-fastest-... note that the original author of mawk has made a release since then, at https://github.com/mikebrennan000/mawk-2, that doesn’t have the crashes I encountered with 64-bit builds of Dickey’s fork
[+] [-] version_five|3 years ago|reply
If I need a script to generate some output (a regular use case that seems to come up for me is random numbers with some property), I tend to use python.
I also use python if I need to do more complicated aggregations or something on tabular data that pandas is better at. Though it's fun to try with `join` and awk sometimes (parsing csv can get tricky).
If I need to plot something I tend to use jupyter notebook but it's way more satisfying to use gnuplot, which I mention because it fits naturally into workflows that use shell tools like awk.
[+] [-] gpvos|3 years ago|reply
[+] [-] BiteCode_dev|3 years ago|reply
- can I use ripgrep, fdfind, fzf and choose to do this?
- can I ask chatgpt to do this ?
[+] [-] HopenHeyHi|3 years ago|reply
[+] [-] noloblo|3 years ago|reply
[+] [-] anthk|3 years ago|reply
[+] [-] pcw888|3 years ago|reply
[+] [-] mcculley|3 years ago|reply
The problem was that every time someone else was asked to add any feature to it, they freaked out at the language choice and I had to get on a plane.
[+] [-] sgarland|3 years ago|reply
Pipes are great because they enable you to trivially send data between programs, and they're terrible for the same reason. While the execution time on modern computers for the average data size isn't noticeable, on larger datasets or repeated execution, it absolutely is. If you don't have to pipe, don't.
[+] [-] Rediscover|3 years ago|reply
I most recently did that an hour ago, and a few hours ago, pretty much every day.
awk '! x[$0]++' foo
[+] [-] klodolph|3 years ago|reply
It is just so damn good at the things it is good at, but nobody learns to use it, because learning Awk is inefficient in the grand scheme of things. There are better things you can learn.
There is one place where Awk has undeniable superiority—and that is its use in environments where bureaucratic rules prohibit the distribution of programs / code (Perl, Python), but where Awk is permitted.
[+] [-] pphysch|3 years ago|reply
It's absolutely efficient to learn, because there isn't much to learn.
[+] [-] pcthrowaway|3 years ago|reply
[+] [-] rapiz|3 years ago|reply
[+] [-] pk-protect-ai|3 years ago|reply
[+] [-] kmarc|3 years ago|reply
pdf / csv / excel export from my three webbanks, a bit of pdftotext, or soffice conversion just to pipe to awk to augment it and render properly formatted spreadsheet
[+] [-] lofaszvanitt|3 years ago|reply
[+] [-] samuell|3 years ago|reply
I think the confusing factor with awk is that it allows you to leave out variuos levels of structure in the really simple scripts, meaning that the same scripts you see around will look quite different.
E.g. all the following would be the same (looking for the string "something" in column 1, and printing only those lines):
'$1 == "somestring"'
'$1 == "something" { print }'
'($1 == "something")'
'($1 == "something") { print }'
... to give a small example.
At least this confused me a lot in the beginning.
[+] [-] tannhaeuser|3 years ago|reply
[1]: https://brendaneich.com/2010/07/a-brief-history-of-javascrip...
[+] [-] hawski|3 years ago|reply
[+] [-] stn_za|3 years ago|reply
Awk + bash could easily recreate most existing code in a couple of lines
[+] [-] asdff|3 years ago|reply
[+] [-] joepvd|3 years ago|reply
$ query_something | awk 'generate commands' | sh
For larger programs, I wrote and use ngetopt.awk: https://github.com/joepvd/ngetopt.awk. This is a loadable library for gawk that lets you add option parsing for programs.
[+] [-] mturmon|3 years ago|reply
It’s well suited to iterative composition of the commands: I’ll write the query/find part, and (with ctrl P) add the awk manipulations, and then pipe to sh.
If it doesn’t have side effects you can pass through “head” before “sh” to check syntax on a subset.
[+] [-] PaulHoule|3 years ago|reply
When I show this trick to younger people it seems most of them aren't familiar with awk at all.
For me the shell is mostly displaced by "single file Python" where I stick to the standard library and don't pip anything, for simple scripting it can be a little more code than bash but there is no cliff where things get more difficult and I code Python almost every day so I know where to find everything in the manual that isn't on my fingertips.
[+] [-] Rediscover|3 years ago|reply
I run stuff like You mentioned (piping to a shell) and also system() frequently. It depends on many factors which one I'll choose.
(FWIW, I'm also quite decent in many shell flavors on many Unix/Linux variants, so that is another determinant)
Eg, the Busybox ash(1) that I frequently work with does not support arrays, but its awk(1) does...
[+] [-] koolba|3 years ago|reply
Though you’d have to be confident that running it twice is going to give the same results. If it’s remote data that could change then weird/bad/nasty things could happen.
For anything non trivial, best to separate those steps and generate a temp script to execute.
[+] [-] ketanmaheshwari|3 years ago|reply
[+] [-] john-tells-all|3 years ago|reply
Once I wrote a 300-line Awk script to install a kernel driver. It would scan a hardware bus, and ask the user questions before loading the driver onto the system. Lots of fun!
[+] [-] xp84|3 years ago|reply
*Especially keeping in mind that these people wrote things like this for fun.
[+] [-] cesaref|3 years ago|reply
https://github.com/crossbowerbt/awk-webserver
I know, just because you can doesn't mean you should.
[+] [-] tyingq|3 years ago|reply
https://news.ycombinator.com/item?id=22085459
[+] [-] unknown|3 years ago|reply
[deleted]
[+] [-] prosaic-hacker|3 years ago|reply
[+] [-] evadk8|3 years ago|reply
[+] [-] jjuliano|3 years ago|reply