top | item 35273273

Fascination of Awk

194 points| todsacerdoti | 3 years ago |maximullaris.com | reply

84 comments

order
[+] cpdean|3 years ago|reply
I'm a huge fan of awk but the "Python vs awk" page this links to [1] shows python code that's almost deliberately atrocious.

Take this function the author wrote for converting a list of integers (or strings) into floats

    def ints2float(integerlist):
        for n in range(0,len(integerlist)):
            integerlist[n]=float(integerlist[n])
        return integerlist
Using `range(0,len(integerlist))` immediately betrays how the author doesn't understand python. The first arg in `range` is entirely redundant. Mutating the input list like this is also just bad design. If someone has used python for longer than a month, you'd write this with just `[float(i) for i in integerlist]`.

Further down in the function `format_captured` you see this attempt at obfuscation:

    freqs=ints2float(filter(None,captured[n].split(' '))[2:5])
Why bother with a `filter`? Who hurt you?

    freqs = ints2float(captured[n].split(' ')[2:5])
That said, the author's implementation in awk does look pretty clean. I'm just peeved that they straw-manned the other language.

[1] https://pmitev.github.io/to-awk-or-not/Python_vs_awk/

[+] elesiuta|3 years ago|reply
> python code that's almost deliberately atrocious

That code was so bad I felt I had to step in too, I used chatGPT to simplify it a bit but it also introduced some errors, so I found what appears to be an input file to test it on [1]. The only difference with the awk program is that it uses spaces while the original python program used tabs.

  #!/usr/bin/env python3
  import sys
  
  freq, fc, ir = [], [], []
  with open(sys.argv[1]) as f:
      for line in f.readlines():
          words = line.split()
          if "Frequencies" in line:
              freq.extend(words[2:])
          elif "Frc consts" in line:
              fc.extend(words[3:])
          elif "IR Inten" in line:
              ir.extend(words[3:])
  
  for i in range(len(freq)):
      print(f"{freq[i]}\t{fc[i]}\t{ir[i]}")
[1] https://dornshuld.chemistry.msstate.edu/comp-chem/first-gaus...
[+] NoboruWataya|3 years ago|reply
Agreed - this is pretty much the perfect use case for list comprehensions, which are one of the best features of Python. Normally "oh but there's a better way to do it in that language" isn't a particularly interesting observation, but here it completely turns the author's point on its head. I can't think of many more elegant ways to convert a list of ints to floats, in any language, than `[float(i) for i in integerlist]`.
[+] sgarland|3 years ago|reply
I wanted to point out that the Python code was written to be 2.7 compatible, and maybe the atrociousness was due to that, but then I looked up when list comprehensions were introduced - 2.0, with PEP202.
[+] version_five|3 years ago|reply
Looks like he copy pasted the python version from another forum post, and didn't look at it carefully. I'd suspect it can be made to look a lot cleaner (edit, yes, e.g. by just translating each of the main lines in the awk script to an if statement) I agree with the strawman comment.
[+] asicsp|3 years ago|reply
My rough workflow is:

* Can I solve the problem using one-liners (grep, sed, awk, perl, sort, etc)? Or perhaps from within Vim?

* Can I glue together one-liners with minimal control flow as a Bash script?

* If not, go for Python

---

Discussion for https://blog.jpalardy.com/posts/why-learn-awk/ mentioned at the end of the article: https://news.ycombinator.com/item?id=22108680 (420 points | Jan 21, 2020 | 235 comments)

[+] mananaysiempre|3 years ago|reply
I don’t think you’re right to put Awk into the one-liner category. It actually scales up remarkably well to a couple hundred lines or so, as long as the problem does not strain its anaemic data-structure capabilities.

Compared to straight Python (i.e. not Numpy/pandas), it can also be surprisingly fast[1]. I experienced this personally on a trivial problem: take a TSV of (I,J) pairs, histogram I/J into a given number of bins. I can’t remember the exact figures now, but it went like this: on a five-gig file, pure Python is annoyingly slow, GNU awk is okayish but I still have to wait, mawk is fast enough that I don’t wait for it, C is of course still at least an order of magnitude faster but at that point it doesn’t matter.

[1] https://brenocon.com/blog/2009/09/dont-mawk-awk-the-fastest-... note that the original author of mawk has made a release since then, at https://github.com/mikebrennan000/mawk-2, that doesn’t have the crashes I encountered with 64-bit builds of Dickey’s fork

[+] version_five|3 years ago|reply
If I have a text file of csv that I need to do something with, I'll usually start with a shell script and coreutils + sed and awk.

If I need a script to generate some output (a regular use case that seems to come up for me is random numbers with some property), I tend to use python.

I also use python if I need to do more complicated aggregations or something on tabular data that pandas is better at. Though it's fun to try with `join` and awk sometimes (parsing csv can get tricky).

If I need to plot something I tend to use jupyter notebook but it's way more satisfying to use gnuplot, which I mention because it fits naturally into workflows that use shell tools like awk.

[+] gpvos|3 years ago|reply
This, but with Perl instead of Python. Perl really has a perfect flow from one-liners to small scripts.
[+] BiteCode_dev|3 years ago|reply
Same, but with 2 recent :

- can I use ripgrep, fdfind, fzf and choose to do this?

- can I ask chatgpt to do this ?

[+] noloblo|3 years ago|reply
Awk sed bash and perl are extremely underrated and nearly always beats python for elegance and succinctness of the repeating problems in the daily chores of sys admin
[+] anthk|3 years ago|reply
Perl itself replaces awk, sed and bash for scripts. Think about it :D.
[+] pcw888|3 years ago|reply
I remember the trend away from Perl, but it's a great language. It just became unfashionable.
[+] mcculley|3 years ago|reply
A long time ago, I built a relatively complex program that managed some other systems in awk. It was really a great fit for the problem and I was, at the time, working in an environment with poor developer tooling. The target systems were heterogeneous and I could not depend on Perl even being available. But awk was guaranteed to be there.

The problem was that every time someone else was asked to add any feature to it, they freaked out at the language choice and I had to get on a plane.

[+] sgarland|3 years ago|reply
If you find yourself piping any combination of cut, grep, sed, uniq (and likely others I'm missing) together, you can probably do it all in awk. If you can guarantee usage of gawk, you can add sort to that list (tbf you can also implement any algorithm you want in awk, but arguably at that point you're wasting time) - and it's also worth noting that you can dedupe in awk _without_ having sorted input, albeit at the cost of storing all unique lines in memory.

Pipes are great because they enable you to trivially send data between programs, and they're terrible for the same reason. While the execution time on modern computers for the average data size isn't noticeable, on larger datasets or repeated execution, it absolutely is. If you don't have to pipe, don't.

[+] Rediscover|3 years ago|reply
> ...dedup in awk...

I most recently did that an hour ago, and a few hours ago, pretty much every day.

awk '! x[$0]++' foo

[+] klodolph|3 years ago|reply
Awk is actually amazing as long as you operate within its limitations and core problems that it solves well—it is definitely not limited to one-liners, but it has other limitations.

It is just so damn good at the things it is good at, but nobody learns to use it, because learning Awk is inefficient in the grand scheme of things. There are better things you can learn.

There is one place where Awk has undeniable superiority—and that is its use in environments where bureaucratic rules prohibit the distribution of programs / code (Perl, Python), but where Awk is permitted.

[+] pphysch|3 years ago|reply
A naive AWK solution (extract+transform text) will vastly outperform a naive Python solution in throughput and memory footprint.

It's absolutely efficient to learn, because there isn't much to learn.

[+] pk-protect-ai|3 years ago|reply
In 1996 I was paid to teach Unix to a group of customers. As a graduation work for my course they had to write an awk program for budgeting without use of any databases. I'm sure they had a lot of fun and cursed me for this sadistic approach to teaching at that time. However decades later some of them were still thankful they have learned regex and awk back then.
[+] kmarc|3 years ago|reply
i do this for my own budgeting in 2023.

pdf / csv / excel export from my three webbanks, a bit of pdftotext, or soffice conversion just to pipe to awk to augment it and render properly formatted spreadsheet

[+] lofaszvanitt|3 years ago|reply
Bash, awk et al = the syntax that can't be remembered, not even when smashed in the head with a tiring iron or forced with a submachine gun.
[+] samuell|3 years ago|reply
I found the awk syntax to be surprisingly discoverable, once I got the rough structure of scripts.

I think the confusing factor with awk is that it allows you to leave out variuos levels of structure in the really simple scripts, meaning that the same scripts you see around will look quite different.

E.g. all the following would be the same (looking for the string "something" in column 1, and printing only those lines):

'$1 == "somestring"'

'$1 == "something" { print }'

'($1 == "something")'

'($1 == "something") { print }'

... to give a small example.

At least this confused me a lot in the beginning.

[+] tannhaeuser|3 years ago|reply
Awk syntax is basically what became core Javascript according to its creator [1]. Bourne shell syntax is very different so I take your comment as a frustrated reaction to the "Python obsolete" comment which must be seen in context with Python introducing itself as an awk replacement among other things (though not nearly as aggressive as Perl which use to have the a2p script to bogusly convert awk to Perl code).

[1]: https://brendaneich.com/2010/07/a-brief-history-of-javascrip...

[+] hawski|3 years ago|reply
I can agree on bash syntax being crazy, but certainly not on awk. Awk is very simple, a man page is all you need if you need to find something. Otherwise what's so complex with awk?
[+] stn_za|3 years ago|reply
It's easy. You just need to stop thinking of these things as toys...

Awk + bash could easily recreate most existing code in a couple of lines

[+] asdff|3 years ago|reply
Bash is so comfortable to write in, its like you are just writing psuedocode. Pipes >>>>>>> a dozen parentheses I forget to close half the time.
[+] joepvd|3 years ago|reply
awk is amazing. One pattern I often use is:

$ query_something | awk 'generate commands' | sh

For larger programs, I wrote and use ngetopt.awk: https://github.com/joepvd/ngetopt.awk. This is a loadable library for gawk that lets you add option parsing for programs.

[+] mturmon|3 years ago|reply
This can be a very powerful idiom (basically, code generation at the shell prompt).

It’s well suited to iterative composition of the commands: I’ll write the query/find part, and (with ctrl P) add the awk manipulations, and then pipe to sh.

If it doesn’t have side effects you can pass through “head” before “sh” to check syntax on a subset.

[+] PaulHoule|3 years ago|reply
It is a guilty pleasure but I like writing awk scripts that write shell scripts that get piped into sh, for example

   ls | awk '{if (length($1)==7) {print "cat " $1 }}' | sh
it is something you really aren't supposed to do because bad inputs could be executed by the shell. Personally the control structures for bash never stick in my mind because they are so unlike conventional programming languages (and I only write shell scripts sporadically) so I have to look them up in the info pages each time. I could do something like the above with xargs but same thing, I find it painful to look at the man page for xargs.

When I show this trick to younger people it seems most of them aren't familiar with awk at all.

For me the shell is mostly displaced by "single file Python" where I stick to the standard library and don't pip anything, for simple scripting it can be a little more code than bash but there is no cliff where things get more difficult and I code Python almost every day so I know where to find everything in the manual that isn't on my fingertips.

[+] Rediscover|3 years ago|reply
Do You ever use awk's "system" command?

I run stuff like You mentioned (piping to a shell) and also system() frequently. It depends on many factors which one I'll choose.

(FWIW, I'm also quite decent in many shell flavors on many Unix/Linux variants, so that is another determinant)

Eg, the Busybox ash(1) that I frequently work with does not support arrays, but its awk(1) does...

[+] koolba|3 years ago|reply
That’s not so terrible if you at least verify the output before the final “| sh”.

Though you’d have to be confident that running it twice is going to give the same results. If it’s remote data that could change then weird/bad/nasty things could happen.

For anything non trivial, best to separate those steps and generate a temp script to execute.

[+] john-tells-all|3 years ago|reply
I love awk! It's incredibly clean and direct.

Once I wrote a 300-line Awk script to install a kernel driver. It would scan a hardware bus, and ask the user questions before loading the driver onto the system. Lots of fun!

[+] xp84|3 years ago|reply
The existence of things like "git implemented in awk"* serves as a great reminder that there are a lot of developers out there who are far, far, far, far more talented than I will ever be. I salute them.

*Especially keeping in mind that these people wrote things like this for fun.

[+] evadk8|3 years ago|reply
the problem with awk is it is slow: counting lines of a json file (some 18mil lines) takes a few seconds where sed is much faster
[+] jjuliano|3 years ago|reply
Not until you used JQ inside awk