top | item 32214419

Performance comparison: counting words in Python, C/C++, Awk, Rust, and more

368 points| goranmoomin | 3 years ago |benhoyt.com

218 comments

order
[+] mustache_kimono|3 years ago|reply
Re: the Rust optimized implementation, I was able to get ~20-25% better performance by rewriting the for loops as iterators, using Rust's byte range pattern matching, and a buffered writer, which seems crazy, but it's true. I chalked it up to some crazy ILP/SIMD tricks the compiler is doing.

I even submitted a PR[0], but Ben decided he was tired of maintaining and decided to archive the project (which fair enough!).

[0]: https://github.com/benhoyt/countwords/pull/115

[+] mustache_kimono|3 years ago|reply
In case anyone is interested, I did an optimized, but much more simple, Rust implementation just today[0], which is faster than the optimized implementation on my machine. No indexing into arrays of bytes, no unsafe, etc., no "code golf" measures. Of course, credit where it's due -- these are simply tweaks to Andrew Gallant's code.

Looks like idiomatic Rust, which I think is interesting. Shows there is more than one way to skin a cat.

[0]: https://github.com/kimono-koans/countwords/blob/master/rust/...

[+] tobz1000|3 years ago|reply
These changes would prompt some additional remarks about the Rust implementation: most optimisations make code more verbose and less idiomatic, but in Rust it can sometimes be the opposite.
[+] codeflo|3 years ago|reply
Very nice. That’s in fact really clean, and (at least to me) indeed surprising.

As rejection reasons go, “I wanted to name-drop person X” is interesting. But as you said, that’s the maintainer’s decision to make.

[+] grumpyprole|3 years ago|reply
Iterators are potentially a much better design too as they allow for more modular code.
[+] burntsushi|3 years ago|reply
From my measurements, your optimized version runs at about the same speed as mine:

    $ cd /tmp/
    $ git clone -b ag/test-kimono https://github.com/BurntSushi/countwords
    $ cd countwords/rust/
    $ ./bench
    <snip>
    Summary
      './optimized-trie/target/release/countwords < kjvbible_x10.txt' ran
        1.22 ± 0.02 times faster than './kimono-simple/target/release/countwords < kjvbible_x10.txt'
        1.26 ± 0.02 times faster than './optimized-customhashmap/target/release/countwords < kjvbible_x10.txt'
        1.56 ± 0.03 times faster than './optimized-unsafe/target/release/countwords < kjvbible_x10.txt'
        1.58 ± 0.03 times faster than './kimono-optimized/target/release/countwords < kjvbible_x10.txt'
        1.60 ± 0.03 times faster than './optimized/target/release/countwords < kjvbible_x10.txt'
        3.97 ± 0.06 times faster than './simple/target/release/countwords < kjvbible_x10.txt'
        7.58 ± 0.21 times faster than './bonus/target/release/countwords < kjvbible_x10.txt'
You mentioned in another comment that you were benchmarking on an M1. Maybe there's some interesting differences there in the codegen, how the CPU executes or both.

Your 'fast-simple' version is a different story though. Personally, I would not put that in the "simple" classification as outlined by the OP. It is IMO very much not the first program someone would write for this. Bringing in 'hashbrown' and futzing with the hash table lookup/insert is definitely a perf tweak that you probably wouldn't want to do unless you had to, because it makes the code more complex. The 'Box<[u8]>' cements it as an 'optimized' variant.

Now it does look like a simpler variant of the "optimized" Rust program I wrote. I'm pretty impressed. I don't think I ever would have broken out of my local optima to discover that program. In particular, your approach appears to do 3 (and some change) passes over each buffer: 1) read up to '\n' (just the first line), 2) UTF-8 validation, 3) make lowercase and 4) split on whitespace. I realize chunking all of that up confers a lot of benefits, but still, 4 passes and it still being faster than 1 pass is very interesting.

I've glanced at the profiles of each program but haven't been able to solidly conclude anything more precisely about where exactly the benefit is. At this point, my next step would be to slowly translate my version into yours, and benchmark each step of the way until I could isolate the key change (assuming it is one change). But alas, I have run out of time this evening. :-)

Kudos!

(The other interesting thing to note here is that my 'trie' variant is now the fastest Rust submission. Previously, it was slower than the optimized variants on my older CPU. That's pretty neat.)

[+] codeflo|3 years ago|reply
For those curious about Rust optimization, the “optimized” version makes three changes compared to the “idiomatic” version:

* It uses byte strings instead of UTF-8 strings. In my opinion, that’s not an optimization, that’s changing the problem. Depending on the question you’re asking, only one of the two can be correct.

* It uses a faster hash algorithm. It’s not the first time this came up in a benchmark article. Rust’s decision to use a DOS-safe hash by default (and not provide a fast algorithm in the std, like other languages do) really seems to hurt it in that kind of microbenchmark.

* It uses get_mut+insert instead of the more convenient HashMap::entry method, because the latter would require redundantly allocating the key even in the repeat case. I’ve hit this problem in the past as well. Maybe the upcoming HashMap::raw_entry_mut will make this kind of optimization cleaner.

[+] tialaramex|3 years ago|reply
The problem specified declares the words we're counting are ASCII:

> ASCII: it’s okay to only support ASCII for the whitespace handling and lowercase operation

UTF-8 (quite deliberately) is a superset of ASCII. So a UTF-8 solution is correct for ASCII, but a bytes-as-ASCII solution works fine in Rust if you only need ASCII.

This is why Rust provides ASCII variants of a lot of functions on strings, and the same functions are available on byte slices [u8] where ASCII could be what you have (whereas their Unicode cousins are not available on byte slices).

[+] saghm|3 years ago|reply
> It uses byte strings instead of UTF-8 strings. In my opinion, that’s not an optimization, that’s changing the problem. Depending on the question you’re asking, only one of the two can be correct.

Sure, but is it changing the problem to something easier than what the other languages are already doing, or to something more similar? I'd imagine the C code is basically just using byte arrays as well, for instance.

[+] olalonde|3 years ago|reply
> * It uses byte strings instead of UTF-8 strings. In my opinion, that’s not an optimization, that’s changing the problem. Depending on the question you’re asking, only one of the two can be correct.

To be fair, that's what the C version does as well.

[+] burntsushi|3 years ago|reply
> It uses byte strings instead of UTF-8 strings. In my opinion, that’s not an optimization, that’s changing the problem. Depending on the question you’re asking, only one of the two can be correct.

No, that's very wrong. ripgrep has rich Unicode support for example, but represents file contents as byte strings. UTF-8 strings vs byte strings is an implementation detail.

I think you might benefit from reading the "bonus" Rust submission: https://github.com/benhoyt/countwords/blob/8553c8f600c40a462...

IMO, Ben kind of glossed over the bonus submission. But I personally think it was the biggest point in favor of Rust for a real world version of this task.

[+] coldtea|3 years ago|reply
>It uses byte strings instead of UTF-8 strings. In my opinion, that’s not an optimization, that’s changing the problem. Depending on the question you’re asking, only one of the two can be correct.

For tons of questions, both can be correct.

[+] stewbrew|3 years ago|reply
Ad uft8: it would change the problem only if the definition of word breaks changed. E.g., if a word break is defined by whitespace, then it probably doesn't.
[+] raffraffraff|3 years ago|reply
I love using the UNIX shell (pipes and programs that do "one thing well") to the point where it hurts my ability to improve at other other languages. But sometimes all you need is a throwaway command to do something once, not a beautifully optimized function that will be written, improved, compiled, profiled, reoptimized and then executed a billion times. The utter tersity of a bash one-liner just tickles me.
[+] compumike|3 years ago|reply
How about extending this to comparing programming language efficiency from a developer's standpoint, looking at bytes of code, versus the execution time?

I took the source code file size from the repository, and the runtime from the blog post.

Then I made an arbitrary overall "PAIN SCORE" (lower is better) by multiplying code size * runtime. I suggest this is a worthwhile metric simply because lower is better on both axes, but of course, in the "real world" there will be different economic costs to CPU time and developer time depending on the use case. Here's the sorted results, from least "pain" to most:

    LANGUAGE   FILENAME        CODE SIZE      RUNTIME     PAIN SCORE (LOWER IS BETTER)
    Shell      optimized.sh      75 bytes      1.83 s      137.25
    Crystal    simple.cr        240 bytes      1.29 s      309.6
    Nim        simple.nim       424 bytes      0.77 s      326.48
    Python     simple.py        208 bytes      2.21 s      459.68
    Ruby       simple.rb        175 bytes      3.17 s      554.75
    Go         optimized.go    1514 bytes      0.40 s      605.6
    Python     optimized.py     464 bytes      1.33 s      617.12
    Zig        optimized.zig   2688 bytes      0.24 s      645.12
    Go         simple.go        688 bytes      1.12 s      770.56
    Zig        simple.zig      1394 bytes      0.55 s      766.7
    Nim        optimized.nim   1683 bytes      0.49 s      824.67
    Shell      simple.sh         60 bytes     14.81 s      888.6
    Ruby       optimized.rb     401 bytes      2.47 s      990.47
    JavaScript simple.js        532 bytes      1.88 s     1000.16
    C          optimized.c     4360 bytes      0.23 s     1002.80
    Rust       optimized.rs    3065 bytes      0.43 s     1317.95
    Swift      simple.swift     317 bytes      4.23 s     1340.91
    JavaScript optimized.js    1501 bytes      1.10 s     1651.1
    C          simple.c        2735 bytes      0.96 s     2625.6
    Rust       simple.rs       2239 bytes      1.38 s     3089.82
Sorting only by code size, the most concise implementations are: Shell, Ruby, Python, Crystal. Nobody was aiming to play code golf (i.e. minimize source code size), so these are fairly straightforward, idiomatic, readable implementations.

I am definitely a Crystal fan, in fact this afternoon I'm continuing to implement suggestions from my recent Show HN https://news.ycombinator.com/item?id=32081943 comments. :)

[+] HarHarVeryFunny|3 years ago|reply
This is a rather meaningless comparison, since the differences are going to be dominated by:

1) The choice of libraries/datatypes used for strings and word->count map

2) How the source file is split into words - probably library function again, although in C/C++ one could choose to implement a super-optimized low level version that would blow the others away

IMO a performance comparison between languages is only meaningful if the runtime is not dominated by library functions, or if one admits it's really a standard library performance comparison, not a language comparison.

[+] kitd|3 years ago|reply
I understand what you're saying. However if you look at it not from the point of the performance figures per se, but rather which language, when written idiomatically, can solve a realistic problem performantly, then I think it makes more sense.
[+] ashvardanian|3 years ago|reply
Agreed. It’s a common desire to boil benchmarks down to a single number.

Some of those languages, however, are far more complex then the others. And as any complex tool, it requires certain methodology. If you are to use heavy machinery like C++, forget iostreams and character-level processing on your hot paths.

[+] oconnor663|3 years ago|reply
Do you have a particular example problem in mind that might demonstrate language performance better? One of the Benchmark Game programs maybe?
[+] kovac|3 years ago|reply
With C I rarely use libs outside of stdlib and when I do I pay attention to performance more than convenience.

I think even if it's dominated by libs and data types I think it has value as it might reflect what actually happens out there in the wild rather than a purely academic exercise.

[+] Scarbutt|3 years ago|reply
exactly, is mostly comparing different C libraries implementations.
[+] gcanyon|3 years ago|reply
My language of choice, LiveCode, is a superset of HyperTalk. The concept of "words" has been built-in (not a library) since the '80s.
[+] anonymoushn|3 years ago|reply
Recently there was a contest to optimize the performance of a similar program[0] and a Zoom discussion of the optimizations[1].

The programs are not comparable in the the following ways:

- Case: TFA requires (at least) ASCII lowercasing but the contest problem required no lowercasing.

- Ordering: TFA does not require sorting, but the contest problem required sorting.

- Memory: TFA imposes a requirement phrased as "don't read the whole file into memory" and this sounds like it's a resource-saving constraint, but it's usually a constraint that requires the program to spend additional resources. You could just mmap the file and store pointers into the mapped region. It costs extra to do copies instead of no copies.

- Text: TFA is unclear on what assumptions may be made about the lengths of words. For the contest problem, the Hungarian wikipedia input's longest word is around 80k.

- Safe, Hashing, Stdlib: TFA imposes some restrictions on what constructs may be used that are not imposed in the contest problem.

For the contest version of this problem, it seems like you can tokenize, hash, and count strings at around 1GB/s. Adapting a solution to solve TFA's problem (but not to conform to its Safe/Hashing/Stdlib requirements) would probably not carry too large of a penalty, since it's like 3 instructions to ASCII-lowercase 32 bytes and 1 string copy per unique string should take negligible time compared to the hash table lookups. So there is some room for the optimized solutions to go a little faster, if more optimizations are permitted.

[0]: https://easyperf.net/blog/2022/05/28/Performance-analysis-an...

[1]: https://www.youtube.com/watch?v=R_yX0XjdSBY

[+] jeffbee|3 years ago|reply
Funny how you can take the "optimized" C++ and make it significantly faster (vs. gnu/linux standard libraries) with little effort. A lookup table for to_lower, instead of bit math, helps slightly. Building with the LLVM-libc string functions helps a lot, because the GNU std::string_view::operator== is calling an AVX-optmized memcmp via the PLT, which is a terrible strategy for short strings. LLVM-libc has an inlined bcmp (note: bcmp can be significantly faster than memcmp, but on GNU they are aliases) that is resolved for the target platform at build time instead of run time.

Edit:

Even if you ignore LLVM-libc, just slapping this into the optimized.cpp and replacing the critical `==` with `our_bcmp` makes it 10% faster. IFUNC calls to micro-optimized SIMD functions are counterproductive. It is far, far better that the compiler can see all the code at build time.

  int our_bcmp (const char* a, const char* b, size_t sz) {
    for (size_t i = 0; i < sz; i++) {
      if (a[i] != b[i]) return 1;
    }
    return 0;
  }
[+] djhworld|3 years ago|reply
EDIT: I think the use of the term "allocation" might be overloaded here from the original article, after thinking about this a bit more I think the author is referring to a different way of accessing the map. If you look in godbolt https://godbolt.org/z/Yf64rodW6 you can see that the pointer version uses

  CALL    runtime.mapaccess2_fast64(SB)
whereas the 'raw' version uses

  CALL    runtime.mapassign_fast64(SB)


When reading the Go solution this bit stood out to me

> To reduce the allocations, we’ll use a map[string]*int instead of map[string]int so we only have to allocate once per unique word, instead of for every increment

I just tried benchmarking this with a simple setup and I get zero allocations for both approaches, although the "pointer" approach is slightly faster

  package main
  
  import (
   "testing"
  )
  
  func BenchmarkIncrementMapRawInt(b *testing.B) {
   var data = make(map[int]int)
  
   b.ResetTimer()
   for i := 0; i < b.N; i++ {
    key := i % 1000
    data[key]++
   }
  }
  
  func BenchmarkIncrementMapPointerInt(b *testing.B) {
   var data = make(map[int]*int)
  
   b.ResetTimer()
   for i := 0; i < b.N; i++ {
    key := i % 1000
    increment(data, key)
   }
  }
  
  func increment(counts map[int]*int, value int) {
   if p, ok := counts[value]; ok {
    *p++
    return
   }

   n := 1
   counts[value] = &n
  }

  $ go test -bench=. -benchmem .
  goos: darwin
  goarch: arm64
  BenchmarkIncrementMapRawInt-8        112288002         10.57 ns/op        0 B/op        0 allocs/op
  BenchmarkIncrementMapPointerInt-8    139728302          8.586 ns/op        0 B/op        0 allocs/op
[+] dan-robertson|3 years ago|reply
Presumably this would allocate if bits weren’t “primitive”, eg if you had 128 bit ints implemented as a tuple of two ints.
[+] yakubin|3 years ago|reply
I like the OCaml version. It's both very easy to read and comes out really well in the benchmark.

I also looked at the "simple" Zig version, which came out really well in the benchmark, and to me it didn't look simple at all. It seems you need to make a lot of low level details explicit.

But IMHO AWK takes the crown here. :)

[+] OskarS|3 years ago|reply
> But IMHO AWK takes the crown here. :)

Agreed! AWK is still the king of this stuff. For tasks like this, I kinda think AWK is nigh-unbeatable: so simple to write, so obvious what's going on (even if you've never seen any AWK program before, you're probably going to be able to figure out what's going on there), and decently performant.

AWK is the bee's knees.

[+] tannhaeuser|3 years ago|reply
TFA only talks about gawk, though, making this comparison meaningless ie. you can't benchmark languages but only language implementations. Same goes for other programming languages ofc unless those have only a single implementation.

mawk is an order of magnitude faster than gawk, and gawk isn't even the default on many Linuxen.

[+] gary17the|3 years ago|reply
Wow. Swift, touted as "safe by design and (...) runs lightning-fast"[1] is more of a screw-up than I thought. Almost twice as slow as Lua and behind even Pascal and Forth.

[1] https://developer.apple.com/swift/

[+] zX41ZdbW|3 years ago|reply
I've checked with ClickHouse and the result is better than I expect... it runs in 0.043 sec. on my machine, which is faster than any other result.

The code:

SELECT arrayJoin(splitByChar(' ', lower(line))) AS word, count() AS c FROM file('kjvbible.txt', LineAsString) WHERE notEmpty(word) GROUP BY word ORDER BY c DESC FORMAT Null

or:

clickhouse-local --query "SELECT arrayJoin(splitByChar(' ', lower(line))) AS word, count() AS c FROM file('kjvbible.txt', LineAsString) WHERE notEmpty(word) GROUP BY word ORDER BY c DESC" > /dev/null

It is using only a single thread.

[+] fauigerzigerk|3 years ago|reply
>it runs in 0.043 sec. on my machine, which is faster than any other result.

Did you run the other benchmarks on your machine as well?

[+] zX41ZdbW|3 years ago|reply
I forgot to multiply the file 10 times. When I do, the result is 0.209 sec. which is still better than every other result.
[+] BiteCode_dev|3 years ago|reply
In this type of blog post with comparisons including many languages, you usually see some unatural approach for your favorite language.

However, in this case, the Python example is idiomatic.

Since the articles calls for better approaches, I would suggest to take the opportunity to use the walrus operator and rsplit() for the optimized version. Something like this:

    reminding = ""
    c=Counter( )
    while (chunk := sys.stdin.read(64 * 1024)):
        pre, post = chunk.lower().rsplit('\n', 1)
        c.update((reminding + pre ).split())
        reminding = post
It should not affect performance too much, and the code gets more expressive.

However, the performances will be quite different depending of the python version you use. Interestingly, Python 3.11 beta, which comes with a lot performance tweaks, is slower for this exercice, while being reported to be faster on real life tasks.

[+] chrismorgan|3 years ago|reply
> Incidentally, this problem set the scene for a wizard duel between two computer scientists several decades ago. In 1986, Jon Bentley asked Donald Knuth to show off “literate programming” with a solution to this problem, and he came up with an exquisite, ten-page Knuthian masterpiece. Then Doug McIlroy (the inventor of Unix pipelines) replied with a one-liner Unix shell version using tr, sort, and uniq.

Since this writing and other linked resources present only one side of the affair, I will mention: the presentation in Pearls was quite unfair to Knuth, and the conclusions commonly made of it somewhere between moderately and entirely unsound. (That is, as conclusions from the paper. I speak of unsoundness of logic only: these conclusions may be supportable from other sources.)

Knuth was challenged to demonstrate literate programming, and so that was what he demonstrated. He showed building something from the ground up assuming a very minimal environment, with problem analysis and all: of course that ended up more verbose than a Unix pipeline that glosses over problem analysis and starts with most of the tools you need to make a probably-acceptable solution!

McIlroy himself admitted a few years later that he had been “a little unfair” to criticise on engineering grounds what had only been an illustration of technique. Even in the paper, Bentley did admit a degree of culpability for this mismatch in his “criticism of programs” paragraph. (“He admires the execution of the solution, but faults the problem on engineering grounds. (That is, of course, my responsibility as problem assigner; Knuth solved the problem he was given on grounds that are important to most engineers-the paychecks provided by their problem assigners.)”)

But I do wish that Knuth had been given the opportunity to write a review of McIlroy’s review, for simultaneous publication.

[+] svat|3 years ago|reply
See also a different comparison (on ~the same problem) at StackExchange: https://codegolf.stackexchange.com/questions/188133/bentleys...

(Copying from my comment the last time this was posted: https://news.ycombinator.com/item?id=26467684)

There's also a nice book "Exercises in Programming Style" about just this problem.

[+] ppeetteerr|3 years ago|reply
Curious if anyone knows why the swift version is slow? Is it the casting from subsequence to a string?
[+] bhauer|3 years ago|reply
As I wrote elsewhere in this thread (though I was downvoted), I suspect the reason is that this benchmark is measuring the total execution time of a single run of an executable. This would include any time spent by the executable to bootstrap its runtime environment, initiate a virtual machine, and whatever else it needs to do in addition to the relevant code at hand to process the string.

Such a test will favor implementations that have extremely svelte boostrapping, allowing them to immediately begin executing the relevant code and return a result.

I feel a more useful test would be for the relevant string processing code to be run tens to thousands of times within the process itself, so as to diminish the relative importance of boostrap/warmup code and/or runtime optimization performed by VMs. Unless, of course, part of the intent was specifically to measure the impact of the bootstrapping.

[+] saagarjha|3 years ago|reply
Here's a very rough breakdown of the operations it does:

  1.33 s    56.7% specialized Collection<>.split(separator:maxSplits:omittingEmptySubsequences:)
  263.00 ms 11.2% Substring.lowercased()
  226.00 ms  9.6% specialized Dictionary.subscript.modify
  189.00 ms  8.0% readLine(strippingNewline:)
As you can see, calling split on the string is really slow. The reason it is slow is that it's using the generic implementation from Collection, rather than String, which doesn't really know anything about how String works. So to do the split it's doing a linear march down the string calling formIndex(after:), then subscripting to see if there's a space character there using Unicode-aware string comparison on that one Character.

Swift is really nice that it gives you "default" implementations of things for free if you conform to the right things in the protocol hierarchy. But, and this is kind of unfortunately a common bottleneck, if you "know" more you should really specialize the implementation to use a more optimized path that can use all the context that is available. For example, in this case a "split" implementation should really do its own substring matching and splitting rather than having Collection call it through its slow sequential indexing API. Probably something for the stdlib to look at, I guess.

The rest of the things are also not too unfamiliar but I'll go over them one by one; lowercased() is slow because it creates a new Substring and then a new String, plus Unicode stuff. Dictionary accesses are slow because Swift uses a secure but not very performant hash function. readLine is slow because the lines are short and the call to getline is not particularly optimized on macOS, reallocs and locks internally, then the result is used to create a new String with the newline sliced off.

[+] Twirrim|3 years ago|reply
As they call out in the article, accurately counting words is a bit more complicated that splitting by space. You need to account for all sorts of punctuation etc.

When I've been asked this in programming interviews, I've almost always been expected to produce something like the code in the article (and do), but usually I'd point to something like the NLTK library as a better approach. It's polished and highly capable, and handles probably just about every edge case there is.

[+] staticassertion|3 years ago|reply
It's a bit of a shame that Rust's stdlib doesn't have a "fast but less safe" hasher. It's one of the easiest performance optimizations in my experience because it's very rare that an attacker controls input to your hashmap in a way that can actually lead to a DOS.

Not that it doesn't happen, it's just not something I've run into.

Just today I was testing out the performance difference hashing some Scylla queries and got nearly 2x faster hashing moving to fxhash.

[+] ribit|3 years ago|reply
These days I tend to use R for these type problems. It’s slow, but vectors as first-class data types simplify things a lot

readLines(file) |> strsplit(“ +”) |> unlist() |> table() |> sort(desc = TRUE)

[+] jcelerier|3 years ago|reply
Got a bit better C++ version here which uses a couple libraries instead of std:: stuff - https://gist.github.com/jcelerier/74dfd473bccec8f1bd5d78be5a... ; boost, fmt and https://github.com/martinus/robin-hood-hashing

    $ g++ -I robin-hood-hashing/src/include -O2 -flto -std=c++20 -fno-exceptions -fno-unwind-tables -fno-asynchronous-unwind-tables -lfmt

    $ time ./a.out < kjvbible_x10.txt > /dev/null
    0,19s user 0,01s system 99% cpu 0,197 total
with the same build flags, optimize.cpp gives me

    0,22s user 0,01s system 99% cpu 0,233 total