Make grep 50x faster

[+] agf|12 years ago|reply

If you take a look at the comments on the article, most of that speedup is because the LANG=C command was run second and the files were cached.

His estimate accounting for that was 7x, but this is clearly not a benchmark that was carefully thought through.

[+] pmelendez|12 years ago|reply

I believe that's what ack-grep[1] and the silver searcher(AKA ag)[2] do underneath.

Actually I would recommend people to give it a try to those alternatives, I haven't had to look back to grep again since I am using ack-grep (and now ag)

[1] http://beyondgrep.com/

[2] http://geoff.greer.fm/2011/12/27/the-silver-searcher-better-...

[+] cs02rm0|12 years ago|reply

Maybe, the silver searcher still seems a _lot_ faster than grep with this trick though. (About 0.5s vs 32s on an arbitrary test).

[+] anilshanbhag|12 years ago|reply

I am curious what will happen when we run the commands in the reverse order. The LANG=C variation before the first. I suspect some of the speedup is because you just brought the file into memory.

[+] ye|12 years ago|reply

Not only that, the search branches are sitting in the CPU caches.

[+] iagooar|12 years ago|reply

This is what I get on a Precise 32bit Ubuntu:

  stuff$ du -sh big.log
  2.8G	big.log

  stuff$ time grep -i e big.log > /dev/null
  real	0m30.228s
  user	0m12.213s
  sys	0m3.228s

  stuff$ time LANG=C grep -i e big.log > /dev/null
  real	0m30.130s
  user	0m12.105s
  sys	0m3.308s

What is LANG=C supposed to do?

[+] simias|12 years ago|reply

Maybe your default locale is already C?

I'm still surprised that TFA can claim such a speedup, I would have thought IO speed was the bottleneck when you grep through a big amount of data.

As an other poster mentioned I wonder if the speedup is not mainly disk caching in RAM during the 2nd run.

[+] masklinn|12 years ago|reply

Sets the process's locale to C rather than whatever the system's default is (if the process is locale-aware, as "C" is the default locale for C programs). The main change[0] for this case is to also disable encoding (and thus decoding), all text is considered to be ASCII rather than whatever the locale specifies (usually UTF8 these days)

[0] the locale should have an impact on what character ranges match. See http://stackoverflow.com/questions/6799872/how-to-make-grep-... for an example

[+] UNIXgod|12 years ago|reply

It's historical and sets your local character set. The default value on FreeBSD is POSIX which is an alias the historically value which is C. Desktop unix's like OSX or ubuntu set it to utf8.

Another note is that this is triggering fgrep which is already fast due to it's fixed length expression (i.e. no recursion is involved)

[+] pmelendez|12 years ago|reply

>What is LANG=C supposed to do?

It changes the charset to do not use utf-8.

[+] blassium|12 years ago|reply

There was a really interesting post on here a while back on GNU grep vs BSD grep (2010)[1]

The improvement mentioned here also has to do with the Boyer-Moore algorithm. When switching the locale from LANG=whatever to LANG=C, we're reducing the size of the lookup table to a fraction of what it previously was. In this case, the fraction is 1/50th, but, as the author said, this will vary between patterns and platforms.

[1] http://lists.freebsd.org/pipermail/freebsd-current/2010-Augu...

[+] comex|12 years ago|reply

Note that, at least as of GNU grep 2.14, if you don't use -i, the discrepancy doesn't show up, so it's smart enough to recognize that the UTF-8 search can be correctly performed as a byte search. I suspect the insensitive version can also be done correctly much faster, though.

[+] acdha|12 years ago|reply

> it's smart enough to recognize that the UTF-8 search can be correctly performed as a byte search

It shouldn't that simple – it'd also need to confirm that the pattern wouldn't match any combining characters or normalization would still be necessary.

[+] tszming|12 years ago|reply

See:

[1] https://news.ycombinator.com/item?id=3337411

[2] http://dtrace.org/blogs/brendan/2011/12/08/2000x-performance...

[+] acqq|12 years ago|reply

And especially:

http://rg03.wordpress.com/2009/09/09/gnu-grep-is-slow-on-utf...

"Update on 2010/10/28: GNU grep is no longer slow on UTF-8. The problem was fixed with the release of GNU grep 2.7. The rest of the article can now be considered obsolete."

[+] dfc|12 years ago|reply

I did not see any version numbers or if we are discussing BSD grep or GNU grep. The grep in OSX is ridiculously slow. Whenever anyone says grep is slow the first thing I ask is if they are using OSX, the answer is almost always yes. GNU grep is a lot faster.

That being said there was a bug with grep and UTF a little while back. Debian lists the bug as present in 2.6 and fixed in 2.8:

"grep ." pathologically slow in UTF-8 locales -- http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=604408

[+] nullanvoid|12 years ago|reply

There was a write up about this a while ago. http://www.inmotionhosting.com/support/website/ssh/speed-up-...

25 comments