I believe that's what ack-grep[1] and the silver searcher(AKA ag)[2] do underneath.
Actually I would recommend people to give it a try to those alternatives, I haven't had to look back to grep again since I am using ack-grep (and now ag)
I am curious what will happen when we run the commands in the reverse order. The LANG=C variation before the first. I suspect some of the speedup is because you just brought the file into memory.
stuff$ du -sh big.log
2.8G big.log
stuff$ time grep -i e big.log > /dev/null
real 0m30.228s
user 0m12.213s
sys 0m3.228s
stuff$ time LANG=C grep -i e big.log > /dev/null
real 0m30.130s
user 0m12.105s
sys 0m3.308s
Sets the process's locale to C rather than whatever the system's default is (if the process is locale-aware, as "C" is the default locale for C programs). The main change[0] for this case is to also disable encoding (and thus decoding), all text is considered to be ASCII rather than whatever the locale specifies (usually UTF8 these days)
It's historical and sets your local character set. The default value on FreeBSD is POSIX which is an alias the historically value which is C. Desktop unix's like OSX or ubuntu set it to utf8.
Another note is that this is triggering fgrep which is already fast due to it's fixed length expression (i.e. no recursion is involved)
There was a really interesting post on here a while back on GNU grep vs BSD grep (2010)[1]
The improvement mentioned here also has to do with the Boyer-Moore algorithm. When switching the locale from LANG=whatever to LANG=C, we're reducing the size of the lookup table to a fraction of what it previously was. In this case, the fraction is 1/50th, but, as the author said, this will vary between patterns and platforms.
Note that, at least as of GNU grep 2.14, if you don't use -i, the discrepancy doesn't show up, so it's smart enough to recognize that the UTF-8 search can be correctly performed as a byte search. I suspect the insensitive version can also be done correctly much faster, though.
> it's smart enough to recognize that the UTF-8 search can be correctly performed as a byte search
It shouldn't that simple – it'd also need to confirm that the pattern wouldn't match any combining characters or normalization would still be necessary.
"Update on 2010/10/28: GNU grep is no longer slow on UTF-8. The problem was fixed with the release of GNU grep 2.7. The rest of the article can now be considered obsolete."
I did not see any version numbers or if we are discussing BSD grep or GNU grep. The grep in OSX is ridiculously slow. Whenever anyone says grep is slow the first thing I ask is if they are using OSX, the answer is almost always yes. GNU grep is a lot faster.
That being said there was a bug with grep and UTF a little while back. Debian lists the bug as present in 2.6 and fixed in 2.8:
[+] [-] agf|12 years ago|reply
His estimate accounting for that was 7x, but this is clearly not a benchmark that was carefully thought through.
[+] [-] pmelendez|12 years ago|reply
Actually I would recommend people to give it a try to those alternatives, I haven't had to look back to grep again since I am using ack-grep (and now ag)
[1] http://beyondgrep.com/
[2] http://geoff.greer.fm/2011/12/27/the-silver-searcher-better-...
[+] [-] cs02rm0|12 years ago|reply
[+] [-] anilshanbhag|12 years ago|reply
[+] [-] ye|12 years ago|reply
[+] [-] iagooar|12 years ago|reply
[+] [-] simias|12 years ago|reply
I'm still surprised that TFA can claim such a speedup, I would have thought IO speed was the bottleneck when you grep through a big amount of data.
As an other poster mentioned I wonder if the speedup is not mainly disk caching in RAM during the 2nd run.
[+] [-] masklinn|12 years ago|reply
[0] the locale should have an impact on what character ranges match. See http://stackoverflow.com/questions/6799872/how-to-make-grep-... for an example
[+] [-] UNIXgod|12 years ago|reply
Another note is that this is triggering fgrep which is already fast due to it's fixed length expression (i.e. no recursion is involved)
[+] [-] pmelendez|12 years ago|reply
It changes the charset to do not use utf-8.
[+] [-] blassium|12 years ago|reply
The improvement mentioned here also has to do with the Boyer-Moore algorithm. When switching the locale from LANG=whatever to LANG=C, we're reducing the size of the lookup table to a fraction of what it previously was. In this case, the fraction is 1/50th, but, as the author said, this will vary between patterns and platforms.
[1] http://lists.freebsd.org/pipermail/freebsd-current/2010-Augu...
[+] [-] comex|12 years ago|reply
[+] [-] acdha|12 years ago|reply
It shouldn't that simple – it'd also need to confirm that the pattern wouldn't match any combining characters or normalization would still be necessary.
[+] [-] tszming|12 years ago|reply
[1] https://news.ycombinator.com/item?id=3337411
[2] http://dtrace.org/blogs/brendan/2011/12/08/2000x-performance...
[+] [-] acqq|12 years ago|reply
http://rg03.wordpress.com/2009/09/09/gnu-grep-is-slow-on-utf...
"Update on 2010/10/28: GNU grep is no longer slow on UTF-8. The problem was fixed with the release of GNU grep 2.7. The rest of the article can now be considered obsolete."
[+] [-] dfc|12 years ago|reply
That being said there was a bug with grep and UTF a little while back. Debian lists the bug as present in 2.6 and fixed in 2.8:
"grep ." pathologically slow in UTF-8 locales -- http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=604408
[+] [-] nullanvoid|12 years ago|reply