top | item 36219329

(no title)

nirvanis | 2 years ago

Somewhat related tip: prepend LANG=C to many console commands such as grep to speed up many tools processing large files, as they will assume ASCII input (which is probably what you have in most cases)

discuss

order

seanhunter|2 years ago

If you care about speed you would probably be using ripgrep rather than grep anyway, but doesn’t `LANG=en_US.UTF-8` give a similar speed on modern systems without any compromise on consistency of sort ordering etc and support for extended characters?

burntsushi|2 years ago

For GNU grep in particular, no, using a UTF-8 locale can significantly slow it down:

    $ time LC_ALL=C grep -E '^\w{30}$' OpenSubtitles2018.raw.sample.en -c
    3
    
    real    0.808
    user    0.744
    sys     0.063
    maxmem  10 MB
    faults  0
    
    $ time LC_ALL=en_US.UTF-8 grep -E '^\w{30}$' OpenSubtitles2018.raw.sample.en -c
    4
    
    real    20.064
    user    19.982
    sys     0.077
    maxmem  10 MB
    faults  0
Where as ripgrep is just Unicode aware by default, and still about as fast as the ASCII only variant of GNU grep above:

    $ time rg '^\w{30}$' OpenSubtitles2018.raw.sample.en -c 
    4
    
    real    1.163
    user    1.132
    sys     0.030
    maxmem  916 MB
    faults  0

emmelaich|2 years ago

and set it for consistency of ordering (collation) between sort, join, tsort, look, etc.