I don't think they use anything in common. Try to set your locale to "C" as otherwise string comparisons will do extra work handling your locale's notions of equivalent characters.
While it sure is possible to do text manipulation in C, I don't think it should ever be the first choice, even if 'fastest' is a goal. A 0 byte is perfectly acceptable in a utf8 string (or any unicode string, really). But C has those annoying zero-terminated strings, so if you want to manipulate arbitrary unicode strings the first thing you can do is kiss the string functions in the C standard library goodbye. Which you probably want to do anyway because pascal-strings are simply better.
Note that this and that are not necessarily related: you're talking about performing unicode-aware text matching and manipulation, TFA is solely about validating a buffer's content as UTF-8.
They are still mostly not multi-byte string (i.e. unicode) aware after decades of work. I.e. you cannot really search for strings, with case-folding or normalized variants.
This tool only does the minor task of validation of the UTF-8 encoding, nothing else. There are still the major tasks of decoding, folding and normalization to do.
How slow? On my 2013 MBP, `gsed` (sed from coreutils) can do a replacement like that at about 350 MiB/s (of which most seems to be spent writing to disk, since writing to /dev/null hikes it up to 800 MiB/s).
It was sed substitute command on a ~800Mb file on Thinkpad T470 with SSD. It was taking around 40-50 sec for each substitution. Though as others have pointed, it may not be directly related to article in discussion.
zorked|7 years ago
coldtea|7 years ago
A "few table names" doesn't mean much if the SQL file is 20GB.
In any case, sed and awk are plenty fast, but not the fastest methods of text manipulation. You could write a custom C program for that.
Thiez|7 years ago
I would use Rust or C++ for this task.
masklinn|7 years ago
rurban|7 years ago
See http://crashcourse.housegordon.org/coreutils-multibyte-suppo... and http://perl11.org/blog/foldcase.html for an overview of the performance problems.
This tool only does the minor task of validation of the UTF-8 encoding, nothing else. There are still the major tasks of decoding, folding and normalization to do.
akx|7 years ago
akarambir|7 years ago