top | item 18262818

(no title)

akarambir | 7 years ago

What does linux utilities like sed, awk use for text manipulation because they were very slow when I was changing a few table names in a sql file.

discuss

zorked|7 years ago

I don't think they use anything in common. Try to set your locale to "C" as otherwise string comparisons will do extra work handling your locale's notions of equivalent characters.

coldtea|7 years ago

What was the size of the SQL file?

A "few table names" doesn't mean much if the SQL file is 20GB.

In any case, sed and awk are plenty fast, but not the fastest methods of text manipulation. You could write a custom C program for that.

Thiez|7 years ago

While it sure is possible to do text manipulation in C, I don't think it should ever be the first choice, even if 'fastest' is a goal. A 0 byte is perfectly acceptable in a utf8 string (or any unicode string, really). But C has those annoying zero-terminated strings, so if you want to manipulate arbitrary unicode strings the first thing you can do is kiss the string functions in the C standard library goodbye. Which you probably want to do anyway because pascal-strings are simply better.

I would use Rust or C++ for this task.

masklinn|7 years ago

Note that this and that are not necessarily related: you're talking about performing unicode-aware text matching and manipulation, TFA is solely about validating a buffer's content as UTF-8.

rurban|7 years ago

They are still mostly not multi-byte string (i.e. unicode) aware after decades of work. I.e. you cannot really search for strings, with case-folding or normalized variants.

See http://crashcourse.housegordon.org/coreutils-multibyte-suppo... and http://perl11.org/blog/foldcase.html for an overview of the performance problems.

This tool only does the minor task of validation of the UTF-8 encoding, nothing else. There are still the major tasks of decoding, folding and normalization to do.

akx|7 years ago

How slow? On my 2013 MBP, `gsed` (sed from coreutils) can do a replacement like that at about 350 MiB/s (of which most seems to be spent writing to disk, since writing to /dev/null hikes it up to 800 MiB/s).

akarambir|7 years ago

It was sed substitute command on a ~800Mb file on Thinkpad T470 with SSD. It was taking around 40-50 sec for each substitution. Though as others have pointed, it may not be directly related to article in discussion.