There are so many string searching algorithms now, with different best and worst cases. Some work better on low alphabets (eg DNA), some are better for text or high entropy data, some take advantage of CPU instructions, some are generic. The real challenge is picking the right algorithm.
I've implemented a few of them in java here, and extended them to support multi byte matching at any position:
For Part I, though I get the point to demonstrate proofs with CBMC, this is a pretty contrived example. E.g. using state machines for the cocos search makes an on-line very lazily written implementation human-verifiable at a read. I would like to see an example that doesn't have such a contrived nature.
Edit: Now I see in part II/III where the title actually comes to fruition and there is a state machine oriented generator is made and used. That still doesn't satisfy me about part I, honestly I would have just skipped it in retrospect.
"You see, there are two memory accesses per one character of the input string."
No, there's one memory access on average even if your compiler is maximally dumb (short circuit AND is something C compilers need to know how to do to be conformant). And in a smarter compiler this will be one memory access per byte anyways. Not to mention on any modern architecture, even if your compiler doesn't squash the second memory access, the microarchitecture will.
Also all the semicolons after his brackets bug me. Not to mention that the thing looks past the end of the string. (We have a length specified, so I assume it's not null terminated... indeed, we can get a match past the end).
"worst" versus "average". He probably could've phrased the sentence better, but his conclusion isn't wrong. If the string is all 'o' characters we'd hit this worst case memory access.
Isn't the "homebrew" result at the end exactly the state machine from the DFA?
It has been a while, but I remember KMP calculating a prefix table of some sort beforehand, to skip the repeated checking on yet-unclear partial matches?
Yeah, back in university I've really only learned the version that is presented in part 3. I mean I feel like the prefix table is just a more succinct representation of the DFA but I do find it interesting to see one way that that approach could've been arrived at.
Thank you for the write-up, I love the way you use CMBC to construct your algorithm, seem like a way to synthesize code :)
Thanks also for point CMBC, I was looking for something like that for years!
Long ago, I wrote a text search that used the 8088 XLAT instruction in a very short loop to do a text search, the carry flag got set when there was a hit.
On modern machines, it would all end up in cache quickly, the branch would be well predicted, and it would just zip through text.
It handled upper/lower case (you just set the bits corresponding to both upper and lower case of the letter)... but it couldn't do regex.
Part 1 is a fairly naïve hand-rolled solution. Part 2 shows KMP generating a DFA, which does reduce branching (specifically in the search, where the branches in the first part become array lookups in the second part).
It is and it isn't. The DFA is not deliberately constructed, it's more a "happy accident". A renaming/reframing of the `seen` variable would make it obvious that what is constructed is a manually conceived DFA. As presented, `seen` is "how many characters have we seen of the target string?". Reframing it as "There exists a state for each character in the target, `seen` represents which of these states we are presently in" makes it clear that it is a DFA. But as constructed it's an accidental DFA, rather than a deliberate one.
Does avoiding memory accesses for short strings really provide much performance improvement? Surely the adjacent characters, which need to be read at least once anyway, will be in cache. I can see the benefit when the searched-for string is long.
It makes the algorithm (if not using a series of unoptimized if-else-ifs) linear in the length of the string being searched. You go from O(n*k) to O(n).
I mean, it may not matter much, but it is more efficient. A more impactful optimization is knowing how far to skip in the string being searched based on which state you're in and what character appears.
[+] [-] MattPalmer1086|5 years ago|reply
For those who are interested, there's a good tool to specifically test string searching algorithms here:
https://github.com/smart-tool/smart
There are so many string searching algorithms now, with different best and worst cases. Some work better on low alphabets (eg DNA), some are better for text or high entropy data, some take advantage of CPU instructions, some are generic. The real challenge is picking the right algorithm.
I've implemented a few of them in java here, and extended them to support multi byte matching at any position:
https://github.com/nishihatapalmer/byteseek
[+] [-] boxfire|5 years ago|reply
Edit: Now I see in part II/III where the title actually comes to fruition and there is a state machine oriented generator is made and used. That still doesn't satisfy me about part I, honestly I would have just skipped it in retrospect.
[+] [-] mlyle|5 years ago|reply
[+] [-] mlyle|5 years ago|reply
No, there's one memory access on average even if your compiler is maximally dumb (short circuit AND is something C compilers need to know how to do to be conformant). And in a smarter compiler this will be one memory access per byte anyways. Not to mention on any modern architecture, even if your compiler doesn't squash the second memory access, the microarchitecture will.
Also all the semicolons after his brackets bug me. Not to mention that the thing looks past the end of the string. (We have a length specified, so I assume it's not null terminated... indeed, we can get a match past the end).
[+] [-] klodolph|5 years ago|reply
[+] [-] Jtsummers|5 years ago|reply
[+] [-] maweki|5 years ago|reply
It has been a while, but I remember KMP calculating a prefix table of some sort beforehand, to skip the repeated checking on yet-unclear partial matches?
[+] [-] Jtsummers|5 years ago|reply
[+] [-] thdc|5 years ago|reply
[+] [-] victor82|5 years ago|reply
[+] [-] kuter|5 years ago|reply
[+] [-] mikewarot|5 years ago|reply
Long ago, I wrote a text search that used the 8088 XLAT instruction in a very short loop to do a text search, the carry flag got set when there was a hit.
On modern machines, it would all end up in cache quickly, the branch would be well predicted, and it would just zip through text.
It handled upper/lower case (you just set the bits corresponding to both upper and lower case of the letter)... but it couldn't do regex.
[+] [-] Jtsummers|5 years ago|reply
[+] [-] dennis714|5 years ago|reply
[+] [-] danmg|5 years ago|reply
[+] [-] Jtsummers|5 years ago|reply
[+] [-] bambataa|5 years ago|reply
[+] [-] Jtsummers|5 years ago|reply
I mean, it may not matter much, but it is more efficient. A more impactful optimization is knowing how far to skip in the string being searched based on which state you're in and what character appears.
[+] [-] tangjurine|5 years ago|reply
[+] [-] kreeben|5 years ago|reply