top | item 25856558

Knuth-Morris-Pratt string-searching algorithm: DFA-less version

109 points| ingve | 5 years ago |yurichev.com | reply

38 comments

order
[+] MattPalmer1086|5 years ago|reply
That was a fun read, I liked the use of cmbc to validate the algorithm.

For those who are interested, there's a good tool to specifically test string searching algorithms here:

https://github.com/smart-tool/smart

There are so many string searching algorithms now, with different best and worst cases. Some work better on low alphabets (eg DNA), some are better for text or high entropy data, some take advantage of CPU instructions, some are generic. The real challenge is picking the right algorithm.

I've implemented a few of them in java here, and extended them to support multi byte matching at any position:

https://github.com/nishihatapalmer/byteseek

[+] boxfire|5 years ago|reply
For Part I, though I get the point to demonstrate proofs with CBMC, this is a pretty contrived example. E.g. using state machines for the cocos search makes an on-line very lazily written implementation human-verifiable at a read. I would like to see an example that doesn't have such a contrived nature.

Edit: Now I see in part II/III where the title actually comes to fruition and there is a state machine oriented generator is made and used. That still doesn't satisfy me about part I, honestly I would have just skipped it in retrospect.

[+] mlyle|5 years ago|reply
Note that the use of CMBC here was not thorough enough to spot that his reference implementation reads past the end of the string.
[+] mlyle|5 years ago|reply
"You see, there are two memory accesses per one character of the input string."

No, there's one memory access on average even if your compiler is maximally dumb (short circuit AND is something C compilers need to know how to do to be conformant). And in a smarter compiler this will be one memory access per byte anyways. Not to mention on any modern architecture, even if your compiler doesn't squash the second memory access, the microarchitecture will.

Also all the semicolons after his brackets bug me. Not to mention that the thing looks past the end of the string. (We have a length specified, so I assume it's not null terminated... indeed, we can get a match past the end).

[+] klodolph|5 years ago|reply

    search_ok("ooooooooooooooo", 15)
30 memory accesses. length times two. The article is correct.
[+] Jtsummers|5 years ago|reply
"worst" versus "average". He probably could've phrased the sentence better, but his conclusion isn't wrong. If the string is all 'o' characters we'd hit this worst case memory access.
[+] maweki|5 years ago|reply
Isn't the "homebrew" result at the end exactly the state machine from the DFA?

It has been a while, but I remember KMP calculating a prefix table of some sort beforehand, to skip the repeated checking on yet-unclear partial matches?

[+] Jtsummers|5 years ago|reply
Yes. The author discusses the relation between the homebrew solution and a DFA solution in part 2, and more about partial matches in part 3.
[+] thdc|5 years ago|reply
Yeah, back in university I've really only learned the version that is presented in part 3. I mean I feel like the prefix table is just a more succinct representation of the DFA but I do find it interesting to see one way that that approach could've been arrived at.
[+] victor82|5 years ago|reply
Thank you for the write-up, I love the way you use CMBC to construct your algorithm, seem like a way to synthesize code :) Thanks also for point CMBC, I was looking for something like that for years!
[+] mikewarot|5 years ago|reply
Wouldn't it be better to do less branching?

Long ago, I wrote a text search that used the 8088 XLAT instruction in a very short loop to do a text search, the carry flag got set when there was a hit.

On modern machines, it would all end up in cache quickly, the branch would be well predicted, and it would just zip through text.

It handled upper/lower case (you just set the bits corresponding to both upper and lower case of the letter)... but it couldn't do regex.

[+] Jtsummers|5 years ago|reply
Part 1 is a fairly naïve hand-rolled solution. Part 2 shows KMP generating a DFA, which does reduce branching (specifically in the search, where the branches in the first part become array lookups in the second part).
[+] danmg|5 years ago|reply
Is this really DFA-less? Isn't the look-back table you construct just encoding the DFA?
[+] Jtsummers|5 years ago|reply
It is and it isn't. The DFA is not deliberately constructed, it's more a "happy accident". A renaming/reframing of the `seen` variable would make it obvious that what is constructed is a manually conceived DFA. As presented, `seen` is "how many characters have we seen of the target string?". Reframing it as "There exists a state for each character in the target, `seen` represents which of these states we are presently in" makes it clear that it is a DFA. But as constructed it's an accidental DFA, rather than a deliberate one.
[+] bambataa|5 years ago|reply
Does avoiding memory accesses for short strings really provide much performance improvement? Surely the adjacent characters, which need to be read at least once anyway, will be in cache. I can see the benefit when the searched-for string is long.
[+] Jtsummers|5 years ago|reply
It makes the algorithm (if not using a series of unoptimized if-else-ifs) linear in the length of the string being searched. You go from O(n*k) to O(n).

I mean, it may not matter much, but it is more efficient. A more impactful optimization is knowing how far to skip in the string being searched based on which state you're in and what character appears.

[+] tangjurine|5 years ago|reply
I took a look at part 1, and the use of CMBC was cool!
[+] kreeben|5 years ago|reply
This sound interesting. I wanted to read the documentation but the link in the README file returns 404.