top | item 35980040

(no title)

jove_ | 2 years ago

As everyone has pointed out, this does not count. Note that the idea that regex can't parse html is specific and proven. What it means is that you can't write an expression that matches both the opening and matching closing tags. There's no way to handle nested tags within a single regex. It's only possible to write a regex that matches up to a finite nesting limit.

discuss

im3w1l|2 years ago

I think this is the difference between the theoretician and the practitioner. You see your interpretation is the obvious one for the former. But as any practitioner can tell a regular expression can't even parse a regular language!

See, normally the whole point of parsing something is to get data out right. And the way a regex gets data out is through capture groups. But herein lies the issue, a capture group can only capture one piece of information!

Consider a simple regular language: a non empty sequence of comma separated positive integers. We would like to get the integers out. An attempt

  (\d+)(,(\d+))*

The first group captures the first number, the second group is just something we introduced for the purpose of writing the regex, we don't care about the value. The third (inner) group should ideally capture all the subsequent numbers separately. But it doesn't! If you try to run that regex on 1,2,3,4,5,6,7,8,9 you will find that group 1 matches 1. And group 3 matches 9. Where did all the other numbers go?!

So really, you have to give the regex some outside help, maybe an outside loop, maybe splitting on a regex rather than parsing with one. Even for this simple language!

And when you are already doing that, why the step to giving it a bit more help, perhaps a stack, is quite small.

Tainnor|2 years ago

That is true of "theoretical" regexes, not of the ones actually used by modern languages.