top | item 13942789

(no title)

It seems like the "ASCII puke" concrete syntax for regex patterns doesn't scale that well. Regular expressions have binary operators, parenthesization, named groups, lookaheads, etc.-- if you're building a sophisticated regex of more than 10 characters or so, why not have some kind of an object model for this stuff so you can have reasonable forms of composition, naming of intermediate values in construction of a larger pattern, and the ability to attach modifiers to things without needing to pack more @#$%&*!^ un-Googleable junk into string literals?

discuss

ebiester|9 years ago

Icon (and SNOBOL before that) had an alternate syntax that was more verbose but more readable.

  s := "this is a string"
    s ? {                               # Establish string scanning environment
        while not pos(0) do  {          # Test for end of string
            tab(many(' '))              # Skip past any blanks
            word := tab(upto(' ') | 0)  # the next word is up to the next blank -or- the end of the line
            write(word)                 # write the word
        }
    }

I should implement something like this in ruby some day... (I did it in java long ago, but this was before you just threw things up to github and I have long since lost it.)

sublimeloge|9 years ago

For those who haven't seen the language before, I think it's also useful to know what expression evaluation in Icon works using a recursive backtracking algorithm. This means that the most natural way of writing a string scanning parser (like the one above) more or less automatically gives you a recursive backtracking parser. Like ebiester, I too have found it to be a nice way to do certain kinds of simple string parsing.

norswap|9 years ago

Nice notation. Looks like a grammar with a bit more fancyness thrown in. I have a tool that does this kind of stuff: https://github.com/norswap/whimsy/blob/master/doc/autumn/REA...

I should try to implement something like this, to see how hard it would be (I think not too much, but I might overlook something).

theamk|9 years ago

The regular expressions in Ruby, Perl and Python supported "extended syntax" for a long time (x flag). For an example, see the first first section http://www.perl.com/pub/2004/01/16/regexps.html (this is perl , ruby is very similar).

I think the article would be easier to read if all the regexes were in extended form, but I suppose that author is an expert regex user, so the examples were easy enough for him.

And finally, perl6 totally re-did text matching with "grammars" (https://docs.perl6.org/language/grammars.html) -- they use much more readable syntax, nameable groups, etc... It really is quite a wonderful thing, I with it was available in other languages.

petercooper|9 years ago

Not so much an expert, but I did Perl for 8 years before 13 years (so far) of Ruby so a lot of exposure(!) I've not been a fan of extended syntax but it might be worth me giving it a proper go and writing something up if it's helpful so thanks!

draegtun|9 years ago

Thats why i much prefer to use something like the Parse dialect that comes with Rebol / Red - http://www.rebol.com/docs/core23/rebolcore-15.html

Here's ebiester Icon example in the parse dialect:

    s: "This is a string"

    parse s [
        any " "                         ;; skip past any leading blanks (none in this example!
        any [                           ;; repeat ANY while keeps matching (true)
            copy word to [" " | end]    ;; next word up to next blank or EOL
            (print word)                ;; print word
            skip                        ;; skip past blank
        ]
    ]

And here is one translation of falsedan example:

    ws:     [some [" " | "^-"]]
    sep:    #"^-"
    digits: charset "1234567890"
    lower:  charset [#"a" - #"z"]
    upper:  charset [#"A" - #"Z"]
    cost:   [3 5 digits "." 2 digits]
    quoted-description: [
        copy quot [{'} | {"}]
        some [{\} quot | upper | lower | " "]
        quot
    ]
    unquote-description: [to sep]
    description:         [quoted-description | unquote-description]
    SKU:                 [some [upper | digits]]

    parse/case s [ws cost sep description sep SKU ws]

falsedan|9 years ago

Don't all the languages which support complex regex also support composing regexes from objects/strings?

e.g.

  /\A\s*\d{3,5}\.\d{2}\t\(?:(['"])(?:[\s\w]|\\\1)+\1|[^'"]+)\t[A-Z0-9]+\s*\Z/

would be written as

  $ws                   = /\s*/;
  $sep                  = /\t/;
  $cost                 = /\d{3,5} [.] \d\d/x;
  $quoted_description   = /
    (['"])       # opening quote; remember this for later
      (?:        # (capture groups, don't read this)
         [\w\s]  #   a chars or space
         |       #   OR
         [\\] \1 #   backslash-escaped quote, same as opening
      )+         # (description can't be empty)
    \1           # closing quote, matching opening
  /x;
  $unquoted_description = /[^'"]+/;
  $description          = /(?: $quoted_description | $unquoted_description )/x;
  $SKU                  = / [[:upper:][:digit:]]+ /x;

  /\A $ws $cost $sep $description $sep $SKU $ws \Z/x

elchief|9 years ago

`x` mode and DEFINE can your friends. Example

  (?xi)
  (?(DEFINE)
  
    (?<year> \d{4} )  # I'm a comment
  
    (?<brand> honda|toyota )
  
    (?<model> crx|prius )
  )

...later...

  ((?&year)) ((?&brand)) ((?&model))

brudgers|9 years ago

There is this project, Verbal Expressions: https://github.com/VerbalExpressions

It appears from time to time on Hacker News. The Javascript implementation has nearly 10k stars.

Johnny_Brahms|9 years ago

I have replaced most regular expressions with irregexes http://synthcode.com/scheme/irregex/ . They are sadly scheme specific, but does a lot for clarity.

For anything more complex than small things that can be understood in less than 20 seconds, I use a parser generator, be it parsack (racket version of Haskell's parsec) or whichever I have at hand.

derefr|9 years ago

Personally, I'm fine with it—but only because Regexps really belong to a different domain than people think.

Regexps do not exist to be a self-documenting syntax for writing code that gets read and maintained. If you are going to sit down, write, debug, commit, and PR some code that matches strings, for heaven's sake just write your pattern in BNF and apply a parser generator to it, or use a parser combinator library.

Regexps are intended as a fluent syntax for interacting with data. Regexps exist to be arguments to sed, awk, and vim's :s command. Regexps exist to let you type an SQL query into psql that finds rows with columns matching a pattern. They're meant to be a hand-tool, used by a craftsman during the manual work of analysis that comes before the job is planned.

And as such, regexp syntax features aren't meant to be composed into multi-line monstrosities that do all the work at once; they're meant to let you match chunks, and then pipe that to another regexp that winnows those chunks down, and then another that substitutes one part of each chunk, etc.

If you've ever seen a PERL script written in "imperative mode", where every line is relying on the implicit assignment of its result to the $_ variable, each line doing one more thing to that variable, each little regexp sawing off one edge or patching one hole—that's an example of the proper use of regexps. Such a script is effectively less a "program", and more simply a record of someone's keystrokes at a REPL.

And because of that, I honestly find it a bit strange that modern compiled languages build in "first-class" native support for regexps. They make sense in "scripting" languages like Ruby and Python because those languages can indeed be used for "scripting": writing code in their REPLs to do some manual tinkering, and then maybe saving a record of what you just did in case you need it again. But in languages like Go or Elixir? Why not just give the developer a batteries-included parser-combinator library instead? (If you, as a developer, need to parse regexps to support your users querying your system by passing it regexps, they could still be available from a library. But there's no need for a literal syntax for them in such languages.)

That being said, I wouldn't mind if an IDE for a particular compiled language accepted regular-expression syntax as a sort of Input Method Editor: you'd hit Ctrl+Shift+R or somesuch, a little "Input regexp: " window would pop up over your cursor, and then as you wrote and modified the regexp in the window, the equivalent BNF grammar would appear inside a text-selection at the cursor. That's a good use of regexps: to allow you to fluently, quickly create BNF grammars. As if they were a synthesizer keyboard, with each keystroke immediately performing a function.

unknown|9 years ago

[deleted]