top | item 6319229

I don't know Regex

30 points| dyml | 12 years ago |ideasof.andersaberg.com | reply

52 comments

order
[+] microtonal|12 years ago|reply
Which one is the simplest? I rest my case.

Actually, I like neither. The code is easier to read, but the regex gives a broader overview. This is something where parser combinators can shine. E.g., from Haskell's email-validate:

  addrSpec = do
  	localPart <- local
  	char '@'
  	domainPart <- domain
  	endOfInput
  	return (EmailAddress localPart domainPart)
Source: http://hackage.haskell.org/packages/archive/email-validate/1...

To end with a positive note: good work on the library! I think it will be useful for many people who dislike writing regexes.

[+] Lindrian|12 years ago|reply
You could always comment the regex. My website provides a full and accurate explanation, step by step, of almost any given regex. Have a look here: http://regex101.com/r/dG4lP3
[+] waps|12 years ago|reply
Just because it's a regex does not mean you can't document it. There are many regex tracers that can tell you exactly where a match fails. Plus regexes condense a lot of information in small spaces, which makes them easier (imho) to debug. Most other parsing syntaxes are one-offs, and very verbose.

And your average parsing library is not going to be using boyer-moore state machine parsing like you can easily achieve with regexes. It's complex, terse, fast, and the code that will be running your match is probably better debugged than any code you could hope to produce (it's most programmers' understanding of regexes that could use some debugging). Regexes also just make sense if you know the theory behind the state machines.

So how about this way of writing the regex :

  regex = r"""(?x)           # Extended syntax (ignores \n and whitespace, allows comments)
  # Regex to match email addresses
  \b                        # Word boundary
  (?P<username>\w+)         # Username part
  @
  (?P<domain>[\w.]+)        # Domain
  \b                        # Word boundary
  """

  # Example use
  import re
  m = re.match(regex, "john@snow")

  print m.group('username')    # john
  print m.group('domain')      # snow
I find parser combinators very hard to use. I wrote parser combinator libraries in C and one in java thinking it'd be easier to use than a parser generator like ANTLR, and I've since rethought the process. ANTLR studio is just so useful for writing a parser to example data.

There's also the concern that parsers are strictly more expressive than regexes. If you need that, then regexes are simply out. However, most parser generators allow you to easily combine regex(-like) tokenization with parsing.

[+] irahul|12 years ago|reply
Nice work. Personally I will still use the raw regex rather than the method calls to build the regular expression. As another commenter pointed out, the example regex is complex than it should be. It can be reduced to:

    pat = re.compile(r'^ \w+ @ [A-Za-z]\w*  \. \w+ $', re.X)
    if pat.match('[email protected]'):
        print "woot"
I won't bother explaining this regex(too simple). However, if it were something complex, I would put inline comments:

    pat = re.compile(r'''^ \w+  # rahul
                    @
                    [A-Za-z]\w*  # thoughtnirvana
                    \.
                    \w+ # com
                    $''',
                    re.X)
Notes about the example regex:

  var regEx = {(?:^)[A-Za-z]([A-Za-z]+|(?:\d+))(@{1,1})[A-Za-z]+(.{1,1})[A-Za-z]+(?:$)}
(?:^), (?:$) - This is the same as simply using ^. It isn't captured by default so there isn't a need to mark it non-capturing.

([A-Za-z]+|(?:\d+)) - What's going on here? You have a capturing group and within that capturing group, you have the or part marked as non capturing. What's the intent?

(@{1,1}) - @{1,1} is the same as @. Also, why are you capturing it? I think you are using parens for making the regex readable. You should use the IgnorePatternWhitespace instead http://msdn.microsoft.com/en-us/library/yd1hzczs.aspx

[+] masklinn|12 years ago|reply
Seems to me the example regex is the (completely broken) output of the (completely broken) generator expression below.
[+] simpleigh|12 years ago|reply
Unless I'm being daft, it can't be reduced to that at all. Dropping matching brackets, removing redundant {1,1} blocks, and escaping the \. we get down to:

    ^[A-Za-z]([A-Za-z]+|\d+)@[A-Za-z]+\.[A-Za-z]+$
Your version isn't the same at all - \w allows letters digits and underscores anywhere, whereas the original is more subtle.
[+] Argorak|12 years ago|reply
All these libraries suffer from the problem of decribing regexes in a non-formal language (english!).

An example: What are .Letters()? [a-zA-Z]? Are diacritics included? The whole UTF-8 letter range?

And suddenly, you have to specify that character soup and the example goes to hell, because it reintroduces most of complexities in the original regexp.

[+] BobTurbo|12 years ago|reply
I would like to point out that I am actually the creator of this idea, and not the author. The author has created a variation in C#, that has some differences.

The original repository is at:

https://github.com/thebinarysearchtree/RegExpBuilder

I came up with this idea 2 years ago. Some differences I see between my idea and this c# implementation are:

Or() is confusing by itself. In mine, you pass in objects or strings, such as:

   var regex = new RegExpBuilder()
     .either(pattern1)
     .or(pattern2);

   var regex = new RegExpBuilder()
     .either("sometime")
     .or("soon")
     .or("never");
Also, all the special characters are escaped properly (\ is not escaped).

There are shortcuts - you don't have to do

   .exactly(1).of("hackernews")
you can just do:

   .then("hackernews");
In terms of differences between this and VerbalExpressions, verbal expressions is very limited. It cannot represent many quantifiers (eg, at least 3 of something), does not have decent ways to group subexpressions, and so on. It can only represent (in a practical way), about 0.000001 % of regular expressions, as opposed to RegExpBuilder.
[+] dyml|12 years ago|reply
You have my support! I read a blogpost showing off your RegExpBuilder and I got inspired to create something similiar (as a chance to improve my regex and coding skills)in C#, although I have some things I would love to do differently than how your lib does it.

Thank you for a great library, after I have reached stable with this C# port, i'd like to create a TypeScript version. I hope you do not have anything against me writing spinnofs? :)

[+] asperous|12 years ago|reply
His example could be simplified to

    ^ ( [a-z0-9]+ @ [a-z]+ \. [a-z]+ ) $
With ignore case and ignore whitespace mode on. I work with Regex a lot so I find this very readable, set in a universal format, and more concise. I will gladly concede that the builder would be easier for those that aren't familiar with regex.
[+] dyml|12 years ago|reply
Thanks for the comments! I bet I could improve the way regex is generated, since i'm not so comfortable working with regex.

I'd also like to add features, a .Not operator would be really useful, and I'd gladly take a pull request if anyone have an implementation in mind :)

If I receive some signals that others find this library useful and would like me to add some feature, I'd be more than glad to do so.

[+] simpleigh|12 years ago|reply
I don't think it can - the local-part of the original matches as (a letter followed by letters) or (a letter followed by numbers):

    [A-Za-z]([A-Za-z]+|(?:\d+))
    =>
    [A-Za-z]([A-Za-z]+|\d+)
Your version doesn't match this: - it allows numbers and digits to be interleaved - it allows the local-part to start with a digit

    [a-z0-9]+ != ([a-z]+|\d+)
[+] secoif|12 years ago|reply
Unfortunately you're going to encounter regex a lot in your programming career and this tool won't always be there to save you, so you are going to need to learn regex one way or the other. You might as well get it over with sooner rather than later.

This tool just hindering your progression and yet another abstraction someone has to to learn if they are going to deal with this code. It would make sense if this was a one-off thing and you'd be saving someone the effort of learning some weird protocol or syntax, but since regex is so common and most programmers have just learned to deal with them, you're actually adding more cognitive load, since now they have to know two things instead of one. Imagine coming across this in someone else's code and discovering the regex didn't work as expected. Now I have to debug the regex and figure out whether it's a bug in the tool, or in my regex, etc…

[+] troels|12 years ago|reply
Seems there is a bug, since this:

    .Exactly(1).Of(".")
expands to this:

    (.{1,1})
Which is wrong, as dot is a meta-character. It should be escaped.
[+] dyml|12 years ago|reply
You are absolutely right, i'll get right on this bug during lunch break (i'm at the office, so..).

Thanks alot for the report!

[+] masklinn|12 years ago|reply
And {1,1} (twice) is completely unnecessary.
[+] vacri|12 years ago|reply
I've been making good use of http://www.regexper.com/ since it was linked here. It's made learning regexes much easier as it gives a clear workflow diagram.

For example, it showed that the horrible email regex in this article had a couple of errors - the dot before the TLD should be escaped (without the escape, it's 'any character'), and that group #1 can either be letters or digits, but not both (when it can be).

It's still not a good regex, since there are characters like hyphens, dots, and pluses that are valid pre-'@' characters, which both sample regexes fail to recognise.

[+] thiht|12 years ago|reply
Personally I use http://www.debuggex.com/ since it offers a step by step visualization, a live generation of the diagram, a live syntax checking of the regex, etc.
[+] dyml|12 years ago|reply
Great resource, it seems immensly useful! I will be sure to investigate any errors on my part! :)

And as I've said on the Github Repo, the library is not quite ready for prime time. When it's stable, i'll be sure to publish a nuget package for simple access.

[+] deerpig|12 years ago|reply
There are a lot of cool tools for helping write regexes, my favor is re-builder mode in emacs. You write the regex in the minibuffer and see what matches in the text in the buffer. It makes debugging regexes very easy.

Tools like sed and regexes are compact and very powerful, and they aren't difficult to learn. I really don't understand the need for this library, which seem needlessly verbose. And you will still need to be able to read regexes in other people's code.

It's a nice idea, and good work, but in my opinion it's solving a problem that doesn't exist.

[+] draegtun|12 years ago|reply
Some languages provide alternatives to Regexes. For eg. Rebol uses a parse dialect instead - http://www.rebol.com/docs/core23/rebolcore-15.html

Here is the articles example converted to Rebol's parse dialect (minus capturing but it's easy to add):

  ; build some prereqs for parse
  num:         charset [#"0" - #"9"]
  alpha-lower: charset [#"a" - #"z"]
  alpha-upper: charset [#"A" - #"Z"]
  alpha:       union alpha-lower alpha-upper
  alpha-num:   union alpha num 

  ; create parse rule block
  simple-email-rule: [
      alpha
      any alpha-num
      #"@"
      some alpha-num
      #"."
      some alpha-num
      end 
  ]

  ;
  ; then later...

  parse "[email protected]" simple-email-rule  ; => true
  parse "notanemailaddress" simple-email-rule  ; => false
[+] lutusp|12 years ago|reply
With all respect, you're better off using a supportive regex environment that accepts your regex entries and quickly shows their effect on some example text you provide -- a builder/tester like this (just an example, there are many similar ones):

http://www.arachnoid.com/regex_lab/

Philosophically, there are two approaches to making regexes an effective tool -- expand regex syntax until it's so verbose that there's no possibility for confusion -- ironically a somewhat confusing tactic as this topic's comments demonstrate -- or learn native regex in an interactive way that shows its effect on example text, until you develop an instinct for it. I prefer the latter.

It's like learning music by keyboard -- shall we paint each keyboard key a different color and recode sheet music to agree, or shall we use a teaching method that makes the keyboard gradually seem more natural?

[+] Qantourisc|12 years ago|reply
Your email regex is wrong. There are some obscure email address that will not work. For example my.email [email protected]

For more see http://en.wikipedia.org/wiki/Email_address#Valid_email_addre...

[+] joe_the_user|12 years ago|reply
I think that actually says a lot about regexes as code.

The more corner cases you have to consider, the more unreadable the code gets, after a while by seemingly exponential degree.

And if you have a tool to build the regex, why not just use the tool's code as your source so the final result is readable.

Basically, pasting a big regex into your code hardly seems more desirable than pasting a bunch of assembler there. But assembler can unavoidable. This isn't.

Regex are cool as CS constructs though.

[+] zeckalpha|12 years ago|reply
> validates an email adress (something you should never do with Regex, because you won't get it right, but anyway)

I think it is clear that this isn't the point.

[+] islon|12 years ago|reply
Which one is the simplest? I rest my case.

Of course the regex is simpler as everyone who knows regexes will understand it.

If you don't know regex you should invest time on learning it. It's the same if you say:

    I don't know german, look at this german sentence builder, it's so much nicer!
    > builder.firstPersonPronom().verb("like").directObject(new SecondPersonPronom());
    > => "Ich mag dich"
[+] daGrevis|12 years ago|reply
First thing to do when writing regexes is to write them on multiple lines. Also, you should use comments. It will make them more readable and easier to follow. Also, I want to suggest Zed's Learn Regex The Hard Way.

http://regex.learncodethehardway.org/book/

[+] kaeluka|12 years ago|reply
You could have named contents:

  var letter = "[a-zA-Z]";
  var letters = letter + "+";
  ...
then your regexp would look like this:

  var regEx = letter ++
    letters ++
    "|" ++ ...
[+] roryokane|12 years ago|reply
I learned regexes entirely from the tutorial at http://www.regular-expressions.info/tutorial.html. It clearly explained how the regex engine works so I can simulate it in my head and understand why a given regex does or doesn’t work. I tried out various regexes in TextMate as I read through the tutorial – nowadays I would use one of the online sandboxes listed on http://stackoverflow.com/tags/regex/info. That free tutorial was enough to get me very comfortable with regexes.

I also tested my understanding afterwards with some online exercises chosen from these lists:

http://www.emacs.uniyar.ac.ru/doc/em24h/emacs081.htm

http://blogs.msdn.com/ericgu/archive/category/11323.aspx

This reference was handy while doing the exercises: http://www.regular-expressions.info/reference.html

Knowing regexes has been very helpful to me in general. I have used regexes in reformatting my code through find and replace, in finding the code that I need to edit next or that could be causing a certain problem, in writing Apache config URL rewriting rules, in writing poor man’s language parsers that assisted me in generating code, in converting raw data into programming language literals, in understanding user input validation rules, and in other ways. I think that any serious developer who expects to work with more than one programming language in their lifetime should understand regular expressions. Thus, I encourage the OP to try learning regexes, using the resources linked above.

That said, I agree that regexes could be easier to understand. I rather wish that Perl 6’s revised, simpler regex syntax (http://perlcabal.org/syn/S05.html) were the universal standard.

If you use regexes a lot, and get mentally strained by the complexity of some of your bigger ones, consider learning about parsers too, another type of tool that lets you manipulate text in more powerful ways, with longer but more readable code than regexes. http://kschiess.github.io/parslet/ is a simple parsing library to start with if you use Ruby. In fact, Parslet is rather like a more powerful and more theoretically-sound version of the OP’s library RegExpBuilder. Like RegExpBuilder, Parslet uses chains of methods with English names to build parsers.

[+] roryokane|12 years ago|reply
Here is one possible translation of asperous’s simplified email regex into a Parslet parser:

  #!/usr/bin/env ruby
  
  original_regex = /^ ( [a-z0-9]+ @ [a-z]+ \. [a-z]+ ) $/ix
  
  require 'parslet'
  include Parslet
  local_part = match['A-Za-z0-9'].repeat(1)
  letters = match['A-Za-z'].repeat(1)
  domain_part = letters >> str('.') >> letters
  email_parser = local_part >> str('@') >> domain_part
  
  user_input = "foo#bar.com"
  matches_regex = original_regex.match(user_input)
  matches_parser = email_parser.parse(user_input)
asperous’s regex: https://news.ycombinator.com/item?id=6319435

Parslet info: http://kschiess.github.io/parslet/