The biggest problem I've noticed for regex is we use it every once in a while and once it works we move onto other things. And a few weeks/months later, we have forgotten much of it and have to relearn it all over again. Whereas, you generally use your programming language ( C++, C#, Java, etc ) everyday to keep your skills sharp, regex is generally "set it and forget it" situation for most people. And as you noted, different languages/shells/etc implement their own flavor of regex which can trip you up.
It's similar to SQL when you think about it. You set up a query to get the data you need and move on to other things. And every RDBMs implements their own flavor of SQL which can complicate things.
I don't see the problem here. The regex itself is exactly the same, it's just that different languages have different string literal syntaxes (and some have dedicated regex syntax, thus solving the problem of double-escaping).
The only regex engine where this is a problem is Vim's, because there are characters that a special unless escaped, and characters that are normal but become special when escaped. And as if that wasn't enough, there are config options to determine which characters those are. My usual practice is to prefix all Vim regexes with \v so that all the special characters are at least consistent.
I use regexbuddy and it does a lot of this. Huge downside is that it's $40 and Windows only.
You can do things like write a more generic regex and then select your language (e.g Python 2 or 3, Java, Perl, so on), And a few common actions, such as "iterate over all matches in a string" and it will auto-generate a code stub for you. Whenever teammates of mine are working on a weird regex they usually email me to double check it for them. (My response is usually that they're trying to do too much with one regex, haha)
>what is a literal character and what is a control character?
I read a tip back in the Perl 5 book, that you can just escape any character if you don't remember if it has a special meaning. (You'll still get the literal character even if it didn't need escaping.)
So I basically do that a lot. Never had any issue with control characters.
o ^ character outside brackets
o $ end of line
o +
But the explanation above does not introduce these yet, so a real beginner user (like me) is lost. The ambigious characters example is fine, since it uses all the concepts already explained.
I like the example based approach. I learn from examples far quicker than I learn from “explanations”. If I attempt to learn from an example and my brain hits an exception, only then do I start reading the supporting text.
Nice approach. You’ve made a valuable thing and implemented a powerful idea.
Examples being greater than explanations is one of the main reasons I emphasize explanative error messaging, clear simple typings, and verbose function/variable names over documentation.
Docs are really good for discovery and should cover many topics shallowly so you can glean a big picture quickly. I generally don't like going to them for specs that could have just been an error message, a type, or a better naming convention.
I personally had the most trouble with regexes because I didn't have a good mental model of how they worked. The hard part wasn't finding the correct symbol/character class I was trying to match, but coming to grips with repetition, greedy/nongreedy, etc.
I took a compilers class in college where one of the projects was to implement a simple regex matcher using NFAs. Bashing my head against this for a week really helped with being able to "read" a regex. Not sure if this was due to finally understanding the algorithm, or the fact that I was just constantly staring at broken regex matches all day.
IMO it was a fairly small time investment for something that is so widely used.
I'm not sure how to explain it but the most important thing I've learned in over a decade of programming experience is to not use regular expressions for many things they may seem like a regular expression problem.
For example, even something as simple a phone number can have all sorts of weird but valid variants. Be sure you really need to even validate it's format and not just that it's present.
Trying to handle all of those variants via regex expression is doable but a pain. And in practice you as the programmer should not be defining those variants that are valid as it's up to the business itself to define what type of data it considers to be valid for the field.
That said I've also worked for companies with small engineering teams where the goal has always been to be as efficient with development time as possible, as opposed to making a near ideal system. Software has different needs when it's used by a thousand people than when it's used by millions.
I think it is understanding the algorithm that does it.
I also recommend that people learn how to read a regex by writing a small recursive program to match specific regexes. After you look at a regex and think about how it might work, intuition follows.
Actually writing the bit that turns the regex expression into said program isn't as important though. Doing that by hand 5 times is enough IMO.
One thing not mentioned here which I think is good to be aware of as you write intermediate to advanced regexes is understanding "catastrophic backtracking" and how to mitigate it: https://www.regular-expressions.info/catastrophic.html
For some reason I enjoy figuring regexes out. What I usually do is TDD them, I have a mini test suite of examples of strings I want to match and strings I don't want to match and I write some code to apply a candidate regex to them all and validate, and then I iterate until it passes. Then I rewrite the regex in extended regex format and add comments so that other people or future me understand what's going on.
Doing what a good regex can do with regular code instead (which you might do with the goal of readability or maintainability) is usually much much MUCH slower, FYI
This looks really nice but I think it suffers from the same issue a lot of regex tutorials suffer from. It's focusing solely on the regex and not at all on how to actually execute them. This site in particular says it's going to use javascript but at least the first few pages don't show anything except raw regular expressions.
For any tutorial about regular expressions I think the second thing (beyond a very simple example regex) to show should be how to actually execute one in code. Is it that all the tutorials want to be language-agnostic? Maybe just show a javascript example and point out which part is the js function/method call and which part is the actual regex.
It's nice to be told what /[aeiou]/ means but without actually typing it in and executing it (against various inputs, not just one) it wouldn't really sink in for me.
I agree that it would be nice if the learner was given some instruction on how to experiment right away. I'd recommend https://regex101.com/ as a tool to complement this (or any) regex tutorial, as well as when you're crafting/debugging regex. It's language-agnostic in that you don't need to write any code, just the regex and input -- but you can still pick different regex flavors like PCRE vs JavaScript.
However, I'd suggest to reorganize the chapters so that features not yet introduced aren't shown in examples without explanations. For example, you explain anchors and quantifiers many chapters later but use them liberally in earlier chapters without explaining them.
This looks like a great resource! Like others, I vastly prefer an example-based style, and the examples are really well chosen and very illustrative. I generally think I know my regexes, but I've already learned a few tricks. (Backreferences to match different delimiter options but not mixed delimiters is very cool!)
Feedback:
The highlighting of matches is slightly shifted to the left for me in Firefox 75 but not in Chrome (both on Ubuntu 16.04). The shift is subtle but enough to make me have to look two or three times at most examples, as the highlight covers half of the character before the match and only half of the last character in the match. Can I suggest adding Firefox to your test regimen, if you haven't already? :)
Also, on the Anchors page, I believe "carat" should be spelled "caret."
Thanks for this once again! I will definitely be revisiting this site to brush up and learn new tricks. Especially lookaround, which I have never quite wrapped my head around!
> The highlighting of matches is slightly shifted to the left … Can I suggest adding Firefox to your test regimen, if you haven't already? :)
Oh, I thought I had fixed that. I primarily test with Firefox, so this is a bit of a surprise. I'll check it out—I think it's something to do with CSS's `letter-spacing`.
Your examples use * and $ and + before they are explained. Inductive learning goes smoother if new concepts have context.
You explain [^ ...] So the use of these examples without explanation is .. unexpected. If you use examples which don't depend on * or + or $ I agree it's 'boring' but for a class of learner these surprise moments interfere with learning.
You only casually mention capitalised \thing is inversion of \thing \d and \D I think you would want to repeat that \w and \W and \s and \S and after three.. it's established.
I see this a lot in e.g. Haskell tutorials: simple inductive constructive learning examples littered with 'oh I explain that later just ignore it for now' syntactic constructs.
\( and \) are dangerous in substitution. Their meaning shifts from regex to variable-marker. Surely this needs to be noted in passing?
RegExr [0] does a great job of showing individual highlights even when they are in a sequential string. You can try to implement this if you want instead of showing a callout with a note to let the reader know that they highlights should be on individual characters.
Constructive criticism: I was about to send this to a friend who is new to programming, but the introduction is just too short. It would be great if the introduction included one or two motivational examples for the types of trouble you run into when you don't know regular expressions.
Just some 2cent feedback - don't assume anything is known.
The BASIC lesson doesn't mention anything about /g. Having not touched regex in years I had no idea what that was and kept thinking 'why isn't he showing it matching a g if he has that in the example'.
I love this! I love RegEx but have struggled trying to “teach” others over the years. In addition to books like this, I often find writing RegEx with something like Expressions[1] (and I know there are many great website solutions, Expressions is just a great app that I find very approachable to newcomers) is a great way to learn. When you can see what you’re writing select what you want, you get a great grasp of how it works. This, alongside a good book with good examples, is pretty much how I learned RegEx ~12 years ago.
Nice intro. Tangential question. Is there a regex tool that shows where the expression failed ? Not in syntax, but the logical failure point? Would be useful for when an expression gets a little long and nested and modifications need to be made.
Edit: I mean like:
Target text is abcde
Regex is /abe/
Is there a tool that will tell me it matched a and b and then failed trying to match e ?
Those sites are great resources but they are showing pass/fail and do show an excellent breakdown when something satisfies the expression, but I’m just wondering if there is something that shows partial matching until the failure point?
It might be nice to touch on composition as a good way to get started is to test out individual pieces and be confident they work when you're putting them together.
If you're building a complex regular expression, setting smaller parts in variables and dropping them in with (?:${part}) makes things a bit more readable.
It also exposes a real weakness of most regex engines. In particular, alternation is a first-class operation, but complement and intersection, while theoretically possible[1] are typically not.
A person might guess that to match three keywords is /.keyword1.&.keyword2.&.keyword3./
Or maybe /.keyword1.&(.keyword2.)!/ to match keyword1 and not keyword2.
But those won't work, so it's a good idea to explain some options, an obvious one being /keyword1/.test() && !/keyword2/.test()
In the section on lookaround assertions, it's probably useful to note that (?=thing1)(?=thing2) can match both, and it's a good mental model for it, but that it comes with a few gotchas.
[+] [-] m463|5 years ago|reply
I've been using regexs for most of my career, and still struggle to get them right on first writing.
The #1 problem I run into is:
what is a literal character and what is a control character?
for example, both these are very common:
- match a parenthesis character or a period character
- use a parenthesis to group a match or use a period to match any one character
You would think I would learn it once, and be good.
but my #2 problem confounds this:
what is a literal character and what is a control character - in the language I am using?
for example I might need to escape a period to make it a literal for a regex.
If I am checking the files filexc and file.c and want to match the second, the regex I want is
in perl, I could say: better would be: in python I would write better would be: in a shell script, I might say: EDIT: crap, I had to escape my comment because the asterisk in the regex was making my text italic[+] [-] dntbnmpls|5 years ago|reply
It's similar to SQL when you think about it. You set up a query to get the data you need and move on to other things. And every RDBMs implements their own flavor of SQL which can complicate things.
[+] [-] 7786655|5 years ago|reply
The only regex engine where this is a problem is Vim's, because there are characters that a special unless escaped, and characters that are normal but become special when escaped. And as if that wasn't enough, there are config options to determine which characters those are. My usual practice is to prefix all Vim regexes with \v so that all the special characters are at least consistent.
[+] [-] squeaky-clean|5 years ago|reply
You can do things like write a more generic regex and then select your language (e.g Python 2 or 3, Java, Perl, so on), And a few common actions, such as "iterate over all matches in a string" and it will auto-generate a code stub for you. Whenever teammates of mine are working on a weird regex they usually email me to double check it for them. (My response is usually that they're trying to do too much with one regex, haha)
[+] [-] spdustin|5 years ago|reply
https://regex101.com
[+] [-] logicallee|5 years ago|reply
I read a tip back in the Perl 5 book, that you can just escape any character if you don't remember if it has a special meaning. (You'll still get the literal character even if it didn't need escaping.)
So I basically do that a lot. Never had any issue with control characters.
[+] [-] Gehinnn|5 years ago|reply
[+] [-] mycall|5 years ago|reply
[+] [-] nobrains|5 years ago|reply
Feedback:
- In the chapter https://refrf.shreyasminocha.me/chapters/character-classes an example is given which uses:
But the explanation above does not introduce these yet, so a real beginner user (like me) is lost. The ambigious characters example is fine, since it uses all the concepts already explained.[+] [-] shreyasminocha|5 years ago|reply
[+] [-] LeonB|5 years ago|reply
Nice approach. You’ve made a valuable thing and implemented a powerful idea.
[+] [-] ehsankia|5 years ago|reply
[0] https://www.attrs.org/en/stable/examples.html
[+] [-] seph-reed|5 years ago|reply
Docs are really good for discovery and should cover many topics shallowly so you can glean a big picture quickly. I generally don't like going to them for specs that could have just been an error message, a type, or a better naming convention.
[+] [-] wonnage|5 years ago|reply
I took a compilers class in college where one of the projects was to implement a simple regex matcher using NFAs. Bashing my head against this for a week really helped with being able to "read" a regex. Not sure if this was due to finally understanding the algorithm, or the fact that I was just constantly staring at broken regex matches all day.
IMO it was a fairly small time investment for something that is so widely used.
I'll recommend this post that's been on HN many times: https://swtch.com/~rsc/regexp/regexp1.html
[+] [-] the-pigeon|5 years ago|reply
For example, even something as simple a phone number can have all sorts of weird but valid variants. Be sure you really need to even validate it's format and not just that it's present.
Trying to handle all of those variants via regex expression is doable but a pain. And in practice you as the programmer should not be defining those variants that are valid as it's up to the business itself to define what type of data it considers to be valid for the field.
That said I've also worked for companies with small engineering teams where the goal has always been to be as efficient with development time as possible, as opposed to making a near ideal system. Software has different needs when it's used by a thousand people than when it's used by millions.
[+] [-] btilly|5 years ago|reply
I also recommend that people learn how to read a regex by writing a small recursive program to match specific regexes. After you look at a regex and think about how it might work, intuition follows.
Actually writing the bit that turns the regex expression into said program isn't as important though. Doing that by hand 5 times is enough IMO.
[+] [-] pmarreck|5 years ago|reply
For some reason I enjoy figuring regexes out. What I usually do is TDD them, I have a mini test suite of examples of strings I want to match and strings I don't want to match and I write some code to apply a candidate regex to them all and validate, and then I iterate until it passes. Then I rewrite the regex in extended regex format and add comments so that other people or future me understand what's going on.
Doing what a good regex can do with regular code instead (which you might do with the goal of readability or maintainability) is usually much much MUCH slower, FYI
[+] [-] saberworks|5 years ago|reply
For any tutorial about regular expressions I think the second thing (beyond a very simple example regex) to show should be how to actually execute one in code. Is it that all the tutorials want to be language-agnostic? Maybe just show a javascript example and point out which part is the js function/method call and which part is the actual regex.
It's nice to be told what /[aeiou]/ means but without actually typing it in and executing it (against various inputs, not just one) it wouldn't really sink in for me.
[+] [-] thedirt0115|5 years ago|reply
[+] [-] jehlakj|5 years ago|reply
[+] [-] asicsp|5 years ago|reply
However, I'd suggest to reorganize the chapters so that features not yet introduced aren't shown in examples without explanations. For example, you explain anchors and quantifiers many chapters later but use them liberally in earlier chapters without explaining them.
[+] [-] shreyasminocha|5 years ago|reply
I'll work on making things clearer.
[+] [-] unknown|5 years ago|reply
[deleted]
[+] [-] twicetwice|5 years ago|reply
Feedback:
The highlighting of matches is slightly shifted to the left for me in Firefox 75 but not in Chrome (both on Ubuntu 16.04). The shift is subtle but enough to make me have to look two or three times at most examples, as the highlight covers half of the character before the match and only half of the last character in the match. Can I suggest adding Firefox to your test regimen, if you haven't already? :)
Also, on the Anchors page, I believe "carat" should be spelled "caret."
Thanks for this once again! I will definitely be revisiting this site to brush up and learn new tricks. Especially lookaround, which I have never quite wrapped my head around!
[+] [-] shreyasminocha|5 years ago|reply
Oh, I thought I had fixed that. I primarily test with Firefox, so this is a bit of a surprise. I'll check it out—I think it's something to do with CSS's `letter-spacing`.
I've fixed the typo, thanks for pointing it out.
Thanks for the comments!
[+] [-] ggm|5 years ago|reply
You explain [^ ...] So the use of these examples without explanation is .. unexpected. If you use examples which don't depend on * or + or $ I agree it's 'boring' but for a class of learner these surprise moments interfere with learning.
You only casually mention capitalised \thing is inversion of \thing \d and \D I think you would want to repeat that \w and \W and \s and \S and after three.. it's established.
I see this a lot in e.g. Haskell tutorials: simple inductive constructive learning examples littered with 'oh I explain that later just ignore it for now' syntactic constructs.
\( and \) are dangerous in substitution. Their meaning shifts from regex to variable-marker. Surely this needs to be noted in passing?
[+] [-] vasili111|5 years ago|reply
Good regex book: https://www.amazon.com/gp/product/0596528124/
Good regex website: https://www.regular-expressions.info/
Interesting regex links: https://github.com/aloisdg/awesome-regex
[+] [-] kccqzy|5 years ago|reply
https://swtch.com/~rsc/regexp/regexp1.html
https://swtch.com/~rsc/regexp/regexp2.html
And actual implementations based on these articles: https://github.com/google/re2 and https://github.com/rust-lang/regex
[+] [-] shanecoin|5 years ago|reply
[0] https://regexr.com/
[+] [-] backzerman|5 years ago|reply
[+] [-] shreyasminocha|5 years ago|reply
[+] [-] tragomaskhalos|5 years ago|reply
:)
[+] [-] donaldihunter|5 years ago|reply
[+] [-] airstrike|5 years ago|reply
[+] [-] sakekasi|5 years ago|reply
This site is my goto whenever I need to write a complex regex. It's got syntax highlighting, explanations and a tested all rolled into one!
[+] [-] evo_9|5 years ago|reply
The BASIC lesson doesn't mention anything about /g. Having not touched regex in years I had no idea what that was and kept thinking 'why isn't he showing it matching a g if he has that in the example'.
[+] [-] shreyasminocha|5 years ago|reply
[+] [-] canada_dry|5 years ago|reply
More regex resources I rely on:
http://www.regexr.com/
https://gchq.github.io/CyberChef
https://regexper.com/#.%3F%5Bv%2Ci%5D.*
https://cheatography.com/davechild/cheat-sheets/regular-expr...
[+] [-] shreyasminocha|5 years ago|reply
Also, those are some amazing resources, especially CyberChef.
[+] [-] donaldihunter|5 years ago|reply
One visual enhancement that could be really helpful would be to hover over the regex or the match and see the reciprocal highlighted.
[+] [-] filmgirlcw|5 years ago|reply
[1]: https://www.apptorium.com/expressions
[+] [-] T3RMINATED|5 years ago|reply
[deleted]
[+] [-] binstub|5 years ago|reply
Edit: I mean like:
Target text is abcde
Regex is /abe/
Is there a tool that will tell me it matched a and b and then failed trying to match e ?
Those sites are great resources but they are showing pass/fail and do show an excellent breakdown when something satisfies the expression, but I’m just wondering if there is something that shows partial matching until the failure point?
[+] [-] bmn__|5 years ago|reply
http://p3rl.org/rxrx
rxrx -e'"abcde" =~ /abe/'
Demo: https://blog-cloudflare-com-assets.storage.googleapis.com/20...
http://p3rl.org/re#'debug'-mode
perl -Mre=debug -e'"abcde" =~ /abe/'
----
https://stackoverflow.com/questions/2348694/how-do-you-debug...
[+] [-] nickysielicki|5 years ago|reply
https://regex101.com/
These websites have saved me hours of time at this point.
[+] [-] ben509|5 years ago|reply
If you're building a complex regular expression, setting smaller parts in variables and dropping them in with (?:${part}) makes things a bit more readable.
It also exposes a real weakness of most regex engines. In particular, alternation is a first-class operation, but complement and intersection, while theoretically possible[1] are typically not.
A person might guess that to match three keywords is /.keyword1.&.keyword2.&.keyword3./
Or maybe /.keyword1.&(.keyword2.)!/ to match keyword1 and not keyword2.
But those won't work, so it's a good idea to explain some options, an obvious one being /keyword1/.test() && !/keyword2/.test()
In the section on lookaround assertions, it's probably useful to note that (?=thing1)(?=thing2) can match both, and it's a good mental model for it, but that it comes with a few gotchas.
[1]: https://www.researchgate.net/publication/220994310_Succinctn...