Pomsky – A portable, modern regular expression language

[+] asicsp|3 years ago|reply

Previous discussion: "Rulex – A new, portable, regular expression language"

https://news.ycombinator.com/item?id=31690878 (242 points | 6 months ago | 189 comments)

[+] vintermann|3 years ago|reply

It's probably better than regular expressions. However, is it enough better that it's worth learning yet another syntax?

Well, maybe. What I REALLY like about this one, is that it fully reverses the quoting/escaping assumptions. The default assumption of old regexes that symbols should by default match themselves, I think is regexes' million dollar mistake. The literal matches are the least interesting part of regexes. If you reach for regexes, it's because you want something more complex than literal matches, and the syntax should be about taming that complexity. Even the Unix world sort of conceded that, in reversing the quoting assumption for ?, +, (), [] and | in egrep. The mistake was in stopping there.

I will take a good look at this. I hope they provide good justifications for their choices.

[+] gpvos|3 years ago|reply

Perl regex syntax does what you describe, and has done so since the 1990s. Only alphanumeric characters are guaranteed to match themselves, and the others have a special meaning unless escaped. Some nonalphanumeric characters don't have a special meaning and you can use them either with or without a backslash, but the man page has always warned that it's safer to escape them anyway.

[+] cmovq|3 years ago|reply

The literal matches are useful in an editor search context where most of the time you want to search for the literal string.

[+] coffeeblack|3 years ago|reply

On first glance, I didn’t see anything better, tbh.

[+] antman|3 years ago|reply

For spec review from business partners I have found "verbal expressions" the most useful flavor. Specifically because to review them you don't need to know regular expressions and the library has ports for most programming languages.

Example: https://github.com/VerbalExpressions/JSVerbalExpressions#tes...

[+] mekster|3 years ago|reply

Why don't languages have "grok" patterns in their standard libraries?

It seems to only exist in log parsing ecosystems but this really helps with getting rid of little bugs and wrong parsing of specific regex patterns.

Instead of doing "^\d+(\.\d+){3}$" for IP checking which is clearly wrong, you'd do "%{IPV4:ip}" which is so much better.

List of known patterns : https://github.com/hpcugent/logstash-patterns/blob/master/fi...

Even for PHP a third party library only has 15 stars.

[+] pimlottc|3 years ago|reply

Perhaps because many devs, like me, haven't heard of "grok patterns"[0] before. Because of that, it took me a while to understand what your post was saying, since I was reading "grok" as a normal word.

It's not a bad idea, though you'd want to make it a more formal standard; since it's just used internally in one project, it may be subject to change based on that project's own needs. There could also be more documentation, showing examples of strings that are accepted and not accepted by each pattern, as well as advice for generating compliant strings.

0: https://logz.io/blog/logstash-grok/

[+] gregmac|3 years ago|reply

Having never heard of this, it just looks like a list of regex patterns. Am I missing something? There's lots of lists of patterns available [1]

If I came across this in code, I still have similar issues to regex. I have to question the source and what unit testing exists, that the tests cover the particular cases we care about, and if composed it needs tests anyway.

For context, I always split my regexs to a function and add unit tests for them, and ask the same in PRs. It is one of the easiest tests to write and most beneficial: if I don't discover a case I missed while I come up with example inputs, it protects any future modifications by future devs. It also empowers devs that don't know regexs well to be able to modify them (because they know it won't break) and to approve a PR with one.

I don't see how grok patterns avoid the need for tests (especially if composed), and if there are tests anyway, they're just a level of indirection.

In the worst case, when there's inevitably a bug (or app-specific missing situation) in one of these, it's harder to fix than just having a regex directly.

[1] https://github.com/mmkjony/awesome-regex-1 , https://regexlib.com/DisplayPatterns.aspx?cattabindex=5&cate...

[+] Minor49er|3 years ago|reply

They don't have them because most of these checks are expensive and offer little benefit outside of things like log parsing

A common example is email address validation. Trying to validate that an email address is well-formed with regex will often miss valid addresses (the example you linked would miss some in the HOSTNAME portion alone). It would be better to simply send a confirmation link to the address and wait for it to be clicked since that would ensure that the address is valid

[+] b3morales|3 years ago|reply

Apple's framework has [NSDataDetector][0]

> A specialized regular expression object that matches natural language text for predefined data patterns.

It's targeted at the kind of information end users are interested in: dates, email addresses, like that.

[0]:https://developer.apple.com/documentation/foundation/nsdatad...

[+] andy81|3 years ago|reply

.NET has something close in standard libraries, in that you can try converting a string to System.Net.IPAddress and other similar classes.

It's for parsing rather than search though, you'd still need regex if e.g. searching unstructured logs for IPs.

e.g. (powershell syntax for easy reproduction) returns parsed, or throw if invalid. [uri]"\\ServerName1\Folder\UserName" [version]'1.2.3.4' [ipaddress]'1.1.1.1'

[+] zokier|3 years ago|reply

While I'm not sure about Pomsky specifically, I do think its nice that people explore the language space for regex more. General programming languages have huge variety of styles and syntaxes available, from APL to Haskell and Lisp, whereas regexes are pretty much the same everywhere. It feels like we stuck with the first thing that Kleene, Thompson et al thought of and for 50 years didn't even really try anything else.

[+] turnsout|3 years ago|reply

To me, the lead example on the homepage ("Basic") is a major red flag. This is not clearer than a traditional regular expression:

'Hello' ' '

Did you count the single quotes correctly? Even with syntax highlighting, people WILL mess this up.

[+] kstenerud|3 years ago|reply

The fundamental problem comes from assigning meaning to whitespace (in this case, concatenation). I had the same issues when developing KBNF ( https://github.com/kstenerud/kbnf/blob/master/kbnf.md ) which operates in a closely related space (grammars).

In early development, I took a number of cues from existing work that turned out to be bad ideas, in particular using whitespace for concatenation (which all BNF dialects seem to do).

Switching to '&' for concatenation (reading it as "x and then y") made things a lot clearer, as it would also do for Pomsky:

    'Hello' & ' '+ & ('world' | 'pomsky')

[+] blindseer|3 years ago|reply

I find something like this a lot more readable:

https://github.com/jkrumbiegel/ReadableRegex.jl

It is in Julia, but if you have it installed locally it’s just a few taps away. You can even generate the regex, and use that in Python and just add the ReadableRegex in a comment nearby.

[+] luuuzeta|3 years ago|reply

Named regexes (variables in Pomsky) remind me of Raku [1], which implements an improved flavor of PCRE regexes plus grammars in general as part of the language.

[1]: https://docs.raku.org/language/grammars

[+] jjice|3 years ago|reply

The grammars as part of the language is probably the most interesting thing about Raku to me. Never seen that in another language as a core concept like that.

[+] vanderZwan|3 years ago|reply

The idea of a compile-to-regex language is a neat one, that immediately makes it a lot easier to use in existing projects.

If there's any interest in other takes on "better regexps", the Rosie Pattern Language has some neat ideas:

[0] https://rosie-lang.org/index.html

[1] https://www.youtube.com/watch?v=MkTiYDrb0zg&list=PLcGKfGEEON...

[+] echelon|3 years ago|reply

I don't know why I'd never previously considered regular expressions as being a compile/transpile target. It's pretty obvious from PL theory and makes a ton of sense.

That said, after looking at this syntax, I'm not sure that this is much of an improvement. Maybe I've spent far too much time in Regex land [1], but I know I'd perform much slower in this. It's not particularly beautiful, either. The verbosity doesn't seem clearer.

Variables and comments are great, though. We need to add them in future regexes.

Overall, good idea. I'd like to see more takes on this.

[1] https://jimbly.github.io/regex-crossword/

[+] TuringTest|3 years ago|reply

It may be a good way to standardize regexp syntax for users of all levels of expertise.

Every text editor, shell environment, programming language or desktop application seems to use regular expressions with a syntax which is slightly different to all the others, but not different enough to call it with a different name.

This means that a newbie learning regular expressions will be thrown into an environment where it can learn the basic principles, but the rules it learns are not generalisable to all applications (e.g. do I match an arbitrary string with '*', or '.*' ? Can I reuse matched patterns with (1) or with {1}? Etc.)

A new readable and easy-to-learn syntax that is nevertheless portable may work like markdown, as a simple-yet-universal new way to apply regexp that newcomers may learn with confidence and apply everywhere, replacing all the previous slightly incompatible versions.

[+] jsnell|3 years ago|reply

Those improvements are not really novel though. Perl regular expression have had variables, comments, non-significant whitespace basically forever.

[+] hnlmorg|3 years ago|reply

I think that’s a fair summary.

Interesting idea but the syntax adds nothing to the readability of regex.

I’d be more impressed with this if it targeted multiple different regex targets, since not all implementations of regex are equal. But in its current state it has all the problems of regex plus now the problems of this new language on top.

Looks a fun personal project though. Hope the developers enjoyed building it.

[+] tmtvl|3 years ago|reply

Comments are a great idea, you're right! https://perldoc.perl.org/perlre#/x-and-/xx

[+] rekado|3 years ago|reply

For Scheme I enjoy using Irregex: http://synthcode.com/scheme/irregex/ The key benefit is to have a proper DSL instead of a DSL hidden away inside an untyped string.

[+] gamesbrainiac|3 years ago|reply

This was previously called Rulex. Glad to see it is getting traction :) I made a video on it a few months ago. https://www.youtube.com/watch?v=nPjCxwEdIIo

[+] neilv|3 years ago|reply

Olin Shivers defined the related "SRE" S-expression language for regular expressions:

https://scsh.net/docu/html/man-Z-H-7.html

https://www.ccs.neu.edu/home/shivers/papers/sre.txt

It works nicely with Scheme, including for programmatic generation.

(I've done some regular expressions from heck, without the benefit of this, and needed extensive commenting just to keep a few/several levels of nested groupings straight. With S-expressions, that's trivial.)

[+] osigurdson|3 years ago|reply

I seem to need regular expressions about once every six months. Every time I do I wonder how this horrible little language became ubiquitous. However, I’m not going to pull in a dependency just to avoid it (at my current regex cadence at in any case).

[+] biztos|3 years ago|reply

I once worked on a product that was heavily invested in regular expressions, and fairly non-technical users generated more, often hundreds per day.

Of course this led to a certain amount of UI whack-a-mole: people learn a mini-language really fast when they use it all day at their jobs, and people are creative; whereas the "computer people" really needed the system to not grind to a halt because of a massively inefficient regex.

From this I learned to see regexes everywhere when I look at text; and that we should always consider the "threat model" of our own users being good at their jobs.

[+] rhapsodic|3 years ago|reply

[deleted]

[+] moreati|3 years ago|reply

Some regex engines/dialects also allow pre-defined named subroutines (variables) with `(?(DEFINE) (?<name>pattern)...)` http://www.rexegg.com/regex-disambiguation.html#define, but it's a very niche feature. e.g. https://gist.github.com/moreati/9d974e5395829d737dc342715f15...

[+] AkshatJ27|3 years ago|reply

> If you know RegExp's, the syntax will immediately make sense

If I know RegEx, why would I use pomsky?

[+] skrebbel|3 years ago|reply

Because if you know regexp, you know how terrible it is.

[+] echelon|3 years ago|reply

Variables and comments seem nice.

Insignificant whitespace is needed to support it, and as an added bonus it would make it easier to break up patterns across multiple lines.

The syntax changes and added verbosity do not seem great, though. They'd trip me up for sure.

In general, I think I'd like to see a language more like "Regex 2.0", ie. an extension that doesn't depart too far from what we're used to.

[+] kyriakos|3 years ago|reply

I wrote a lot of regex but I find reading regex extremely hard. Even expressions I wrote myself become unreadable after a few months.

[+] DemocracyFTW2|3 years ago|reply

If you know RegEx, you have two problems.

[+] YesThatTom2|3 years ago|reply

Code readability isn’t for you.

It is for the next person that maintains the code.

[+] unknown|3 years ago|reply

[deleted]

[+] account-5|3 years ago|reply

I like regex, the syntax is succinct, powerful, and easy to learn. I learned regex before I learned any proper programming language, because you could use it in a text editor. I remember feeling like a wizard at the time.

I do admit it can be a write once hopefully never need to read again thing, but I still love it.

[+] sulami|3 years ago|reply

I had a similar kind of idea for a long time, which I put into action a few weeks ago via a standalone transpiler of Emacs' rx macro to common regexp syntaxes.[0] I ended up getting interrupted and didn't completely finish it, but it generally works, though is probably riddled with edge cases.

The basic idea of rx is to use S-expressions to describe regular expressions, and my elevator pitch would've been to embed rx invocations in shell scripts using $(syntax), the main use case being something like sed invocations.

I still think it's a neat idea, and complex regular expressions tend to be hard to parse for humans.

[0]: https://github.com/sulami/rx

[+] ngomile|3 years ago|reply

Would be nice if it was possible for the Pomsky playground to show informative modal boxes when you hover over some of the Pomsky style expressions. Kind of like RegExr which remains my favorite tool for quickly writing out a regular expression and seeing what its outcome will be. Very nice to be able to quickly see what some thing does and how it affects your query including documentation in an easily accessible location on the same page.

Not having to navigate to a different page would be a boon for the playground. Very interesting though, I use Regular Expressions for the times when you need to extract information but find and replace functions/methods just aren't enough. Mostly for my scrapers.

[+] xiphias2|3 years ago|reply

Number ranges look great, but it's much better to just extend + polyfill current regex syntax and keep compatibility.

[:0-255:] could be an option for example to write a number range.

Also regex / variable interpolation should be added to regexes in languages probably :)

[+] burntsushi|3 years ago|reply

Author of Rust's regex crate here.

> Also regex / variable interpolation should be added to regexes in languages probably

I've often wondered and desired this myself. It feels like the single most useful change one might make, because it enables (albeit very basic) logical decomposition. It comes in handy when a regex works for your use case, but the regex is somewhat long but with many repeated parts. In my experience, that isn't too rare, and the regex readability would likely be helped quite a bit by some kind of decomposition.

145 comments