Pattern Matching Without Regex – Introducing the Rosie Pattern Language

[+] greggyb|7 years ago|reply

I get that PEGs are more powerful than regexes, but it seems the example of componentizing is weak at best.

I can do this very easily on the command line or in a .<shell>rc file:

  ipv4component='[0-9]{1,3}'
  ipv4="$ipv4component"'\.'"$ipv4component"'\.'"$ipv4component"'\.'"$ipv4component"

  $ grep -E $ipv4 ...

I am bad at regex, so I'm sure the example is a poor definition, but the idea is there. I can make variables that hold components of a regex, and since a regex is just a string I can compose these via concatenation.

If I did this a lot, I could build a small helper script (or probably just a set of shell functions) to maintain a library file of regex components that I can use in the shell with grep.

[+] taeric|7 years ago|reply

It isn't that hard to build functions around a lot of this, either. Emacs has the wonderful macro "rx". https://www.emacswiki.org/emacs/rx

[+] msoucy|7 years ago|reply

The article would be helped by using a full regex for ipv4 addresses - the one it uses would match invalid numbers (999.999.999.5 for instance), but the proper one is more complex (and would probably make for a better example as a result)

Also I think there's something wrong with this blog's formatting, it appears to be replacing underscores with italics even within code samples.

[+] setr|7 years ago|reply

I feel like thats fine: syntactically valid, semantically not;

Ideally syntax vs semantics should be decoupled in most parsing (hence the AST)

[+] wild_preference|7 years ago|reply

Though to be fair it's not obvious where to encode validation. You're saying it should be intrinsic in the parsing rules of ipv4. But there are reasons why you would want to validate in a separate pass.

For example, the more errors you parse, then the better error messages and introspection you have (structured data) when you want to validate.

It's a classic trade-off.

You're right though, doesn't make a very good example.

[+] drivers99|7 years ago|reply

Kind of looks like grok. Grok lets you name patterns, then build up larger patterns by those names, and then also name the groups that it matches to those sub-patterns so you can refer to them in the data. It's built on top of regex, as each pattern can be defined by a mix of other patterns and/or regex.

For example this grok pattern (taken from [1] )

%{TIMESTAMP_ISO8601:timestamp} \[%{IPV4:ip};%{WORD:environment}\] %{LOGLEVEL:log_level} %{GREEDYDATA:message}

refers to a pattern called TIMESTAMP_ISO8601 and calls it "timestamp" in the resulting output data structure.

In logstash, TIMESTAMP_ISO8601 is predefined in a patterns file, such as [2], which is made of up of a mix of regex and other patterns like YEAR, MONTHNUM, etc.

TIMESTAMP_ISO8601 %{YEAR}-%{MONTHNUM}-%{MONTHDAY}[T ]%{HOUR}:?%{MINUTE}(?::?%{SECOND})?%{ISO8601_TIMEZONE}?

MONTHNUM is a regex (optional 0 followed by 1-9; or 10 through 12):

MONTHNUM (?:0?[1-9]|1[0-2])

I'm not sure what all of Rosie's base patterns are. This appears to be a valid regex though, from the example: [:alpha:]+ (regex character class and "+" meaning 1 or more). It's C instead of java which is useful in more/different places.

[1] https://www.elastic.co/blog/do-you-grok-grok

[2] https://github.com/logstash-plugins/logstash-patterns-core/b...

[+] eutropia|7 years ago|reply

Is there anyway to make use of grok on the command line like the post does with rosie patterns? I rather like the idea of `| grep 'net.ipv4'`

[+] neurotrace|7 years ago|reply

This is a very cool too but I feel like I'm missing something. This looks like any other PEG parser generator. The only difference I see is that it will automatically handle the case where a valid match starts somewhere other than at the start of the stream. I'm not sure that this constitutes calling it a whole language unto itself.

What separates this from tools like PEG.js[1] or pest[2]?

[1]: https://pegjs.org/ [2]: https://github.com/pest-parser/pest

[+] yAak|7 years ago|reply

For the sake of discussion, here's what the author says: http://rosie-lang.org/blog/2018/02/25/why-rpl.html#why-not-u...

I guess pest is comparable then, but wasn't mature when the author started work on Rosie?

(I'd be curious for a proper comparison, but I'm not really knowledgeable in this area -- I had no idea there were so many alternatives to regex: https://en.wikipedia.org/wiki/Comparison_of_parser_generator...)

[+] gabiruh|7 years ago|reply

Yeah.. that got me thinking too. I would not have called it "new".

As of version 5.10, Perl regex engine implements a complete recursive-descent parsing. Allowing things like Regexp::Grammars[0] to exist. Perl also has a nice PEG parser framework called Pegex[1]

-- [0] - https://metacpan.org/pod/Regexp::Grammars [1] - https://metacpan.org/pod/Pegex

[+] KeyboardFire|7 years ago|reply

The example they give isn't really convincing, to me. I can see the usecase for this kind of language, but for e.g. searching for a pattern on the shell that isn't just one of a few predefined special cases, it seems like it'd still be a lot easier to compose regexes on the fly.

[+] jmaa|7 years ago|reply

I don't see the difference between this and any other Context-Free Grammar specification language. Yacc is an industry standard, and even SNOBOL4 (1967) had first-class CFG datatypes. Maybe he's just excited about being able to use CFGs in the cmdline?

[+] dblotsky|7 years ago|reply

I was going to agree with everyone about how it’s not a language, but reading into it more, I proved myself wrong.

This is a different language insofar as it describes PEGs, not regexes, which is fundamentally different and more powerful (it can parse more things).

The naming of patterns isn’t unique, since you can just put regexes in variables in every other language too. However, the syntax in Rosie seems nicer, and sharing is easier.

[+] msla|7 years ago|reply

This is nice within a specific usecase: Being able to make files with all the pattern chunks you use repeatedly, so you can reuse them and add to them. If you can't make files, it at least looks no worse than composing regexes on the command line, but it also doesn't look all that different.

Edit: OK, I was wrong. It is strictly more powerful than regexes, in that it can correctly match nested pairs.

[+] ketralnis|7 years ago|reply

This talk about Rosie https://www.youtube.com/watch?v=MkTiYDrb0zg is also quite nice

[+] unknown|7 years ago|reply

[deleted]

[+] AndrewOMartin|7 years ago|reply

Can this be used to parse HTML?

[+] neurotrace|7 years ago|reply

From the third paragraph:

> Rosie has several benefits over traditional regexes, including the ability to parse recursive structures like HTML and JSON

[+] zwieback|7 years ago|reply

Third paragraph:

Rosie has several benefits over traditional regexes, including the ability to parse recursive structures like HTML and JSON, to create new patterns by combining other patterns, and to name patterns. You can combine these named patterns into libraries which you can import and use elsewhere.

[+] busterarm|7 years ago|reply

As an alternative, look into Nokogiri. I've never used a better XML/HTML library.

[+] unknown|7 years ago|reply

[deleted]

26 comments