Show HN: Regex.ai – AI-powered regular expression generator

[+] AdieuToLogic|3 years ago|reply

> Regex.ai is an AI-powered tool that generates regular expressions.

Or, just write regular expressions?

> ... Regex.ai's intuitive interface makes it easy to input sample text and generate complex regular expressions quickly and efficiently.

See: https://www.ibm.com/topics/overfitting

Inputting the sample text:

  foo bar baz
  baz bar foo

And highlighting the first "baz" produced patterns which all had "[A-Z][a-z]*@libertylabs\\.ai" included, assumedly due to the default inclusions.

Removing those and highlighting the second "baz" resulted "<Agent B>" as the results in one case.

There is no explanation of any patterns generated. If a person is to use one of the generated patterns and Regex.ai is supposed to "save you time and streamline your workflow", no matter "[w]hether you're a novice or an expert", then some form of verification and/or explanation must exist.

Otherwise, a person must know how to formulate regular expressions in order to determine which, if any, of the presented options are applicable. And if a person knows how to formulate regular expressions, then why would they use Regex.ai?

[+] anileated|3 years ago|reply

I often find it faster to write something from scratch rather than to work with someone else’s code to fix it. In the latter case I need to understand the intent, the whys behind the choices.

Well guess what, LLM-generated code is someone else’s code: an amalgamation derived from many peoples’ code. Except those people are ‘helpfully’ “abstracted away” from you by the middleman, so you can’t know their original intents and choices. What’s worse, it’s someone else’s code that will be treated as your code—unlike working with a legacy system that everyone knows was written by some guy, in this case any bugs will be squarely on you.

[+] regexLL|3 years ago|reply

Thanks for your feedback! Updated ver 1.1 coming soon with more descriptions and better performance :)

[+] barbariangrunge|3 years ago|reply

I’d you don’t understand regexes well enough to write them yourself, you should not get some ai to generate them for you. You won’t be able to verify whether they do what you want and the bugs can be subtle and destructive

[+] jameshart|3 years ago|reply

You gave it an example where inferring the semantics you were after was basically a crapshoot. It’s not going to do well under those conditions. Nor will a random human who lacks insight into what specifically you are after. Did you want all the bazzes that are at the end of lines? The bazzes that follow bars? Who knows?

Try giving it examples where the data provides context cues.

[+] qwertox|3 years ago|reply

Using your example and deselecting the email addresses I end up with these suggestions:

\b(foo|bar|baz)\b

\w(foo|bar|baz)\w

\bbaz\b

[fF][oO][oO]|[Bb][Aa][Rr]|[Bb][Aa][Zz]

It only lacks a dice button which randomly selects the "correct" answer.

[+] 6510|3 years ago|reply

There are tools that somewhat explain what each part of a regex does.

[+] qsort|3 years ago|reply

Just so that you know, your problem is called "regular expression synthesis", there's vast literature on it and a LLM is by no means necessary.

https://arts.units.it/retrieve/handle/11368/2758954/57751/20...

https://arxiv.org/pdf/1908.03316

https://cs.stanford.edu/~minalee/pdf/gpce2016-alpharegex.pdf

[+] __lm__|3 years ago|reply

The first one is available here: http://regex.inginf.units.it/

It uses genetic programming to build the regular expression.

[+] hackernewds|3 years ago|reply

and yet a decent regex generator has not existed before.

[+] florianfmmartin|3 years ago|reply

How about instead of an AI generating a regex we can't understand, we put energy using actually well developped method for parsing & validating text? Why put code you can't understand in your database?

For complex inputs, use actual peg parsers : https://docs.rs/peg/latest/peg/

For simplet inputs, express your intent with readable methods using a lib : https://github.com/sgreben/regex-builder/ & https://github.com/francisrstokes/super-expressive

[+] b5n|3 years ago|reply

Once you know regex well enough to replace regex you realize that regex is pretty well developed.

There are certainly cases where different parsing methods/grammars are a better fit, but regex shines in many places.

[+] column|3 years ago|reply

Reality check : there are people like my colleague who aren't software engineers and still have to occasionally maintain/create a regex in some corporate software config.

[+] thunky|3 years ago|reply

> How about instead of an AI generating a regex we can't understand

Would you feel better if it generated a regex-builder expression instead of a regex?

Even if regex-builder generates a regex under the hood?

In any case, the regex itself is only an implementation detail.

[+] textread|3 years ago|reply

If you would like to generate a regular expression by giving an example input text and an example output match, you could use this closed form solution tool:- https://regex-generator.olafneumann.org/

There is an excellent HN comment that provides more reading material around regex generation:- https://news.ycombinator.com/item?id=32037544

[+] elif|3 years ago|reply

I had a problem so I used an AI generated regex. Now I have an unknowable number of problems.

[+] jedberg|3 years ago|reply

Usually when you have an AI like this that is supposed to generate verifiable results, you do an adversarial test where you ask it to solve problems that you already know the answer to, to make sure it works.

It looks like no one did that here. Even using the sample data provided, if you highlight a few of the addresses, it can't find the rest of them, mainly because it generates a regex with ST/AVE/LN in it, missing all the ones with RD. And if you add an RD sample, it just adds that to the list.

There's lots of great innovation coming with LLMs, but people are forgetting their "AI basics" when it comes to verifying them.

[+] pyuser583|3 years ago|reply

I really wonder if this sort of thing is the how AI will work.

We tell AI what we want. AI produces a hyper-specific, but barely comprehensible result. We look over the result to make sure it’s all good.

Then execute.

[+] yawnxyz|3 years ago|reply

I just used ChatGPT to create a ton of permutations for product pricing that I'm putting on Stripe as products.

Except... it made ONE ERROR that I just spent two hours tracking down and fixing in my JSON file and now in the Stripe dash. (I coincidentally found the error using ChatGPT lol).

It's probably still faster and less error-prone than I could have done it manually. But it's still error-prone...

[+] globalise83|3 years ago|reply

AI generates a comprehensive set of unit tests with correct and incorrect inputs, then we run the tests to ensure that they all pass.

[+] tomashubelbauer|3 years ago|reply

I was curious if this would be smart enough to generate a regex for any four letter word so I copied the tagline of the site and highlighted all four letter words in it. (I have deleted the previous highlights of course.) It generated three regexes that just had a union of those words and one which started off good-ish by looking for any word of length of three or four, but then tacked on some random suffix and in the end this most promising regex turned out to not even match anything in the source text. As a suggestion to the authors of this tool I'd propose to add a step where any generated regexes that don't match anything in the input text are removed from the results.

[+] gbro3n|3 years ago|reply

The results I got from this were unfortunately not useful. For example in trying to extract the property names from a connection string, I highlighted all of the property names along with the equals sign. So for

"PostgresSql": "Host=localhost;User ID=postgres;Password=xxxx;Database=test;Application Name=Test1234,Port=35432;Pooling=false;"

I selected User ID=, Host=, Application Name=, Password=

The results were pretty useless, using the literal inputs as pattern matches:

The following prompt given to Chat GPT 4 however:

> Write a regular expression to extract the property names from this PostgreSQL connection string: "PostgresSql": "Host=localhost;User ID=postgres;Password=xxxx;Database=test;Application Name=Test1234,Port=35432;Pooling=false;"

Yields the response with an explanation:

(?<=[^\\w])([A-Za-z ]+)(?==)

"This regular expression will match any sequence of alphabetic characters (upper or lower case) that are followed by an equal sign (=). The negative lookbehind (?<=[^\\w]) ensures that the property name is not preceded by another word character."

A quick test on regex101.com shows this works perfectly.

Sorry, don't like to be overly critical. Someone has attempted to solve a common problem for developers, but LLMs are going to blow applications like this away. And I think that Chat GPT at version 4 has become a truly useful tool.

I was interested to test this since I'd been writing a regular expression earlier in the day for a similar usecase, which I've written up here: https://journals.appsoftware.com/public/76/227/4221/blog/sof...

[+] _andrei_|3 years ago|reply

[+] majastro|3 years ago|reply

My dude, how do I get in touch with you?!

[+] bastardoperator|3 years ago|reply

Yes!

[+] waldenyan20|3 years ago|reply

Cool hack! I'm having some trouble thinking of a case where I wouldn't just explain to Copilot / ChatGPT what I need. Maybe specifically in cases where I had the raw data but not the column titles?

[+] ouraf|3 years ago|reply

Going a little more "end user needs vs new tech offers", the intent is on the right place, but the output isn't helping as much as normal programmatic tools

At least for me, what would make this a killer app would be the ability of reading a document or pdf or big text dump and 1: identify "possible fields" (first name, date of birth), "probable fields" (middle name or other fields that are part of data set but doesn't appear in every line) and "probable junk data" (page numbers, page headers, useless pdf padding"

2: allow selection or tuning of these fields to generate regex to catch or remove only the data related to the parsed fields.

I THINK there's something done with pandas (pandoc?)that can help tearing a document apart and getting fields or basic doc structure, but AI would need to take it from here and present it in a clear, concise and optionally explained way so a busy office worker could just copy the regex filter in a spreadsheet formula or program function

[+] libraryatnight|3 years ago|reply

This doesn't seem to generate great regex, but it does seem to generally work(ish?) so I guess nobody would care. That said, how's this work? Are you just sending this off to one of the AI api's - what's going on with the data pasted in the box after we hit run?

Struck me as funny when we have another thread going about people pasting company data into ChatGPT and here we have a regex AI with an example that looks like it's encouraging you to trust it with helping you regex through your PII, just paste it in the box and highlight what you need lol (not saying that's the intent, just that's what less savvy users may do)

Company site does not inspire much confidence: https://libertylabs.ai/

Light on details, heavy on philosophers, trend setters, idea banks, and radicals that make me worried I'm dealing with opportunists taking swings at monetizing a bunch of .ai domains. Especially the weird cinematic banner.

[+] jen729w|3 years ago|reply

> Company site does not inspire much confidence: https://libertylabs.ai/

Ah c'mon. It's a bunch of kids -- which I say with envy not malice -- giving something new a go. Let 'em at it!

The products will speak for themselves. This one, meh, not so much. But we should be encouraging not disparaging.

[+] danmur|3 years ago|reply

Prompt Engineer is such a pompous title. I think AI operator would be more accurate.

[+] rpigab|3 years ago|reply

It's nice and there are use cases for it, but if I ever need something like it, I'll prolly just explain what I want to ChatGPT and tell it what regexp engine I'm using, and it'll give me results I'll paste to regexr.com for tests. The only added value here is that I wouldn't need to think of a prompt, but I've become good at finding nice prompts for programming problems, so directly querying ChatGPT is what I'd go for, personally.

Also, I'm not sure what underlying tech is used, and the only explanations on the tool seems to be a Youtube video, so I didn't look further. I'd like to know more about how it's made, if that's possible and something the author would be ok to share.

[+] benrutter|3 years ago|reply

This is a really nice implementation, so full credit to the creator. Regex is always confusing to compose, but it's also one of those situations where I can't help but wonder if the solution is just to improve upon / provide a nice abstraction for regex rather than handing over full control to a non-deterministic AI.

I've seen at least 2 projects in the last 6 months using LLMs to generate bash code which seems like a similar solve. LLMs are super cool, but there's a massive advantage to actually understanding what you're code does, and LLM generated regex, bash, assembly etc loses that.

[+] cutler|3 years ago|reply

Dang, there goes my investment in Jeffrey Friedl's "Mastering Regular Expressions" which launched my programming career after I discovered a reference to it in "Dreamweaver Bible" back in 2000.

[+] politician|3 years ago|reply

Now you've got 3 problems?

https://blog.codinghorror.com/regular-expressions-now-you-ha...

[+] bongobingo1|3 years ago|reply

And a bill.

[+] AdieuToLogic|3 years ago|reply

Or just swoop in with Perl?

https://xkcd.com/208/

[+] regexLL|3 years ago|reply

Great to see our service rank 2nd in Hacker News! Surprise by the sudden rise in traffic.

We will be deploying regex.ai v1.1 on the first week of April , with descriptions and 5x improved performance. Stay tuned!

[+] alexnew|3 years ago|reply

To be honest I find ChatGPT sufficient for regex. I usually ask it for test cases that I can then validate in a regex playground to make the regex is working as expected.

[+] laserbeam|3 years ago|reply

Exactly my thought. I can just go to GPT and ask it for regex related stuff. Why do I need a dedicated AI for that? I don't.

[+] ekiauhce|3 years ago|reply

Writing regexps by hand, indeed, might be tedious task in some cases.

One I familiar with is to match datetime interval, when you need to narrow down log rows for a particular time range.

So I built a tool just for it :) https://github.com/ekiauhce/interval-to-regexp

123 comments