Show HN: Search code in GitHub repos using regular expressions

This is awesome!

@danfox, sent you an email though commenting here too.

I'm the CTO @ GitHub. Would love to talk to you about this and other things we are building in this area at GitHub.

Feel free to email direct to jason at github.com

I do enjoy moments like these on hacker news when someone presents a project for X and the CTO of X shows up and wants to talk. It shows how directly of an impact one can potentially have in this community.

I hope this means we’re getting grep searches for github soon. Cheers.

latenightcoding|6 years ago

github's code search is notoriously bad, feels like a huge missed opportunity. Nice to see you guys reaching out to other people working in this area.

sixwing|6 years ago

hah. you beat me to it, Jason.

@danfox, i'm always down to talk code search as well - rand@github.com

glangdale|6 years ago

iirc, Github uses (used?) my old project (https://github.com/intel/hyperscan) at Intel. It's probably faster than the alternatives, although if you want to support all types of regex you'll need to use Hyperscan as a prefilter for a richer regex engine like PCRE.

This project looks like it pulls literal factors out of the regex that I type in, maybe to an index a la that Russ Cox blog post a while back about Code Search. It seems to Not Like things that have very open-ended character classes (e.g. \w) unless there is a decent length literal involved somewhere.

It seems to have a very rudimentary literal extraction routine, as it decides to give a partial result set when fed an alternation between two literals that it handles pretty well on their own.

WhiteOwlLion|6 years ago

How about the ability to search code on forks? GitLab allows it. At least have feature parity? Thanks.

simonw|6 years ago

Impressive! Really fast, full featured code search across a huge corpus.

1. How did you build the index? Did you use a GitHub dump of some sort? How often do you refresh it?

2. Is it Elasticsearch or similar or a completely custom engine?

3. What kind of RAM/CPU are you using to power it?

4. Any plans to open source the code or commercialize the technology?

I could absolutely imagine paying for a private code search engine like this to run against a large internal company codebase spread across many repositories.

danfox|6 years ago

Thanks! It's built on top of Solr. It fetches the repos from GitHub - it should pick up any updates to repos within a few days. It's running on a couple servers with 20 cores each, which is not really enough for the traffic it's getting right now.

heipei|6 years ago

I'd be curious how you built the step from regex to ElasticSearch. My guess would be an n-gram (3-gram) index in ElasticSearch and then translating the regexes to that, but just curious if you built that custom or used something off-the-shelf. Love the site!

dang|6 years ago

I still miss Google Code Search, which was a great way to find examples of anything I wanted to learn about in programming and usually answered my questions better than anything else, including Stack Overflow. Has it really been 8 years... https://news.ycombinator.com/item?id=3112029

If this tool can fill that hole in my world, I'll be stoked. I've bookmarked it.

londons_explore|6 years ago

Google code search still exists as long as you want to search Chromium source code.

[1]: https://cs.chromium.org/

thanatos_dem|6 years ago

Next post from danfox - “how to get 3 job offers in 3 hours”.

Already has been publicly contacted by:

- GitHub CTO

- SerpApi CEO

- SourceGraph CEO

Search is hot right now!

swat535|6 years ago

Actually, It would more be like: "How I failed at 3 interviews, despite being directly contacted by execs."

nonbirithm|6 years ago

If only the answer to "how" was as simple as "writing a web service for searching GitHub repos with regexes," even though the problem is probably in itself non-trivial if there's this much interest in search at all. At least the specification is clear enough.

I guess what I mean to ask is, how would people know this is a "correct" answer to the "how" question beforehand? Is the answer literally just "search" because that's simply what's trending right now?

sdesol|6 years ago

It also probably goes without saying he should be careful with what details to share.

Existenceblinks|6 years ago

I'm surprised as well, think why big tech companies didn't have this awesome search already.

neonate|6 years ago

Also by the co-creator of Django: https://news.ycombinator.com/item?id=22397023

sqs|6 years ago

This is really cool. What are you using it for? Usage examples, debugging, etc.?

I'm the CEO at Sourcegraph (universal code search for companies to use on their internal code). Our product is really optimized for searching a company's internal code right now, but soon we'll start working on offering much better search for public and open-source code as well. If you'd like to help out or just chat, please reach out! sqs@sourcegraph.com

edwinyzh|6 years ago

Sorry, but his code search covers far more languages than yours the last time I tried yours :)

franciscop|6 years ago

This is amazing! One thing that allows me to do, which I wasn't before, is to do a search for the repos that use some of my open source.

While there were some tools for this, they fail sort for older projects where using a library meant copy/paste it into your project, which is not reported in the CDN stats, npm installs or github "uses".

Now I can run a search with a bit of code that is only present in my library and find reliably those who copy/pasted it. While I publish my code under the MIT, this would also be very useful for those publishing under the GPL to detect bad actors.

danielecook|6 years ago

Wow. This is incredibly helpful. You can use it to see how someone may have used a function with named parameters:

  my_function(label=x, option_1=2)
  my_function.*option_1 # search

SlowRobotAhead|6 years ago

That was my first thought. I’ll have to wait until tomorrow to try it, but I have one super rarely used function ima rare package I’d love to see how other people are using.

hoorayimhelping|6 years ago

to grep specific repos locally, I use a tool called Hound, https://github.com/hound-search/hound developed by a couple of engineers at Etsy while I was there, but never released officially.

unknown|6 years ago

[deleted]

glouwbug|6 years ago

Amazin, why Microsoft hasn't built this for GitHub yet is beyond me.

Can it grep on individual repos?

hcm|6 years ago

I built https://grephub.com for that. It doesn’t maintain an index so it’s not super snappy, but it’s good enough / better than you’d expect in many cases!

funklute|6 years ago

Why would you want to use this tool to grep individual repos? If you know the repo you're interested in, you can just clone it and then grep it locally...?

oefrha|6 years ago

A tangent, my biggest gripe with GitHub code search (within a repo) off the top of my head is the inability to blacklist directories or only search whitelisted directories. Often times I want to look up the implementation of a function, and bam, three pages of results from tests.

Noctem|6 years ago

I'm glad I'm not the only one. It's very common that I'll be searching for a keyword that only appears in the actual code a handful of times but hundreds of times in tests. GitHub's search is practically useless in those cases.

I almost always just resort to cloning and searching with ripgrep, which can be annoying if I have no other reason to have the codebase on my machine or it's just a one-off.

cynicalreason|6 years ago

yeap .. having this issue as well, trying to easily find where a method is defined in JS/TS I'd so much want to be able to exclude `*.(spec|test).(js|ts|jsx|tsx)`

patrickdevivo|6 years ago

This is really cool! Awesome work. I assume you've seen https://sourcegraph.com/ as well? This to me seems much clearer and a bit more intuitive (though I've only spent a little time in sourcegraph). Really really cool. Does it also search code comments?

edwinyzh|6 years ago

last time I tried sourcegraph doesn't cover the language I use, so it's useless to me.

hartator|6 years ago

Excellent work!

I am the CEO at SerpApi. If you need a job, shot me an at julien _at_ serpapi.com.

fanf2|6 years ago

I wonder how this compares to Debian Code Search (https://codesearch.debian.net/about) and Russ Cox’s code search tools (https://swtch.com/~rsc/regexp/regexp4.html).

Obviously the source material is different (Debian packages vs GitHub repos) and grep.app also uses re2, but that is all I can see from a look at the “about” blurb.

sciurus|6 years ago

Another related tool is

https://searchfox.org/

https://github.com/bgrins/searchfox

nickjj|6 years ago

Hey Dan, if you ever wanted to come on my podcast to talk about your tech stack (how your site is developed / deployed, lessons learned, etc.), I'd love to have you on.

That podcast is at: https://runninginproduction.com/, drop me a line at nick.janetakis@gmail.com if you're interested.

edwinyzh|6 years ago

@danfox, Without revealing your tech/business secretes, I wonder if you can share some tips about building such a search app :)

lol768|6 years ago

How did you pick the 500k repositories to index out of the 28 million or so which are public?

danfox|6 years ago

It was based on the number of stars/forks and the size of the repository.

TACIXAT|6 years ago

I do not have a great example to try on my phone, but are results deduplicated? That's my big peeve with GitHub search is getting 5 pages of the same forked repo.

danfox|6 years ago

There isn't any deduplication, although that will hopefully be less of an issue at this point since there's a limited number of repositories in the index.

aodj|6 years ago

You have no idea how often I've wanted something like this for GitHub. Thanks so much!

j1elo|6 years ago

GitHub confirmed to me that their search is not able to find in substrings; this is annoying because if you want to find all affected code among all possibly involved repositories, before a change, you need to clone them and grep locally. In the end this means you need to clone absolutely everything you work with, because otherwise you might miss changing that one repo you didn't think of:

https://stackoverflow.com/questions/43891605/search-partial-...

I've used Sourcegraph and it was cool; will have a look at this new tool too. But, GitHub pretty please add plain food old grep abilities to your search!

w-m|6 years ago

Amazing feat!

Something I found when testing the regexp: the highlights seem to be off sometimes. When grepping for '<.*?@gmail.com>' (sorry, just the first thing that came to mind to try out the regexp), the second highlight in the first result seems to be in the wrong location:

https://grep.app/search?q=%3C.%2A%3F%40gmail.com%3E&regexp=t...

https://imgur.com/a/VyUXhcF

sn4pp|6 years ago

Seems to be good for stuff like

api_key="[a-z0-9]+"

Ty

bananaeater|6 years ago

"We didn't find any matching results."

ferenczy|6 years ago

I would say this needs a list of indexed repos and mainly an explanation of how it exactly works to be usable (how's the index build and how often it's refreshed, what types of files are being indexed, etc.). Otherwise, there's no much value in searching in an unknown data, is it?

Anyway, to not only criticize, good job! It's definitely one of GitHub's missing features. And I can imagine it's not an easy job to build something like that. But as I wrote, it really has to be well explained to be actually usable.

clarry|6 years ago

> there's no much value in searching in an unknown data, is it?

So you know exactly how Google's index works?

I think "best effort", whatever it is, is useful even if I don't know the specifics of what it captures or misses. As long as it returns useful results.

tekkk|6 years ago

Superb work. You built a better code search than Github (well with some of its features missing sure) with a lot less resources. Shows how stagnated the progress in big companies is after a service is deemed "good enough". Good for you kicking them in their butts to lead the way. Hope you get out of this something else too than HN karma.

Really like the minimalistic design, not too designy but still easy on my eyes. Just the way I want it to let me focus on the task at hand

jakear|6 years ago

Any plans to include backrefs? I'd like to see how many examples of /(\w+) && \1\./ are out there in .js/.ts compared to /(\w+)\?\./

tyingq|6 years ago

The about blurb mentions it uses RE2. So backreferences aren't likely. See https://github.com/google/re2/issues/101

dabei|6 years ago

It’s interesting how it took so many years for such an obviously useful tool to emerge. I guess hosting this is finally getting cheap enough.

edwinyzh|6 years ago

I've been wondering the same thing for many years. And I don't know why Google killed Code Search

blackandblue|6 years ago

thank you so much for doing this! i hope it continues to open more doors of opportunities to you!

primo, this is a crazy snappy proof that shows that github search can be done. next, the UI is amazing. and finally, all my queries worked!

i am now going to remove "github search sucks" from my to-be-published rants because this post demonstrates that 1. people care 2. github was already working on it.

mrkramer|6 years ago

Backend for codegrep was Play framework + Elasticsearch and you could search by programming languages.

Screenshot: http://archive.is/0mFML

edwinyzh|6 years ago

Awesome! To me it looks like the come back of "Google Code Search" which I've been missing for many years!

enriquto|6 years ago

Curious that I found many "secret forks" of my stuff, but none of my repos is directly indexed.

justanotheratom|6 years ago

Can you elaborate how you found them?

welder|6 years ago

Can I search only additions/deletions? Recently when searching GitHub I wanted to find if anyone had replaced the usage of a deprecated method with the new one, because the docs for that library don't mention the non-deprecated method name.

yuz|6 years ago

Do you index the default branch of every repo? Or do you just index the master branch?

danfox|6 years ago

It indexes the default branch of each repo.

cddotdotslash|6 years ago

The interface for this is really clean and nice - did you use a theme or framework?

danfox|6 years ago

Thanks! It's using Elastic's Search UI (https://github.com/elastic/search-ui) and Ant Design (https://github.com/ant-design/ant-design).

inetknght|6 years ago

I was going to say that I didn't want javascript on this.

But it's actually pretty #neat. It's all tidied up into a single app without any dependencies.

This rocks and, so far, seems way way WAY better than Github's own search tool.

bilekas|6 years ago

This is cool, reminds me of the vulnerability search too.

https://shhgit.darkport.co.uk/

Existenceblinks|6 years ago

^(.)'(.)'(.)$
I got a tooltip say:
Error: JSON.parse: unexpected character at line 1 column 1 of the JSON data

Update:: Oh ^(.)"(.)"(.)$ works and fast.

danfox|6 years ago

I think that error was just because the server was overloaded - sorry about that.

stagas|6 years ago

I wish there was something this fast, but for searching error outputs instead (along with discussions/solutions).

AdrianEGraphene|6 years ago

Feels like magic to me! Lets me easily see who's working on similar topics. Thanks!

edwinyzh|6 years ago

Can you share your search string? Thanks.

OutsmartDan|6 years ago

This is one of the fastest, most responsive searches i've ever used. Great work!

thrownaway954|6 years ago

might be a good idea to have some sort of clickable "demo" search or "try these" example on the frontend page to show off the capabilities of this.

KhoomeiK|6 years ago

How is it that fast?

chasers|6 years ago

How do you handle expensive regex statements?

doubleorseven|6 years ago

My last name(Ament) is really rare where I come from, so I've used the tool to find other people with the same last name. Was not disappoint. Thank you!

unknown|6 years ago

[deleted]

mtnGoat|6 years ago

this is awesome stuff, thank you! great work!

habit20|6 years ago

Hello world

unknown|6 years ago

[deleted]

sonicxxg|6 years ago

[deleted]

dang|6 years ago

There's no need for personal attack. We ban accounts that do that, so please don't.

Cherry-picking one post from a statistical cloud and calling it typical is dodgy. Even the distribution in this thread doesn't match your description. Actually, even the comment you're picking on doesn't match your description.

We detached this subthread from https://news.ycombinator.com/item?id=22397156.

unknown|6 years ago

[deleted]

dbielik|6 years ago

[deleted]

appleflaxen|6 years ago

This seems unrelated.

I hope u/dang sees your comment history; you are basically just spamming nerdydata.com

whatever1|6 years ago

Why regex still exists? It is unintuitive, requires mastering an obscure syntax, it is very hard to debug, and very difficult to explain to others how it works. It feels like we are trying to write intermediate code by ourselves, while we should have a human readable language that generates regex.

roryokane|6 years ago

You might be interested in “Eggex”, which aims to be a human-readable language that generates regexes. It’s currently written as a feature of the Oil shell, but in theory any tool could support them. Eggex docs: https://www.oilshell.org/release/latest/doc/eggex.html. Recent blog post about their development: https://www.oilshell.org/blog/2019/12/22.html.

However, Eggexes are a thin, mostly-syntactic layer over regexes. You still have to understand the regex engine to use them. If this sounds useless to you because you don’t currently understand any flavor of regex or parsing, I encourage you not to give up on learning regexes. (https://www.regular-expressions.info/ was how I learned; it’s a great tutorial.) Text-parsing engines, including regex engines, are a powerful concept that can be used in many situations, and I think it’s worth spending the effort learning them until, to paraphrase another commenter, regexes become the human-readable language you were searching for. Or Eggexes, at least.

GrantZvolsky|6 years ago

The investment into learning regexes is worth it if you write or read enough of them. They become the human readable language you speak of, eventually. The question is where the threshold lies.

frabert|6 years ago

Do it! You will find that it's very easy, but the result will either be extermely verbose or just like regex. Since most regexes (at least for me) are meant as one-time-use, the extra verboseness has no added benefit. If you have complex needs, you should probably be using something other that regex, anyways.

tyingq|6 years ago

"Why regex still exists?"

Is there an alternative that is clearly superior?

unknown|6 years ago

[deleted]

GordonS|6 years ago

Because it's really powerful, and some people actually like it (I'm one of them).

I can understand that a complex pattern might look scary if you're unfamiliar, but if you work with it long enough, you can put patterns together with relative ease.

155 comments