I do enjoy moments like these on hacker news when someone presents a project for X and the CTO of X shows up and wants to talk. It shows how directly of an impact one can potentially have in this community.
I hope this means we’re getting grep searches for github soon. Cheers.
iirc, Github uses (used?) my old project (https://github.com/intel/hyperscan) at Intel. It's probably faster than the alternatives, although if you want to support all types of regex you'll need to use Hyperscan as a prefilter for a richer regex engine like PCRE.
This project looks like it pulls literal factors out of the regex that I type in, maybe to an index a la that Russ Cox blog post a while back about Code Search. It seems to Not Like things that have very open-ended character classes (e.g. \w) unless there is a decent length literal involved somewhere.
It seems to have a very rudimentary literal extraction routine, as it decides to give a partial result set when fed an alternation between two literals that it handles pretty well on their own.
Impressive! Really fast, full featured code search across a huge corpus.
1. How did you build the index? Did you use a GitHub dump of some sort? How often do you refresh it?
2. Is it Elasticsearch or similar or a completely custom engine?
3. What kind of RAM/CPU are you using to power it?
4. Any plans to open source the code or commercialize the technology?
I could absolutely imagine paying for a private code search engine like this to run against a large internal company codebase spread across many repositories.
Thanks! It's built on top of Solr. It fetches the repos from GitHub - it should pick up any updates to repos within a few days. It's running on a couple servers with 20 cores each, which is not really enough for the traffic it's getting right now.
I'd be curious how you built the step from regex to ElasticSearch. My guess would be an n-gram (3-gram) index in ElasticSearch and then translating the regexes to that, but just curious if you built that custom or used something off-the-shelf. Love the site!
I still miss Google Code Search, which was a great way to find examples of anything I wanted to learn about in programming and usually answered my questions better than anything else, including Stack Overflow. Has it really been 8 years... https://news.ycombinator.com/item?id=3112029
If this tool can fill that hole in my world, I'll be stoked. I've bookmarked it.
If only the answer to "how" was as simple as "writing a web service for searching GitHub repos with regexes," even though the problem is probably in itself non-trivial if there's this much interest in search at all. At least the specification is clear enough.
I guess what I mean to ask is, how would people know this is a "correct" answer to the "how" question beforehand? Is the answer literally just "search" because that's simply what's trending right now?
This is really cool. What are you using it for? Usage examples, debugging, etc.?
I'm the CEO at Sourcegraph (universal code search for companies to use on their internal code). Our product is really optimized for searching a company's internal code right now, but soon we'll start working on offering much better search for public and open-source code as well. If you'd like to help out or just chat, please reach out! sqs@sourcegraph.com
This is amazing! One thing that allows me to do, which I wasn't before, is to do a search for the repos that use some of my open source.
While there were some tools for this, they fail sort for older projects where using a library meant copy/paste it into your project, which is not reported in the CDN stats, npm installs or github "uses".
Now I can run a search with a bit of code that is only present in my library and find reliably those who copy/pasted it. While I publish my code under the MIT, this would also be very useful for those publishing under the GPL to detect bad actors.
That was my first thought. I’ll have to wait until tomorrow to try it, but I have one super rarely used function ima rare package I’d love to see how other people are using.
to grep specific repos locally, I use a tool called Hound, https://github.com/hound-search/hound developed by a couple of engineers at Etsy while I was there, but never released officially.
I built https://grephub.com for that. It doesn’t maintain an index so it’s not super snappy, but it’s good enough / better than you’d expect in many cases!
Why would you want to use this tool to grep individual repos? If you know the repo you're interested in, you can just clone it and then grep it locally...?
A tangent, my biggest gripe with GitHub code search (within a repo) off the top of my head is the inability to blacklist directories or only search whitelisted directories. Often times I want to look up the implementation of a function, and bam, three pages of results from tests.
I'm glad I'm not the only one. It's very common that I'll be searching for a keyword that only appears in the actual code a handful of times but hundreds of times in tests. GitHub's search is practically useless in those cases.
I almost always just resort to cloning and searching with ripgrep, which can be annoying if I have no other reason to have the codebase on my machine or it's just a one-off.
yeap .. having this issue as well, trying to easily find where a method is defined in JS/TS I'd so much want to be able to exclude `*.(spec|test).(js|ts|jsx|tsx)`
This is really cool! Awesome work. I assume you've seen https://sourcegraph.com/ as well? This to me seems much clearer and a bit more intuitive (though I've only spent a little time in sourcegraph). Really really cool. Does it also search code comments?
Obviously the source material is different (Debian packages vs GitHub repos) and grep.app also uses re2, but that is all I can see from a look at the “about” blurb.
Hey Dan, if you ever wanted to come on my podcast to talk about your tech stack (how your site is developed / deployed, lessons learned, etc.), I'd love to have you on.
I do not have a great example to try on my phone, but are results deduplicated? That's my big peeve with GitHub search is getting 5 pages of the same forked repo.
There isn't any deduplication, although that will hopefully be less of an issue at this point since there's a limited number of repositories in the index.
GitHub confirmed to me that their search is not able to find in substrings; this is annoying because if you want to find all affected code among all possibly involved repositories, before a change, you need to clone them and grep locally. In the end this means you need to clone absolutely everything you work with, because otherwise you might miss changing that one repo you didn't think of:
I've used Sourcegraph and it was cool; will have a look at this new tool too. But, GitHub pretty please add plain food old grep abilities to your search!
Something I found when testing the regexp: the highlights seem to be off sometimes. When grepping for '<.*?@gmail.com>' (sorry, just the first thing that came to mind to try out the regexp), the second highlight in the first result seems to be in the wrong location:
I would say this needs a list of indexed repos and mainly an explanation of how it exactly works to be usable (how's the index build and how often it's refreshed, what types of files are being indexed, etc.). Otherwise, there's no much value in searching in an unknown data, is it?
Anyway, to not only criticize, good job! It's definitely one of GitHub's missing features. And I can imagine it's not an easy job to build something like that. But as I wrote, it really has to be well explained to be actually usable.
> there's no much value in searching in an unknown data, is it?
So you know exactly how Google's index works?
I think "best effort", whatever it is, is useful even if I don't know the specifics of what it captures or misses. As long as it returns useful results.
Superb work. You built a better code search than Github (well with some of its features missing sure) with a lot less resources. Shows how stagnated the progress in big companies is after a service is deemed "good enough". Good for you kicking them in their butts to lead the way. Hope you get out of this something else too than HN karma.
Really like the minimalistic design, not too designy but still easy on my eyes. Just the way I want it to let me focus on the task at hand
thank you so much for doing this! i hope it continues to open more doors of opportunities to you!
primo, this is a crazy snappy proof that shows that github search can be done. next, the UI is amazing. and finally, all my queries worked!
i am now going to remove "github search sucks" from my to-be-published rants because this post demonstrates that 1. people care 2. github was already working on it.
Can I search only additions/deletions? Recently when searching GitHub I wanted to find if anyone had replaced the usage of a deprecated method with the new one, because the docs for that library don't mention the non-deprecated method name.
My last name(Ament) is really rare where I come from, so I've used the tool to find other people with the same last name. Was not disappoint.
Thank you!
There's no need for personal attack. We ban accounts that do that, so please don't.
Cherry-picking one post from a statistical cloud and calling it typical is dodgy. Even the distribution in this thread doesn't match your description. Actually, even the comment you're picking on doesn't match your description.
Why regex still exists? It is unintuitive, requires mastering an obscure syntax, it is very hard to debug, and very difficult to explain to others how it works. It feels like we are trying to write intermediate code by ourselves, while we should have a human readable language that generates regex.
However, Eggexes are a thin, mostly-syntactic layer over regexes. You still have to understand the regex engine to use them. If this sounds useless to you because you don’t currently understand any flavor of regex or parsing, I encourage you not to give up on learning regexes. (https://www.regular-expressions.info/ was how I learned; it’s a great tutorial.) Text-parsing engines, including regex engines, are a powerful concept that can be used in many situations, and I think it’s worth spending the effort learning them until, to paraphrase another commenter, regexes become the human-readable language you were searching for. Or Eggexes, at least.
The investment into learning regexes is worth it if you write or read enough of them. They become the human readable language you speak of, eventually. The question is where the threshold lies.
Do it! You will find that it's very easy, but the result will either be extermely verbose or just like regex. Since most regexes (at least for me) are meant as one-time-use, the extra verboseness has no added benefit. If you have complex needs, you should probably be using something other that regex, anyways.
Because it's really powerful, and some people actually like it (I'm one of them).
I can understand that a complex pattern might look scary if you're unfamiliar, but if you work with it long enough, you can put patterns together with relative ease.
jasoncwarner|6 years ago
@danfox, sent you an email though commenting here too.
I'm the CTO @ GitHub. Would love to talk to you about this and other things we are building in this area at GitHub.
Feel free to email direct to jason at github.com
sovietmudkipz|6 years ago
I hope this means we’re getting grep searches for github soon. Cheers.
latenightcoding|6 years ago
sixwing|6 years ago
@danfox, i'm always down to talk code search as well - rand@github.com
glangdale|6 years ago
This project looks like it pulls literal factors out of the regex that I type in, maybe to an index a la that Russ Cox blog post a while back about Code Search. It seems to Not Like things that have very open-ended character classes (e.g. \w) unless there is a decent length literal involved somewhere.
It seems to have a very rudimentary literal extraction routine, as it decides to give a partial result set when fed an alternation between two literals that it handles pretty well on their own.
WhiteOwlLion|6 years ago
simonw|6 years ago
1. How did you build the index? Did you use a GitHub dump of some sort? How often do you refresh it?
2. Is it Elasticsearch or similar or a completely custom engine?
3. What kind of RAM/CPU are you using to power it?
4. Any plans to open source the code or commercialize the technology?
I could absolutely imagine paying for a private code search engine like this to run against a large internal company codebase spread across many repositories.
danfox|6 years ago
heipei|6 years ago
dang|6 years ago
If this tool can fill that hole in my world, I'll be stoked. I've bookmarked it.
londons_explore|6 years ago
[1]: https://cs.chromium.org/
thanatos_dem|6 years ago
Already has been publicly contacted by:
- GitHub CTO
- SerpApi CEO
- SourceGraph CEO
Search is hot right now!
swat535|6 years ago
nonbirithm|6 years ago
I guess what I mean to ask is, how would people know this is a "correct" answer to the "how" question beforehand? Is the answer literally just "search" because that's simply what's trending right now?
sdesol|6 years ago
Existenceblinks|6 years ago
neonate|6 years ago
sqs|6 years ago
I'm the CEO at Sourcegraph (universal code search for companies to use on their internal code). Our product is really optimized for searching a company's internal code right now, but soon we'll start working on offering much better search for public and open-source code as well. If you'd like to help out or just chat, please reach out! sqs@sourcegraph.com
edwinyzh|6 years ago
franciscop|6 years ago
While there were some tools for this, they fail sort for older projects where using a library meant copy/paste it into your project, which is not reported in the CDN stats, npm installs or github "uses".
Now I can run a search with a bit of code that is only present in my library and find reliably those who copy/pasted it. While I publish my code under the MIT, this would also be very useful for those publishing under the GPL to detect bad actors.
danielecook|6 years ago
SlowRobotAhead|6 years ago
hoorayimhelping|6 years ago
unknown|6 years ago
[deleted]
glouwbug|6 years ago
Can it grep on individual repos?
hcm|6 years ago
funklute|6 years ago
oefrha|6 years ago
Noctem|6 years ago
I almost always just resort to cloning and searching with ripgrep, which can be annoying if I have no other reason to have the codebase on my machine or it's just a one-off.
cynicalreason|6 years ago
patrickdevivo|6 years ago
edwinyzh|6 years ago
hartator|6 years ago
I am the CEO at SerpApi. If you need a job, shot me an at julien _at_ serpapi.com.
fanf2|6 years ago
Obviously the source material is different (Debian packages vs GitHub repos) and grep.app also uses re2, but that is all I can see from a look at the “about” blurb.
sciurus|6 years ago
https://searchfox.org/
https://github.com/bgrins/searchfox
nickjj|6 years ago
That podcast is at: https://runninginproduction.com/, drop me a line at nick.janetakis@gmail.com if you're interested.
edwinyzh|6 years ago
lol768|6 years ago
danfox|6 years ago
TACIXAT|6 years ago
danfox|6 years ago
aodj|6 years ago
j1elo|6 years ago
https://stackoverflow.com/questions/43891605/search-partial-...
I've used Sourcegraph and it was cool; will have a look at this new tool too. But, GitHub pretty please add plain food old grep abilities to your search!
w-m|6 years ago
Something I found when testing the regexp: the highlights seem to be off sometimes. When grepping for '<.*?@gmail.com>' (sorry, just the first thing that came to mind to try out the regexp), the second highlight in the first result seems to be in the wrong location:
https://grep.app/search?q=%3C.%2A%3F%40gmail.com%3E®exp=t...
https://imgur.com/a/VyUXhcF
sn4pp|6 years ago
api_key="[a-z0-9]+"
Ty
bananaeater|6 years ago
ferenczy|6 years ago
Anyway, to not only criticize, good job! It's definitely one of GitHub's missing features. And I can imagine it's not an easy job to build something like that. But as I wrote, it really has to be well explained to be actually usable.
clarry|6 years ago
So you know exactly how Google's index works?
I think "best effort", whatever it is, is useful even if I don't know the specifics of what it captures or misses. As long as it returns useful results.
tekkk|6 years ago
Really like the minimalistic design, not too designy but still easy on my eyes. Just the way I want it to let me focus on the task at hand
jakear|6 years ago
tyingq|6 years ago
dabei|6 years ago
edwinyzh|6 years ago
blackandblue|6 years ago
primo, this is a crazy snappy proof that shows that github search can be done. next, the UI is amazing. and finally, all my queries worked!
i am now going to remove "github search sucks" from my to-be-published rants because this post demonstrates that 1. people care 2. github was already working on it.
mrkramer|6 years ago
Backend for codegrep was Play framework + Elasticsearch and you could search by programming languages.
Screenshot: http://archive.is/0mFML
edwinyzh|6 years ago
enriquto|6 years ago
justanotheratom|6 years ago
welder|6 years ago
yuz|6 years ago
danfox|6 years ago
cddotdotslash|6 years ago
danfox|6 years ago
inetknght|6 years ago
But it's actually pretty #neat. It's all tidied up into a single app without any dependencies.
This rocks and, so far, seems way way WAY better than Github's own search tool.
bilekas|6 years ago
https://shhgit.darkport.co.uk/
Existenceblinks|6 years ago
I got a tooltip say:
Error: JSON.parse: unexpected character at line 1 column 1 of the JSON data
Update:: Oh ^(.)"(.)"(.)$ works and fast.
danfox|6 years ago
stagas|6 years ago
AdrianEGraphene|6 years ago
edwinyzh|6 years ago
OutsmartDan|6 years ago
thrownaway954|6 years ago
KhoomeiK|6 years ago
chasers|6 years ago
doubleorseven|6 years ago
unknown|6 years ago
[deleted]
mtnGoat|6 years ago
habit20|6 years ago
unknown|6 years ago
[deleted]
sonicxxg|6 years ago
[deleted]
dang|6 years ago
Cherry-picking one post from a statistical cloud and calling it typical is dodgy. Even the distribution in this thread doesn't match your description. Actually, even the comment you're picking on doesn't match your description.
We detached this subthread from https://news.ycombinator.com/item?id=22397156.
unknown|6 years ago
[deleted]
dbielik|6 years ago
[deleted]
appleflaxen|6 years ago
I hope u/dang sees your comment history; you are basically just spamming nerdydata.com
whatever1|6 years ago
roryokane|6 years ago
However, Eggexes are a thin, mostly-syntactic layer over regexes. You still have to understand the regex engine to use them. If this sounds useless to you because you don’t currently understand any flavor of regex or parsing, I encourage you not to give up on learning regexes. (https://www.regular-expressions.info/ was how I learned; it’s a great tutorial.) Text-parsing engines, including regex engines, are a powerful concept that can be used in many situations, and I think it’s worth spending the effort learning them until, to paraphrase another commenter, regexes become the human-readable language you were searching for. Or Eggexes, at least.
GrantZvolsky|6 years ago
frabert|6 years ago
tyingq|6 years ago
Is there an alternative that is clearly superior?
unknown|6 years ago
[deleted]
GordonS|6 years ago
I can understand that a complex pattern might look scary if you're unfamiliar, but if you work with it long enough, you can put patterns together with relative ease.