top | item 22398804

(no title)

thanatos_dem | 6 years ago

If this were to be offered by an actual company (a first party solution), there are some features that'd be expected that make the problem space a lot harder. Here's an "intro to search" article that's a good read, and I'll use it to highlight some of the things that'd be different in a first party solution - https://medium.com/startup-grind/what-every-software-enginee...

(See the "Theory: the search problem" section)

Size: This is only indexing ~500k public repos. A first party solution would be expected to index all of it, public and private.

Indexing speed: This can take up to a few days to index. A first party solution would be expected to have a much lower index latency - seconds to minutes.

Query language: This can (and does) have its own simple query language. A first party solution would need to have support embedded into and not break backwards compatibility with the current query language.

Context-dependence: A first party solution would be expected to index private repos as well, and now the query context (logged in user) becomes another variable in an already multi-variate problem space.

Latency: Gets harder with scale, and a first party solution would likely provide a SLA/SLO around latency.

Access control: Same issue as context-dependence, with private repos being included.

There's also unknown but likely considerations around compliance and internationalization, which are quite tricky problems.

Note - I don't mean for this to be critical of the author at all. This is an awesome and useful tool, with a fantastic UX. I just want to make it clear that search at scale is a lot harder than it seems at first glance, especially as the feature requirements increase.

discuss

order

fjania|6 years ago

Engineering manager for code search at GitHub here... this is an excellent summary of many of the concerns we have as we work on code search at GitHub scale!

sdesol|6 years ago

For GitHub, I would have to imagine only being able to search public repos with regexp would be good enough. GitHub has many strategies, but the main one is, they want to maintain, if not, expand their open source mind share.

The more reasons you give people to go to GitHub, the better off they will be in the future. So I do agree with you that as a commercial solution, this may not be viable, but for GitHub's public repos, this can turn into a very positive thing.

marceloabsousa|6 years ago

That might well be true but to scale this type of service to all public repos with decent latency and update ratio is a major technical challenge and likely very costly to maintain.