top | item 15380424

(no title)

rw | 8 years ago

This only uses LDA on your starred repository descriptions, to find topic terms that describe your starred repositories. These topic terms are then used to query the GitHub search API to find matching repositories. The results are then sorted by star count.

That is a clever way to make use of a search API like GitHub's. The principled way to do this, though, is to run LDA over all descriptions on GitHub, then use that similarity index to find similar repositories. You could run LDA over code, too.

I'll note that there is a cold start problem with this implementation: using LDA on such a small set of short documents will often lead to uninformative topics with words that are too-specific. You need a big corpus to capture e.g. synonym relationships.

discuss

order

painted|8 years ago

Your point is quite interesting although I'm not sure running LDA on the entire code would be useful. I spent half a year writing my postgraduate thesis on a recommender systems for streaming services based on LDA, in particular we wanted to infer who is watching what and when in a shared account. From all the tests I did with LDA I believe the best thing would be to run it on the README files.

rw|8 years ago

Good idea, the READMEs would be best of all.

c5urf3r|8 years ago

Thats right, but one additional level of depth in fetching repositories increases the API latency by so much that almost becomes unusable for a web app at least for a hobby web app :P Hence the shortcuts. Open to ideas and suggestions.

rw|8 years ago

As I said, your approach is a clever way to use the GitHub API. I think you need to change the title and readme to indicate that this isn't an LDA index of GitHub descriptions. To ML practitioners, that's what you are implying with a title of "Show HN: Using LDA to suggest GitHub repositories based on what you have starred".

nl|8 years ago

Lab41 has done work on code recommendations by using a word2vec representation on the code itself.

c5urf3r|8 years ago

Links please?