top | item 29896578

(no title)

diegoveralli | 4 years ago

I scrape all my bookmarks, github / gitlab stars, favorite tweets, etc... regularly, and push that to an elasticsearch instance, so that I can do full text searches later.

This is the only reliable way for me to find something I have a vague recollection of, which happens quite often. In the past I used to simply type whatever I could recall into Google search and often find it, but that just doesn't work any more. It's not just the SEO stuff, also the bias towards recent content seems to have increased, and obviously the pace of content generation as well. The haystack is much bigger now than it used to be.

discuss

order

skinnymuch|4 years ago

Any public code of the scraping? I’ve been meaning to do something similar.

diegoveralli|4 years ago

The code is here: https://github.com/diegov/searchbox

But don't let its attempt at a friendly README fool you, it's personal-grade software of "works-on-my-machine", and barely, quality.

If you want to look at something more polished, I think https://github.com/amirgamil/apollo and https://github.com/thesephist/monocle are worth taking a look.

Also there's the https://github.com/karlicoss/HPI library, which you could build on, though it mainly relies on data dumps from the different services instead of crawling and fetching through APIs, which is why I didn't use it. Keeping up with API changes is bad enough, I don't want to deal with undocumented dump formats...