top | item 46665803

(no title)

Zanfa | 1 month ago

When working at an influencer marketing company a while ago, back when Instagram still allowed pretty much complete access through their API. As we were indexing the entire Instagram universe for our internal tooling, we had this graph traversal setup to crawl Instagram profiles, then each of their followers etc. We’d need to keep track of visited profiles to not loop and had an Apache Storm cluster for the entire scraping pipeline. It worked, but was cumbersome to work with and monitor as well as couldn’t reach our desired throughput.

Given there were about a billion IG profiles total at the time, I just replaced the entire setup with a single Go script that iterated from 1 to billion and tried to scrape every id in between. That gave us 10k requests per second on a single machine, which was more than enough.

discuss

order

anon_cow1111|1 month ago

>an influencer marketing company

I really, really, really wish this sequence of words did not exist in modern society.

/my unsubstantiated reddit-tier comment which I'm only posting because I'm sure someone will piggyback off of it with something related and actually insightful.

sjducb|1 month ago

People forget that a billion rows isn’t big data anymore.

nojs|1 month ago

How long ago? I’m surprised you got anywhere near their servers with 10k requests per second from a single machine.