(no title)
mosseater | 3 years ago
I've done this method a lot. Honestly scraping Google Reviews was the most difficult in terms of complexity. This was like 6 or 7 years ago. You would get back these huge nested arrays that mostly had 0s in them. Occasionally a value would be set and that's what I would go with. I'm assuming their internal tools were obfuscated and/or using protobuf. But it certainly took me back to the good ol' days hexediting games in order to make your own cheat codes.
Another difficulty I faced were sites that relied on the previous UI state to pass the API call. You'd have to emulate "real" browsing by requesting the subsequent pages and get the ID number. Still much faster than emulating the whole browser via Selenium.
Honestly, it was the small sites that actually proved more troublesome. The ones that had an actual admin reading logs. They would ban our whole IP Block, then ban our whole proxy IP Block. Once I implemented TOR functionality into our scraper for a particularly valuable but small site and they blocked that too. This site ended up implementing ludicrous rate limiting that had normal users waiting for 2-3 seconds between requests, all because we were scraping their data. I kid you not, by the time we gave up trying, this Section-8 rental site for a small city had vastly more protections in place than Zillow and Apartments.com combined.
ianberdin|3 years ago
is_true|3 years ago