top | item 29060272

Avoiding bot detection: How to scrape the web without getting blocked?

586 points| proszkinasenne2 | 4 years ago |github.com | reply

298 comments

order
[+] bsamuels|4 years ago|reply
> I need to make a general remark to people who are evaluating (and/or) planning to introduce anti-bot software on their websites. Anti-bot software is nonsense. Its snake oil sold to people without technical knowledge for heavy bucks.

If this guy got to experience how systemically bad the credential stuffing problem is, he'd probably take down the whole repository.

None of these anti-bot providers give a shit about invading your privacy, tracking your every movements, or whatever other power fantasy that can be imagined. Nobody pays those vendors $10m/year to frustrate web crawler enthusiasts, they do it to stop credential stuffing.

[+] Mister_Snuggles|4 years ago|reply
I wish they'd limit it to just stopping credential stuffing.

Here's my scenario: My electricity provider publishes the month's electricity rates on the first of the month, I want to scrape these so that I can update the prices in Home Assistant. This is a very simple task, and it's something that Home Assistant can do with a little configuration. Unfortunately this worked exactly once, after that it started serving up some JavaScript to check my browser.

The information I'm trying to get is public and can be accessed without any kind of authentication. I'm willing to bet that they flipped the anti-bot stuff on their load balancer on for the entire site instead of doing the extra work to only enable it for just electricitycompany.com/myaccount/ (where you do have to log in).

I also asked the company if they'd be willing/able to push the power rates out via the smart meters so that my interface box (Eagle-200) could pick it up, they said they have no plans to do so.

The next step is to scrape the web site for the provincial power regulator, which shows the power rates for each provider. Of course, the regulator's site has different issues (rounding, in particular), I haven't dug any further to see if I can make use of this.

All of this effort to get public information in an automated fashion.

[+] jameshart|4 years ago|reply
Bots aren't just trying credential stuffing. They are:

- committing clickfraud to game ad and referral revenue systems

- posting fake or spam reviews and comments

- generating fake behavioral signals to help bypass CAPTCHAs to help create accounts on other sites that can post spam comments

- validating stolen credit card details

- screwing with your metrics collection if you can't identify them as bots

All of that is enough reason for sites to use bot detection and blocking technology. The fact that the same tech also has some utility against accidental or malicious traffic-based DoS is also a bonus.

[+] oxymoron|4 years ago|reply
Yeah, I used to work for one of the major anti-bot vendors. Customers weren't clueless. Nobody buys these solutions because they're so much fun, it's a cost center and they monitor their ROI quite closely. Credit card charge backs, impact to infrastructure, extra incurred cost due to underlying api's (like in the Airline industry in particular) etc are all reasons why bot mitigation is a better option than nothing for a lot of companies, even if it's not 100% effective.
[+] ivanhoe|4 years ago|reply
Saying that Anti-bot software is nonsense is like saying that door locks are snake oil too. We've all seen Lockpicking Lawyer on Youtube opening with ease any lock out there, so how come that all of us haven't got robbed yet?

Well, because protection is not a binary thing - either being 100% safe or 100% not working - instead it's a proportion between the skill/effort/time needed to break in, and the reward you get for it.

To stop majority of attacks you don't have to be absolutely unbreakable, you just need to make it hard enough for majority of attackers so that it doesn't payout for them compared to the value of the data you're protecting. And that's where anti-bot SW has it's place, it slows done spiders and global attacks, forcing for custom tailored scraping that is constantly being fine-tuned, infrastructure to hide your IPs, and that makes the operation way more expensive and harder to run continuously...

[+] melony|4 years ago|reply
The gold standard is residential IP. It is not cheap but its effectiveness is indisputable.
[+] hattmall|4 years ago|reply
I've always thought credential stuffing and most password hacking attempts could be defeated by simply logging into randomly generated dummy accounts if the password is wrong. Just make it so that the same username / password combo takes you to the same random info. Real users should notice things were wrong immediately but bots would have no way to tell unless they already knew some of the real information.
[+] sparkling|4 years ago|reply
> Anti-bot software is nonsense. Its snake oil sold to people without technical knowledge for heavy bucks.

I disagree. Obviously there is no way to 100% stop scraping, but a for a rather small amount of $ you can implement some measures that make it harder. Services like https://focsec.com/ offer ways to detect web scrapers using proxy/VPNs (one of the most common techniques) for little money.

> Nobody pays those vendors $10m/year to frustrate web crawler enthusiasts, they do it to stop credential stuffing.

Keep in mind that they may be legally or contractually forced to do this. Think of Netflix who are investing heavily into their Anti-VPN capabilities, most likely because they have contracts with content publishers & studios that force them to do so.

[+] devit|4 years ago|reply
If users using weak/reused passwords is your problem, just don't let users choose a password (generate it for them), or don't use passwords at all (send link by e-mail that adds a cookie), or use oauth login.
[+] jjav|4 years ago|reply
> None of these anti-bot providers give a shit about invading your privacy, tracking your every movements, or whatever other power fantasy that can be imagined.

There is vast amount of profit available in doing just that (see e.g. GOOG and FB market cap). Even companies that truly have no intention of exploiting data collected as a side-effect of whatever product line they do, nearly always eventually end up going for that profit line. Because passing up on more income merely due to moral considerations is too much of a temptation for a company to be able to resist in the long term.

[+] chucksmash|4 years ago|reply
The credential stuffing wiki page didn't exist the last time I thought about invalid traffic so I'm pretty out of date.

How is there not an equilibrium here that cuts off credential stuffers? I'd naively imagine the residential IP providers have some measure of bad actors they themselves use to determine if a client is worth it, and that someone getting all your IPs blacklisted would get dropped pretty quickly.

[+] astatine|4 years ago|reply
On a site I used to run, there was no content which needed protection. So, it was not much of a pain except that there would be a lot of bot- filled contact forms. Slowly the problem became severe enough that bandwidth fees started becoming an issue. Finally had to use cloudflare in the front to reduce bandwidth usage. It worked but the side-effect was that some valid users may now get blocked.
[+] nextaccountic|4 years ago|reply
> Nobody pays those vendors $10m/year to frustrate web crawler enthusiasts, they do it to stop credential stuffing.

I don't know about $10m/year, but many sites block bots just because they don't want competitors to access publicly available data. Which is bullshit.

[+] krageon|4 years ago|reply
> he'd probably take down the whole repository.

I know how bad this issue is, and I wouldn't take down this repository. Anti-bot software does not work, anyone who pays 10m per year to have it simply has too much money.

[+] mindslight|4 years ago|reply
If your password db is so broken that it's useful to create a term to abstract attacks ("credential stuffing"), then the right answer is to actually fix that security (eg pick users passwords for them, or completely replace with email auth), rather than thinking you're raising the bar by requiring attackers to come from a residential IP.
[+] Gigachad|4 years ago|reply
2FA should be a requirement on everything now. And if your site can't for some reason or you don't want to deal with it, then limit your site to external login providers only.

2FA, especially app based, has been proven to work really really well.

[+] 5faulker|4 years ago|reply
Same thing goes with ad blocking to a similar degree.
[+] mdoms|4 years ago|reply
If it was just about credential stuffing they would only put limits on POST requests.
[+] ChuckMcM|4 years ago|reply
I am always amazed when otherwise intelligent people assert without data that the marginal cost of serving web traffic to scrapers/bots is zero. It is kind of like people who say "Why don't they put more fuel in the rocket so it can get all the way into orbit with just one stage?"

It sounds great but it is a completely ignorant thing to say.

[+] kodah|4 years ago|reply
When I worked in e-commerce as a SRE, bots were doing two things:

- trying to disrupt business processes (eg: false referral listings, gift card scams, etc)

- trying to disrupt systems

I'm sure there are folks who use bots and scrapers for home automation, but these users generate marginal traffic in comparison. The real cost, aside from successfully achieving the points above, is the bandwidth and hardware costs that become overhead. Bots are usually coded with retry mechanisms and ways to change connection criteria on subsequent retries.

[+] Aperocky|4 years ago|reply
Anyone who has a minor website knows that majority of the traffic are bot.

Imagine if the goal is images and videos, now you've got yourself some heavy duty scraper that could cost the website owner lots of data fees.

[+] ohyeshedid|4 years ago|reply
Seemingly, most of those people don't have a realistic concept of scale.
[+] matheusmoreira|4 years ago|reply
What could my scraper which makes 1 HTTP request per day possibly cost the webmaster?
[+] ufmace|4 years ago|reply
What I really enjoy about this thread is all of the completely different perspectives. Lots of people doing anti-abuse research bemoaning that this stuff exists, and lots of people working against what are from their perspective ham-handed anti-abuse tech blocking legitimate useful automation trading tips on how to do it better. I guess the other sides of those we don't see much. People doing actual black-hat work probably don't post about it on public forums, and most of the over-broad anti-abuse is probably a side effect of taking some anti-abuse tech and blindly applying it to the whole site just because that's simpler, often no tech people may be really involved at all.
[+] marginalia_nu|4 years ago|reply
If someone is signalling to you you that they do not want your bot on their site, then maybe respect that? Trying to circumvent it is besides being legally questionable, a serious pain in the ass for the site owner and makes websites more prone to attempt to block bots in general.

Also, in my experience, most websites that block your bot, block your bot because your bot is too aggressive, or because you are fetching some resource that is expensive that bots in general refuse to lay off. Bots with seconds between the requests rarely get blocked even by CDNs.

[+] al2o3cr|4 years ago|reply

    You use this software at your own risk. Some of them contain malwares just fyi
LOL why post LINKS to them then? Flat-out irresponsible...

    you build a tool to automate social media accounts to manage ads more efficiently
If by "manage" you mean "commit click fraud"
[+] abadger9|4 years ago|reply
I'm a lead engineer on the search team of a publicly traded company who's bread and butter is this domain. I was curious about this list, it candidly misses the mark- the tech mentioned in this blog is what you might get if you hired a competent consultant to build out a service without having domain knowledge. In my experience, what's being used on the bleeding edge is two steps ahead of this.
[+] curun1r|4 years ago|reply
There’s one technique that can be very useful in some circumstances that isn’t mentioned. Put simply, some sites try to block all bots except for those from the major search engines. They don’t want their content scraped, but they want the traffic that comes from search. In those cases, it’s often possible to scrape the search engines instead using specialized queries designed to get the content you want into the blurb for each search result.

This kind of indirect scraping can be useful for getting almost all the information you want from sites like LinkedIn that do aggressive scraping detection.

[+] amelius|4 years ago|reply
But won't the search engines block you after some limit has been reached?
[+] janmo|4 years ago|reply
Or, you can spoof the google bot or Bing bot user agent and try to scrape the site that way.
[+] rp1|4 years ago|reply
It's very easy to install Chrome on a linux box and launch it with a whitelisted extension. You can run Xorg using the dummy driver and get a full Chrome instance (i.e. not headless). You can even enable the DevTools API programmatically. I don't see how this would be detectable, and probably a lot safer than downloading a random browser package from an unknown developer.
[+] walrus01|4 years ago|reply
Google "residential proxies for sale" if you want to see the weird shady grey market for proxies when you need your traffic to come from things like cablemodem operator ASNs' DHCP pools
[+] welanes|4 years ago|reply
Another great resource is incolumitas.com. A list of detection methods are here: https://bot.incolumitas.com/

I run a no-code web scraper (https://simplescraper.io) and we test against these.

Having scraped million of webpages, I find dynamic CSS selectors a bigger time sink than most anti-scraping tech encountered so far (if your goal is to extract structured data).

[+] peterburkimsher|4 years ago|reply
2 of my social media accounts have fallen victim to bot detection, despite not using scripts. There are other websites for which I have used scripts, and sometimes ran into CAPTCHA restrictions, but was able to adjust the rate to stay within limits.

CouchSurfing blocked me after I manually searched for the number of active hosts in each country (191 searches), and posted the results on Facebook. Basically I questioned their claim that they have 15 million users - although that may be their total number of registered accounts, the real number of users is about 350k. They didn't like that I said that (on Facebook) so they banned my CouchSurfing account. They refused to give a reason, but it was a month after gathering the data, so I know that it was retaliation for publication.

LinkedIn blocked me 10 days ago, and I'm still trying to appeal to get my account back.

A colleague was leaving, and his manager asked me to ask people around the company to sign his leaving card. Rather than go to 197 people directly, I intentionally wanted to target those who could also help with the software language translation project (my actual work). So I read the list of names, cut it down to 70 "international" people, and started searching for their names on Google. Then I clicked on the first result, usually LinkedIn or Facebook.

The data was useful, and I was able to find willing volunteers for Malay, Russian, and Brazilian Portuguese!

After finding the languages from 55 colleagues over 2 hours, LinkedIn asked for an identity verification: upload a photo of my passport. No problem, I uploaded it. I also sent them a full explanation of what I was doing, why, how it was useful, and a proof of my Google search history.

But rather than reactivate my account, LinkedIn have permanently banned me, and will not explain why.

"We appreciate the time and effort behind your response to us. However, LinkedIn has reviewed your request to appeal the restriction placed on your account and will be maintaining our original decision. This means that access to the account will remain restricted.

We are not at liberty to share any details around investigations, or interpret the terms of service for you."

So when the CAPTCHA says "Are you a robot?", I'm really not sure. Like Pinocchio, "I'm a real boy!"

[+] arp242|4 years ago|reply
CouchSurfing is just shit, full stop. I love the concept and hosted many people, but the way the company has been run over the last few years is beyond atrocious. It's like AirBnB sent over some people to intentionally run it in to the ground or something.

LinkedIn has to deal with a lot of scummy recruiters and scammers; I don't blame them for being very strict.

[+] nocturnial|4 years ago|reply
I knew there was a reason why I used client certificates and alternate ports.

Why is it so difficult to just respect robots.txt? Maybe there's an idea for a browser plugin that determines if you can easily scrape the data or not. If not, then the website is blocked and then traffic will drop. I know this is a naive idea...

[+] teeray|4 years ago|reply
Never underestimate the scraping technique of last resort: paying people on Mechanical Turk or equivalent to browse to the site and get the data you want
[+] adinosaur123|4 years ago|reply
Are there any court cases that provide precedence regarding the legality of web scraping?

I'm currently looking for ways to get real estate listings in a particular area and apparently the only real solution is the scrape the few big online listing sites.

[+] IceWreck|4 years ago|reply
Half of the short-links to cutt.ly aren't working. Why use short links in markdown ?
[+] dpryden|4 years ago|reply
It always amazes me how people believe they have a right to retrive data from a website. The HTTP protocol calls it a request for a reason: you are asking for data. The server is allowed to say no, for any reason it likes, even a reason you don't agree with.

This whole field of scraping and anti-bot technology is an arms race: one side gets better at something, the other side gets better at countering it. An arms race benefits no one but the arms dealers.

If we translate this behavior into the real world, it ends up looking like https://xkcd.com/1499

[+] connectsnk|4 years ago|reply
For the row "Long-lived sessions after sign-in" the author mentions that this solution is for social media automation i.e. you build a tool to automate social media accounts to manage ads more efficiently.

I am curious by what the author means by automating social media accounts to manage ads more efficiently

[+] kseifried|4 years ago|reply
Trying to stop credential stuffing by blocking bots will not work, and can often severely impact people depending on assistive technologies.

I think a better solution is to implement 2FA/MFA (even bad 2FA/MFA like SMS or email will block the mass attacks, for people worried about targeted attacks let them use a token or software token app) or SSO (e.g. sign in with Google/Microsoft/Facebook/Linkedin/Twitter who can generally do a better job securing accounts than some random website). SSO is also a lot less hassle in the long term that 2FA/MFA for most users (major note: public use computers, but that's a tough problem to solve security wise, no matter what).

Better account security is, well, better, regardless of the bot/credential stuffing/etc problem.

[+] softwaredoug|4 years ago|reply
A lot of web scraping is annoying often because there’s *an explicit API built for the scrapers needs*. Instead of looking for an API, many think to first use web scraping. This in turn puts load and complexity on the user facing web app that must now tell scraper from real users.
[+] no_time|4 years ago|reply
Using the API almost always has more "strings attached". Like you have to register and get an API token or something. Or even pay. If you want people to use your API, don't make it less convinient than scraping the page.
[+] fragmede|4 years ago|reply
But if there's an API, then the overall load is the same, no?

Or to put it another way, naively, having api.example.com and realpeople.example.com separated out into separate sandboxes seems reasonable, but due to the aforementioned problem, its not. But then it also turns out to be the wrong axis for this anyway, and you need your monitoring to work for you.