3) Provide APIs so people don't need to scrape your site.
Trying to essentially DRM your web site so that it's human-readable and not machine-readable is not only inherently impossible to do effectively (like any DRM), but is also solving the wrong problem.
1 and 3 are the most effective. Option 2 is useful until there is any real demand for your data, and then you're back to trying to prevent automated scraping by your paying customers.
#3 is only an option if you are trying to prevent scraping just to reduce your bandwidth caused by rapid, repeated page requests. If you are trying to prevent someone one from just coming in and scooping up all your data, then providing an API is worse than just allowing the scraper to scrape.
There is no bullet proof way to stop it. So you make it as painful for the scraper as possible. I like the randomized classes/ids and the extraneous random invisible table cells and divs.
Not one of the methods listed here would detract a decent scraper. Moreover you would either screw with your users or with SEO if you wanted to make this or that technique more aggressive.
If your database has really great content it's not because some kid has copy of your website online that you'll lose users. Stackoverflow has been scraped to death and nobody goes to the other sites to check out answers.
people will not go there, however the other website will get accidental traffic which might be 2% of stackoverflow's traffic, this is a little bit of money for doing nothing but running an automated script.
Try running a search engine. :-) Needless to say we get folks all the time who are trying to create or enhance data bases out of our index all the time. We even have an error page that suggests they contact business development in the unlikely event they don't "get" the fact that our index is part of our economic value.
One of the humorous things we found is that scrapers can eat error pages very very quickly. Some of our first scrapers were scripts that looked for a page, then the next page, then the next page. We set up nginx so that it could return an error really cheaply and quickly, and once an IP crossed the threshold, blam! we start sending them the error page. What happened next was something over 20,000 hits per second from the IP as the page processing loop became effectively a no-op in their code.
We thought about sending them SERPs to things like the FBI or Interpol or something so they would go charge off in that direction, but its not our way. We settled on telling our router to dump them in the bit bucket.
Ajaxification can be defeated if you scrape using a headless browser like PhantomJS [+]. Actually all the markup/visual techniques you propose can also be defeated using Phantom. Dump the page as PNG and OCR it.
Honey pots suppose that the scrapper is an idiot... And even in that case, if he's dedicated, he'll come back later and will be more careful.
The only potentially effective solutions are those that preclude usability for everyone: truncating the content for logged out users. And even then, with PhantomJS and some subtlety/patience in order not to trigger flood detection, an attacker could probably get away with it.
> By loading the paginated data through javascript without a page reload, this significantly complicates the job for a lot of scrapers out there. Google only recently itself started parsing javascript on page. There is little disadvantage to reloading the data like this.
Well, unless you're visually impaired and using a screen reader... and it doesn't really complicate things for any halfway dedicated scraper, as your AJAX pagination requests probably follow the same predictable pattern as the non-AJAX ones would've.
Do you have any real-world examples of commonly used screen readers that can't handle JavaScript? Screen readers gets its content from the DOM in a browser, so if the browser is able to put it in the DOM, it should be available to the screen reader.
- Honeypot a random link? I don't scrape every link on the page, only links that have my data.
- Randomize the output? And drive your real users crazy?
I have found that the best deterrent to drive-by scraping is to not put CSS id's on everything. Apart from that, you'll need to put the data behind a pay wall.
A lot of the people commenting on these techniques being fallible are missing the point: the idea isn't to make scraping impossible (despite the misleading title), it's to make it hard(er).
A determined scraper will defeat these techniques but most scrapers aren't determined, sufficiently skilled or so inclined to spend the time.
I've been curious about a variation of the honeypot scheme using something like Varnish. If you catch a scraper with a honeypot, how easy would it be to give them a version of your site that is cached and doesn't update very often?
C'mon cletus give us a bit of credit. Are you telling us that a company with the World's Smartest Engineers(tm) doesn't already do exactly this with their custom front end machines? :-) It's one of the more entertaining new hire classes.
You are correct that perfection is not achievable and you don't even want to get so close that you get very many false positives. But honey pots are bandwidth, which for folks who pay for bandwidth as part of their infrastructure charge, its a burden they are loath to bear. Rather to simply toss the packets into the ether whence they came rather than bother waking up their EC-2 instance.
4. Provide a compressed archive of the data the scrapers want and make it available.
No one should have to scrape in the first place.
It's not 1993 anymore. Sites want Google and others to have their data. Turns out that allowing scraping produced something everyone agrees is valuable: a decent search engine. Sites are being designed to be scraped by a search engine bot. This is silly when you think about it. Just give them the data already.
There is too much unnecessary scraping going on. We could save a whole lot of energy by moving more toward a data dump standard.
Plenty of examples to follow. Wikimedia, StackExchange, Public Resource, Amazon's AWS suggestions for free data sources, etc.
One might argue that indexing from a data-dump will lead to search results that are only as up to date as the last dump.
In StackExchange's case, most of these are now a week or more old.
Maybe it's a good idea, but I'm not sure how many would want to dump their data on a daily basis to keep Google updated, when Google can quite easily crawl their sites as and when it needs to.
They are already supplying fake data to see if they are being scrapped.
Using this fake data they can find all the sites that are using their scrapped data. Congrats we now know who is scrapping you with a simple google search.
Now comes the fun part. Instead of supplying the same fake data to all, we need to supply unique fake data to every ip address that comes to the site. Keep track of what ip, and what data you gave them.
Build your own scrapper's specifically for the sites that are stealing your content and scrape them looking for your unique fake data.
Once you find the unique fake data, tie it back to the ip address we stored earlier and you have your scrapper.
This can be all automated at this point to auto ban the crawler that keeps stealing your data. But that wouldn't be fun and would be very obvious. Instead what we will do is randomize the data in some way so its completely useless etc.
In general I think that getting into an arms race with scrapers is not something that you will win, but if you have a dedicated account for each user you can at least take some action.
If this data is actually valuable, they should put it behind some sort of registration. Then they can swap out the planted data for each user to something that links back to the unique account, without wrecking things for users with accessibility needs or unusual setups.
I have yet to see any anti-scraping method that can protect against a full instance of Chrome automated with Sikuli. It's obviously very expensive to run, since you either need dedicated boxes or VMs, but it always works. In my experience the most consistent parts of any web application are the text and widgets that ultimately render on the screen, so you easily make up for the runtime costs with reduced maintenance. You could in theory make a site that randomly changes button labels or positions, but to the extent you annoy scrapers you're also going to annoy your actual users.
As pointed out by others, many of the suggestions here break core fundamentals of the web, and are generally horrible ideas. It's unsurprising to see suggestions in the comments such as, "add a CAPTCHA", which is nearly as bad of an idea. If you're willing to write bad code and damage user experience to prevent people from retrieving publicly accessible data, perhaps you should rethink your operation a bit.
Generally speaking, if you're in the business of collecting data, but you have a competitive incentive not to share and disseminate that data as broadly as possible, you're in the wrong business. This article seems to address a problem of business model more than anything else. And if you're using technology to solve a problem in your business model...
Let me start by saying that I am a sadochistic scraper (yeah I just made up that word) but I will get your database if I want it. This goes the same for other scrapers who I am sure are more persistent than even I am.
You don't have to read any futher, but you should realise that...
* People will get your data if they want it *
The only way you can try and prevent it, is to have a [1] whitelist of scrapers and blacklist useragents who are hitting you faster than you deem possible. You should also paywall if the information is that valuable to you. Or work on your business model so that you can work on providing it free.... so that reuse doesn't effect you.
---------------------------------
I thought I would provide an account of the three reasons why I scrape data.
There are lots of different types of data that I scrape for and it falls into a few different categories. I'll keep it all vague so I can explain in as much detail as possible.
[1] User information (to generate leads for my own services)...
This can be useful for a few reasons. But often it's to find people who might find my service useful.... So many sites reveal their users information. Don't do this unless you have good reason to do so.
If I'm just looking for contact information of users, I'll run something like httrack and then parse the mirrored site for patterns. (I'm that paranoid that check out how I write my email address in my user profile on this site).
[2] Economically valuable data that I can resuppose....
A lot of the data that I scrape I won't use directly on sites. I'm not going to cross legal boundaries.. and I certainly don't want to be slapped with a copyright notice (I might scrape content, but I'm not going to so willfully break the law). But, for example, there is a certain very popular website that collects business information and displays it on their network of websites. They also display this information in Google Maps as Markers.
One of my most successful scrapes of all time, was to pretend to be a user and constantly request different locations to their "private API". It took over a month to stay under the radar, but I got the data. I got banned regularly, but would just spawn up a new server with a new IP.
I'm not going to use this data anywhere on my sites. It's their database that they have built up. But, I can use this data to make my service better to my users.
[3] Content...
Back in day... I used to just scrap content. I don't do this any more since I'm actually working on what will hopefully be a very succesul startup... however, I used to scrape articles/content written by people. I created my own content management system that would publish entire websites for specific terms. This used to work fantastically when the search engines weren't that smart. I would guess it would fail awfully now. But I would quite easily be able to generate a few hundred uniques per website. (This would be considerable when multiplied out to lots of websites!!!).
Anyway, content would be useful to me, because I would spin in into new content, using a very basic markov chain. I'd have thousands of websites up and running all on different .info domains, (bought for 88cents each) and running advertisements on them. The domains would eventually get banned from Google and you'd throw the domain away. You'd make more than 88 cents through affiliate systems and commission junction and the likes that this didn't matter, and you were doing it on such a large scale that it would be quite prosperous.
------------------------------------
I honestly couldn't really offer you any advice on how to prevent scraping.
The best you can do is slow us down.
And the best way to do that is the figure out who is hitting your pages in such a methodical manner and rate limiting them. If you are smart enough, you might also try to "hellban" us, by serving up totally false data. I really would have laughed, if the time I scraped 5million longitude and latitudes over a period of a few months, if at the end of the process, I noticed that all of the lats were wrong.
Resistance is futile. You will be assimilated. </geek>
Yeah as a scraper I'd say that at most all these suggestions would do is make me turn to selenium/greasemonkey instead of mechanize/wget/httrack. Selenium is the bomb when people try to get fancy preventing scraping, how exactly are they supposed to detect the difference between a browser and a browser?
Getting banned is not a big deal, plenty of IPs & proxies out there. EC2 is your best friend as you can automate the IP recycling. Even Facebook/Twitter accounts are almost free.
Even the randomization wouldn't be particularly difficult to circumvent just save the page and then use a genetic algorithm with tunable parameters for the randomization, select the parameters that yield the most/best records.
What I'd actually fear is a system that just silently corrupted the records once scraping was detected, especially if it was intermittent, eg. 10-75% of records on a page are bogus and only every few pages. Or they started displaying the records as images (but I'm guessing they want Google juice)
I actually came up with a very effective method for identifying scraping and blocking it in near real-time. The challenge that I've had was that I was being scraped via many many proxies/IPs in short spurts using a variety of user agents - so as to avoid, or make difficult detection. The solution was simply to identify bot behavior and block it:
1. Scan the raw access logs via 1 minute cron for the last 10,000 lines - depending on how trafficked your site is
2. parse the data by IP, and then by request time
3. search for IP's that have not requested a universal and necessary elements like anything in the images or scripts folder, and that made repetitive requests in a short period of time - like 1 second.
4. Shell command 'csf -d IP_ADDY scraping' to add to the firewall block list.
This process is so effective of identifying bots/spiders that I've had to create a whitelist for search engines and other monitoring services that I want to continue to have access to the site.
Most scrapers don't go to the extent of scraping via headless browsers - so, for the most part, I've pretty much thwarted the scraping that was prevalent on my site.
I honestly couldn't really offer you any advice on how to prevent scraping. The best you can do is slow us down.
And the best way to do that is the figure out who is hitting your pages in such a methodical manner and rate limiting them. If you are smart enough, you might also try to "hellban" us, by serving up totally false data.
Well, no, there are other ways too.
For example, any site behind a paywall probably has your identity, and unless you live in a faraway place with impotent copyright laws -- and there aren't that many of them any more -- there are often staggeringly disproportionate damages for infringement available through the courts these days, certainly enough to justify retaining legal representation to bring a suit in any major jurisdiction. Given a server log showing a pattern of systematic downloading that could only be done by an automated scraper in violation of a site's ToS, and given a credit card in your name linked to the account and an IP address linked to your residence where the downloads went, I imagine it's going to be a fairly short and extremely expensive lawsuit if you upset the wrong site owner.
There's nothing worse than spending lots of hard work scraping sites to build your search engine and then having bad guys perpetrate the scraping of your search engine.
Maybe it's some sort of karma. If you scrape, then you will get scraped.
I don't know what kind of site this is, so it's hard to say if it applies, but do note that several of these can significantly harm usability for legitimate users as well. For example, someone might be copy/pasting a segment of data into Excel to do some analysis for a paper, fully intending to credit you as the source; if you insert fake cells, or render the data to an image, you make their life a lot more difficult.
The first suggestion (AJAX-ifying pagination) can be done without a major usability hit if you give the user permalinks with hash fragments, though, so example.com/foo/2 becomes example.com/foo#2.
I am currently working on a project that involves some scraping as well. The most annoying things I came across so far are:
- Totally broken markup (I fixed this by either using Tidy first or just using a Regex instead of a 'smart' HTML/XML parser)
- Sites that need Javascript even on 'deep links' (I fixed this by using PhantomJS and saving the HTML instead of just using curl)
- Inconsistency, by far the most annoying: different classes, different formatting, different elements for things that should more or less be identical (basically fixing this whenever I come across a problem but sometimes it's just too much hassle and well, ask yourself if you really need to get every single 'item' from your target)
One more thing: RSS is your friend. And often you can find a suitable RSS link (that's not linked anywhere on the site) by just trying some URLs.
PS: No, I am not doing anything evil. If this project ever goes live/public, I'll hit all the targeted sites up and ask for permission. Not causing any significant traffic either.
Anything that can be displayed on a screen can be scraped.
An approach I used to prevent scraping in the past is to start rate limiting anything that hits over N pageviews in an hour, where N is a value around what a high-use user could manually consume. Start with a small delay and increment it with each pageview (excess hits*100ms), then send HTTP 509 (Exceeded Bandwidth) for anything that is clearly hammering the server (or start returning junk data if you're feeling vengeful).
Added bonus is that the crawler will appear to function correctly during testing until they try to do a full production run and run into the (previously undetectable) rate limiting.
This project did not require search indexing so we didn't care about legit searchbots, but you could exclude known Google/Bing crawlers and log IPs of anything that hits the limit for manual whitelisting (or blacklisting of repeat offenders).
More trouble than it's worth. Plus, none of these solutions actually prevent site scraping... if the person is dedicated enough, they'll find a way. The time spent on implementing any of these approaches would be much better spent on site optimization, features, etc.
[+] [-] mikeash|14 years ago|reply
1) Don't have any data worth scraping.
2) Charge for access.
3) Provide APIs so people don't need to scrape your site.
Trying to essentially DRM your web site so that it's human-readable and not machine-readable is not only inherently impossible to do effectively (like any DRM), but is also solving the wrong problem.
[+] [-] Smudge|14 years ago|reply
[+] [-] jack-r-abbit|14 years ago|reply
There is no bullet proof way to stop it. So you make it as painful for the scraper as possible. I like the randomized classes/ids and the extraneous random invisible table cells and divs.
[+] [-] vasco|14 years ago|reply
If your database has really great content it's not because some kid has copy of your website online that you'll lose users. Stackoverflow has been scraped to death and nobody goes to the other sites to check out answers.
[+] [-] josephcooney|14 years ago|reply
[+] [-] jumpbug|14 years ago|reply
[+] [-] level09|14 years ago|reply
[+] [-] ChuckMcM|14 years ago|reply
One of the humorous things we found is that scrapers can eat error pages very very quickly. Some of our first scrapers were scripts that looked for a page, then the next page, then the next page. We set up nginx so that it could return an error really cheaply and quickly, and once an IP crossed the threshold, blam! we start sending them the error page. What happened next was something over 20,000 hits per second from the IP as the page processing loop became effectively a no-op in their code.
We thought about sending them SERPs to things like the FBI or Interpol or something so they would go charge off in that direction, but its not our way. We settled on telling our router to dump them in the bit bucket.
[+] [-] pygy_|14 years ago|reply
Honey pots suppose that the scrapper is an idiot... And even in that case, if he's dedicated, he'll come back later and will be more careful.
The only potentially effective solutions are those that preclude usability for everyone: truncating the content for logged out users. And even then, with PhantomJS and some subtlety/patience in order not to trigger flood detection, an attacker could probably get away with it.
[+] http://phantomjs.org/
[+] [-] ceejayoz|14 years ago|reply
Well, unless you're visually impaired and using a screen reader... and it doesn't really complicate things for any halfway dedicated scraper, as your AJAX pagination requests probably follow the same predictable pattern as the non-AJAX ones would've.
[+] [-] jacobr|14 years ago|reply
98.4% of screen reader users in a 2010 survey (http://webaim.org/projects/screenreadersurvey3/#javascript) had JavaScript enabled.
[+] [-] joshu|14 years ago|reply
[+] [-] jumpbug|14 years ago|reply
[+] [-] epoxyhockey|14 years ago|reply
- As mentioned: AJAXification the data makes it easier to grab.
- Convert text to images? I'll OCR it. http://code.google.com/p/tesseract-ocr/wiki/ReadMe
- Honeypot a random link? I don't scrape every link on the page, only links that have my data.
- Randomize the output? And drive your real users crazy?
I have found that the best deterrent to drive-by scraping is to not put CSS id's on everything. Apart from that, you'll need to put the data behind a pay wall.
[+] [-] cletus|14 years ago|reply
A determined scraper will defeat these techniques but most scrapers aren't determined, sufficiently skilled or so inclined to spend the time.
I've been curious about a variation of the honeypot scheme using something like Varnish. If you catch a scraper with a honeypot, how easy would it be to give them a version of your site that is cached and doesn't update very often?
[+] [-] ChuckMcM|14 years ago|reply
You are correct that perfection is not achievable and you don't even want to get so close that you get very many false positives. But honey pots are bandwidth, which for folks who pay for bandwidth as part of their infrastructure charge, its a burden they are loath to bear. Rather to simply toss the packets into the ether whence they came rather than bother waking up their EC-2 instance.
[+] [-] barbazfoo12|14 years ago|reply
No one should have to scrape in the first place.
It's not 1993 anymore. Sites want Google and others to have their data. Turns out that allowing scraping produced something everyone agrees is valuable: a decent search engine. Sites are being designed to be scraped by a search engine bot. This is silly when you think about it. Just give them the data already.
There is too much unnecessary scraping going on. We could save a whole lot of energy by moving more toward a data dump standard.
Plenty of examples to follow. Wikimedia, StackExchange, Public Resource, Amazon's AWS suggestions for free data sources, etc.
[+] [-] FuzzyDunlop|14 years ago|reply
In StackExchange's case, most of these are now a week or more old.
Maybe it's a good idea, but I'm not sure how many would want to dump their data on a daily basis to keep Google updated, when Google can quite easily crawl their sites as and when it needs to.
[+] [-] minikomi|14 years ago|reply
[+] [-] Cyndre|14 years ago|reply
They are already supplying fake data to see if they are being scrapped.
Using this fake data they can find all the sites that are using their scrapped data. Congrats we now know who is scrapping you with a simple google search.
Now comes the fun part. Instead of supplying the same fake data to all, we need to supply unique fake data to every ip address that comes to the site. Keep track of what ip, and what data you gave them.
Build your own scrapper's specifically for the sites that are stealing your content and scrape them looking for your unique fake data.
Once you find the unique fake data, tie it back to the ip address we stored earlier and you have your scrapper.
This can be all automated at this point to auto ban the crawler that keeps stealing your data. But that wouldn't be fun and would be very obvious. Instead what we will do is randomize the data in some way so its completely useless etc.
Sit back and enjoy
[+] [-] showerst|14 years ago|reply
If this data is actually valuable, they should put it behind some sort of registration. Then they can swap out the planted data for each user to something that links back to the unique account, without wrecking things for users with accessibility needs or unusual setups.
[+] [-] goodside|14 years ago|reply
[+] [-] dustywusty|14 years ago|reply
[+] [-] basseq|14 years ago|reply
[+] [-] chrisacky|14 years ago|reply
You don't have to read any futher, but you should realise that...
* People will get your data if they want it *
The only way you can try and prevent it, is to have a [1] whitelist of scrapers and blacklist useragents who are hitting you faster than you deem possible. You should also paywall if the information is that valuable to you. Or work on your business model so that you can work on providing it free.... so that reuse doesn't effect you.
---------------------------------
I thought I would provide an account of the three reasons why I scrape data.
There are lots of different types of data that I scrape for and it falls into a few different categories. I'll keep it all vague so I can explain in as much detail as possible.
[1] User information (to generate leads for my own services)...
This can be useful for a few reasons. But often it's to find people who might find my service useful.... So many sites reveal their users information. Don't do this unless you have good reason to do so. If I'm just looking for contact information of users, I'll run something like httrack and then parse the mirrored site for patterns. (I'm that paranoid that check out how I write my email address in my user profile on this site).
[2] Economically valuable data that I can resuppose....
A lot of the data that I scrape I won't use directly on sites. I'm not going to cross legal boundaries.. and I certainly don't want to be slapped with a copyright notice (I might scrape content, but I'm not going to so willfully break the law). But, for example, there is a certain very popular website that collects business information and displays it on their network of websites. They also display this information in Google Maps as Markers. One of my most successful scrapes of all time, was to pretend to be a user and constantly request different locations to their "private API". It took over a month to stay under the radar, but I got the data. I got banned regularly, but would just spawn up a new server with a new IP. I'm not going to use this data anywhere on my sites. It's their database that they have built up. But, I can use this data to make my service better to my users.
[3] Content...
Back in day... I used to just scrap content. I don't do this any more since I'm actually working on what will hopefully be a very succesul startup... however, I used to scrape articles/content written by people. I created my own content management system that would publish entire websites for specific terms. This used to work fantastically when the search engines weren't that smart. I would guess it would fail awfully now. But I would quite easily be able to generate a few hundred uniques per website. (This would be considerable when multiplied out to lots of websites!!!).
Anyway, content would be useful to me, because I would spin in into new content, using a very basic markov chain. I'd have thousands of websites up and running all on different .info domains, (bought for 88cents each) and running advertisements on them. The domains would eventually get banned from Google and you'd throw the domain away. You'd make more than 88 cents through affiliate systems and commission junction and the likes that this didn't matter, and you were doing it on such a large scale that it would be quite prosperous.
------------------------------------
I honestly couldn't really offer you any advice on how to prevent scraping. The best you can do is slow us down.
And the best way to do that is the figure out who is hitting your pages in such a methodical manner and rate limiting them. If you are smart enough, you might also try to "hellban" us, by serving up totally false data. I really would have laughed, if the time I scraped 5million longitude and latitudes over a period of a few months, if at the end of the process, I noticed that all of the lats were wrong.
Resistance is futile. You will be assimilated. </geek>
[+] [-] fleitz|14 years ago|reply
Getting banned is not a big deal, plenty of IPs & proxies out there. EC2 is your best friend as you can automate the IP recycling. Even Facebook/Twitter accounts are almost free.
Even the randomization wouldn't be particularly difficult to circumvent just save the page and then use a genetic algorithm with tunable parameters for the randomization, select the parameters that yield the most/best records.
What I'd actually fear is a system that just silently corrupted the records once scraping was detected, especially if it was intermittent, eg. 10-75% of records on a page are bogus and only every few pages. Or they started displaying the records as images (but I'm guessing they want Google juice)
[+] [-] wpwebsite|14 years ago|reply
1. Scan the raw access logs via 1 minute cron for the last 10,000 lines - depending on how trafficked your site is
2. parse the data by IP, and then by request time
3. search for IP's that have not requested a universal and necessary elements like anything in the images or scripts folder, and that made repetitive requests in a short period of time - like 1 second.
4. Shell command 'csf -d IP_ADDY scraping' to add to the firewall block list.
This process is so effective of identifying bots/spiders that I've had to create a whitelist for search engines and other monitoring services that I want to continue to have access to the site.
Most scrapers don't go to the extent of scraping via headless browsers - so, for the most part, I've pretty much thwarted the scraping that was prevalent on my site.
[+] [-] Silhouette|14 years ago|reply
And the best way to do that is the figure out who is hitting your pages in such a methodical manner and rate limiting them. If you are smart enough, you might also try to "hellban" us, by serving up totally false data.
Well, no, there are other ways too.
For example, any site behind a paywall probably has your identity, and unless you live in a faraway place with impotent copyright laws -- and there aren't that many of them any more -- there are often staggeringly disproportionate damages for infringement available through the courts these days, certainly enough to justify retaining legal representation to bring a suit in any major jurisdiction. Given a server log showing a pattern of systematic downloading that could only be done by an automated scraper in violation of a site's ToS, and given a credit card in your name linked to the account and an IP address linked to your residence where the downloads went, I imagine it's going to be a fairly short and extremely expensive lawsuit if you upset the wrong site owner.
[+] [-] 389401a|14 years ago|reply
Maybe it's some sort of karma. If you scrape, then you will get scraped.
[+] [-] _delirium|14 years ago|reply
The first suggestion (AJAX-ifying pagination) can be done without a major usability hit if you give the user permalinks with hash fragments, though, so example.com/foo/2 becomes example.com/foo#2.
[+] [-] soulclap|14 years ago|reply
- Totally broken markup (I fixed this by either using Tidy first or just using a Regex instead of a 'smart' HTML/XML parser)
- Sites that need Javascript even on 'deep links' (I fixed this by using PhantomJS and saving the HTML instead of just using curl)
- Inconsistency, by far the most annoying: different classes, different formatting, different elements for things that should more or less be identical (basically fixing this whenever I come across a problem but sometimes it's just too much hassle and well, ask yourself if you really need to get every single 'item' from your target)
One more thing: RSS is your friend. And often you can find a suitable RSS link (that's not linked anywhere on the site) by just trying some URLs.
PS: No, I am not doing anything evil. If this project ever goes live/public, I'll hit all the targeted sites up and ask for permission. Not causing any significant traffic either.
[+] [-] kysol|14 years ago|reply
[+] [-] garethsprice|14 years ago|reply
An approach I used to prevent scraping in the past is to start rate limiting anything that hits over N pageviews in an hour, where N is a value around what a high-use user could manually consume. Start with a small delay and increment it with each pageview (excess hits*100ms), then send HTTP 509 (Exceeded Bandwidth) for anything that is clearly hammering the server (or start returning junk data if you're feeling vengeful).
Added bonus is that the crawler will appear to function correctly during testing until they try to do a full production run and run into the (previously undetectable) rate limiting.
This project did not require search indexing so we didn't care about legit searchbots, but you could exclude known Google/Bing crawlers and log IPs of anything that hits the limit for manual whitelisting (or blacklisting of repeat offenders).
[+] [-] eps|14 years ago|reply
[+] [-] x3sphere|14 years ago|reply