A few years back we came into work one morning to find that some bot was scanning our site so hard that it seemed the lights nearly dimmed. Some detective work suggests that it was a service performed on behalf of a competitor, to get our price list (bear in mind that our catalog has a few hundred thousand products).
We were really annoyed that rather than just ask us, they had launched what amounted to a DDOS attack. So we thought about how we might exact vengeance...
After a few hours we figured out a pattern to the rogue requests that allowed us to filter them, despite their efforts at stealth (like, they cycle through a list of various user agent strings to make it look like there are multiple different users). We toyed with the idea of, rather than outright banning them, making our pages sensitive to their presence, so that when we detected them, we'd display a false price, defeating their whole operation.
We finally just decided to take the high road, temporarily banning any rogue IP addresses we detected (we couldn't make it permanent because many of the requests came from the Amazon cloud, from which we also receive some legitimate requests)
EDIT: you wouldn't think that requests for a few hundred thousand products would amount to a DDOS, but the bot was rather poorly written and grossly inefficient in the way it walked through the list.
I built a system called caltrops that did almost exactly that. As a given session's requests grew more and more suspicious, their data would skew from reality further and further. A real user on the line would notice immediately (and the more real-looking the user interactions, the more it would reduce suspicion), but competitors scraping our data would get pretty deliciously bunk data.
"Thousands of people use Mailinator everyday, so clearly, its a useful tool that many sites accept"
How many of you would have an outright revolt on your hands from your QA/QE folks if you banned mailinator? I think everyplace I worked would experience this same issue if we did this.
Could use + in the first part of email such as: [email protected] to create throwaways. most sites consider those to be different email address then [email protected] for account purposes but email service, who respect the rfc, will threat them as the same.
One way to get around a domain blacklist is to point your own domain to Mailinator. Heck, since last year you can even get your own private Mailinator...
It took me a bit to get my head around the use cases. It's sometimes amazing how many different ways you can twist a simple (complex really) thing like email into a product/idea.
However tricking site scrappers may not work perfectly if the site scrappers maintained a list of websites in their "whitelist". Say if I am scrapping mailinator.com for domain names, if I see gmail.com or yahoo.com, I might just not put them in my database because they are in my whitelist.
Mailinator seems to have added some other anti-scraping detection.
Unfortunately it does not work very well as I was not scraping mailinator, but still somehow got IP banned. Fortunately my ip has changed. But they definitely have some strange and overzealous method now.
I would go one step further and look for {spam_words} in "username+{text}@{googledomain}.com", where spam_words can be "junk", "spam", etc. This is like a very narrow edge case, but still might catch something. Again, if you're into that kind of thing; I'm quite skeptical that it brings any value.
FTA: "Could I make it harder to scrape? Well, I could, but wouldn't really slow anyone down much."
I think that's the basic idea. He could spend his time making it harder to scrape, like the bar across the steering wheel. Some people would be deterred, others wouldn't, and time would be wasted all around.
At least at the time of writing, if you had enough foresight and engineering time to set something like that up, you had enough foresight and engineering time to not make your system treat email addresses as meaningful identities.
[+] [-] zinxq|10 years ago|reply
I hadn't read that in many years, and what fun to do a re-read.
Thanks Internet - don't stop being you.
[+] [-] dice|10 years ago|reply
I hope you don't mind that I wrote a quick one-liner to see if you're still detecting bots...
Yup :)I didn't see any "evil" insertions, though...
[+] [-] CWuestefeld|10 years ago|reply
We were really annoyed that rather than just ask us, they had launched what amounted to a DDOS attack. So we thought about how we might exact vengeance...
After a few hours we figured out a pattern to the rogue requests that allowed us to filter them, despite their efforts at stealth (like, they cycle through a list of various user agent strings to make it look like there are multiple different users). We toyed with the idea of, rather than outright banning them, making our pages sensitive to their presence, so that when we detected them, we'd display a false price, defeating their whole operation.
We finally just decided to take the high road, temporarily banning any rogue IP addresses we detected (we couldn't make it permanent because many of the requests came from the Amazon cloud, from which we also receive some legitimate requests)
EDIT: you wouldn't think that requests for a few hundred thousand products would amount to a DDOS, but the bot was rather poorly written and grossly inefficient in the way it walked through the list.
[+] [-] adrianpike|10 years ago|reply
[+] [-] dangtard|10 years ago|reply
[deleted]
[+] [-] zer00eyz|10 years ago|reply
How many of you would have an outright revolt on your hands from your QA/QE folks if you banned mailinator? I think everyplace I worked would experience this same issue if we did this.
[+] [-] nkassis|10 years ago|reply
[+] [-] 8ig8|10 years ago|reply
http://mailinator.blogspot.com/2014/10/mailinator-launches-p...
[+] [-] jessaustin|10 years ago|reply
[+] [-] kpcyrd|10 years ago|reply
[+] [-] scoj|10 years ago|reply
It took me a bit to get my head around the use cases. It's sometimes amazing how many different ways you can twist a simple (complex really) thing like email into a product/idea.
[+] [-] brobinson|10 years ago|reply
[+] [-] thetruthseeker1|10 years ago|reply
However tricking site scrappers may not work perfectly if the site scrappers maintained a list of websites in their "whitelist". Say if I am scrapping mailinator.com for domain names, if I see gmail.com or yahoo.com, I might just not put them in my database because they are in my whitelist.
[+] [-] fapjacks|10 years ago|reply
[+] [-] codexon|10 years ago|reply
Unfortunately it does not work very well as I was not scraping mailinator, but still somehow got IP banned. Fortunately my ip has changed. But they definitely have some strange and overzealous method now.
[+] [-] simi_|10 years ago|reply
I would go one step further and look for {spam_words} in "username+{text}@{googledomain}.com", where spam_words can be "junk", "spam", etc. This is like a very narrow edge case, but still might catch something. Again, if you're into that kind of thing; I'm quite skeptical that it brings any value.
[+] [-] octo_t|10 years ago|reply
[+] [-] w8rbt|10 years ago|reply
[+] [-] serve_yay|10 years ago|reply
[+] [-] belovedeagle|10 years ago|reply
[+] [-] botbot|10 years ago|reply
OCR requires a lot more programming effort compared to a text-based content scraper
[+] [-] tempestn|10 years ago|reply
I think that's the basic idea. He could spend his time making it harder to scrape, like the bar across the steering wheel. Some people would be deterred, others wouldn't, and time would be wasted all around.
[+] [-] eru|10 years ago|reply
[+] [-] unknown|10 years ago|reply
[deleted]
[+] [-] unknown|10 years ago|reply
[deleted]
[+] [-] TerryADavis|10 years ago|reply
[deleted]
[+] [-] tegansnyder|10 years ago|reply
[+] [-] geofft|10 years ago|reply