top | item 14183767

High-performance .NET by example: Filtering bot traffic

228 points| alexandrnikitin | 9 years ago |alexandrnikitin.github.io | reply

50 comments

order
[+] hdhzy|9 years ago|reply
Excellent post showing how to correctly improve code w.r.t. performance using the scientific method: hypothesis, measuring the baseline, change, measuring effect with real tools, real code.

Thanks for sharing!

[+] Someone1234|9 years ago|reply
I also found it an interesting post in that it kind of inadvertently proves that for most situations you shouldn't optimise to this extent.

Meaning, yes, the OP got impressive performance improvements but the code is also completely unreadable and utilises unsafe code sections which could expose you to security problems/memory leaks/memory corruption. Not to mention they've recreated and will need to maintain an in-house version of the Dictionary class.

Their first optimisation (from Enumerator to List and Any() to Count()) are something every codebase could use. Most of their other optimisations make the code a maintenance minefield.

Plus programmers are expensive. Hardware is cheap. Why spent time on harder code to write that's also harder to maintain in the medium to long term when instead you could just throw money at hardware and call it a day? Just food for thought, not really a criticism in and of itself.

PS - Please don't take this post too seriously. I am not really being critical, just playing devil's advocate. I actually enjoyed the linked article a lot.

[+] throwasehasdwi|9 years ago|reply
Blocking access based on arbitrary user agent strings is a really bad idea. Every single bad bot will avoid known user agent strings or pretend to be Google, so you're only blocking well behaved ones. Plus there's thousands of browser versions out there, so there's a very good chance you're blocking some users for no reason.

The proper way to do this is to block by IP, based on behavior. Block IPs slowing down the site or throw up a captcha like cloudflare does.

Blocking bots sounds great but it just brings Google one step closer to a monopoly. Even good bots just pretend to be people nowadays because lots of people are implementing naive site protection strategies.

Edited: to be less mean

[+] alexandrnikitin|9 years ago|reply
Yes, you're right. There are many ways to block robots: IP, UA, behaviour analysis. An advertising company has to have UA based filtering to be compliant with standards. However, the focus of the blog post is on performance rather on how to block bots.
[+] Twirrim|9 years ago|reply
The article is not about filtering bad bots. It even says so right near the top:

>We won’t cover black bots because it is a huge topic with sophisticated analysis and Machine learning algorithms. We will focus on the white and grey bots that identify themselves as such.

This is about not wasting time & effort showing advertising banners to good bots.

[+] cm2187|9 years ago|reply
Plus it's baking planned obsolescence into your code. As soon as you stop updating it, it will start blocking newer browser editions and versions.
[+] John23832|9 years ago|reply
> single dumbest idea

I agree with your sentiment, but you should try to be a little more constructive.

[+] jlg23|9 years ago|reply
In theory you are right, but in reality 99% of the rogue bots are actually just some scraper tools were ignorant users changed sane defaults to "make it go faster". They usually don't have enough knowledge to even understand that they are routed into a black hole - not to speak of being able to do something about it.

Disclaimer: Getting rid of those idiots^wmisguides poor souls is part of my job description.

[+] krzrak|9 years ago|reply
> Every single bad bot will avoid known user agent strings

If I was writing a bot, I would set user agent to some well known and very popular value, i.e. newest Chrome on Windows, or something like that.

[+] senorjazz|9 years ago|reply
Rather than block on UA, just add some honeypots. An invisible link. Any bot that pulls that page gets blocked as scrapers tend to pull all links from the page and follow.

Use the robots.txt to ban the pulling of specific pages. Bots 99% of the time ignore robots, so if they pull it: block

Check how quickly pages are pulled. If passes a threshold: block

[+] alexandrnikitin|9 years ago|reply
Yes, using honeypots is one of the ways to identify bots. But that wasn't the focus of the post. I'll add some clarification.
[+] marklit|9 years ago|reply
I've seen bot traffic claiming to be recent versions of Firefox from residential IPs in the Ukraine pulling robots.txt. Sometimes this is one of the few clues to go on.
[+] pc86|9 years ago|reply
I'm pretty sure the point of the article was performance testing of C#, not best practices for banning bots...
[+] doubleplusgood|9 years ago|reply
I did something similar with nginx, the data file from 51degrees and some lua code; each instance only handles 10-20k requests/sec so no clever optimization was needed.
[+] oblio|9 years ago|reply
Would you mind posting the Lua code?
[+] NKCSS|9 years ago|reply
I'd probably store cached results for Dictinary<int, HashSet<string>> allowed, notAllowed; where int == length of the user agent. This should probably be blazing fast as well instead of keep doing those lookups.
[+] alexandrnikitin|9 years ago|reply
I doubt that exactly that will work. There are tens of thousands of different UAs (maybe 100K). Perhaps some kind of tiny (few CPU cache lines) cache for most popular UAs could help. But again: measure, measure, measure :)
[+] andrewgrowles|9 years ago|reply
A FST seems like a good fit for this problem. I believe it will be much more compact than the Aho-Corasick algorithm trie structure. Depends on the size of the dictionary.
[+] frik|9 years ago|reply
Good post!

A lot of manual work with various perf tools.

What's a bit missing is some production performance monitoring (APM) that gives you such data, with no manual interaction.

[+] alexandrnikitin|9 years ago|reply
I intend to write a separate blog post about low-overhead production monitoring (not sure when it happen though)
[+] tener|9 years ago|reply
So, the industry standard requires them not to serve ads to the bots... which means they have implemented the ad blocking themselves?
[+] brilliantcode|9 years ago|reply
what if the "grey" traffic came from residential IP addresses using a normally distributed range of user agents? How would you reliable distinguish them from regular traffic?
[+] Benfromparis|9 years ago|reply
Basically, we are using two sort of technics : technical and behavior.

Technical : if the UserAgent claim to be a regular browser (let say Chrome 43) we will check on network level if the client implement http protocol like Chrome 43 usually do and on the JS side if the Javascript render is correct for Chrome. In case it's a real Chrome, we will check if the Browser is controlled by automation Tool.

Behavior : we will check if the path of requests is regular according to the website usage.

Disclaimer: I'm working at https://datadome.co, a bot protection tool.

[+] alexandrnikitin|9 years ago|reply
It depends. Usually do nothing if that traffic is very low. There's no reliable way to do that. Honeypots and behavior analysis are very useful here.