Excellent post showing how to correctly improve code w.r.t. performance using the scientific method: hypothesis, measuring the baseline, change, measuring effect with real tools, real code.
I also found it an interesting post in that it kind of inadvertently proves that for most situations you shouldn't optimise to this extent.
Meaning, yes, the OP got impressive performance improvements but the code is also completely unreadable and utilises unsafe code sections which could expose you to security problems/memory leaks/memory corruption. Not to mention they've recreated and will need to maintain an in-house version of the Dictionary class.
Their first optimisation (from Enumerator to List and Any() to Count()) are something every codebase could use. Most of their other optimisations make the code a maintenance minefield.
Plus programmers are expensive. Hardware is cheap. Why spent time on harder code to write that's also harder to maintain in the medium to long term when instead you could just throw money at hardware and call it a day? Just food for thought, not really a criticism in and of itself.
PS - Please don't take this post too seriously. I am not really being critical, just playing devil's advocate. I actually enjoyed the linked article a lot.
Blocking access based on arbitrary user agent strings is a really bad idea. Every single bad bot will avoid known user agent strings or pretend to be Google, so you're only blocking well behaved ones. Plus there's thousands of browser versions out there, so there's a very good chance you're blocking some users for no reason.
The proper way to do this is to block by IP, based on behavior. Block IPs slowing down the site or throw up a captcha like cloudflare does.
Blocking bots sounds great but it just brings Google one step closer to a monopoly. Even good bots just pretend to be people nowadays because lots of people are implementing naive site protection strategies.
Yes, you're right. There are many ways to block robots: IP, UA, behaviour analysis. An advertising company has to have UA based filtering to be compliant with standards. However, the focus of the blog post is on performance rather on how to block bots.
The article is not about filtering bad bots. It even says so right near the top:
>We won’t cover black bots because it is a huge topic with sophisticated analysis and Machine learning algorithms. We will focus on the white and grey bots that identify themselves as such.
This is about not wasting time & effort showing advertising banners to good bots.
In theory you are right, but in reality 99% of the rogue bots are actually just some scraper tools were ignorant users changed sane defaults to "make it go faster".
They usually don't have enough knowledge to even understand that they are routed into a black hole - not to speak of being able to do something about it.
Disclaimer: Getting rid of those idiots^wmisguides poor souls is part of my job description.
Rather than block on UA, just add some honeypots. An invisible link. Any bot that pulls that page gets blocked as scrapers tend to pull all links from the page and follow.
Use the robots.txt to ban the pulling of specific pages. Bots 99% of the time ignore robots, so if they pull it: block
Check how quickly pages are pulled. If passes a threshold: block
I've seen bot traffic claiming to be recent versions of Firefox from residential IPs in the Ukraine pulling robots.txt. Sometimes this is one of the few clues to go on.
I did something similar with nginx, the data file from 51degrees and some lua code; each instance only handles 10-20k requests/sec so no clever optimization was needed.
I'd probably store cached results for Dictinary<int, HashSet<string>> allowed, notAllowed; where int == length of the user agent. This should probably be blazing fast as well instead of keep doing those lookups.
I doubt that exactly that will work. There are tens of thousands of different UAs (maybe 100K). Perhaps some kind of tiny (few CPU cache lines) cache for most popular UAs could help. But again: measure, measure, measure :)
A FST seems like a good fit for this problem. I believe it will be much more compact than the Aho-Corasick algorithm trie structure. Depends on the size of the dictionary.
what if the "grey" traffic came from residential IP addresses using a normally distributed range of user agents? How would you reliable distinguish them from regular traffic?
Basically, we are using two sort of technics : technical and behavior.
Technical : if the UserAgent claim to be a regular browser (let say Chrome 43) we will check on network level if the client implement http protocol like Chrome 43 usually do and on the JS side if the Javascript render is correct for Chrome.
In case it's a real Chrome, we will check if the Browser is controlled by automation Tool.
Behavior : we will check if the path of requests is regular according to the website usage.
[+] [-] hdhzy|9 years ago|reply
Thanks for sharing!
[+] [-] Someone1234|9 years ago|reply
Meaning, yes, the OP got impressive performance improvements but the code is also completely unreadable and utilises unsafe code sections which could expose you to security problems/memory leaks/memory corruption. Not to mention they've recreated and will need to maintain an in-house version of the Dictionary class.
Their first optimisation (from Enumerator to List and Any() to Count()) are something every codebase could use. Most of their other optimisations make the code a maintenance minefield.
Plus programmers are expensive. Hardware is cheap. Why spent time on harder code to write that's also harder to maintain in the medium to long term when instead you could just throw money at hardware and call it a day? Just food for thought, not really a criticism in and of itself.
PS - Please don't take this post too seriously. I am not really being critical, just playing devil's advocate. I actually enjoyed the linked article a lot.
[+] [-] throwasehasdwi|9 years ago|reply
The proper way to do this is to block by IP, based on behavior. Block IPs slowing down the site or throw up a captcha like cloudflare does.
Blocking bots sounds great but it just brings Google one step closer to a monopoly. Even good bots just pretend to be people nowadays because lots of people are implementing naive site protection strategies.
Edited: to be less mean
[+] [-] alexandrnikitin|9 years ago|reply
[+] [-] Twirrim|9 years ago|reply
>We won’t cover black bots because it is a huge topic with sophisticated analysis and Machine learning algorithms. We will focus on the white and grey bots that identify themselves as such.
This is about not wasting time & effort showing advertising banners to good bots.
[+] [-] cm2187|9 years ago|reply
[+] [-] John23832|9 years ago|reply
I agree with your sentiment, but you should try to be a little more constructive.
[+] [-] jlg23|9 years ago|reply
Disclaimer: Getting rid of those idiots^wmisguides poor souls is part of my job description.
[+] [-] krzrak|9 years ago|reply
If I was writing a bot, I would set user agent to some well known and very popular value, i.e. newest Chrome on Windows, or something like that.
[+] [-] senorjazz|9 years ago|reply
Use the robots.txt to ban the pulling of specific pages. Bots 99% of the time ignore robots, so if they pull it: block
Check how quickly pages are pulled. If passes a threshold: block
[+] [-] alexandrnikitin|9 years ago|reply
[+] [-] marklit|9 years ago|reply
[+] [-] pc86|9 years ago|reply
[+] [-] doubleplusgood|9 years ago|reply
[+] [-] oblio|9 years ago|reply
[+] [-] NKCSS|9 years ago|reply
[+] [-] alexandrnikitin|9 years ago|reply
[+] [-] andrewgrowles|9 years ago|reply
[+] [-] frik|9 years ago|reply
A lot of manual work with various perf tools.
What's a bit missing is some production performance monitoring (APM) that gives you such data, with no manual interaction.
[+] [-] alexandrnikitin|9 years ago|reply
[+] [-] tener|9 years ago|reply
[+] [-] brilliantcode|9 years ago|reply
[+] [-] Benfromparis|9 years ago|reply
Technical : if the UserAgent claim to be a regular browser (let say Chrome 43) we will check on network level if the client implement http protocol like Chrome 43 usually do and on the JS side if the Javascript render is correct for Chrome. In case it's a real Chrome, we will check if the Browser is controlled by automation Tool.
Behavior : we will check if the path of requests is regular according to the website usage.
Disclaimer: I'm working at https://datadome.co, a bot protection tool.
[+] [-] alexandrnikitin|9 years ago|reply