I don't get what Bytedance is doing here. Clearly they are not actively trying to evade blocks, as they are idenifying their bot with a user agent sites can block.
However, surely they have enough smart engineers there to realize that running a bot at full speed (and, based on other reports, completely ignoring robots.txt) will get them blocked by a lot of sites.
If they just had a well behaved spider, almost no one would mind. Getting crawled is a fact of life on the internet, and most website owners recognize it as an essential cost of doing busses. Once you get a reputation as a bad spider, though, that is very hard to shake.
Is returning a 403 based on the user agent worth a blog post? Also, can't Bytespider just change their user agent to Byte-Spider? Or, just make their user agent a random string? It will be a forever arms race and require constant code updates to keep chasing that bot by user agent. You're probably better off whitelisting the known user agents and blocking everything else.
Also, does it really require a specific "gem"? This is HTTP request filtering, the router (as in the real router, like the metal box with network cables) can probably do it by itself these days.
It might not be, but I couldn't find much about the topic so I figured I'd write it up and share. And you're right that this may be a bit of whack-a-mole, but for now I've cut my bandwidth down which means I may be able to downgrade my cloudinary plan to a lower tier, which is a big win for me since it accounts for like 20-30% of my total operating cost
This is the worst behaved bot I have ever seen, I suspect it is AI related. I recently decided to block all the AI crawlers - unlike search engines I get nothing from them.
Yeah...I suck at optimizing for dark mode and I think I'm about to get too much traffic from this post so I can't fix it right now. Probably a tomorrow task haha
gizmo686|1 year ago
However, surely they have enough smart engineers there to realize that running a bot at full speed (and, based on other reports, completely ignoring robots.txt) will get them blocked by a lot of sites.
If they just had a well behaved spider, almost no one would mind. Getting crawled is a fact of life on the internet, and most website owners recognize it as an essential cost of doing busses. Once you get a reputation as a bad spider, though, that is very hard to shake.
jd20|1 year ago
hajimuz|1 year ago
chptung|1 year ago
chasd00|1 year ago
Also, does it really require a specific "gem"? This is HTTP request filtering, the router (as in the real router, like the metal box with network cables) can probably do it by itself these days.
phartenfeller|1 year ago
Also why should they not respect the 403? Crawlers just go to anything they can find. It is not a targeted attack.
chptung|1 year ago
braden_e|1 year ago
chptung|1 year ago
unknown|1 year ago
[deleted]
mmaunder|1 year ago
Edit: Nice try on the vote brigade guys. lol
chptung|1 year ago
catoc|1 year ago
jsnell|1 year ago
https://www.nerdcrawler.com/robots.txt
The domain serving the images is allowing everything:
https://res.cloudinary.com/robots.txt
unknown|1 year ago
[deleted]