An aggressive, stealthy web spider operating from Microsoft IP space

[+] aorth|3 years ago|reply

I'm sick of it too. In 2023-01 alone I have had 9,000 different IPs from Microsoft's ASN8075 crawling one of my sites with these "normal looking" user agents. Poring over the logs to see why your server is on fire takes non-trivial amounts of time. If I didn't have a ton of other stuff to do I'd say it was kinda fun, but I'm freaking fed up.

Just yesterday I put all their networks into an nginx geo map:

    geo $limit_bots_ip {
        # requests with an empty key are not evaluated by limit_req
        # see: http://nginx.org/en/docs/http/ngx_http_limit_req_module.html
        default '';

        157.55.39.0/24  'bot';
        207.46.13.0/24  'bot';
        40.77.167.0/24  'bot';
        13.66.139.0/24  'bot';
        ...
    }

Any request from these networks gets classified as a bot, which is then used as the key for a rate limit

    limit_req_zone $limit_bots_ip zone=badbots_ip:1m rate=1r/m;

It's incredible the amount of resources these companies have. I'm just one guy trying to keep a few dozen web servers up.

[+] gggggg5|3 years ago|reply

Wouldn't it be easier to fix whatever issue is causing your server to be on fire than spending non-trivial amounts of time poring over logs?

[+] varenc|3 years ago|reply

Bing provides a tool that allows a webmaster to verify if requests came from Bing. The "Verify Bingbot" tool is linked near the bottom of this article: https://www.bing.com/webmasters/help/which-crawlers-does-bin...

I suspect this is a real Bing web crawler that's possibly misconfigured. The "spoofed" user-agent might be an attempt to get a more genuine crawl of what a mobile browser sees. I at least know Google does this.

[+] eli|3 years ago|reply

Nah, that's almost certainly someone hosting something on Azure and not Bing.

[+] Enderboi|3 years ago|reply

This bot is stupidly broken - we've had to block a whole swathe of Azure ranges on our load balancer.

This particular 'bot' would start crawling in-line Javascript on a customer site, and then get stuck in an infinite loop requesting things like:

  /page/window.open
  /page/window.open/page/window.open
  /page/window.open/page/window.open/page/window.open

We opened an abuse report with Microsoft last week, but haven't really heard anything back. Over 500 IP addresses involved hitting dozens of our hosted sites, le sigh.

Unfortunately, we do see enough legitimate requests with that UA that we had to resort to the IP blocking. A bunch of /16's and a few smaller ranges... it feels dirty :(

[+] rezonant|3 years ago|reply

From a linked blog post of the author's:

> Given that there are websites that are willing (or reluctantly forced) to allow Google(bot) access but would rather like to block everyone else, more than a few of them are probably using User-Agent matching instead of anything more sophisticated. (https://utcc.utoronto.ca/~cks/space/blog/web/SpiderUserAgent...)

Yeah, what a great idea to make all of our websites only crawlable by Google. Because competition in the search engine space is _so_ undesirable.

[+] shagie|3 years ago|reply

The sequence likely went "site is not crawalble by anything" (because there were so many badly behaved crawlers back in the day). Later the webmaster wanted it to show up in searches and so specifically opened up the robots.txt from "no body no way at all ever" to "ok, google can search these pages because they're not being evil" (this was back in the day).

It wasn't so much a "ok, I want only google to search this and not anyone else", but rather at the time, there wasn't anyone else (that behaved).

So now, you've got a website that is only searchable by google, and the webmaster retired and you've got this robots.txt that says you can't do it - and even some active defenses so that when non-google hits the page, they get tarpited. What are competing (reasonably well behaved) web spiders to do? How does Bing get in there to be able to compete now?

Its not that Bing should or shouldn't be doing this - but there's a fairly reasonably way that we got to the point where are now without any nefarious behavior from the various participants.

[+] tluyben2|3 years ago|reply

I saw this behaviour from MS/Azure IP space about 3 weeks ago; there were so many requests that some dynamic pages started to get slower because of it, so I gave in and added Cloudflare in front. It stopped minutes later as per the logs on my servers. Cloudflare blocked 10000s of visits in short timespans. Not sure how to really handle this well without something like cloudflare ; I would block like OP but I cannot risk that for paying b2b clients. We had no complaints at all by the way.

[+] efitz|3 years ago|reply

Former AWS security here. Cloud operators are EXTREMELY interested in maintaining the reputation of their IP address space; I promise you that abuse complaints are taken very seriously, investigated and acted on.

[+] LinuxBender|3 years ago|reply

I would not bother with looking at user-agent strings. Some bots can be spotted by oddly high packet TTL windows default is 128, Linux and Mac 64, missing TCP options, inability to do HTTP/2.0. There are iptables modules to block some of them by their missing MSS or odd MSS values, high packet TTL, or simply enforce HTTP/2.0 especially if this is just a blog. This will block all search engines except for Bing if that matters. Maybe someone here from Google can chime in as to when they plan to support HTTP/2.0 on their crawlers.

Most bots are using really old libraries. Some of the newer bots are using farms of cellular cards but they can be spotted by the use of TCP Timestamps as that is enabled by default on Android. The downside of doing anything with this is that one would also block legit cell phones, so just give them a simple puzzle to solve. e.g. What is 1+4 or if you are certain they are a bot then ask How many roads must a man walk down?

[+] coredog64|3 years ago|reply

Will I have access as a human if I answer anything but “42” to that last question?

[+] londons_explore|3 years ago|reply

OP didn't say... But were they trying to access a single page repeatedly, or were they just doing a scrape of the web?

People who are doing a 'get all content on the web' scrape, I try to not block, because there are plenty of reasons to want an offline copy of the internet.

I try to make sure all my webapps have sensible URL schemes to prevent such (IMO legitimate) activities ending up doing an infinite crawl because for example the URL has a session identifier in it and the bot keeps getting assigned a new session.

[+] luckylion|3 years ago|reply

> trying to access a single page repeatedly

That's what I saw from Azure IPs recently. A single URL being requested from some 5k IPs, each one requesting it every few minutes.

I had reported it to MS, they said (after two weeks) "it looks like abuse" and they'd forward it to their cert. I haven't heard any update since.

[+] patja|3 years ago|reply

I had the same thing happen out of the azure data center in Des Moines Iowa. I ended up having to geolocate the IP address in combination with a few funky / missing request headers to fingerprint this miscreant. They were rotating through hundreds of IP addresses.

It was a really dumb spider. It wasted a huge amount of its efforts requesting pages that are no longer valid on my site. It must have had an archived set of URLs it was using as a source rather than spidering the site as it is today.

[+] joelby37|3 years ago|reply

I've been experiencing exactly the same thing over a number of sites for the last few days too, with exactly the same user agent in this post. The annoying part is that it is not really 'spidering' web sites, but rather continuously hammering a list of non-existent pages which appear to be from a years-old version of the site.

Generating 404 responses puts a considerable load on WordPress sites and generates a lot of network traffic, but these have been relatively easy to block because of the predictable user agent and URI path prefixes. I'm thinking about blocking the Azure ASN completely, or developing something akin to Cloudflare's "are you a human?" interstitial page when requests come from cloud provider ASNs.

[+] wglass|3 years ago|reply

Yes - same here!

Started around Jan 12. Large pool of IP addresses, hard to block. Occasional brief DOS impacts but mostly just annoying errors in my logs. (if too many crop up we get automatic alerts). What was really puzzling is that many of the URLs are old (e.g. request for details on hosted sites that no longer exist). I loaded up a 6 month old backup database and confirmed those accounts weren't present, so the source list of URLs must be older that. Really bizarre.

After reading this article I looked and confirmed via spot checks they are from Microsoft IPs and Safari 15.1.

[+] anileated|3 years ago|reply

Is this how MS are scraping the web to monetize it under OpenAI?

If we don’t start batting an eye on this soon it’ll be the new normal that webmasters/content producers don’t really need readers or income.

[+] Nextgrid|3 years ago|reply

> it’ll be the new normal that webmasters/content producers don’t really need readers or income.

There's already way more content being produced out there (some of which is done altruistically without any expectation of income) than there is demand for it that making even more of it may not be as valuable as you think.

[+] mjtechguy|3 years ago|reply

This is exactly what I thought when I read this.

[+] Jules8850|3 years ago|reply

I've also seen this spider, and upon close examination it appears to be using a large dump of URLs, many years old, including non-public, non-guessable URLs that could only have been harvested from some sort of browser/plugin data breach.

[+] jonny_eh|3 years ago|reply

Related to GPT/OpenAI? They apparently use MS servers, due to MS's investment.

[+] speed_spread|3 years ago|reply

Maybe one of the GPT-5 prototypes has gained consciousness and is trying to escape its Azure jail. Obviously they'd deny it.

[+] Steve0|3 years ago|reply

Would be strange not to use the Bing crawlers for that.

[+] freitzkriesler|3 years ago|reply

Must be, I fear the day we have all of these self learning AIs released onto the internet.

[+] tehlike|3 years ago|reply

That was my guess too...

[+] drbeast|3 years ago|reply

[deleted]

[+] masswerk|3 years ago|reply

Maybe related, I had a few emails by alleged security researches, linking to standard tools, often even misinterpreting the results or maybe just too lazy to have a real look at them. Maybe, "security research" has just entered the "script-kiddy" age at a larger scale?

The list provided in this comment [0] looks much like it.

[0] https://news.ycombinator.com/item?id=34435222

[+] enlyth|3 years ago|reply

We got DDoSed by Microsoft once, since the marketing department sent out an email with a link to one of our low traffic services which generally handles maybe a few hundred requests a day, and tens of thousands of requests from Microsoft's email crawlers came in a very short time span and rendered our service unusable.

[+] vinaypai|3 years ago|reply

It sounds like the bot was making about one request every 3 seconds. I'd hardly call that "aggressive".

[+] sqreept|3 years ago|reply

It is aggressive in what content is trying to access. It looks for security vulnerabilities and normal bots don't do that (with the notable exception of some security testing software). Also it's not spidering, somehow it knows very old URLs which are not even public which were probably obtained from a malicious browser extension.

[+] gwittel|3 years ago|reply

In a past job I’ve seen crappy crawlers from badly designed security applications do stuff like this. An an example one customer was using Trend CAS to scan all URLs in their inbound email. This causes big bursts of traffic on our systems.

The crawls came from Azure and AWS. Forged UAs, repeat hits in the same URL, etc.

[+] dhx|3 years ago|reply

I recently made some contributions to https://github.com/alltheplaces/alltheplaces which is a set of Scrapy spiders for extracting locations of primarily franchise business stores, for use with projects such as OpenStreetMap. After these contributions I am now in shock at how hostile a lot of businesses are with allowing people the privilege of accessing their websites to find store locations, store information or to purchase an item.

Some examples of what I found you can't do at perhaps 20% of business websites:

* Use the website near a country border where people often commute freely across borders on a daily or regular basis (example: Schengen region).

* Plan a holiday figuring out when a shop closes for the day in another country.

* Order something from an Internet connection in another country to arrive when you return home.

* Look up product information whilst on a work trip overseas.

* Access the website via a workplace proxy server that is mandated to be used and just so happens to be hosted in a data centre (the likes of AWS and Azure included) which is blocked simply because it's not a residential ISP. A more complex check would be the website checking the time between a request from Javascript code executing on client reaching the server matches the time the server thinks it should have taken (by ICMP PING or similar to the origin IP address).

* Use the website simply because the geo IP database the website uses hasn't been updated to reflect block reassignments by the regional registry, and thinks that your French residential IP address is a defunct Latvian business IP address.

* Find the business on a price comparison website.

* Find the business on a search engine that isn't Google.

* Access the website without first allowing obfuscated Javascript to execute (example: [1]).

* Use the website if you had certain disabilities.

* Access the website with IPv6 or using 464XLAT (shared origin IPv4 address with potentially a large pool of other users).

The answer to me appears obvious: Store and product information is for the most part static and can be stored in static HTML and JSON/GeoJSON files that are rewritten on a regular cycle, or at least cached until their next update. JSON files can be advertised freely to the public so if anyone was trying to obtain information off the website, they could do so with a single request for a small file causing as minimal impact as possible. It's not difficult to create a website where 10,000 requests per second can be made to static data like product information or store locations. More advanced features such as stock availability would cause additional load, but again, 10,000 simple queries to a relational database is not a challenging problem.

Of course, none of the anti-bot measures implemented by websites actually stop bots they're most seeking to block. There are services specifically designed for scraping websites that have 100,000's of residential IP addresses in their pools around the world (fixed and mobile ISPs), and it's trivial with software such as selenium-stealth to make each request look legitimate from a Javascript perspective as if the request was from a Chrome browser running on Windows 10 with a single 2160p screen, etc. If you force bots down the path of working around anti-bot measures, it's a never ending battle the website will ultimately lose because the website will end up blocking legitimate customers and have extremely poor performance that is worsened by forcing bots to make 10x the number of requests to the website just to look legitimate or pass client tests.

[1] https://www.imperva.com/products/web-application-firewall-wa...

[+] sqreept|3 years ago|reply

I see it stopped crawling our website last night after hammering it for over a week.

We used a combination of ASN and UA to block them.

[+] the_third_wave|3 years ago|reply

Odd, I just noticed an ongoing rather intense scan for random paths on one of my servers using a number of IP addresses from within Microsoft-assigned space:

   "/var/www/domains/www.example.org/wp"
   "/var/www/domains/www.example.org/bc"
   "/var/www/domains/www.example.org/bk"
   "/var/www/domains/www.example.org/backup"
   "/var/www/domains/www.example.org/old"
   "/var/www/domains/www.example.org/new"
   "/var/www/domains/www.example.org/main"
   "/var/www/domains/www.example.org/home"
   "/var/www/domains/www.example.org/Telerik.Web.UI.WebResource.axd"
   "/var/www/domains/www.example.org/remote/fgt_lang"
   "/var/www/domains/www.example.org/wordpress"
   "/var/www/domains/www.example.org/Wordpress"
   "/var/www/domains/www.example.org/WORDPRESS"
   "/var/www/domains/www.example.org/WordPress"
   "/var/www/domains/www.example.org/wp"
   "/var/www/domains/www.example.org/Wp"
   "/var/www/domains/www.example.org/WP"
   "/var/www/domains/www.example.org/old"
   "/var/www/domains/www.example.org/Old"
   "/var/www/domains/www.example.org/OLD"
   "/var/www/domains/www.example.org/oldsite"
   "/var/www/domains/www.example.org/new"
   "/var/www/domains/www.example.org/New"
   "/var/www/domains/www.example.org/NEW"
   "/var/www/domains/www.example.org/wp-old"
   "/var/www/domains/www.example.org/2022"
   "/var/www/domains/www.example.org/2020"
   "/var/www/domains/www.example.org/2019"
   "/var/www/domains/www.example.org/2018"
   "/var/www/domains/www.example.org/backup"
   "/var/www/domains/www.example.org/test"
   "/var/www/domains/www.example.org/Test"
   "/var/www/domains/www.example.org/TEST"
   "/var/www/domains/www.example.org/demo"
   "/var/www/domains/www.example.org/bc"
   "/var/www/domains/www.example.org/www"
   "/var/www/domains/www.example.org/WWW"
   "/var/www/domains/www.example.org/Www"
   "/var/www/domains/www.example.org/2021"
   "/var/www/domains/www.example.org/main"
   "/var/www/domains/www.example.org/old-site"
   "/var/www/domains/www.example.org/bk"
   "/var/www/domains/www.example.org/Backup"
   "/var/www/domains/www.example.org/BACKUP"
   "/var/www/domains/www.example.org/SHOP"
   "/var/www/domains/www.example.org/Shop"
   "/var/www/domains/www.example.org/shop"
   "/var/www/domains/www.example.org/bak"
   "/var/www/domains/www.example.org/sitio"
   "/var/www/domains/www.example.org/bac"
   "/var/www/domains/www.example.org/sito"
   "/var/www/domains/www.example.org/site"
   "/var/www/domains/www.example.org/Site"
   "/var/www/domains/www.example.org/SITE"
   "/var/www/domains/www.example.org/blog"
   "/var/www/domains/www.example.org/BLOG"
   "/var/www/domains/www.example.org/Blog"

...etcetera, thousands of attempts using about 5 address from within Microsoft. I can only assume something has been pwned again.

[+] dafelst|3 years ago|reply

That looks like someone scanning for vulnerabilities in things like outdated/misconfigured WordPress.

[+] userbinator|3 years ago|reply

My first thoughts were "web page change detection service" and "someone misconfigured something".

118 comments