Many issues with this analysis, some others have already mentioned, including:
• The 'domains' collected by the source, as those "with the most referring subnets", aren't necessarily 'websites' that now, or ever, respnded to HTTP
• In many cases any responding web server will be on the `www.` subdomain, rather than the domain that was listed/probed – & not everyone sets up `www.` to respond/redirect. (Author misinterprets appearances of `www.domain` and `domain` in his source list as errant duplicates, when in fact that may be an indicator that those `www.domain` entries also have significant `subdomain.www.domain` extensions – depending on what Majestic means by 'subnets'.)
• Many sites may block `curl` requests because they only want attended human browser traffic, and such blocking (while usually accompanied with some error response) can be a more aggressive drop-connection.
• `curl` given a naked hostname likely attempts a plain HTTP connection, and given that even browsers now auto-prefix `https:` for a naked hostname, some active sites likely have nothing listening on plain-HTTP port anymore.
• Author's burst of activity could've triggered other rate-limits/failures - either at shared hosts/inbound proxies servicing many of the target domains, or at local ISP egresses or DNS services. He'd need to drill-down into individual failures to get a beter idea to what extent this might be happening.
If you want to probe if domains are still active:
• confirm they're still registered via a `whois`-like lookup
• examine their DNS records for evidence of current services
• ping them, or any DNS-evident subdomains
• if there are any MX records, check if the related SMTP server will confirm any likely email addresses (like postmaster@) as deliverable. (But: don't send an actual email message.)
• (more at risk of being perceived as aggressive) scan any extant domains (from DNS) for open ports running any popular (not just HTTP) services
If you want to probe if web sites are still active, start with an actual list of web site URLs that were known to have been active at some point.
> The 'domains' collected by the source, as those "with the most referring subnets", aren't necessarily 'websites' that now, or ever, respnded to HTTP
Majestic promotes their list as the "top 1 million websites of the world", not domains. You would thus expect that every entry in their list is (was?) a website that responds to HTTP.
> `subdomain.www.domain`
Is this really a thing? I mean, I know it's technically possible, but I don't think I've ever seen anybody do it.
> Many sites may block `curl` requests because they only want attended human browser traffic,
Citation needed, because if you do this, you'll also cut yourself off every search engine in existence.
And for kicks, I'll add one reason why the 900k valid sites is almost certainly an overestimate: the search can't tell apart an actual website from a blank domain parking page.
It dawned on me when I hit the Majestic query page [1] and saw the link to "Commission a bespoke Majestic Analytics report." They run a bot that scans the web, and (my opinion, no real evidence) they probably don't include sites that block the MJ12bot. This could explain why my site isn't in the list, I had some issues with their bot [2] and they blocked themselves from crawling my site.
So, is this a list of the actual top 1,000,000 sites? Or just the top 1,000,000 sites they crawl?
I downloaded the file and looked at the second 000 in his file, which refers to wixsite.com.
It appears that wixsite.com isn't valid but www.wixsite.com is, and redirects to wix.com.
It's misleading to say that the sites are dead. As noted elsewhere, his source data is crap (other sites I checked such as wixstatic.com don't appear to be valid) but his methodology is bad, or at least his describing the sites as dead is misleading.
wixsite.com is a domain for free sites built on Wix, so if your username on Wix is smugma, and your site name is mysite, then you'll have a URL like smugma.wixsite.com/mysite for your Home page.
100% agree his methodology is broken. Another example like this is googleapis.com. If I remember correctly there a quite a number of domains like this in magestic million.
Not to mention a number of his requests may have been blocked.
I'm honestly amazed that out of the top million sites, which probably includes a ton of tiny tiny sites that are idle or abandoned, only ten percent are offline.
Yeah, I'd expect a list of 1,000,000 "top" "sites" to contain much more than what can be called a "site," especially in 2022 when the internet has been all but destroyed and all that's left is a corporate oligopoly.
How is "top" defined here? If they were dead, wouldn't they fairly quickly stop being "top"?
EDIT: the article uses a list sorted by inlinks, and I guess other websites don't necessarily update broken links, but that may be less true in the modern age where we have tools and automated services to automatically warn us about dead links on our websites.
The biggest problem I find is that it seems to be pretty "outdated" to keep redirects in place, if you move stuff. So many links to news websites, etc. will cause a redirect to either / or a 404 (which is a very odd thing to redirect to in my opinion).
If you are unlucky an article you wanted to find also completely disappeared. This is scary, because it's basically history disappearing.
I also wonder what will happen to text on websites that are some ajax and javascript breaks because a third party goes down. While the internet archive seems to be building tools for people to use to mitigate this I found that they barely worked on websites that do something like this.
Another worry is the ever-increasing size of these scripts making archiving more expensive.
You can often pop the URL into the Wayback Machine to bring up the last live copy. It's better at handling dynamic stuff the more recent it is. Older stuff, especially early AJAX pages, are just gone because the crawler couldn't handle it at the time. It's far from a perfect solution, especially in light of the big publishers finally getting their excuse to go after the Internet Archive legally. It's a good silo, but just as vulnerable as any other.
I’m a no-www advocate. All my sites can be accessed from the Apex domain. But some people for whatever reason like to prepend www to my domains, so I wrote a rule in Apache’s .HTACCESS to rewrite the www to the Apex.
I used to feel the same way. — Until the arrival of so many new TLDs.
Since then I always use www, because mentioning www.alice.band in a sentence is much more of a hint to a general audience as to what I’m referring to than just alice.band
25 years ago I added a rule to my employer’s firewall to allow the bare domain to work on our web server.
Inbound email immediately broke. I was still very new, and didn’t want to prolong the downtime, so I reverted instead of troubleshooting.
A few months after I left, I sent an email to a former co-worker, my replacement, and got the same bounce message. I rang him up and verified that he had just set up the same firewall rule.
Been much too long to have any clue now what we did wrong.
I'm a www advocate and reroute my domains from apex domain to www. When you use an apex domain, you have to use an A record which means if you have a server outage it is going to take time to update the record to point at a new IP address. If you use www with a CNAME, the final server IP can be quickly switched assuming you've set the CNAME and network up for that functionality - you can't do that with an apex domain.
Free.fr, one of the biggest ISP in France a while back, and perhaps still today, still runs all the old-school websites it was hosting for people (for free) today. It's quite insane, but a lot of the French web 1.0 is still alive today thanks to them. Truly an ISP ran by passionate technical people.
Good on them. Last year I randomly discovered an ancient email to my old Hotmail address from free website host Tripod, owned at the time by Lycos, that old search engine. As an 11 year old I had a website with them and wanted to dig it out to see what I had put there. I managed to convince them I was the owner and got my access back, only to discover nothing there. I guess at some point in the ~20 years since I made one they nuked their dormant sites.
All these top million lists are very good at telling you the top most 10K-50K sites on the web. After that, you're going into 'crapshoot' land, where the 500,000th most popular site is very likely to be a site that got some traffic a long time ago, but now isn't even up.
So I would take this data with a grain of salt. You're better off just analyzing the top 100K sites on these lists.
How are people determining the "top" sites? We do some of this at work and we pay SimilarWeb a giant sum of money, are people able to find site traffic in inexpensive ways which allow for these analyses?
Dude, it's the second sentence of the first paragraph:
> For my purposes, the Majestic Million dataset felt like the perfect fit as it is ranked by the number of links that point to that domain (as well as taking into account diversity of the origin domains as well).
Last time I tried to crawl that many domains, I ran into problems with my ISP's DNS server. I ended up using a pool of public DNS servers to spread out all the requests. I'm surprised that wasn't an issue for the author?
I've been working on trying to migrate sites I ran in 2008 or so into my new preferred hosting strategy lately: I know zero people look at them, since many were functionally broken at present, but I don't like the idea of actually removing them from the web. So I'm patching them up, migrating them to a more maintainable setting, and keeping them going. Maybe someday some historian will get something out of it.
It happens. Most of the stuff we do these days invokes a number of disciplines. I forget sometimes that maybe ten percent of us just play with random CS domains for “fun” and that most people are coming into big problems blind, even sometimes the explorers (though having comfort with exploring random fields is a skill set unto itself).
Before the Cloud, when people would ask for a book on distributed computing, which wasn’t that often, I would tell them seriously “Practical Parallel Rendering”. That book was almost ten years old by then. 20 now. It’s ostensibly a book about CGI, but CGI is about distributed work pools, so half the book is a whirlwind tour of distributed computing and queuing theory. Once they start talking at length about raytracing, you can stop reading if CGI isn’t your thing, but that’s more than halfway through the book.
I still have to explain some of that stuff to people, and it catches them off guard because they think surely this little task is not so sophisticated as that…
I think this is where the art comes in. You can make something fiddly that takes constant supervision, so much so that you get frustrated trying to explain it to others, or you can make something where you push a button and magic comes out.
"a very reasonable but basic check would be to check each domain and verify that it was online and responsive to http requests. With only a million domains, this could be run from my own computer relatively simply and it would give us a very quick temperature check on whether the list truly was representative of the “top sites on the internet”. "
This took him 50 minutes to run. Think about that when you want to host something smaller than a large commercial site. We live in the future now, where bandwidth is relatively high and computers are fast. Point being that you don't need to rent or provision "big infrastructure" unless you're actually quite big.
your point has a truth behind it for sure, but there's a large difference between serving requests and making requests. Many sites are simple html and css pages, but many others also have complex backends. It's those that often are hard to scale and why the cloud is hugely popular, maintaining and scaling the backend is hard
> you don't need to rent or provision "big infrastructure" unless you're actually quite big.
Or if you have hard response-time requirements. I really don't think it would be good to, for example, wait an hour to process the data from 800K earthquake sensors and send out an alert to nearby affected areas.
Just one thing, analyze sites by total referring domains is not accurate as your result showed. A backlink can be easily faked and you can literally spam 1 million links within 1 day for any domain. Thus, this data source is not much useful.
For a more accurate result, try to use Ahrefs top 1 million domains, ranked by their traffics. Ahrefs rank sites by their ranking keywords, thus infer the traffic numbers, meaning, these websites are live, and ranking with some keywords.
You will see the result is much more accurate then, maybe not even a single website will be offline, because they are earning good cash.
I don't have any particular opinions on the author's conclusions, but I learned a thing or two about the power of terminal commands by reading through the article. I had no idea that xargs had a parallel mode.
Probably not news to anyone who works with big data™, but I learned, after additional searches, that using (something like) duckdb as a CSV parser makes sense, especially if the alternative is loading the entire thing to memory with (something like) base R. This was informative for me: https://hbs-rcs.github.io/large_data_in_R/.
having the luxury of scrutinizing the method and retesting: to "normalize" domains and skip the www skewed results - not all websites do their redirects across apex to www (and schemas). Some servers weren't answering the request with the default curl accept header / and needed encouragement.
I retested the 000 class of .de ccTLD (1227) and found more than a third (473) of them answering when prefixed with www. Lots of german universities were false negatives - if this is representative I cannot tell, just a hint to retest.
The takeaway from this is slightly off. There aren't 107776 sites that are dead, there are 107776 sites that don't run a HTTP server, or are otherwise dead.
If you try to connect via HTTP or HTTPS, then a quick run yields 91106 sites that are dead, or 9.11%
(And I ran this test on an AWS EC2 node with a fairly aggressive timeout. No doubt some % of sites play dead to AWS, or didn't respond fast enough for me)
Most of cities in Poland have their own $city.pl domain and allow websites to buy $website.$city.pl. That might not be well known. And cities have theri websites, so I guess it's OK.
But info.pl and biz.pl? Did nobody hear about country variants of gTLDs?!
[+] [-] gojomo|3 years ago|reply
• The 'domains' collected by the source, as those "with the most referring subnets", aren't necessarily 'websites' that now, or ever, respnded to HTTP
• In many cases any responding web server will be on the `www.` subdomain, rather than the domain that was listed/probed – & not everyone sets up `www.` to respond/redirect. (Author misinterprets appearances of `www.domain` and `domain` in his source list as errant duplicates, when in fact that may be an indicator that those `www.domain` entries also have significant `subdomain.www.domain` extensions – depending on what Majestic means by 'subnets'.)
• Many sites may block `curl` requests because they only want attended human browser traffic, and such blocking (while usually accompanied with some error response) can be a more aggressive drop-connection.
• `curl` given a naked hostname likely attempts a plain HTTP connection, and given that even browsers now auto-prefix `https:` for a naked hostname, some active sites likely have nothing listening on plain-HTTP port anymore.
• Author's burst of activity could've triggered other rate-limits/failures - either at shared hosts/inbound proxies servicing many of the target domains, or at local ISP egresses or DNS services. He'd need to drill-down into individual failures to get a beter idea to what extent this might be happening.
If you want to probe if domains are still active:
• confirm they're still registered via a `whois`-like lookup
• examine their DNS records for evidence of current services
• ping them, or any DNS-evident subdomains
• if there are any MX records, check if the related SMTP server will confirm any likely email addresses (like postmaster@) as deliverable. (But: don't send an actual email message.)
• (more at risk of being perceived as aggressive) scan any extant domains (from DNS) for open ports running any popular (not just HTTP) services
If you want to probe if web sites are still active, start with an actual list of web site URLs that were known to have been active at some point.
[+] [-] thematrixturtle|3 years ago|reply
Majestic promotes their list as the "top 1 million websites of the world", not domains. You would thus expect that every entry in their list is (was?) a website that responds to HTTP.
> `subdomain.www.domain`
Is this really a thing? I mean, I know it's technically possible, but I don't think I've ever seen anybody do it.
> Many sites may block `curl` requests because they only want attended human browser traffic,
Citation needed, because if you do this, you'll also cut yourself off every search engine in existence.
And for kicks, I'll add one reason why the 900k valid sites is almost certainly an overestimate: the search can't tell apart an actual website from a blank domain parking page.
[+] [-] spc476|3 years ago|reply
So, is this a list of the actual top 1,000,000 sites? Or just the top 1,000,000 sites they crawl?
[1] https://majestic.com/reports/majestic-million
[2] http://boston.conman.org/2019/07/09-12
[+] [-] smugma|3 years ago|reply
It appears that wixsite.com isn't valid but www.wixsite.com is, and redirects to wix.com.
It's misleading to say that the sites are dead. As noted elsewhere, his source data is crap (other sites I checked such as wixstatic.com don't appear to be valid) but his methodology is bad, or at least his describing the sites as dead is misleading.
[+] [-] code123456789|3 years ago|reply
That's why this domain is in the top
[+] [-] zinekeller|3 years ago|reply
But docs.wixstatic.com is valid.
[+] [-] winddude|3 years ago|reply
Not to mention a number of his requests may have been blocked.
[+] [-] quickthrower2|3 years ago|reply
> there’s a longtail of sites that had a variety of non-200 reponse codes but just to be conservative we’ll assume that they are all valid
[+] [-] bioemerl|3 years ago|reply
[+] [-] mike_hock|3 years ago|reply
[+] [-] MonkeyMalarky|3 years ago|reply
[+] [-] ehsankia|3 years ago|reply
EDIT: the article uses a list sorted by inlinks, and I guess other websites don't necessarily update broken links, but that may be less true in the modern age where we have tools and automated services to automatically warn us about dead links on our websites.
[+] [-] Swizec|3 years ago|reply
[+] [-] tete|3 years ago|reply
If you are unlucky an article you wanted to find also completely disappeared. This is scary, because it's basically history disappearing.
I also wonder what will happen to text on websites that are some ajax and javascript breaks because a third party goes down. While the internet archive seems to be building tools for people to use to mitigate this I found that they barely worked on websites that do something like this.
Another worry is the ever-increasing size of these scripts making archiving more expensive.
[+] [-] Kye|3 years ago|reply
[+] [-] nikisweeting|3 years ago|reply
[+] [-] gravitate|3 years ago|reply
I’m a no-www advocate. All my sites can be accessed from the Apex domain. But some people for whatever reason like to prepend www to my domains, so I wrote a rule in Apache’s .HTACCESS to rewrite the www to the Apex.
Here’s a tutorial for doing that: https://techstream.org/Web-Development/HTACCESS/WWW-to-Non-W...
[+] [-] noizejoy|3 years ago|reply
I used to feel the same way. — Until the arrival of so many new TLDs.
Since then I always use www, because mentioning www.alice.band in a sentence is much more of a hint to a general audience as to what I’m referring to than just alice.band
[+] [-] macintux|3 years ago|reply
Inbound email immediately broke. I was still very new, and didn’t want to prolong the downtime, so I reverted instead of troubleshooting.
A few months after I left, I sent an email to a former co-worker, my replacement, and got the same bounce message. I rang him up and verified that he had just set up the same firewall rule.
Been much too long to have any clue now what we did wrong.
[+] [-] agraddy|3 years ago|reply
[+] [-] baby|3 years ago|reply
[+] [-] ssl232|3 years ago|reply
[+] [-] altdataseller|3 years ago|reply
So I would take this data with a grain of salt. You're better off just analyzing the top 100K sites on these lists.
[+] [-] giantrobot|3 years ago|reply
That's literally the phenomenon the article is describing.
[+] [-] TuringNYC|3 years ago|reply
[+] [-] the_biot|3 years ago|reply
[+] [-] kjeetgill|3 years ago|reply
> For my purposes, the Majestic Million dataset felt like the perfect fit as it is ranked by the number of links that point to that domain (as well as taking into account diversity of the origin domains as well).
[+] [-] MonkeyMalarky|3 years ago|reply
[+] [-] wumpus|3 years ago|reply
[+] [-] ocdtrekkie|3 years ago|reply
[+] [-] macintux|3 years ago|reply
[+] [-] hinkley|3 years ago|reply
Before the Cloud, when people would ask for a book on distributed computing, which wasn’t that often, I would tell them seriously “Practical Parallel Rendering”. That book was almost ten years old by then. 20 now. It’s ostensibly a book about CGI, but CGI is about distributed work pools, so half the book is a whirlwind tour of distributed computing and queuing theory. Once they start talking at length about raytracing, you can stop reading if CGI isn’t your thing, but that’s more than halfway through the book.
I still have to explain some of that stuff to people, and it catches them off guard because they think surely this little task is not so sophisticated as that…
I think this is where the art comes in. You can make something fiddly that takes constant supervision, so much so that you get frustrated trying to explain it to others, or you can make something where you push a button and magic comes out.
[+] [-] phkahler|3 years ago|reply
"a very reasonable but basic check would be to check each domain and verify that it was online and responsive to http requests. With only a million domains, this could be run from my own computer relatively simply and it would give us a very quick temperature check on whether the list truly was representative of the “top sites on the internet”. "
This took him 50 minutes to run. Think about that when you want to host something smaller than a large commercial site. We live in the future now, where bandwidth is relatively high and computers are fast. Point being that you don't need to rent or provision "big infrastructure" unless you're actually quite big.
[+] [-] jayd16|3 years ago|reply
[+] [-] stevemk14ebr|3 years ago|reply
[+] [-] cratermoon|3 years ago|reply
Or if you have hard response-time requirements. I really don't think it would be good to, for example, wait an hour to process the data from 800K earthquake sensors and send out an alert to nearby affected areas.
[+] [-] unknown|3 years ago|reply
[deleted]
[+] [-] mouzogu|3 years ago|reply
this is why i like the archive.ph project so much and using it more as a kind of bookmarking service.
[+] [-] syedkarim|3 years ago|reply
[+] [-] system2|3 years ago|reply
[+] [-] yajjackson|3 years ago|reply
[+] [-] kerbersos|3 years ago|reply
[+] [-] terrycody|3 years ago|reply
Just one thing, analyze sites by total referring domains is not accurate as your result showed. A backlink can be easily faked and you can literally spam 1 million links within 1 day for any domain. Thus, this data source is not much useful.
For a more accurate result, try to use Ahrefs top 1 million domains, ranked by their traffics. Ahrefs rank sites by their ranking keywords, thus infer the traffic numbers, meaning, these websites are live, and ranking with some keywords.
You will see the result is much more accurate then, maybe not even a single website will be offline, because they are earning good cash.
[+] [-] allknowingfrog|3 years ago|reply
[+] [-] thelamest|3 years ago|reply
[+] [-] flas9sd|3 years ago|reply
I retested the 000 class of .de ccTLD (1227) and found more than a third (473) of them answering when prefixed with www. Lots of german universities were false negatives - if this is representative I cannot tell, just a hint to retest.
[+] [-] banana_giraffe|3 years ago|reply
If you try to connect via HTTP or HTTPS, then a quick run yields 91106 sites that are dead, or 9.11%
(And I ran this test on an AWS EC2 node with a fairly aggressive timeout. No doubt some % of sites play dead to AWS, or didn't respond fast enough for me)
[+] [-] zX41ZdbW|3 years ago|reply
[+] [-] kozziollek|3 years ago|reply
But info.pl and biz.pl? Did nobody hear about country variants of gTLDs?!
[+] [-] drdaeman|3 years ago|reply
And you're entirely correct that author should've referred to such list.