10% of the top million sites are dead

[+] gojomo|3 years ago|reply

Many issues with this analysis, some others have already mentioned, including:

• The 'domains' collected by the source, as those "with the most referring subnets", aren't necessarily 'websites' that now, or ever, respnded to HTTP

• In many cases any responding web server will be on the `www.` subdomain, rather than the domain that was listed/probed – & not everyone sets up `www.` to respond/redirect. (Author misinterprets appearances of `www.domain` and `domain` in his source list as errant duplicates, when in fact that may be an indicator that those `www.domain` entries also have significant `subdomain.www.domain` extensions – depending on what Majestic means by 'subnets'.)

• Many sites may block `curl` requests because they only want attended human browser traffic, and such blocking (while usually accompanied with some error response) can be a more aggressive drop-connection.

• `curl` given a naked hostname likely attempts a plain HTTP connection, and given that even browsers now auto-prefix `https:` for a naked hostname, some active sites likely have nothing listening on plain-HTTP port anymore.

• Author's burst of activity could've triggered other rate-limits/failures - either at shared hosts/inbound proxies servicing many of the target domains, or at local ISP egresses or DNS services. He'd need to drill-down into individual failures to get a beter idea to what extent this might be happening.

If you want to probe if domains are still active:

• confirm they're still registered via a `whois`-like lookup

• examine their DNS records for evidence of current services

• ping them, or any DNS-evident subdomains

• if there are any MX records, check if the related SMTP server will confirm any likely email addresses (like postmaster@) as deliverable. (But: don't send an actual email message.)

• (more at risk of being perceived as aggressive) scan any extant domains (from DNS) for open ports running any popular (not just HTTP) services

If you want to probe if web sites are still active, start with an actual list of web site URLs that were known to have been active at some point.

[+] thematrixturtle|3 years ago|reply

> The 'domains' collected by the source, as those "with the most referring subnets", aren't necessarily 'websites' that now, or ever, respnded to HTTP

Majestic promotes their list as the "top 1 million websites of the world", not domains. You would thus expect that every entry in their list is (was?) a website that responds to HTTP.

> `subdomain.www.domain`

Is this really a thing? I mean, I know it's technically possible, but I don't think I've ever seen anybody do it.

> Many sites may block `curl` requests because they only want attended human browser traffic,

Citation needed, because if you do this, you'll also cut yourself off every search engine in existence.

And for kicks, I'll add one reason why the 900k valid sites is almost certainly an overestimate: the search can't tell apart an actual website from a blank domain parking page.

[+] spc476|3 years ago|reply

It dawned on me when I hit the Majestic query page [1] and saw the link to "Commission a bespoke Majestic Analytics report." They run a bot that scans the web, and (my opinion, no real evidence) they probably don't include sites that block the MJ12bot. This could explain why my site isn't in the list, I had some issues with their bot [2] and they blocked themselves from crawling my site.

So, is this a list of the actual top 1,000,000 sites? Or just the top 1,000,000 sites they crawl?

[1] https://majestic.com/reports/majestic-million

[2] http://boston.conman.org/2019/07/09-12

[+] smugma|3 years ago|reply

I downloaded the file and looked at the second 000 in his file, which refers to wixsite.com.

It appears that wixsite.com isn't valid but www.wixsite.com is, and redirects to wix.com.

It's misleading to say that the sites are dead. As noted elsewhere, his source data is crap (other sites I checked such as wixstatic.com don't appear to be valid) but his methodology is bad, or at least his describing the sites as dead is misleading.

[+] code123456789|3 years ago|reply

wixsite.com is a domain for free sites built on Wix, so if your username on Wix is smugma, and your site name is mysite, then you'll have a URL like smugma.wixsite.com/mysite for your Home page.

That's why this domain is in the top

[+] zinekeller|3 years ago|reply

> other sites I checked such as wixstatic.com don't appear to be valid

But docs.wixstatic.com is valid.

[+] winddude|3 years ago|reply

100% agree his methodology is broken. Another example like this is googleapis.com. If I remember correctly there a quite a number of domains like this in magestic million.

Not to mention a number of his requests may have been blocked.

[+] quickthrower2|3 years ago|reply

He takes this into account by generously considering any returned response code as “not dead”.

> there’s a longtail of sites that had a variety of non-200 reponse codes but just to be conservative we’ll assume that they are all valid

[+] bioemerl|3 years ago|reply

I'm honestly amazed that out of the top million sites, which probably includes a ton of tiny tiny sites that are idle or abandoned, only ten percent are offline.

[+] mike_hock|3 years ago|reply

Yeah, I'd expect a list of 1,000,000 "top" "sites" to contain much more than what can be called a "site," especially in 2022 when the internet has been all but destroyed and all that's left is a corporate oligopoly.

[+] MonkeyMalarky|3 years ago|reply

How many are placeholder pages thrown up by registrars like Network Solutions?

[+] ehsankia|3 years ago|reply

How is "top" defined here? If they were dead, wouldn't they fairly quickly stop being "top"?

EDIT: the article uses a list sorted by inlinks, and I guess other websites don't necessarily update broken links, but that may be less true in the modern age where we have tools and automated services to automatically warn us about dead links on our websites.

[+] Swizec|3 years ago|reply

Blows my mind that my blog is 210863rd on that list. That makes the web feel somehow smaller than I thought it was.

[+] tete|3 years ago|reply

The biggest problem I find is that it seems to be pretty "outdated" to keep redirects in place, if you move stuff. So many links to news websites, etc. will cause a redirect to either / or a 404 (which is a very odd thing to redirect to in my opinion).

If you are unlucky an article you wanted to find also completely disappeared. This is scary, because it's basically history disappearing.

I also wonder what will happen to text on websites that are some ajax and javascript breaks because a third party goes down. While the internet archive seems to be building tools for people to use to mitigate this I found that they barely worked on websites that do something like this.

Another worry is the ever-increasing size of these scripts making archiving more expensive.

[+] Kye|3 years ago|reply

You can often pop the URL into the Wayback Machine to bring up the last live copy. It's better at handling dynamic stuff the more recent it is. Older stuff, especially early AJAX pages, are just gone because the crawler couldn't handle it at the time. It's far from a perfect solution, especially in light of the big publishers finally getting their excuse to go after the Internet Archive legally. It's a good silo, but just as vulnerable as any other.

[+] nikisweeting|3 years ago|reply

ArchiveWeb.page + ReplayWeb.page are the best I've found at handling ajax loaded content.

[+] gravitate|3 years ago|reply

> Domain normalization is a bitch

I’m a no-www advocate. All my sites can be accessed from the Apex domain. But some people for whatever reason like to prepend www to my domains, so I wrote a rule in Apache’s .HTACCESS to rewrite the www to the Apex.

Here’s a tutorial for doing that: https://techstream.org/Web-Development/HTACCESS/WWW-to-Non-W...

[+] noizejoy|3 years ago|reply

> I’m a no-www advocate.

I used to feel the same way. — Until the arrival of so many new TLDs.

Since then I always use www, because mentioning www.alice.band in a sentence is much more of a hint to a general audience as to what I’m referring to than just alice.band

[+] macintux|3 years ago|reply

25 years ago I added a rule to my employer’s firewall to allow the bare domain to work on our web server.

Inbound email immediately broke. I was still very new, and didn’t want to prolong the downtime, so I reverted instead of troubleshooting.

A few months after I left, I sent an email to a former co-worker, my replacement, and got the same bounce message. I rang him up and verified that he had just set up the same firewall rule.

Been much too long to have any clue now what we did wrong.

[+] agraddy|3 years ago|reply

I'm a www advocate and reroute my domains from apex domain to www. When you use an apex domain, you have to use an A record which means if you have a server outage it is going to take time to update the record to point at a new IP address. If you use www with a CNAME, the final server IP can be quickly switched assuming you've set the CNAME and network up for that functionality - you can't do that with an apex domain.

[+] baby|3 years ago|reply

Free.fr, one of the biggest ISP in France a while back, and perhaps still today, still runs all the old-school websites it was hosting for people (for free) today. It's quite insane, but a lot of the French web 1.0 is still alive today thanks to them. Truly an ISP ran by passionate technical people.

[+] ssl232|3 years ago|reply

Good on them. Last year I randomly discovered an ancient email to my old Hotmail address from free website host Tripod, owned at the time by Lycos, that old search engine. As an 11 year old I had a website with them and wanted to dig it out to see what I had put there. I managed to convince them I was the owner and got my access back, only to discover nothing there. I guess at some point in the ~20 years since I made one they nuked their dormant sites.

[+] altdataseller|3 years ago|reply

All these top million lists are very good at telling you the top most 10K-50K sites on the web. After that, you're going into 'crapshoot' land, where the 500,000th most popular site is very likely to be a site that got some traffic a long time ago, but now isn't even up.

So I would take this data with a grain of salt. You're better off just analyzing the top 100K sites on these lists.

[+] giantrobot|3 years ago|reply

> where the 500,000th most popular site is very likely to be a site that got some traffic a long time ago, but now isn't even up.

That's literally the phenomenon the article is describing.

[+] TuringNYC|3 years ago|reply

How are people determining the "top" sites? We do some of this at work and we pay SimilarWeb a giant sum of money, are people able to find site traffic in inexpensive ways which allow for these analyses?

[+] the_biot|3 years ago|reply

By what possible criteria are these the "top" million sites, if 10% are dead? I'd start with questioning that data.

[+] kjeetgill|3 years ago|reply

Dude, it's the second sentence of the first paragraph:

> For my purposes, the Majestic Million dataset felt like the perfect fit as it is ranked by the number of links that point to that domain (as well as taking into account diversity of the origin domains as well).

[+] MonkeyMalarky|3 years ago|reply

Last time I tried to crawl that many domains, I ran into problems with my ISP's DNS server. I ended up using a pool of public DNS servers to spread out all the requests. I'm surprised that wasn't an issue for the author?

[+] wumpus|3 years ago|reply

You have to run your own resolver. Crawling 101.

[+] ocdtrekkie|3 years ago|reply

I've been working on trying to migrate sites I ran in 2008 or so into my new preferred hosting strategy lately: I know zero people look at them, since many were functionally broken at present, but I don't like the idea of actually removing them from the web. So I'm patching them up, migrating them to a more maintainable setting, and keeping them going. Maybe someday some historian will get something out of it.

[+] macintux|3 years ago|reply

Title is misleading: that’s the outcome, but the bulk of the story is the data processing to reach that conclusion.

[+] hinkley|3 years ago|reply

It happens. Most of the stuff we do these days invokes a number of disciplines. I forget sometimes that maybe ten percent of us just play with random CS domains for “fun” and that most people are coming into big problems blind, even sometimes the explorers (though having comfort with exploring random fields is a skill set unto itself).

Before the Cloud, when people would ask for a book on distributed computing, which wasn’t that often, I would tell them seriously “Practical Parallel Rendering”. That book was almost ten years old by then. 20 now. It’s ostensibly a book about CGI, but CGI is about distributed work pools, so half the book is a whirlwind tour of distributed computing and queuing theory. Once they start talking at length about raytracing, you can stop reading if CGI isn’t your thing, but that’s more than halfway through the book.

I still have to explain some of that stuff to people, and it catches them off guard because they think surely this little task is not so sophisticated as that…

I think this is where the art comes in. You can make something fiddly that takes constant supervision, so much so that you get frustrated trying to explain it to others, or you can make something where you push a button and magic comes out.

[+] phkahler|3 years ago|reply

Read that again folks:

"a very reasonable but basic check would be to check each domain and verify that it was online and responsive to http requests. With only a million domains, this could be run from my own computer relatively simply and it would give us a very quick temperature check on whether the list truly was representative of the “top sites on the internet”. "

This took him 50 minutes to run. Think about that when you want to host something smaller than a large commercial site. We live in the future now, where bandwidth is relatively high and computers are fast. Point being that you don't need to rent or provision "big infrastructure" unless you're actually quite big.

[+] jayd16|3 years ago|reply

The flip side is anyone can run these kinds of tools against your site easily and cheaply.

[+] stevemk14ebr|3 years ago|reply

your point has a truth behind it for sure, but there's a large difference between serving requests and making requests. Many sites are simple html and css pages, but many others also have complex backends. It's those that often are hard to scale and why the cloud is hugely popular, maintaining and scaling the backend is hard

[+] cratermoon|3 years ago|reply

> you don't need to rent or provision "big infrastructure" unless you're actually quite big.

Or if you have hard response-time requirements. I really don't think it would be good to, for example, wait an hour to process the data from 800K earthquake sensors and send out an alert to nearby affected areas.

[+] unknown|3 years ago|reply

[deleted]

[+] mouzogu|3 years ago|reply

whenever i go through my bookmarks, i tend to find maybe 5-10% are now 404.

this is why i like the archive.ph project so much and using it more as a kind of bookmarking service.

[+] syedkarim|3 years ago|reply

What’s the benefit to using archive.ph instead of archive.org (Internet Archive)? Seems like the latter is much more likely to be around for awhile.

[+] system2|3 years ago|reply

archive.ph = Russian federation website. Blocked by most firewalls by default.

[+] yajjackson|3 years ago|reply

Tangential, but I love the format for your site. Any plans to do a "How I built this blog" post?

[+] kerbersos|3 years ago|reply

Likely using Hugo with the congo theme

[+] terrycody|3 years ago|reply

Nice work.

Just one thing, analyze sites by total referring domains is not accurate as your result showed. A backlink can be easily faked and you can literally spam 1 million links within 1 day for any domain. Thus, this data source is not much useful.

For a more accurate result, try to use Ahrefs top 1 million domains, ranked by their traffics. Ahrefs rank sites by their ranking keywords, thus infer the traffic numbers, meaning, these websites are live, and ranking with some keywords.

You will see the result is much more accurate then, maybe not even a single website will be offline, because they are earning good cash.

[+] allknowingfrog|3 years ago|reply

I don't have any particular opinions on the author's conclusions, but I learned a thing or two about the power of terminal commands by reading through the article. I had no idea that xargs had a parallel mode.

[+] thelamest|3 years ago|reply

Probably not news to anyone who works with big data™, but I learned, after additional searches, that using (something like) duckdb as a CSV parser makes sense, especially if the alternative is loading the entire thing to memory with (something like) base R. This was informative for me: https://hbs-rcs.github.io/large_data_in_R/.

[+] flas9sd|3 years ago|reply

having the luxury of scrutinizing the method and retesting: to "normalize" domains and skip the www skewed results - not all websites do their redirects across apex to www (and schemas). Some servers weren't answering the request with the default curl accept header / and needed encouragement.

I retested the 000 class of .de ccTLD (1227) and found more than a third (473) of them answering when prefixed with www. Lots of german universities were false negatives - if this is representative I cannot tell, just a hint to retest.

[+] banana_giraffe|3 years ago|reply

The takeaway from this is slightly off. There aren't 107776 sites that are dead, there are 107776 sites that don't run a HTTP server, or are otherwise dead.

If you try to connect via HTTP or HTTPS, then a quick run yields 91106 sites that are dead, or 9.11%

(And I ran this test on an AWS EC2 node with a fairly aggressive timeout. No doubt some % of sites play dead to AWS, or didn't respond fast enough for me)

[+] zX41ZdbW|3 years ago|reply

This looks surprisingly similar to the unfinished research that I did: https://github.com/ClickHouse/ClickHouse/issues/18842

[+] kozziollek|3 years ago|reply

Most of cities in Poland have their own $city.pl domain and allow websites to buy $website.$city.pl. That might not be well known. And cities have theri websites, so I guess it's OK.

But info.pl and biz.pl? Did nobody hear about country variants of gTLDs?!

[+] drdaeman|3 years ago|reply

Those are called Public Suffixes or effective TLDs (eTLDs): https://en.wikipedia.org/wiki/Public_Suffix_List

And you're entirely correct that author should've referred to such list.

143 comments