By only profiling the top 1M sites, I wonder if this may be sampling from a set not normally distributed? I suspect the frequency of use of WordPress might go up the further down the list you go — some random blog is less likely to be on the top 1M list yet be more likely to use WordPress.
I suspect this is so. The top million (specifically a subset of that) is where the sites fall whose owners can afford to hire people to build their own custom stuff. Small independent sites seem more likely to rely on something like WordPress.
It'd be interesting to do all that you said (and more) and then determine what's the combined amount, as well as what % of sites do some sort of obfuscation... and why?
I don't think there's an unquestionable way to know if a site is made in WordPress, although you can intuit by some clues: header, meta tag, robots.txt, cookie login, sitemap ...
Some other notes:
1) you're not checking subdomains like blog.company.com or paths like company.com/blog
2) if you use something like zgrab you can do 1M site crawl in a couple of hours. Consider checking it out.
The Readme says "Warning that it can take a long time: between 20 to 30 days."
How in the world can it take so long time? The csv file seems to be 24mb in size and the computation performed can't be that advanced. Did the author do something seriously wrong?
What? I guess this is a toy program used to learn clojure or sth - it even uses sed for line parsing. A 10-line php script could do the same with a few MB of RAM
This is a pretty amazing feat. The top 1 million sites includes many who have the money to afford custom sites and yet Wordpress is still almost 1/5 of sites.
WordPress is the software HN loves to hate, but while it certainly has plenty of warts it's also a very flexible, pliable system for building the kinds of web sites that most people want to build. It'll never win any architectural beauty contests, but market share is driven by utility, not beauty. And WordPress can be very useful software.
I’m wondering why this takes 20-30 days to run all up? Seems crazy for 1M requests. Could one make this a concurrent task and get much greater efficiency?
[+] [-] rbritton|6 years ago|reply
[+] [-] mfer|6 years ago|reply
[1] https://w3techs.com/technologies/overview/content_management...
[+] [-] rdiddly|6 years ago|reply
[+] [-] modernerd|6 years ago|reply
The script seems to detect a WordPress site by looking for a meta generator tag containing WordPress:
https://github.com/tanrax/calculate-wordpress-usage/blob/5aa...
It's pretty common to remove that meta tag — popular WordPress theme frameworks like Genesis do it by default.
A more reliable test would be to look for additional strings in the source that point to the use of WordPress, such as “wp-content” and “wp-includes”.
A faster way that avoids string searches would be to send an HTTP head request to `/wp-login.php` and check for:
Set-Cookie: wordpress_test_cookie
(/wp-login.php doesn't always appear in the root directory and it's not always accessible to all IPs, but that setup is most common).
[+] [-] eugenekolo2|6 years ago|reply
[+] [-] benbristow|6 years ago|reply
Custom themes etc. might choose to omit that so it's not a 100% reliable check
[+] [-] andros|6 years ago|reply
[+] [-] audessuscest|6 years ago|reply
[+] [-] sandov|6 years ago|reply
[+] [-] mxpxrocks10|6 years ago|reply
[+] [-] mxpxrocks10|6 years ago|reply
[+] [-] mxpxrocks10|6 years ago|reply
[+] [-] subpixel|6 years ago|reply
[+] [-] buboard|6 years ago|reply
How is that a random sample?
[+] [-] mxpxrocks10|6 years ago|reply
[+] [-] audessuscest|6 years ago|reply
[+] [-] rbritton|6 years ago|reply
[+] [-] ludamad|6 years ago|reply
[+] [-] ga-vu|6 years ago|reply
https://translate.google.com/translate?sl=auto&tl=en&u=https...
[+] [-] capableweb|6 years ago|reply
How in the world can it take so long time? The csv file seems to be 24mb in size and the computation performed can't be that advanced. Did the author do something seriously wrong?
[+] [-] eugenekolo2|6 years ago|reply
However, this is an [embarassingly parallel](https://en.wikipedia.org/wiki/Embarrassingly_parallel) problem, and renting out some machines would speed it up.
[+] [-] buboard|6 years ago|reply
What? I guess this is a toy program used to learn clojure or sth - it even uses sed for line parsing. A 10-line php script could do the same with a few MB of RAM
[+] [-] detaro|6 years ago|reply
[+] [-] mfer|6 years ago|reply
[+] [-] smacktoward|6 years ago|reply
[+] [-] bdcravens|6 years ago|reply
[+] [-] blondin|6 years ago|reply
[+] [-] andros|6 years ago|reply
[+] [-] hestefisk|6 years ago|reply
[+] [-] image888|6 years ago|reply
[deleted]