Cached Chrome Top Million Websites

[+] mg|3 years ago|reply

Top level domains by popularity:

    grep -oP '\.[a-z]+(?=,)' current.csv | sort | uniq -c | sort -n

  ...
  15840 .pl
  17914 .it
  20182 .de
  21690 .in
  27812 .ru
  29194 .jp
  30359 .org
  35741 .br
  36675 .net
 406052 .com

.com domains by popularity:

    grep -oP '[a-z0-9-]+\.com(?=,)' current.csv | sort | uniq -c | sort -n

    ...
    365 tistory.com
    370 fc2.com
    408 skipthegames.com
    489 online.com
    515 wordpress.com
    707 uptodown.com
    880 schoology.com
   2570 fandom.com
   2651 instructure.com
   3244 blogspot.com

[+] azeemba|3 years ago|reply

It might be worth updating this comment and explaining your second query.

People seem to think it is somehow measuring visits to those origins. But it's measuring how many unique subdomains are listed for those domains

[+] egman_ekki|3 years ago|reply

Rather amazing seeing almost abandoned blogspot.com there at the top.

Also interesting I haven’t heard about half of them. Some are nsfw, apparently.

[+] kristianp|3 years ago|reply

Loading the data into the duckdb cli [0] and doing the first query:

    create table current as select * from '202211.csv';
    select * from current;
    ┌────────────────────────────────────┬─────────┐
    │               origin               │  rank   │
    │              varchar               │  int32  │
    ├────────────────────────────────────┼─────────┤
    │ https://hochi.news                 │    1000 │
    │ https://www.xnxx.xxx               │    1000 │
    │ https://www.wordreference.com      │    1000 │
    │ https://finance.naver.com          │    1000 │
    │ https://www.macys.com              │    1000 │
    │ https://www.xv-videos1.com         │    1000 │
    │ https://fr.xhamster.com            │    1000 │
    │ https://poki.com                   │    1000 │
    │ https://salonboard.com             │    1000 │
    │ https://clgt.one                   │    1000 │


    select tld, count(*) 
    from (select reverse(substr(reverse(origin),1, position('.' in reverse(origin))-1)) tld 
            from current) 
    group by tld 
    order by count(*) desc;
    ┌───────────┬──────────────┐
    │    tld    │ count_star() │
    │  varchar  │    int64     │
    ├───────────┼──────────────┤
    │ com       │       406052 │
    │ net       │        36675 │
    │ br        │        35741 │
    │ org       │        30359 │
    │ jp        │        29194 │
    │ ru        │        27812 │
    │ in        │        21690 │
    │ de        │        20182 │
    │ it        │        17914 │
    │ pl        │        15840 │
    │ ·         │            · │
    │ ·         │            · │
    │ ·         │            · │
    │ za:5002   │            1 │
    │ lk:8090   │            1 │
    │ org:1445  │            1 │
    │ co:14443  │            1 │
    │ ar:3016   │            1 │
    │ net:8001  │            1 │
    │ care:9624 │            1 │
    │ au:8443   │            1 │
    │ com:333   │            1 │
    │ edu:9016  │            1 │
    ├───────────┴──────────────┤
    │   2076 rows (20 shown)   │
    └──────────────────────────┘

[0] https://duckdb.org/docs/installation/

[+] voytec|3 years ago|reply

sort -r will reverse the order from most to less popular.

[+] unknown|3 years ago|reply

[deleted]

[+] unknown|3 years ago|reply

[deleted]

[+] slim|3 years ago|reply

it's amazing that fandom is number 3 and wikipedia is not even there

[+] wirthjason|3 years ago|reply

Curious where PornHub and other sites rank. I always hear how that porn sites are in the top X of all traffics but people don’t talk about due to its nature.

I’m always amazed that they have a data science team. It’s not something many would expect from the porn industry. I certainly didn’t expect it.

https://www.pornhub.com/insights/2022-year-in-review

[+] mtmail|3 years ago|reply

"Pornhub’s statisticians make use of Google Analytics to figure out the most likely age and gender of visitors. This data is anonymized from billions of visits to Pornhub annually, giving us one of the richest and most diverse data sets to analyze traffic from all around the world."

[+] layer8|3 years ago|reply

A quick grep shows that there are almost 2.5K domains at 1M rank with “porn” in their name.

The data science teams likely provide a considerable ROI in that industry.

[+] E39M5S62|3 years ago|reply

If you ignore the content, large-scale adult sites are just like any other high traffic (bandwidth, RPS) site out there. A lot of planning goes into where their content delivery PoPs should be placed.

[+] mtmail|3 years ago|reply

"New Year’s Eve kicked holiday ass with a massive –40% drop in worldwide traffic from 6pm to Midnight on December 31st." It's Dec/31, 1pm in New York right now.

[+] xwowsersx|3 years ago|reply

I remember reading about their experience with Redis https://groups.google.com/g/redis-db/c/d4QcWV0p-YM there is something funny about reading engineering insights from a porn co, but they do deal with scale that not many others do!

[+] oars|3 years ago|reply

"Pornhub’s statisticians make use of Google Analytics to figure out the most likely age and gender of visitors. This data is anonymized from billions of visits to Pornhub annually, giving us one of the richest and most diverse data sets to analyze traffic from all around the world."

[+] est|3 years ago|reply

Looks like not a single Chinese site made to top 1k. I guess it's reasonable because all Google services were blocked so CrUX can't gather any data.

[+] themoonisachees|3 years ago|reply

Do Chinese people use chrome? One would think the download page is blocked as well, so the demographic for chrome users should be way smaller.

Also to consider: China uses in-app browsing a lot, with interactive experiences very similar to websites built right in the bilibili/ali/wechat apps.

[+] ksec|3 years ago|reply

I assume the data is aggregate across all devices. Chrome has 60% of desktop usage in China. But less than 10% on Mobile.

But in a market of near 1B internet user, not having a single site in top 1K suggest something is wrong with the stats. I wonder what are we missing from those numbers.

[+] kristianp|3 years ago|reply

> The CrUX dataset is based on data collected from Google Chrome and is thus biased away from countries with limited Chrome usage (e.g., China).

[+] modeless|3 years ago|reply

Wow, I'm kinda surprised to find my site in the top million worldwide. I have about 100k monthly visits as measured by Cloudflare web analytics, I guess that's all it takes.

[+] zX41ZdbW|3 years ago|reply

If you are interested in the research on technologies used on the Internet, I recommend playing with the "Minicrawl" dataset.

It contains data about ~7 million top websites, and for every website, it also contains: - the full content of the main page; - the verbose output of curl, containing various timing info; the HTTP headers, protocol info...

Using this dataset, you can build a service similar to https://builtwith.com/ for your research.

Data: https://clickhouse-public-datasets.s3.amazonaws.com/minicraw... (129 GB compressed, ~1 TB uncompressed).

Description: https://github.com/ClickHouse/ClickHouse/issues/18842

You can easily try it with clickhouse-local without downloading:

  $ curl https://clickhouse.com/ | sh

  $ ./clickhouse local 
    ClickHouse local version 22.13.1.294 (official build).

    milovidov-desktop :) DESCRIBE url('https://clickhouse-public-datasets.s3.amazonaws.com/minicrawl/data.native.zst')

    DESCRIBE TABLE url('https://clickhouse-public-datasets.s3.amazonaws.com/minicrawl/data.native.zst')

    Query id: 6746232f-7f5f-4c5a-ac68-d749d949a2dc

    ┌─name────┬─type───┬─default_type─┬─default_expression─┬─comment─┬─codec_expression─┬─ttl_expression─┐
    │ rank    │ UInt32 │              │                    │         │                  │                │
    │ domain  │ String │              │                    │         │                  │                │
    │ log     │ String │              │                    │         │                  │                │
    │ content │ String │              │                    │         │                  │                │
    └─────────┴────────┴──────────────┴────────────────────┴─────────┴──────────────────┴────────────────┘

    4 rows in set. Elapsed: 1.390 sec. 

    milovidov-desktop :) SELECT rank, domain, log, substringUTF8(content, 1, 100) FROM url('https://clickhouse-public-datasets.s3.amazonaws.com/minicrawl/data.native.zst') LIMIT 1 FORMAT Vertical

    SELECT
        rank,
        domain,
        log,
        substringUTF8(content, 1, 100)
    FROM url('https://clickhouse-public-datasets.s3.amazonaws.com/minicrawl/data.native.zst')
    LIMIT 1
    FORMAT Vertical

    Query id: 8dba6976-0bf6-4ce8-a0f1-aa579c828175

    Row 1:
    ──────
    rank:                           1907977
    domain:                         0--0.uk
    log:                            *   Trying 213.32.47.30:80...
    * Connected to 0--0.uk (213.32.47.30) port 80 (#0)
    > GET / HTTP/1.1
    > Host: 0--0.uk
    > Accept: */*
    > User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:84.0) Gecko/20100101 Firefox/84.0
    > 
    * Mark bundle as not supporting multiuse
    < HTTP/1.1 302 Moved Temporarily
    < Server: nginx
    < Date: Sun, 29 May 2022 06:27:14 GMT
    < Content-Type: text/html
    < Content-Length: 154
    < Connection: keep-alive
    < Location: https://0--0.uk/
    < 
    * Ignoring the response-body
    { [154 bytes data]
    * Connection #0 to host 0--0.uk left intact
    * Issue another request to this URL: 'https://0--0.uk/'
    *   Trying 213.32.47.30:443...
    * Connected to 0--0.uk (213.32.47.30) port 443 (#1)
    * ALPN, offering h2
    * ALPN, offering http/1.1
    *  CAfile: /etc/ssl/certs/ca-certificates.crt
    *  CApath: /etc/ssl/certs
    * TLSv1.0 (OUT), TLS header, Certificate Status (22):
    } [5 bytes data]
    * TLSv1.3 (OUT), TLS handshake, Client hello (1):
    } [512 bytes data]
    * TLSv1.2 (IN), TLS header, Certificate Status (22):
    { [5 bytes data]
    * TLSv1.3 (IN), TLS handshake, Server hello (2):
    { [108 bytes data]
    * TLSv1.2 (IN), TLS header, Certificate Status (22):
    { [5 bytes data]
    * TLSv1.2 (IN), TLS handshake, Certificate (11):
    { [4150 bytes data]
    * TLSv1.2 (IN), TLS header, Certificate Status (22):
    { [5 bytes data]
    * TLSv1.2 (IN), TLS handshake, Server key exchange (12):
    { [333 bytes data]
    * TLSv1.2 (IN), TLS header, Certificate Status (22):
    { [5 bytes data]
    * TLSv1.2 (IN), TLS handshake, Server finished (14):
    { [4 bytes data]
    * TLSv1.2 (OUT), TLS header, Certificate Status (22):
    } [5 bytes data]
    * TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
    } [70 bytes data]
    * TLSv1.2 (OUT), TLS header, Finished (20):
    } [5 bytes data]
    * TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
    } [1 bytes data]
    * TLSv1.2 (OUT), TLS header, Certificate Status (22):
    } [5 bytes data]
    * TLSv1.2 (OUT), TLS handshake, Finished (20):
    } [16 bytes data]
    * TLSv1.2 (IN), TLS header, Finished (20):
    { [5 bytes data]
    * TLSv1.2 (IN), TLS header, Certificate Status (22):
    { [5 bytes data]
    * TLSv1.2 (IN), TLS handshake, Finished (20):
    { [16 bytes data]
    * SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
    * ALPN, server accepted to use http/1.1
    * Server certificate:
    *  subject: CN=mail.htservices.co.uk
    *  start date: May 15 18:36:37 2022 GMT
    *  expire date: Aug 13 18:36:36 2022 GMT
    *  subjectAltName: host "0--0.uk" matched cert's "0--0.uk"
    *  issuer: C=US; O=Let's Encrypt; CN=R3
    *  SSL certificate verify ok.
    * TLSv1.2 (OUT), TLS header, Supplemental data (23):
    } [5 bytes data]
    > GET / HTTP/1.1
    > Host: 0--0.uk
    > Accept: */*
    > User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:84.0) Gecko/20100101 Firefox/84.0
    > 
    * TLSv1.2 (IN), TLS header, Supplemental data (23):
    { [5 bytes data]
    * Mark bundle as not supporting multiuse
    < HTTP/1.1 200 OK
    < Server: nginx
    < Date: Sun, 29 May 2022 06:27:15 GMT
    < Content-Type: text/html;charset=utf-8
    < Transfer-Encoding: chunked
    < Connection: keep-alive
    < X-Frame-Options: SAMEORIGIN
    < Expires: -1
    < Cache-Control: no-store, no-cache, must-revalidate, max-age=0
    < Pragma: no-cache
    < Content-Language: en-US
    < Set-Cookie: ZM_TEST=true;Secure
    < Set-Cookie: ZM_LOGIN_CSRF=b2dda010-d795-4759-a9c3-80349f3b46ed;Secure;HttpOnly
    < Vary: User-Agent
    < X-UA-Compatible: IE=edge
    < Vary: Accept-Encoding, User-Agent
    < 
    { [13068 bytes data]
    * Connection #1 to host 0--0.uk left intact

    substringUTF8(content, 1, 100): <!DOCTYPE html>
    <!-- set this class so CSS definitions that now use REM size, would work relative to

    1 row in set. Elapsed: 0.539 sec. Processed 4.60 thousand rows, 273.86 MB (8.54 thousand rows/s., 508.28 MB/s.)

[+] simonw|3 years ago|reply

How does that work? How can clickehouse-local run queries against a 129 GB file hosted on S3 without downloading the whole thing?

Is it using HTTP range header tricks, like DuckDB does for querying Parquet files? https://duckdb.org/docs/extensions/httpfs.html

If so, what's the data.native.zst file format? Is it similar to Parquet?

[+] anonu|3 years ago|reply

> grep http: current.csv | wc -l

54679

So over 5% of the top 1m sites still don't use HTTPS.

[+] zX41ZdbW|3 years ago|reply

I have prepared a nice report: the rank of the websites in groups 1..10, 11..100, ... the percentage of TLS and an example of non-TLS website:

https://play.clickhouse.com/play?user=play#U0VMRUNUIGZsb29yK...

    SELECT
        floor(log10(rank)) AS r,
        count() AS total,
        sum(log LIKE '%TLS%') AS tls,
        round(tls / total, 2) AS ratio,
        anyIf(domain, log NOT LIKE '%TLS%')
    FROM minicrawl
    WHERE log LIKE '%Content-Length:%'
    GROUP BY r
    ORDER BY r

    ┌─r─┬───total─┬─────tls─┬─ratio─┬─anyIf(domain, notLike(log, '%TLS%'))─┐
    │ 0 │       6 │       6 │     1 │                                      │
    │ 1 │      61 │      58 │  0.95 │ baidu.com                            │
    │ 2 │     599 │     562 │  0.94 │ google.cn                            │
    │ 3 │    5591 │    5057 │   0.9 │ volganet.ru                          │
    │ 4 │   51279 │   44291 │  0.86 │ furbo.co                             │
    │ 5 │  476181 │  361910 │  0.76 │ funygold.com                         │
    │ 6 │ 3797023 │ 2927052 │  0.77 │ funyo.vip                            │
    └───┴─────────┴─────────┴───────┴──────────────────────────────────────┘

    7 rows in set. Elapsed: 0.844 sec. Processed 7.59 million rows, 43.74 GB (8.99 million rows/s., 51.83 GB/s.)

[+] alfu|3 years ago|reply

If I am not mistaken, 8310 sites offer http and https:

    grep -o -E "://.*?," current.csv | sort | uniq -c | grep -v "1 ://" | wc -l
    8310

[+] philipphutterer|3 years ago|reply

How about websites that are browsed http first and then redirected? People might browse for a domain without the https prefix for convenience (or old links) and the browser defaults to http.

[+] _nhynes|3 years ago|reply

“The 5% rule”

[+] Proven|3 years ago|reply

[deleted]

[+] forgotmypw17|3 years ago|reply

[deleted]

[+] kristianp|3 years ago|reply

This raises the question: how much in the way of user telemetry does Chrome send back to google?

[+] tgsovlerkhgsel|3 years ago|reply

By default, a lot. However, they also are (or at least used to be, it seems to be quite outdated now) really good at documenting their telemetry publicly: https://www.google.com/chrome/privacy/whitepaper.html

(I haven't checked whether the documentation is complete/accurate, of course.)

[+] unknown|3 years ago|reply

[deleted]

[+] deterrence|3 years ago|reply

[deleted]

[+] Mortiffer|3 years ago|reply

Thanks for pointing this out. Can definitely put this dataset to use

[+] cronaday|3 years ago|reply

This is very ethically dubious. Google is collecting raw URLs from Chrome users who turned on history syncing across their own devices, then reusing the data and funneling it through Stanford. No way Chrome users understand or approve of this.

The paper tries to justify its ethics with Google's privacy policy, which is laughable. There are so many papers about how meaningless privacy policies are. If Apple or Mozilla did anything remotely like this, Hacker News would riot.

Edit: I don't want to be a conspiracy theorist, but this post suddenly got a bunch of downvotes at the same time as defensive comments from a current Googler and recent ex-Googler. Then one of my responses below to a Chrome developer got flagged for no obvious reason. Hmm.

[+] dang|3 years ago|reply

Can you please make your substantive points without breaking the site guidelines? You did that here with your last paragraph, and worse at https://news.ycombinator.com/item?id=34197958.

If you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html and taking the intended spirit of the site more to heart, we'd be grateful.

[+] jefftk|3 years ago|reply

Google has written publicly about how this system works: https://developer.chrome.com/docs/crux/methodology/ https://www.google.com/chrome/privacy/whitepaper.html#usages...

This includes only listing publicly discoverable pages, only including data from users who have turned on "Make searches and browsing better (Sends URLs of pages you visit to Google)", and only including pages that are visited by a minimum number of users.

[+] dadrian|3 years ago|reply

1. They're not funneling it through Stanford. They're posting it publicly, but on BigQuery https://developer.chrome.com/docs/crux/

2. Chrome prompts you to opt-out of metrics collection on install.

None of the reasons you've listed for this being ethically dubious are true.

[+] tristor|3 years ago|reply

I very much agree with you. This type of data collection MUST be opt-in to be ethical, and in Chrome it’s enabled by default and buried. The VAST majority of users have no idea this is even happening. It is grossly unethical and it is obvious that it is so, but unsurprisingly folks at Google are happy to do things like this given their salaries.

[+] pvg|3 years ago|reply

Edit: I don't want to be a conspiracy theorist

That's not merely a good idea but also

https://news.ycombinator.com/newsguidelines.html

Please don't post insinuations about astroturfing, shilling, bots, brigading, foreign agents and the like. It degrades discussion and is usually mistaken. If you're worried about abuse, email [email protected] and we'll look at the data.

There's also just not writing in the high-dudgeon flamewar style which helps with the downvotes.

[+] jeffbee|3 years ago|reply

Maybe your posts would get better votes if you made any effort at all to back up your claim on unethical behavior. You provided nothing.

[+] chiefalchemist|3 years ago|reply

re: Edit.

I've noticed similar behavior in HN voting. Down vote spikes but few if any comments in-line with the voting. Not sure if it's bots, human-based click farms, or too just don't understand that disagreement is not grounds for down voting.

Perhaps a bit of all three?

[+] unknown|3 years ago|reply

[deleted]

[+] soneca|3 years ago|reply

> ” If Apple or Mozilla did anything remotely like this, Hacker News would riot.”

My perception is that, collectively, HN hates and criticizes Google much more than Apple and Mozilla. I mean, much more. This last sentence accusation sounded bizarre to me.

[+] unknown|3 years ago|reply

[deleted]

97 comments