Curious where PornHub and other sites rank. I always hear how that porn sites are in the top X of all traffics but people don’t talk about due to its nature.
I’m always amazed that they have a data science team. It’s not something many would expect from the porn industry. I certainly didn’t expect it.
"Pornhub’s statisticians make use of Google Analytics to figure out the most likely age and gender of visitors. This data is anonymized from billions of visits to Pornhub annually, giving us one of the richest and most diverse data sets to analyze traffic from all around the world."
If you ignore the content, large-scale adult sites are just like any other high traffic (bandwidth, RPS) site out there. A lot of planning goes into where their content delivery PoPs should be placed.
"New Year’s Eve kicked holiday ass with a massive –40% drop in worldwide traffic from 6pm to Midnight on December 31st." It's Dec/31, 1pm in New York right now.
I remember reading about their experience with Redis https://groups.google.com/g/redis-db/c/d4QcWV0p-YM there is something funny about reading engineering insights from a porn co, but they do deal with scale that not many others do!
"Pornhub’s statisticians make use of Google Analytics to figure out the most likely age and gender of visitors. This data is anonymized from billions of visits to Pornhub annually, giving us one of the richest and most diverse data sets to analyze traffic from all around the world."
I assume the data is aggregate across all devices. Chrome has 60% of desktop usage in China. But less than 10% on Mobile.
But in a market of near 1B internet user, not having a single site in top 1K suggest something is wrong with the stats. I wonder what are we missing from those numbers.
Wow, I'm kinda surprised to find my site in the top million worldwide. I have about 100k monthly visits as measured by Cloudflare web analytics, I guess that's all it takes.
If you are interested in the research on technologies used on the Internet, I recommend playing with the "Minicrawl" dataset.
It contains data about ~7 million top websites, and for every website, it also contains: - the full content of the main page; - the verbose output of curl, containing various timing info; the HTTP headers, protocol info...
Using this dataset, you can build a service similar to https://builtwith.com/ for your research.
How about websites that are browsed http first and then redirected? People might browse for a domain without the https prefix for convenience (or old links) and the browser defaults to http.
This is very ethically dubious. Google is collecting raw URLs from Chrome users who turned on history syncing across their own devices, then reusing the data and funneling it through Stanford. No way Chrome users understand or approve of this.
The paper tries to justify its ethics with Google's privacy policy, which is laughable. There are so many papers about how meaningless privacy policies are. If Apple or Mozilla did anything remotely like this, Hacker News would riot.
Edit: I don't want to be a conspiracy theorist, but this post suddenly got a bunch of downvotes at the same time as defensive comments from a current Googler and recent ex-Googler. Then one of my responses below to a Chrome developer got flagged for no obvious reason. Hmm.
Can you please make your substantive points without breaking the site guidelines? You did that here with your last paragraph, and worse at https://news.ycombinator.com/item?id=34197958.
This includes only listing publicly discoverable pages, only including data from users who have turned on "Make searches and browsing better (Sends URLs of pages you visit to Google)", and only including pages that are visited by a minimum number of users.
I very much agree with you. This type of data collection MUST be opt-in to be ethical, and in Chrome it’s enabled by default and buried. The VAST majority of users have no idea this is even happening. It is grossly unethical and it is obvious that it is so, but unsurprisingly folks at Google are happy to do things like this given their salaries.
Please don't post insinuations about astroturfing, shilling, bots, brigading, foreign agents and the like. It degrades discussion and is usually mistaken. If you're worried about abuse, email [email protected] and we'll look at the data.
There's also just not writing in the high-dudgeon flamewar style which helps with the downvotes.
I've noticed similar behavior in HN voting. Down vote spikes but few if any comments in-line with the voting. Not sure if it's bots, human-based click farms, or too just don't understand that disagreement is not grounds for down voting.
> ” If Apple or Mozilla did anything remotely like this, Hacker News would riot.”
My perception is that, collectively, HN hates and criticizes Google much more than Apple and Mozilla. I mean, much more. This last sentence accusation sounded bizarre to me.
[+] [-] mg|3 years ago|reply
[+] [-] azeemba|3 years ago|reply
People seem to think it is somehow measuring visits to those origins. But it's measuring how many unique subdomains are listed for those domains
[+] [-] egman_ekki|3 years ago|reply
Also interesting I haven’t heard about half of them. Some are nsfw, apparently.
[+] [-] kristianp|3 years ago|reply
[+] [-] voytec|3 years ago|reply
[+] [-] unknown|3 years ago|reply
[deleted]
[+] [-] unknown|3 years ago|reply
[deleted]
[+] [-] slim|3 years ago|reply
[+] [-] wirthjason|3 years ago|reply
I’m always amazed that they have a data science team. It’s not something many would expect from the porn industry. I certainly didn’t expect it.
https://www.pornhub.com/insights/2022-year-in-review
[+] [-] mtmail|3 years ago|reply
[+] [-] layer8|3 years ago|reply
The data science teams likely provide a considerable ROI in that industry.
[+] [-] E39M5S62|3 years ago|reply
[+] [-] mtmail|3 years ago|reply
[+] [-] xwowsersx|3 years ago|reply
[+] [-] oars|3 years ago|reply
[+] [-] est|3 years ago|reply
[+] [-] themoonisachees|3 years ago|reply
Also to consider: China uses in-app browsing a lot, with interactive experiences very similar to websites built right in the bilibili/ali/wechat apps.
[+] [-] ksec|3 years ago|reply
But in a market of near 1B internet user, not having a single site in top 1K suggest something is wrong with the stats. I wonder what are we missing from those numbers.
[+] [-] kristianp|3 years ago|reply
[+] [-] modeless|3 years ago|reply
[+] [-] zX41ZdbW|3 years ago|reply
It contains data about ~7 million top websites, and for every website, it also contains: - the full content of the main page; - the verbose output of curl, containing various timing info; the HTTP headers, protocol info...
Using this dataset, you can build a service similar to https://builtwith.com/ for your research.
Data: https://clickhouse-public-datasets.s3.amazonaws.com/minicraw... (129 GB compressed, ~1 TB uncompressed).
Description: https://github.com/ClickHouse/ClickHouse/issues/18842
You can easily try it with clickhouse-local without downloading:
[+] [-] simonw|3 years ago|reply
Is it using HTTP range header tricks, like DuckDB does for querying Parquet files? https://duckdb.org/docs/extensions/httpfs.html
If so, what's the data.native.zst file format? Is it similar to Parquet?
[+] [-] anonu|3 years ago|reply
54679
So over 5% of the top 1m sites still don't use HTTPS.
[+] [-] zX41ZdbW|3 years ago|reply
https://play.clickhouse.com/play?user=play#U0VMRUNUIGZsb29yK...
[+] [-] alfu|3 years ago|reply
[+] [-] philipphutterer|3 years ago|reply
[+] [-] _nhynes|3 years ago|reply
[+] [-] Proven|3 years ago|reply
[deleted]
[+] [-] forgotmypw17|3 years ago|reply
[deleted]
[+] [-] kristianp|3 years ago|reply
[+] [-] tgsovlerkhgsel|3 years ago|reply
(I haven't checked whether the documentation is complete/accurate, of course.)
[+] [-] unknown|3 years ago|reply
[deleted]
[+] [-] deterrence|3 years ago|reply
[deleted]
[+] [-] Mortiffer|3 years ago|reply
[+] [-] cronaday|3 years ago|reply
The paper tries to justify its ethics with Google's privacy policy, which is laughable. There are so many papers about how meaningless privacy policies are. If Apple or Mozilla did anything remotely like this, Hacker News would riot.
Edit: I don't want to be a conspiracy theorist, but this post suddenly got a bunch of downvotes at the same time as defensive comments from a current Googler and recent ex-Googler. Then one of my responses below to a Chrome developer got flagged for no obvious reason. Hmm.
[+] [-] dang|3 years ago|reply
If you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html and taking the intended spirit of the site more to heart, we'd be grateful.
[+] [-] jefftk|3 years ago|reply
This includes only listing publicly discoverable pages, only including data from users who have turned on "Make searches and browsing better (Sends URLs of pages you visit to Google)", and only including pages that are visited by a minimum number of users.
[+] [-] dadrian|3 years ago|reply
2. Chrome prompts you to opt-out of metrics collection on install.
None of the reasons you've listed for this being ethically dubious are true.
[+] [-] tristor|3 years ago|reply
[+] [-] pvg|3 years ago|reply
That's not merely a good idea but also
https://news.ycombinator.com/newsguidelines.html
Please don't post insinuations about astroturfing, shilling, bots, brigading, foreign agents and the like. It degrades discussion and is usually mistaken. If you're worried about abuse, email [email protected] and we'll look at the data.
There's also just not writing in the high-dudgeon flamewar style which helps with the downvotes.
[+] [-] jeffbee|3 years ago|reply
[+] [-] chiefalchemist|3 years ago|reply
I've noticed similar behavior in HN voting. Down vote spikes but few if any comments in-line with the voting. Not sure if it's bots, human-based click farms, or too just don't understand that disagreement is not grounds for down voting.
Perhaps a bit of all three?
[+] [-] unknown|3 years ago|reply
[deleted]
[+] [-] soneca|3 years ago|reply
My perception is that, collectively, HN hates and criticizes Google much more than Apple and Mozilla. I mean, much more. This last sentence accusation sounded bizarre to me.
[+] [-] unknown|3 years ago|reply
[deleted]