top | item 39806969

(no title)

HN is different - its open access and downloadable. Reddit, as an example, sells its data and isn't scraping friendly.

discuss

HN “exploits” its community by building the street cred of Y Co and also being a venue Y Co startups can advertise for help. It doesn’t bother me or I wouldn’t be here but a certain person could say it is some rich white (and asian) dudes benefitting from it all.

As a “hacker” I feel open access to the data is “fair” but I think much less technical person might not care if he surplus is reaped by anyone with a webcrawler or by Reddit’s administration.

fragmede|1 year ago

the points given to a comment aren't public. That information would be highly valuable for training an LLM.

PaulHoule|1 year ago

That’s interesting.

I have predictive models that can predict if a headline (w/o the rest of the article and not considering the URL) will (a) get more than 10 votes and (b) if it does get more than 10 votes will the votes/comments ratio be more than 2 (which is roughly average)

The first model gets a ROC-AUC (see https://scikit-learn.org/stable/modules/generated/sklearn.me...) in the low 60’s (not good, the second model gets in the low 70’s (actually pretty good though it is a heat seeking missile for clickbait headlines) and my latest content-based recommender for RSS items gets almost 80. (I saw a paper that one system at TikTok gets about 85)

To do all that you need about 10,000 headlines and don’t get a lot of benefit from having more than 100,000. The ceilings on performance have more to do with the nature of the problem rather than my models: the same article can get submitted twice and get 0 votes one time and 200 the other time so it can never be as accurate as “is this an article about galactic astronomy?”

I had it ingest the HN comments firehose and found the amount of articles was overwhelming, my YOShInOn RSS reader now ingests the “best comments” from

https://hnrss.github.io/

together with 110 other feeds and actually I like the comments it picks out a lot. Now that the system is adding about 3000 items per day it might be able to handle a big feed like the comments firehose since now those comments are diluted with so many quality articles. For a problem like that you might want a two-score system with: (i) is it relevant? (something I like) and (ii) is it popular? (like Google’s PageRank)

I think you could make a model that compares comments in the best comments feed with other comments. I have tried formulating the problems above as regression problems where I try to predict the actual score and it does not work well because of the uncertainty problem but formulated as a classification problem for a score over a threshold it is easy to make a well-calibrated model that tells you “this article has a 20% chance of frontpaging” which is about the best anyone can do.