(no title)
LeroyRaz | 2 months ago
Looking at the comment reviews on the actual website, the LLM seems to have mostly judged whether it agreed with the takes, not whether they came true, and it seems to have an incredibly poor grasp of it's actual task of accessing whether the comments were predictive or not.
The LLM's comment reviews are of often statements like "correctly characterized [program language] as [opinion]."
This dynamic means the website mostly grades people on having the most confirmist take (the take most likely to dominate the training data, and be selected for in the LLM RL tuning process of pleasing the average user).
LeroyRaz|2 months ago
Link to LLM review: https://karpathy.ai/hncapsule/2015-12-02/index.html#article-....
So the LLM is praising a comment as describing DF as unforgiving (a characterization of the present then, not a statement about the future). And worse, it seems like tptacek may in fact be implying the opposite of the future (e.g., x will continue to crash when it was eventually fixed.)
Here is the original comment: " tptacek on Dec 2, 2015 | root | parent | next [–]
If you're not the kind of person who can take flaws like crashes or game-stopping frame-rate issues and work them into your gameplay, DF is not the game for you. It isn't a friendly game. It can take hours just to figure out how to do core game tasks. "Don't do this thing that crashes the game" is just another task to learn."
Note: I am paraphrasing the LLM review, as the website is also poorly designed, with one unable to select the text of the LLM review!
N.b., this choice of comment review is not overly cherry picked. I just scanned the "best commentators" and tptacek was number two, with this particular egregiously unrelated-to-prediction LLM summary given as justifying his #2 rating.
hathawsh|2 months ago
https://karpathy.ai/hncapsule/2015-12-03/index.html#article-...
xpe|2 months ago
It is unfortunate that the questions of "how well did the LLM do?" and "how does 'grading' work in this app?" seem to have gone out the window when HN readers see something shiny.
karmickoala|2 months ago
Some of the issues could be resolved with better prompting (it was biased to always interpret every comment through the lens of predictions) and LLM-as-a-judge, but still. For example, Anthropic's Deep Research prompts sub-agents to pass original quotes instead of paraphrasing, because it can deteriorate the original message.
Some examples:
sebastiank123 got a C-, and was quoted by the LLM as saying: Now, let's read his full comment: I don't interpret it as a prediction, but a desire. The user is praising Swift. If it went the server way, perhaps it could replace JS, to the user's wishes. To make it even clearer, if someone asked the commenter right after: "Is that a prediction? Are you saying Swift is going to become a serious Javascript competitor?" I don't think its answer would be 'yes' in this context. Full quote: "Any reasonable definition of 'significant' is satisfied"? That's not how I would interpret this. We see it clearly as a duopoly in North America. It's not wrong per se, but I'd say misleading. I know we could take this argument and see other slices of the data (premium phones worldwide, for instance), I'm just saying it's not as clear cut as it made it out to be. That's not what the user was saying: He was praising him and he did miss opportunities at first. OC did not make predictions of his later days. Full quote: Full quote: I thought the debate was useful and so did pjbrunet, per his update.I mean, we could go on, there are many others like these.
andy99|2 months ago
I understand this is just a fun exercise so it’s basically what LLMs are good at - generating plausible sounding stuff without regard for correctness. I would not extrapolate this to their utility on real evaluation tasks.