top | item 46610293

(no title)

stingrae | 1 month ago

1 and 2 have been achieved.

4 is close, the interface needs some work to allow nontechnical people use it. (claude code)

discuss

order

fxtentacle|1 month ago

I strongly disagree. I’ve yet to find an AI that can reliably summarise emails, let alone understand nuance or sarcasm. And I just asked ChatGPT 5.2 to describe an Instagram image. It didn’t even get the easily OCR-able text correct. Plus it completely failed to mention anything sports or stadium related. But it was looking at a cliche baseball photo taken by an fan inside the stadium.

protocolture|1 month ago

I have had ChatGPT read text in an image, give me a 100% accurate result, and then claim not to have the ability and to have guessed the previous result when I ask it to do it again.

pixl97|1 month ago

>let alone understand nuance or sarcasm

I'm still trying to find humans that do this reliably too.

To add on, 5.2 seems to be kind of lazy when reading text in images by default. Feeding it an image it may give the first word or so. But coming back with a prompt 'read all the text in the image' makes it do a better job.

With one in particular that I tested I thought it was hallucinating some of the words, but there was a picture in the picture with small words it saw I missed the first time.

I think a lot of AI capabilities are kind of munged to end users because they limit how much GPU is used.

falloutx|1 month ago

I dispute 1 & 2 more than 4.

1) Is it actually watching a movie frame by frame or just searching about it and then giving you the answer?

2) Again can it handle very long novels, context windows are limited and it can easily miss something. Where is the proof for this?

4 is probably solved

4) This is more on predictor because this is easy to game. you can create some gibberish code with LLM today that is 10k lines long without issues. Even a non-technical user can do

CjHuber|1 month ago

I think all of those are terrible indicators, 1 and 2 for example only measure how well LLMs can handle long context sizes.

If a movie or novel is famous the training data is already full of commentary and interpretations of them.

If its something not in the training data, well I don't know many movies or books that use only motives that no other piece of content before them used, so interpreting based on what is similar in the training data still produces good results.

EDIT: With 1 I meant using a transcript of the Audio Description of the movie. If he really meant watch a movie I'd say thats even sillier because well of course we could get another Agent to first generate the Audio Description, which definitely is possible currently.