top | item 44333032

(no title)

birdfood | 8 months ago

Perhaps related, after watching a talk by Gerald Sussman I loaded an image of the Kanizsa triangle into Claude and asked it a pretty vague question to see if it could “see” the inferred triangle. It recognised the image and went straight into giving me a summary about it. So I rotated the image 90 degrees and tried in a new conversation, it didn’t recognise the image and got the number of elements incorrect:

This image shows a minimalist, abstract geometric composition with several elements:

Four black shapes that appear to be partial circles or "Pac-Man" like forms, each with a wedge cut out, positioned in the four corners/quadrants of the image Two thin black triangular or arrow-like shapes - one pointing upward in the upper left area, and one pointing to the right in the center-right area All elements are arranged on a light gray or off-white background

discuss

order

latentsea|8 months ago

I guess they will now just rotate all the images in the training data 90 degrees too to fill this kind of gap.

recursivecaveat|8 months ago

Everything old is new again: in the Alexnet paper that kicked off the deep learning wave in 2012, they describe horizontally flipping every image as a cheap form of data augmentation. Though now that we expect models to actually read text that seems potentially counter-productive. Rotations are similar, in that you'd hope it would learn heuristics such as that the sky is almost always at the top.

mirekrusin|8 months ago

That's how you train neural network with synthetic data so it extracts actual meaning.

That's how humans also learn ie. adding numbers. First there is naive memoization, followed by more examples until you get it.

LLM training seems to be falling into memoization trap because models are extremely good at it, orders of magnitude better than humans.

IMHO what is missing in training process is this feedback explaining wrong answer. What we're currently doing with training is leaving out this understanding as "exercise to the reader". We're feeding correct answers to specific, individual examples which promotes memoization.

What we should be doing in post training is ditch direct backpropagation on next token, instead let the model finish its wrong answer, append explanation why it's wrong and continue backpropagation for final answer - now with explanation in context to guide it to the right place in understanding.

What all of this means is that current models are largely underutilized and unnecessarily bloated, they contain way too much memoized information. Making model larger is easy, quick illusion of improvement. Models need to be squeezed more, more focus needs to go towards training flow itself.

littlestymaar|8 months ago

And it will work.

I just whish the people believing LLM can actually reason and generalize see that they don't.

Workaccount2|8 months ago

Show any LLM a picture of a dog with 5 legs watch them be totally unable to count.

pfdietz|8 months ago

Or watch them channel Abraham Lincoln.

iknownothow|8 months ago

As far as I can tell, the paper covers text documents only. Therefore your example doesn't quite apply.

It is well known that LLMs have a ways to go when it comes to processing images like they process text or audio.

I don't think there's any good performing multimodal model that accepts image pixels directly. Most vision capabilities are hacks or engineered in. An image undergoes several processing steps and each processor's outputs are fed to the transformer as tokens. This may happen in one network but there's non-transformer networks involved. Examples of preprocessing:

* OCR * CNNs (2D pattern recognizers) with different zooms, angles, slices etc * Others maybe too?

akomtu|8 months ago

To generalise this idea: if we look at a thousand points that more or less fill a triangle, we'll instantly recognize the shape. IMO, this simple example reveals what intelligence is really about. We spot the triangle because so much complexity - a thousand points - fits into a simple, low-entropy geometric shape. What we call IQ is the ceiling of complexity of patterns that we can notice. For example, the thousand dots may in fact represent corners of a 10-dimensional cube, rotated slightly - an easy pattern to see for a 10-d mind.

saithound|8 months ago

Cool. Since ChatGPT 4o is actually really good at this particular shape identification task, what, if anything do you conclude about its intelligence?