(no title)
birdfood | 8 months ago
This image shows a minimalist, abstract geometric composition with several elements:
Four black shapes that appear to be partial circles or "Pac-Man" like forms, each with a wedge cut out, positioned in the four corners/quadrants of the image Two thin black triangular or arrow-like shapes - one pointing upward in the upper left area, and one pointing to the right in the center-right area All elements are arranged on a light gray or off-white background
latentsea|8 months ago
recursivecaveat|8 months ago
mirekrusin|8 months ago
That's how humans also learn ie. adding numbers. First there is naive memoization, followed by more examples until you get it.
LLM training seems to be falling into memoization trap because models are extremely good at it, orders of magnitude better than humans.
IMHO what is missing in training process is this feedback explaining wrong answer. What we're currently doing with training is leaving out this understanding as "exercise to the reader". We're feeding correct answers to specific, individual examples which promotes memoization.
What we should be doing in post training is ditch direct backpropagation on next token, instead let the model finish its wrong answer, append explanation why it's wrong and continue backpropagation for final answer - now with explanation in context to guide it to the right place in understanding.
What all of this means is that current models are largely underutilized and unnecessarily bloated, they contain way too much memoized information. Making model larger is easy, quick illusion of improvement. Models need to be squeezed more, more focus needs to go towards training flow itself.
littlestymaar|8 months ago
I just whish the people believing LLM can actually reason and generalize see that they don't.
Workaccount2|8 months ago
pfdietz|8 months ago
JohnKemeny|8 months ago
Oct 2011, 30 comments.
https://news.ycombinator.com/item?id=3163473
Strange loop video:
July 2011, 36 comments.
https://news.ycombinator.com/item?id=2820118
iknownothow|8 months ago
It is well known that LLMs have a ways to go when it comes to processing images like they process text or audio.
I don't think there's any good performing multimodal model that accepts image pixels directly. Most vision capabilities are hacks or engineered in. An image undergoes several processing steps and each processor's outputs are fed to the transformer as tokens. This may happen in one network but there's non-transformer networks involved. Examples of preprocessing:
* OCR * CNNs (2D pattern recognizers) with different zooms, angles, slices etc * Others maybe too?
akomtu|8 months ago
saithound|8 months ago