top | item 40184467

(no title)

The reason it can't do that is that, for example, "twenty" and "20" are nearly identical in the vector embedding space and it can't really distinguish them that well in most contexts. That's true for generally any task that relies on sort of "how the words look" vs "what the words mean". Any kind of meta request is going to be very difficult for an LLM, but a multi-modal GPT model should be able to handle it.

discuss

Xenoamorphous|1 year ago

Thanks, I’ll try the multimodal one.

Xenoamorphous|1 year ago

Tried it, did not perform better than the non-multimodal one.