top | item 45820651

(no title)

tifa2up | 3 months ago

We tried GPT-5 for a RAG use case, and found that it performs worse than 4.1. We reverted and didn't look back.

discuss

order

sigmoid10|3 months ago

4.1 is such an amazing model in so many ways. It's still my nr. 1 choice for many automation tasks. Even the mini version works quite well and it has the same massive context window (nearly 8x GPT-5). Definitely the best non-reasoning model out there for real world tasks.

HugoDias|3 months ago

Can you elaborate on that? In which part of the RAG pipeline did GPT-4.1 perform better? I would expect GPT-5 to perform better on longer context tasks, especially when it comes to understanding the pre-filtered results and reasoning about them

tifa2up|3 months ago

For large context (up to 100K tokens in some cases). We found that GPT-5: a) has worse instruction following; doesn't follow the system prompt b) produces very long answers which resulted in a bad ux c) has 125K context window so extreme cases resulted in an error

mbesto|3 months ago

How do you objectively tell whether a model "performs" better than another?

belval|3 months ago

Not the original commenter but I work in the space and we have large annotated datasets with "gold" evidence that we want to retrieve, the evaluation of new models is actually very quantitative.

teekert|3 months ago

So… You did look back then didn’t look forward anymore… sorry couldn’t resist.