top | item 45820651

(no title)

tifa2up | 3 months ago

We tried GPT-5 for a RAG use case, and found that it performs worse than 4.1. We reverted and didn't look back.

discuss

4.1 is such an amazing model in so many ways. It's still my nr. 1 choice for many automation tasks. Even the mini version works quite well and it has the same massive context window (nearly 8x GPT-5). Definitely the best non-reasoning model out there for real world tasks.

HugoDias|3 months ago

Can you elaborate on that? In which part of the RAG pipeline did GPT-4.1 perform better? I would expect GPT-5 to perform better on longer context tasks, especially when it comes to understanding the pre-filtered results and reasoning about them

tifa2up|3 months ago

For large context (up to 100K tokens in some cases). We found that GPT-5: a) has worse instruction following; doesn't follow the system prompt b) produces very long answers which resulted in a bad ux c) has 125K context window so extreme cases resulted in an error

mbesto|3 months ago

How do you objectively tell whether a model "performs" better than another?

belval|3 months ago

Not the original commenter but I work in the space and we have large annotated datasets with "gold" evidence that we want to retrieve, the evaluation of new models is actually very quantitative.

teekert|3 months ago

So… You did look back then didn’t look forward anymore… sorry couldn’t resist.