This is basically how I respond to requests myself. Sometimes a single short sentence will cause me to slowly spit out a few words. Other times I can respond instantly to paragraphs of technical information with high accuracy and detailed explanations. There seems to be no way to predict my performance.
wolpoli|2 years ago
Is it possible that you have a caching system too so that you are able to respond instantly with paragraphs of technical information to some types of requests that you have seen before?
unknown|2 years ago
[deleted]
gtirloni|2 years ago
clbrmbr|2 years ago
As far as I understand, the earlier GPT generations required a fixed amount of compute per token inferred.
But given the tremendous load on their systems, I wouldn’t be surprised if OpenAI is playing games with running a smaller model when they predict they can get away with it. (Is there evidence for this?)