How does a prompt this long affect resource usage?
Does inference need to process this whole thing from scratch at the start of every chat?
Or is there some way to cache the state of the LLM after processing this prompt, before the first user token is received, and every request starts from this cached state?
My understanding is that is what the KV cache does in models serving. I would imagine they'd want to prime any such KV cache with common tokens but retain a per-session cache to avoid leaks. It seems HF agrees with the concept, at least https://huggingface.co/docs/transformers/kv_cache#prefill-a-...
It's a good thing people were enamored of how inexpensive GPT-5 is, given that the system prompt is (allegedly) 54kb. I don't know how many tokens that is offhand, but what a lot of them to burn just on setup of the thing
I might be wrong, but can't you checkpoint the post-system prompt model and restore from there, trading memory for compute? Or is that too much extra state?
It's because they always put things that seem way to specific to certain issues, like riddles and arithmetic. Also, I am not a WS, but the mention of "proud boys" are things that can be used as fodder for LLM bias. I wonder why they even have to use a system prompt; why can't that have a separate fine-tuned model for ChatGPT specifically so that they don't need a system prompt?
dgreensp|6 months ago
> Place rich UI elements within tables, lists, or other markdown elements when appropriate.
crazygringo|6 months ago
Does inference need to process this whole thing from scratch at the start of every chat?
Or is there some way to cache the state of the LLM after processing this prompt, before the first user token is received, and every request starts from this cached state?
mdaniel|6 months ago
mdaniel|6 months ago
Tadpole9181|6 months ago
These are NOT included in the model context size for pricing.
btdmaster|6 months ago
TZubiri|6 months ago
NewsaHackO|6 months ago