top | item 44240493

(no title)

HugoDias | 8 months ago

This document explains the process very well. It’s a good read: https://platform.openai.com/docs/guides/prompt-caching

discuss

xmprt|8 months ago

That link explains how OpenAI uses it, but doesn't really walk through how it's any faster. I thought the whole point of transformers was that inference speed no longer depended on prompt length. So how does caching the prompt help reduce latency if the outputs aren't being cached.

> Regardless of whether caching is used, the output generated will be identical. This is because only the prompt itself is cached, while the actual response is computed anew each time based on the cached prompt

singron|8 months ago

> I thought the whole point of transformers was that inference speed no longer depended on prompt length

That's not true at all and is exactly what prompt caching is for. For one, you can at least populate the attention KV Cache, which will scale with the prompt size. It's true that if your prompt is larger than the context size, then the prompt size no longer affects inference speed since it essentially discards the excess.

catlifeonmars|8 months ago

> OpenAI routes API requests to servers that recently processed the same prompt,

My mind immediately goes to rowhammer for some reason.

At the very least this opens up the possibility of some targeted denial of service

xmprt|8 months ago

Later they mention that they have some kind of rate limiting because if over ~15 requests are being processed per minute, the request will be sent to a different server. I guess you could deny cache usage but I'm not sure what isolation they have between different callers so maybe even that won't work.