top | item 42861324

(no title)

ahzhou | 1 year ago

It’s a tensor stored in GPU memory to improve inference throughput. Check out the PagedAttention (which introduces vLLM) paper for how most systems implement it nowadays.

discuss

No comments yet.