top | item 42861324 (no title) ahzhou | 1 year ago It’s a tensor stored in GPU memory to improve inference throughput. Check out the PagedAttention (which introduces vLLM) paper for how most systems implement it nowadays. discuss order hn newest No comments yet.
No comments yet.