animan | 1 month ago | on: Improving 15 LLMs at Coding in One Afternoon. Only the Harness Changed
animan's comments
animan | 8 months ago | on: Life of an inference request (vLLM V1): How LLMs are served efficiently at scale
Instead for decode, you need to sequentially generate each token.
animan | 8 months ago | on: Life of an inference request (vLLM V1): How LLMs are served efficiently at scale
Decode is the next major step where you start generating output tokens one at a time.
Both run on GPUs but have slightly different workloads
1. Prefill has very little I/o from VRAM to HBM and more compute 2. Decode is light on compute but have to I/o the keys and values computed in the prefill stage for every output token
animan | 1 year ago | on: CrowdStrike Bug likely caused by unsafe NULL Pointer