animan's comments

animan | 1 month ago | on: Improving 15 LLMs at Coding in One Afternoon. Only the Harness Changed

What was the point of Claude code or Gemini banning the OP? Why would they care about how IDEs use the underlying API?

animan | 8 months ago | on: Life of an inference request (vLLM V1): How LLMs are served efficiently at scale

That snippet is trying to say that you can calculate KV for all the input tokens at once, and you don't need to loop over them since you have them all available.

Instead for decode, you need to sequentially generate each token.

animan | 8 months ago | on: Life of an inference request (vLLM V1): How LLMs are served efficiently at scale

Prefill is part of Inference. It's the first major step where you calculate all the keys and values for the input tokens.

Decode is the next major step where you start generating output tokens one at a time.

Both run on GPUs but have slightly different workloads

1. Prefill has very little I/o from VRAM to HBM and more compute 2. Decode is light on compute but have to I/o the keys and values computed in the prefill stage for every output token

animan | 1 year ago | on: CrowdStrike Bug likely caused by unsafe NULL Pointer

One of us, one of us...