top | item 44802338

(no title)

Rhubarrbb | 6 months ago

Generation is usually fast, but prompt processing is the main limitation with local agents. I also have a 128 GB M4 Max. How is the prompt processing on long prompts? processing the system prompt for Goose always takes quite a while for me. I haven't been able to download the 120B yet, but I'm looking to switch to either that or the GLM-4.5-Air for my main driver.

discuss

order

ghc|6 months ago

Here's a sample of running the 120b model on Ollama with my MBP:

```

total duration: 1m14.16469975s

load duration: 56.678959ms

prompt eval count: 3921 token(s)

prompt eval duration: 10.791402416s

prompt eval rate: 363.34 tokens/s

eval count: 2479 token(s)

eval duration: 1m3.284597459s

eval rate: 39.17 tokens/s

```

andai|6 months ago

You mentioned "on local agents". I've noticed this too. How do ChatGPT and the others get around this, and provide instant responses on long conversations?

bluecoconut|6 months ago

Not getting around it, just benefiting from parallel compute / huge flops of GPUs. Fundamentally, it's just that prefill compute is itself highly parallel and HBM is just that much faster than LPDDR. Effectively H100s and B100s can chew through the prefill in under a second at ~50k token lengths, so the TTFT (Time to First Token) can feel amazingly fast.

mike_hearn|6 months ago

They cache the intermediate data (KV cache).

anonymoushn|6 months ago

it's odd that the result of this processing cannot be cached.

lostmsu|6 months ago

It can be and it is by most good processing frameworks.