Generation is usually fast, but prompt processing is the main limitation with local agents. I also have a 128 GB M4 Max. How is the prompt processing on long prompts? processing the system prompt for Goose always takes quite a while for me. I haven't been able to download the 120B yet, but I'm looking to switch to either that or the GLM-4.5-Air for my main driver.
You mentioned "on local agents". I've noticed this too. How do ChatGPT and the others get around this, and provide instant responses on long conversations?
Rhubarrbb|6 months ago
ghc|6 months ago
```
total duration: 1m14.16469975s
load duration: 56.678959ms
prompt eval count: 3921 token(s)
prompt eval duration: 10.791402416s
prompt eval rate: 363.34 tokens/s
eval count: 2479 token(s)
eval duration: 1m3.284597459s
eval rate: 39.17 tokens/s
```
andai|6 months ago
anonymoushn|6 months ago
Davidzheng|6 months ago