The pattern I found that works ,use a small local model (llama 3b via Ollama, takes only about 2GB) for heartbeat checks — it just needs to answer 'is there anything urgent?' which is a yes/no classification task, not a frontier reasoning task. Reserve the expensive model for actual work. Done right, it can cut token spend by maybe 75% in practice without meaningfully degrading the heartbeat quality. The tricky part is the routing logic — deciding which calls go to the cheap model and which actually need the real one. It can be a doozy — I've done this with three lobsters, let me know if you have any questions.
what|8 days ago
akssassin907|7 days ago
theshrike79|7 days ago
Like "if it's raining, remind me to grab my umbrella before I leave for work"
-> "is it raining?" requires a tool call to a weather service
-> "before I leave for work" needs access to the user's calendar and information when they leave compared to the time their work day starts
-> "remind me" needs a way to communicate to the user in an efficient way, Telegram, iMessage or Whatsapp for example.
dwood_dev|8 days ago
akssassin907|7 days ago