top | item 46790428

(no title)

mercutio2 | 1 month ago

What toolchain are you going to use with the local model? I agree that’s a Strong model, but it’s so slow for be with large contexts I’ve stopped using it for coding.

discuss

embedding-shape|1 month ago

I have my own agent harness, and the inference backend is vLLM.

mercutio2|1 month ago

Can you tell me more about your agent harness? If it’s open source, I’d love to take it for a spin.

I would happily use local models if I could get them to perform, but they’re super slow if I bump their context window high, and I haven’t seen good orchestrators that keep context limited enough.

storystarling|1 month ago

Curious how you handle sharding and KV cache pressure for a 120b model. I guess you are doing tensor parallelism across consumer cards, or is it a unified memory setup?