top | item 41733068

(no title)

exo maintainer here. tgtweak is correct.

This looks like potentially some promising research that I'm looking into reproducing now. We want to lower the barrier to running large models as much as possible so if this works, it would be a potential addition to the exo offering.

discuss

tgtweak|1 year ago

Yeah combining these two would make a lot of sense, there is a big appetite to run larger models - even slower - on clustered hardware. This way you can add compute to speed up the token pace vs adding it just to run the model at all.

It is also possible some of these optimizations could help optimize distribution based on latency and bandwidth between nodes.