top | item 40452908

(no title)

So the neural scoring introduces ~20ms latency, but this only impacts time-to-first-token (not inter-token-latency). When using our public endpoints there is an additional ~150ms latency, but you can deploy the router on-prem in your own cloud, so then it would only be the inference latency. Generally the improvements in ITL outweigh the small addition to TTFT.

discuss

No comments yet.