(no title)
ankit219 | 14 days ago
The setup is parallel distill and refine. You start with parallel trajectories instead of one, then distill from them, and refine that to get to an answer. Instead of taking all trajectories to completion, they distill it quickly and refine so it gives outputs fast and yet smarter.
- paper came out in nov 2025
- three months is a good research to production pipeline
- one of the authors is at anthropic
- this approach will definitely burn more tokens than a usual simple run.
- > Anthropic explicitly warns that time to first token might still be slow (or even slower)
To what people are saying, speculative decoding wont be smarter or make any difference. Batching could be faster, but then not as costly.
Gemini Deepthink and gpt-5.2-pro use the same underlying parallel test time compute but they take each trajectory to completion before distilling and refining for the user.
xcodevn|14 days ago
> Fast mode is not a different model. It uses the same Opus 4.6 with a different API configuration that prioritizes speed over cost efficiency. You get identical quality and capabilities, just faster responses.