top | item 40108740

(no title)

jncraton | 1 year ago

The speedup would not be that high in practice for folks already using speculative decoding[1]. ANPD is similar but uses a simpler and faster drafting approach. These two enhancements can't be meaningfully stacked. Here's how the paper describes it:

> ANPD dynamically generates draft outputs via an adaptive N-gram module using real-time statistics, after which the drafts are verified by the LLM. This characteristic is exactly the difference between ANPD and the previous speculative decoding methods.

ANPD does provide a more general-purpose solution to drafting that does not require training, loading, and running draft LLMs.

[1] https://github.com/ggerganov/llama.cpp/pull/2926

discuss

MacsHeadroom|1 year ago

Who is already using speculative decoding? I haven't seen anything about it in the llama.cpp or ollama docs.

eshoyuan|1 year ago

https://github.com/ggerganov/llama.cpp/tree/master/examples/...