top | item 38101829

(no title)

lappa | 2 years ago

Excellent, appears Amazon has introduced two important things here:

- Rope theta of 100,000, likely from the Llama 2 Long paper which found that a large theta helped regulate attention between distant tokens[0]

- A 16k (effective 32k) context window, improving upon Mistrals 4k (effective 8k) context window

In The Llama 2 Long paper, they saw improvement in short context benchmarks as a result of long context fine tuning. I can't find any of the expected MMLU / HellaSwag / etc benchmarks yet. Benchmarks haven't been submitted to MTEB yet.

Some user anecdotally seem to be having trouble with generating quality responses [2][3][4]. I can't find any examples of users getting good results from the model outside of using exact examples from the documentation.

[0] https://arxiv.org/pdf/2309.16039.pdf

[2] https://old.reddit.com/r/LocalLLaMA/comments/17jd00g/mistral...

[3] https://old.reddit.com/r/LocalLLaMA/comments/17kzlbl/anyone_...

[4] https://old.reddit.com/r/LocalLLaMA/comments/17b0n8t/llama_2...

discuss

brucethemoose2|2 years ago

Summarization works very well for me, but I was testing huge blocks of stories or conversations with 10K+ context.