(no title)
lappa | 2 years ago
- Rope theta of 100,000, likely from the Llama 2 Long paper which found that a large theta helped regulate attention between distant tokens[0]
- A 16k (effective 32k) context window, improving upon Mistrals 4k (effective 8k) context window
In The Llama 2 Long paper, they saw improvement in short context benchmarks as a result of long context fine tuning. I can't find any of the expected MMLU / HellaSwag / etc benchmarks yet. Benchmarks haven't been submitted to MTEB yet.
Some user anecdotally seem to be having trouble with generating quality responses [2][3][4]. I can't find any examples of users getting good results from the model outside of using exact examples from the documentation.
[0] https://arxiv.org/pdf/2309.16039.pdf
[2] https://old.reddit.com/r/LocalLLaMA/comments/17jd00g/mistral...
[3] https://old.reddit.com/r/LocalLLaMA/comments/17kzlbl/anyone_...
[4] https://old.reddit.com/r/LocalLLaMA/comments/17b0n8t/llama_2...
brucethemoose2|2 years ago