top | item 38476502

(no title)

Summary from https://arxiv.org/pdf/2309.16609.pdf --- (q: how does one format lists on HN?)

* qwen-{1.8B,7B,14B}:

  * 3 trillion tokens; start with BPE tiktoken, cl100k base vocab, augmented with chinese, numbers split into digits, final vocab 152k.

  * RoPE - rotary positional embedding

  * context length 2048

  * qwen-14b perf percentages: 66.3 MMLU(5), 72.1 CEval(5), 61.8 GSM8K(8), 24.8 MATH(4), 32.3 HumanEval(0), 40.8 MBPP(3), 53.4 BBH(3); beats LLaMA2-13B on all, but behind LLaMA2-70B on all except CEval, MATH and HumanEval (somewhat surprising)

* code-qwen-{7B,14B}

  * additional 90B code tokens over base

  * context length 8192, flash attention

  * 14B perf: humaneval 66.4, mbpp 52.4; ok, but not stellar (similar numbers as OSS wizardcoder-py, and lower than gpt-3.5)

* math-qwn-{7B,14B}-chat

  * math instructional dataset

  * context length 1024

  * 14B perf: gsm8k 69.8, MATH 24.2, Math401 85.0, Math23K 78.4 (substantially better than OSS in the same weight class (WizardMath and GAIRMath-Abel) on MATH but same ballpark on GSM8k -- surprising). Math23K is chinese grade school math; and Math401 is arithmetic ability.

* comprehensive automatic evaluation in Appendix A.2.1 pg 36 (based on OpenCompass'23)

discuss

No comments yet.