top | item 44808990

(no title)

thntk | 6 months ago

The model architecture only uses and cites pre-2023 techniques from the GPT-2 and GPT-3 era. Probably they intentionally tried to use the most bare transformers architecture possible. Kudo to them to have found a clever way to play the open-weights model game, while hiding any architectural advancements used in their closed models, and also claim they have moats in data quality and training techniques.

They hide many things, but some speculated observations:

- Their 'mini' models must be smaller than 20B.

- Does the bitter lesson once again strike recent ideas in open models?

- Some architectural ideas cannot be stripped away even if they wanted to, e.g., MoEs, mixed sparse attention, RoPE, etc.

discuss

No comments yet.