top | item 44539869

(no title)

lllllm | 7 months ago

No, the model has nothing do to with Llama. We are using our own architecture, and training from scratch. Llama also does not have open training data, and is non-compliant, in contrast to this model.

Source: I'm part of the training team

discuss

danielhanchen|7 months ago

If you guys need help on GGUFs + Unsloth dynamic quants + finetuning support via Unsloth https://github.com/unslothai/unsloth on day 0 / 1, more than happy to help :)

lllllm|7 months ago

absolutely! i've sent you a linkedin message last week. but here seems to work much better, thanks a lot!

d3m0t3p|7 months ago

Hey, really cool project, I’m excited to see the outcome. Is there a blog / paper summarizing how you are doing it ? Also which research group is currently working on it at eth ?

moffkalast|7 months ago

L3 has open pretraining data, it's just not official for obvious legal reasons: https://huggingface.co/datasets/HuggingFaceFW/fineweb

menaerus|7 months ago

Wait, whole (english speaking) web content dataset size is ~50TB?

Al-Khwarizmi|7 months ago

So you're not going to use copyrighted data for training? That's going to be a disadvantage with respect to LLaMa and other well-known models, it's an open secret that everyone is using everything they can get their hands on.

Good luck though, very needed project!

badsectoracula|7 months ago

Not sure about the Swiss laws, but the EU AI Act and the 2019/790 digital millennium directive it piggies back on the topic, does allow for training on copyrighted data as long as any opt-out mechanisms (e.g. robots.txt) are respected. AFAICT this LLM was trained by respecting those mechanisms (and as linked elsewhere they didn't find any practical difference in performance - note that there is an exception to allow ignoring the opt-out mechanisms for research purposes, so they could make that comparison).

isusmelj|7 months ago

Thanks for clarifying! I wish you all the best luck!

blurbleblurble|7 months ago

Are you using dbpedia?

lllllm|7 months ago

no. the main source is fineweb2, but with additional filtering for compliance, toxicity removal, and quality filters such as fineweb2-hq