top | item 38655984

Show HN: GEITje-7B – A New Large Open Dutch Language Model

20 points| erijgersberg | 2 years ago |github.com

Tried my hand at training a large open Dutch language model, for which the pickings are slim right now.

GEITje is a large open Dutch language model with 7 billion parameters, based on Mistral 7B. I've continued pretraining it on 10 billion tokens of Dutch text. This has improved its Dutch language skills and increased its knowledge of Dutch topics. There's an experimental chat-finetuned model too, called GEITje-chat.

I have to say the experience of gathering a dataset and training a model has been very educational for me. Being forced to deal with every detail yourself really deepens your understanding of a subject.

It's all in the hands of the community now. Can't wait to see what they do with it!

Want to try it out? There's a demo live at Hugging Face Spaces right now: https://huggingface.co/spaces/Rijgersberg/GEITje-7B-chat

4 comments

order

erijgersberg|2 years ago

Version 2 of the chat demo is now out! It's now trained on a lot more translated chat conversations (190k vs 20k before).

Link is the same as before.

vermaat|2 years ago

Works great, does indeed have more Dutch knowledge besides only speaking Dutch.

erijgersberg|2 years ago

It sure does! There's still lots of room for improvement though. It still has trouble sometimes with the industry-standard Bassie en Adriaan-test.

It's too bad that there is still a distinct lack of suitable Dutch training sets and benchmarks.