top | item 45248065

(no title)

ndai | 5 months ago

I’m curious where you got your training data? I will look myself, but saw this and thought I’d ask. I have a CPU-first, no-backprop architecture that works very well on classification datasets. It can do single‑example incremental updates which might be useful for continuous learning. I made a toy demo to train on tiny.txt and it can predict next characters, but I’ve never tried to make an LLM before. I think my architecture might work well as an on-device assistant or for on-premises needs, but I want to work with it more before I embarrass myself. Any open-source LLM training datasets you would recommend?

discuss

electroglyph|5 months ago

https://huggingface.co/datasets/NousResearch/Hermes-3-Datase...

Snuggly73|5 months ago

To my untrained eye, this looks more like an instruct dataset.

For just plain text, I really like this one - https://huggingface.co/datasets/roneneldan/TinyStories

kachapopopow|5 months ago

huggingface has plenty of openai and antrophic user to assistant chains, beware there are dragons (hallucinations), but good enough for instruction training. I actually recommend distilling kimi k2 instead for instruction following capabilities.

hadlock|5 months ago

The training data is contained inside main.rs; it looks like about 50 statements about general facts. Probably to keep training time down. If you go off-script things fall apart pretty quickly:

Enter prompt: hello Model output: What are eclipses ? Assistant : Eclipses occur when one celestial body moves into the shadow of another </s>

Enter prompt: what are facts Model output: eclipses Assistant : Eclipses Eclipses What What was when What through you ? through you meet through using of What was What tall ? of What thank help you explain using , and do you a patterns you you a patterns through air in tall you help metal metal </s>

Enter prompt: how are mountains formed? Model output: I ? ' I ' : Of : in happy Hi wind in yeast altering it it </s>