top | item 38890717

(no title)

jafitc | 2 years ago

Important to note that this model excels in reasoning capabilities.

But it was on purpose not trained on the big “web crawled” datasets to not learn how to build bombs etc, or be naughty.

So it is the “smartest thinking” model in weight class or even comparable to higher param models, but it is not knowledgeable about the world and trivia as much.

This might change in the future but it is the current state.

discuss

rolisz|2 years ago

But that still makes it great for RAG applications, where I want the answer to be based on my data, not on whatever it learned from the web.

monkeydust|2 years ago

Interesting. Anyone tried / benchmarked this for RAG?

dlojudice|2 years ago

If you think that LLMs have basically two properties: habitability to use natural language and knowledge to answer questions, then Small language models should being seen just excellent at natural language, and that's great because for many tasks general knowledge is not needed, specially for RAG.

ethbr1|2 years ago

Which more or less mirrors human learning edges.

If someone read a set of dictionaries, but then talked to actual people... you'd get about the same.

E.g. complete obliviousness to colloquialisms, etc.

notnullorvoid|2 years ago

> This might change in the future but it is the current state I hope it doesn't change. The focus of a model shouldn't be to embed data. Retrieval is a better method to provide data to a model, and leads to less "sounds smart" but very wrong results.

Having less data embedded also means that the model is more generally usable outside the realm of chat assistants, where you only want the model to be aware about data you provide it. One example could be in games where you might have a medieval fantasy setting, it would be really weird if you could get a character to start talking to you about US politics. That probably still wouldn't work with Phi-2 without fine-tuning (as I imagine it does have some data of US politics embedded), but I hope it illustrates the point.

unknown|2 years ago

[deleted]

gumballindie|2 years ago

> But it was on purpose not trained on the big “web crawled” datasets to not learn how to build bombs etc, or be naughty.

It wasn't trained on web crawled data to make it less obvious that microsoft steals property and personal data to monetise it.

visarga|2 years ago

It was trained on "textbook quality" synthetic data + some high quality web data.

The question is - if we train a model on synthetic data generated by GPT-4 which has copyright issues, what is the status of this model? Will MS have to delete it as well? And all models trained with GPT-4 data?