The most interesting thing about this is the way it was trained using synthetic data, which is described in quite a bit of detail in the technical report: https://arxiv.org/abs/2412.08905
I added Phi-4 to my reasoning model collection because it seems to exhibit reasoning behavior, it stopped to consider alternatives before concluding. I assume this is related to their choice in training data:
> Chain-of-Thought: Data should encourage systematic reasoning, teaching the model various approaches to the problems in a step-by-step manner.
Wow, those responses are better than I expected. Part of me was expecting terrible responses since Phi-3 was amazing on paper too but terrible in practice.
I tried the same GGUF on LMStudio and it seems to sometimes leak its training data.
Prompt:
Are you based on OpenAI?
Reply:
No, I am not based on OpenAI's models; I am developed by Microsoft. While both systems are designed for natural language processing and understanding, we may have different architectures, training methodologies, or datasets. However, our shared goal is to provide helpful and informative responses while ensuring user safety and privacy.<|user|>
Are you similar to ChatGPT?<|end|>
When working with GGUF what chat templates do you use? Pretty much every gguf I've imported into ollama has given me garbage response. Converting the tokenizer json has yielded mixed results.
For example how do you handle the phi-4 models gguf chat template?
Along those lines (synthetic data) I would keep an eye on the chinese labs given that they are probably quite data and compute constrained, in English at least.
For prompt adherence it still fails on tasks that Gemma2 27b nails every time. I haven't been impressed with any of the Phi family of models. The large context is very nice, though Gemma2 plays very well with self-extend.
> While phi-4 demonstrates relatively strong performance in answering questions and performing reasoning tasks, it is less proficient at rigorously following detailed instructions, particularly those involving specific formatting requirements.
We’re already past that point! MacBooks can easily run models exceeding GPT-3.5, such as Llama 3.1 8B, Qwen 2.5 8B, or Gemma 2 9B. These models run at very comfortable speeds on Apple Silicon. And they are distinctly more capable and less prone to hallucination than GPT-3.5 was.
Llama 3.3 70B and Qwen 2.5 72B are certainly comparable to GPT-4, and they will run on MacBook Pros with at least 64GB of RAM. However, I have an M3 Max and I can’t say that models of this size run at comfortable speeds. They’re a bit sluggish.
We're there, Llama 3.1 8B beats Gemini Advanced for $20/month. Telosnex with llama 3.1 8b GGUF from bartowski. https://telosnex.com/compare/ (How!? tl;dr: I assume Google is sandbagging and hasn't updated the underlying Gemini)
I've had great success with quantized Phi-4 12B and Ollama so far. It's as fast as Llama 3.1 8B but the results have been (subjectively) higher quality. I copy/pasted some past requests into Phi-4 and found the answers were generally better.
I really like the ~3B param version of phi-3. It wasn't very powerful and overused memory, but was surprisingly strong for such a small model.
I'm not sure how I can be impressed by a 14B Phi-4. That isn't really small any more, and I doubt it will be significantly better than llama 3 or Mistral at this point. Maybe that will be wrong, but I don't have high hopes.
Where have I been? What is a “small” language model? Wikipedia just talks about LLMs. Is this a sort of spectrum? Are there medium language models? Or is it a more nuanced classifier?
I think it came from this paper, TinyStories (https://arxiv.org/abs/2305.07759). iirc this was also the inspiration for the Phi family of models. The essential point (of the TinyStories paper), "if we train a model on text meant for 3-4 year olds, since that's much simpler shouldn't we need fewer parameters?" Which is correct. In the original they have a model that's 32 Million parameters and they compare it GPT-2 (1.5 Billion parameters) and the 32M model does much better. Microsoft has been interesed in this because "lower models == less resource usage" which means they can run on consumer devices. You can easily run TinyStories from your phone, which is presumably what Microsoft wants to do too.
7B vs 70B parameters... I think. The small ones fit in the memory of consumer grade cards. That's what I more or less know (waiting for my new computer to arrive this week)
> So we moved from "reasoning" to "complex reasoning".
Only from the perspective of those still complaining about the use of the term "reasoning", who now find themselves left behind as the world has moved on.
For everyone else, the phrasing change perfectly fits the technological change.
If you graded humanity on their reasoning ability, I wonder where these models would score?
I think once they get to about the 85th percentile, we could upgrade the phrase to advanced reasoning. I'm roughly equating it with the percentage of the US population with at least a master's degree.
Model releases without comprehensive coverage of benchmarks make me deeply skeptical.
The worst was the gpt4o update in November. Basically a 2 liner on what it is better at and in reality it regressed in multiple benchmarks.
Here we just get MMLU, which is widely known to be saturated and knowing they trained on synthetic data, we have no idea how much "weight" was given to having MMLU like training data.
Benchmarks are not perfect, but they give me context to build upon.
---
I'm not too excited by Phi-4 benchmark results - It is#BenchmarkInflation.
Microsoft Research just dropped Phi-4 14B, an open-source model that’s turning heads. It claims to rival Llama 3.3 70B with a fraction of the parameters — 5x fewer, to be exact.
What’s the secret? Synthetic data.
-> Higher quality, Less misinformation, More diversity
But the Phi models always have great benchmark scores, but they always disappoint me in real-world use cases.
Phi series is famous for to be trained on benchmarks.
I tried again with the hashtag#phi4 through Ollama - but its not satisfactory.
To me, at the moment - IFEval is the most important llm benchmark.
But look the smart business strategy of Microsoft:
have unlimited access to gpt-4
the input prompt it to generate 30B tokens
train a 1B parameter model
call it phi-1
show benchmarks beating models 10x the size
never release the data
never detail how to generate the data( this time they told in very high level)
claim victory over small models
simonw|1 year ago
Microsoft haven't officially released the weights yet but there are unofficial GGUFs up on Hugging Face already. I tried this one: https://huggingface.co/matteogeniaccio/phi-4/tree/main
I got it working with my LLM tool like this:
Here are some initial transcripts: https://gist.github.com/simonw/0235fd9f8c7809d0ae078495dd630...More of my notes on Phi-4 here: https://simonwillison.net/2024/Dec/15/phi-4-technical-report...
tkellogg|1 year ago
> Chain-of-Thought: Data should encourage systematic reasoning, teaching the model various approaches to the problems in a step-by-step manner.
https://github.com/tkellogg/lrm-reasoning/blob/main/phi4.md
mirekrusin|1 year ago
Phi is all about synthetic training and prompt -> svg -> render -> evaluate image -> feedback loop feels like ideal fit for synthetic learning.
You can push it quite far with stuff like basic 2d physics etc with plotting scene after N seconds or optics/rays, magnetic force etc.
SVG as LLM window to physical world.
syntaxing|1 year ago
algo_trader|1 year ago
Nice. Thanks.
Do you think sampling the stack traces of millions of machines is a good dataset for improving code performance? Maybe sample android/jvm bytecode.
Maybe a sort of novelty sampling to avoid re-sampling hot-path?
lifeisgood99|1 year ago
fisherjeff|1 year ago
selcuka|1 year ago
vergessenmir|1 year ago
For example how do you handle the phi-4 models gguf chat template?
belter|1 year ago
Is this not supposed to cause Model collapse?
mhh__|1 year ago
Havoc|1 year ago
Thought it was official just not on huggingface but rather whatever azure competitor thing they're pushing?
patrick0d|1 year ago
thot_experiment|1 year ago
impossiblefork|1 year ago
I think the point is more the demonstration that such a small model can have such good performance than any actual usefulness.
jacoblambda|1 year ago
> While phi-4 demonstrates relatively strong performance in answering questions and performing reasoning tasks, it is less proficient at rigorously following detailed instructions, particularly those involving specific formatting requirements.
xeckr|1 year ago
How far are we from running a GPT-3/GPT-4 level LLM on regular consumer hardware, like a MacBook Pro?
anon373839|1 year ago
Llama 3.3 70B and Qwen 2.5 72B are certainly comparable to GPT-4, and they will run on MacBook Pros with at least 64GB of RAM. However, I have an M3 Max and I can’t say that models of this size run at comfortable speeds. They’re a bit sluggish.
bloomingkales|1 year ago
lappa|1 year ago
Phi-4 is yet another step towards a small, open, GPT-4 level model. I think we're getting quite close.
Check the benchmarks comparing to GPT-4o on the first page of their technical report if you haven't already https://arxiv.org/pdf/2412.08905
unknown|1 year ago
[deleted]
ActorNightly|1 year ago
refulgentis|1 year ago
simonw|1 year ago
The Qwen2 models that run on my MacBook Pro are GPT-4 level too.
excerionsforte|1 year ago
accrual|1 year ago
_ea1k|1 year ago
I'm not sure how I can be impressed by a 14B Phi-4. That isn't really small any more, and I doubt it will be significantly better than llama 3 or Mistral at this point. Maybe that will be wrong, but I don't have high hopes.
travisgriggs|1 year ago
hagen_dogs|1 year ago
dboreham|1 year ago
narag|1 year ago
tbrownaw|1 year ago
mupuff1234|1 year ago
I wonder what will be next month's buzzphrase.
TeMPOraL|1 year ago
Only from the perspective of those still complaining about the use of the term "reasoning", who now find themselves left behind as the world has moved on.
For everyone else, the phrasing change perfectly fits the technological change.
criddell|1 year ago
I think once they get to about the 85th percentile, we could upgrade the phrase to advanced reasoning. I'm roughly equating it with the percentage of the US population with at least a master's degree.
zurfer|1 year ago
The worst was the gpt4o update in November. Basically a 2 liner on what it is better at and in reality it regressed in multiple benchmarks.
Here we just get MMLU, which is widely known to be saturated and knowing they trained on synthetic data, we have no idea how much "weight" was given to having MMLU like training data.
Benchmarks are not perfect, but they give me context to build upon. ---
edit: the benchmarks are covered in the paper: https://arxiv.org/pdf/2412.08905
PoignardAzur|1 year ago
ai_biden|1 year ago
Microsoft Research just dropped Phi-4 14B, an open-source model that’s turning heads. It claims to rival Llama 3.3 70B with a fraction of the parameters — 5x fewer, to be exact.
What’s the secret? Synthetic data. -> Higher quality, Less misinformation, More diversity
But the Phi models always have great benchmark scores, but they always disappoint me in real-world use cases.
Phi series is famous for to be trained on benchmarks.
I tried again with the hashtag#phi4 through Ollama - but its not satisfactory.
To me, at the moment - IFEval is the most important llm benchmark.
But look the smart business strategy of Microsoft:
have unlimited access to gpt-4 the input prompt it to generate 30B tokens train a 1B parameter model call it phi-1 show benchmarks beating models 10x the size never release the data never detail how to generate the data( this time they told in very high level) claim victory over small models
liminal|1 year ago
parmesean|1 year ago