Over the holidays, we published a post[1] on using high-precision few-shot examples to get `gpt-4o-mini` to perform similar to `gpt-4o`. I just re-ran that same experiment, but swapped out `gpt-4o-mini` with `phi-4`.
`phi-4` really blew me away in terms of learning from few-shots. It measured as being 97% consistent with `gpt-4o` when using high-precision few-shots! Without the few-shots, it was only 37%. That's a huge improvement!
By contrast, with few-shots it performs as well as `gpt-4o-mini` (though `gpt-4o-mini`'s baseline without few-shots was 59% – quite a bit higher than `phi-4`'s).
I like the direction, but have a pretty different experience in practice. This spans legal analytics, social media analytics, code synthesis, news analysis, cyber security LLMs, etc:
1. The only ultimate absolute quality metric I saw in that blogpost afaict was expert agreement... at 90%. All of our customers would fire us at that level across all of the diff b2b domains we work in. I'm surprised 90% is considered acceptable quality in a paying business context like retail.
2. Gpt-4o-mini is great. I find we can get, for these kind of simple tasks you describe, gpt-4o-mini to achieve about 95-98% agreement with gpt-4o by iteratively manually improving prompts over increasingly large synthetic evals. Given data and a good dev, we do this basically same-day for a lot of simple tasks, which is astounding.
I do expect automatic prompt optimizers to win here long-term, and keep hopefully revisiting dspy et al. For now, they fail over standard prompt engineering. Likewise, I do believe in example learning over time for areas like personalization.... but doing semantic search recall of high-rated answers was a V1 thing we had to rethink due to too many issues.
This is really nice. I loved the detailed process and I'm definitely gonna use it. One nit though: I didn't understand what the graphs mean, maybe you should add the axes names.
Is anyone blown away by how fast we got to running something this powerful locally? I know it's easy to get burnt out on llms but this is pretty incredible.
I genuinely think we're only 2 years away from full custom local voice to voice llm assistants that grow with you like JOI in BR2049 and it's going to change how we think about being human and being social, and how we grow up.
I've been experimenting with running local LLMs for nearly two years now, ever since the first LLaMA release back in March 2023.
About six months ago I had mostly lost interest in them. They were fun to play around with but the quality difference between the ones I could run on my MacBook and the ones I could access via an online API felt insurmountable.
This has completely changed in the second half of 2024. The models I can run locally had a leap in quality - they feel genuinely GPT-4 class now.
They're not as good as the best hosted models (GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet) but they're definitely good enough to be extremely useful.
This started with the Qwen 2 and 2.5 series, but I also rate Llama 3.3 70B and now Phi-4 as GPT-4 class models that run on my laptop.
I am blown away: a year ago I bought a M2 32G Mac to run local models. It seems like what I can run locally now just one year later is 10x more useful for NLP, data wrangling, RAG, experimenting with agents, etc.
Not related to local LLMs, but JOI from BR2049 is essentially what Replika is striving for: https://replika.com/
Infact during the onboarding process they ask the user to choose which AI companion movie they related to the most: Her, BR2049 or Ex-Machina. The experience is then tailored to align closer to the movie chosen.
It's quite a terrible app from a product design perspective: filled with dark patterns (like sending the user blurred images to "unlock*) and upsells, but it's become successful amongst the masses that have adopted it, which I find fascinating. 30m+ users https://en.wikipedia.org/wiki/Replika#:~:text=Replika%20beca....
I’ve thought for a while that Joi in BR2049 was less dystopian than what we will probably do with AI. She doesn’t constantly prompt K to buy more credits (like a mobile game) to continue engaging with her or deepen their relationship. (“If you really love me…”) I’ve been expecting that this is how our industry would operate given the customer hostile psychologically abusive hellscape of social and mobile. Of course there’s still time.
She appears to be a local model runnable on a small device without cloud.
It’s odd that MS is releasing models they are competitors to OA. This reinforce the idea that there is no real strategic advantage in owning a model. I think the strategy is now offer cheap and performant infra to run the models.
> This reinforce the idea that there is no real strategic advantage in owning a model
For these models probably no. But for proprietary things that are mission critical and purpose-built (think Adobe Creative Suite) the calculus is very different.
MS, Google, Amazon all win from infra for open source models. I have no idea what game Meta is playing
According to many press stories in the past year, the relationship between Microsoft and OpenAI has been very strained. It looks more and more like that both sides are looking for opportunity to jump ship.
This is a very clever move by Microsoft. OpenAI has no technological moat and a very unreliable partner.
Was disappointed in all the Phi models before this, whose benchmark results scored way better than it worked in practice, but I've been really impressed with how good Phi-4 is at just 14B. We've run it against the top 1000 most popular StackOverflow questions and it came up 3rd beating out GPT-4 and Sonnet 3.5 in our benchmarks, only behind DeepSeek v3 and WizardLM 8x22B [1]. We're using Mixtral 8x7B to grade the quality of the answers which could explain how WizardLM (based on Mixtral 8x22B) took 2nd Place.
Unfortunately I'm only getting 6 tok/s on NVidia A4000 so it's still not great for real-time queries, but luckily now that it's MIT licensed it's available on OpenRouter [2] for a great price of $0.07/$0.14M at a fast 78 tok/s.
Because it yields better results and we're able to self-host Phi-4 for free, we've replaced Mistral NeMo with it in our default models for answering new questions [3].
Interesting eval but my first reaction is "using Mixtral as a judge doesn't sound like a good idea". Have you tested how different its results are from GPT-4 as a judge (on a small scale) or how stuff like style and order can affect its judgements?
I tested Phi-4 with a Japanese functional test suite and it scored much better than prior Phis (and comparable to much larger models, basically in the top tier atm). [1]
The one red-flag w/ Phi-4 is that it's IFEval score is relatively low. IFEval has specific types of constraints (forbidden words, capitalization, etc) it tests for [2] but its one area especially worth keeping an eye out for those testing Phi-4 for themselves...
IMO SO questions is not a good evaluation. These models were likely trained on the top 1000 most popular StackOverflow questions. You'd expect them to have similar results and perform well when compared to the original answers.
For structured output from anywhere I'm finding https://github.com/BoundaryML/baml good. It's more accurate than gpt-04-mini will do on its own, and any of the other JSON schema approaches I've tried.
Yeah it's not as strong as constrained beam search like OpenAI uses (at least afaik) but it works on any models that support tool calling. Just keep it simple, don't have a lot of deep nested structures or complicated rules.
Lots of other models will work nearly as well though if you just give them a clear schema to follow and ask them to output json only, then parse it yourself. Like I've been using gemma2:9b to analyze text and output a json structure and it's nearly 100% reliable despite it being a tiny model and not supporting tools or structured output officially.
I’ve seen on the localllama subreddit that some GGUFs have bugs in them. The one recommended was by unsloth. However, I don’t know how the Ollama GGUF holds up.
Ollama can pull directly from HF, you just provide the URL and add to the end :Q8_0 (or whatever) to specify your desired quant. Bonus: use the short form url of `hf` instead of `huggingface` to shorten the model name a little in the ollama list table.
Edit: so for example of you want the unsloth "debugged" version of Phi4, you would run:
`$ollama pull hf.co/unsloth/phi-4-GGUF:Q8_0`
(check on the right side of the hf.co/unsloth/phi-4-GGUF page for the available quants)
Phi-4's architecture changed slightly from Phi-3.5 (it no longer uses a sliding window of 2,048 tokens [1]), causing a change in the hyperparameters (and ultimately an error at inference time for some published GGUF files on Hugging Face, since the same architecture name/identifier was re-used between the two models).
For the Phi-4 uploaded to Ollama, the hyperparameters were set to avoid the error. The error should stop occurring in the next version of Ollama [2] for imported GGUF files as well
In retrospect, a new architecture name should probably have been used entirely, instead of re-using "phi3".
The brief of it is by curating a smaller synthetic dataset of high quality from textbooks, problem sets, etc. instead of dumping a massive dataset with tons of information.
I have unfortunately been disappointed with the llama.cpp/ollama ecosystem of late, and thinking about moving my things to vllm instead.
llama.cpp basically dropped support for multimodal visual models. ollama still does support them, but only a handful. Also ollama still does not support vulkan eventhough llama.cpp had vulkan support for a long long time now.
This has been very sad to watch. I'm more and more convinced that vllm is the way to go, not ollama.
I've just tried to make it run something, and I just could not force to include the python code inside ``` ``` quotation marks. It always wants to put word python after three quotes, like this:
```python
.. code..
```
I wonder if that's the result of training.
(I use the LLM output to then run the resulting code)
The ollama application has zero value; it’s just an easy to use front end to their model hosting which is both what this is and why they’re important.
Only having one model host (hugging face) is bad for obvious reasons (and good in others, yes, but still)
Ollama offering an alternative as a model host seems quite reasonable and quite well implemented.
The frontend really is nothing; it’s just llama.cpp in a go wrapper. It has no value and it’s not really interesting, it’s simple stable technology that is perfectly fine to rely on and be totally unexcited or interested in, technically.
…but, they do a lot more than that; and I think it’s a little unfair to imply that trivial piece of their stack is all they do.
I’m not seeing what the issue is with ollama. Can you elaborate? There are tons of open source projects that other stuff gets built upon: that’s part of the point of open source.
I might be wrong about this but doesn't ollama do some work to ensure the model runs efficiently given your hardware? Like choosing between how much gpu memory to consume so you don't oom. Does llama.cpp do that for you with zero config?
It's very hard to put into words without coming off as being unfair to one side or the other, but the ollama project really does provide little-to-no _innovative_ value over simply running components of llama.cpp directly from the command line. 100% of the heavy lifting (from an LLM perspective) is in the llama.cpp codebase. The ollama parts are all simple, well understood, commodity components that most any developer could have produced.
Now, applications like ollama obviously need to exist, as not everyone can run CLI utilities, let alone clone a git repo and compile themselves. Easy to use GUIs are essential for the adoption of new tech (much like how there are many apps that wrap ffmpeg and are mostly UI).
However, if ollama are mostly doing commodity GUI things over a fully fleshed-out, _unique_ codebase to which their very existence is owed, they should do everything in their power to point that out. I'm sure they're legally within their rights because of the licensing, but just from an ethical perspective.
I think there is a lot of ill-will towards ollama in some hard-core OG LLM communities because ollama appears to be attempting to capture the value that ggerganov has provided to the world in this tool without adequate attribution (although there is a small footnote, iirc). Basically, the debt that ollama owes to llama.cpp is so immense that they need to do a much better job recognizing it imo.
sgk284|1 year ago
`phi-4` really blew me away in terms of learning from few-shots. It measured as being 97% consistent with `gpt-4o` when using high-precision few-shots! Without the few-shots, it was only 37%. That's a huge improvement!
By contrast, with few-shots it performs as well as `gpt-4o-mini` (though `gpt-4o-mini`'s baseline without few-shots was 59% – quite a bit higher than `phi-4`'s).
[1] https://bits.logic.inc/p/getting-gpt-4o-mini-to-perform-like
lmeyerov|1 year ago
1. The only ultimate absolute quality metric I saw in that blogpost afaict was expert agreement... at 90%. All of our customers would fire us at that level across all of the diff b2b domains we work in. I'm surprised 90% is considered acceptable quality in a paying business context like retail.
2. Gpt-4o-mini is great. I find we can get, for these kind of simple tasks you describe, gpt-4o-mini to achieve about 95-98% agreement with gpt-4o by iteratively manually improving prompts over increasingly large synthetic evals. Given data and a good dev, we do this basically same-day for a lot of simple tasks, which is astounding.
I do expect automatic prompt optimizers to win here long-term, and keep hopefully revisiting dspy et al. For now, they fail over standard prompt engineering. Likewise, I do believe in example learning over time for areas like personalization.... but doing semantic search recall of high-rated answers was a V1 thing we had to rethink due to too many issues.
yard2010|1 year ago
blharr|1 year ago
search(T,θ,m) retrieves the first m historical tasks that are semantically similar above the θ threshold
Are both m's here the same or different numbers? I found this a bit confusing
vincent_s|1 year ago
nothrowaways|1 year ago
t0lo|1 year ago
I genuinely think we're only 2 years away from full custom local voice to voice llm assistants that grow with you like JOI in BR2049 and it's going to change how we think about being human and being social, and how we grow up.
simonw|1 year ago
I've been experimenting with running local LLMs for nearly two years now, ever since the first LLaMA release back in March 2023.
About six months ago I had mostly lost interest in them. They were fun to play around with but the quality difference between the ones I could run on my MacBook and the ones I could access via an online API felt insurmountable.
This has completely changed in the second half of 2024. The models I can run locally had a leap in quality - they feel genuinely GPT-4 class now.
They're not as good as the best hosted models (GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet) but they're definitely good enough to be extremely useful.
This started with the Qwen 2 and 2.5 series, but I also rate Llama 3.3 70B and now Phi-4 as GPT-4 class models that run on my laptop.
I wrote more about this here: https://simonwillison.net/2024/Dec/31/llms-in-2024/#some-of-...
mark_l_watson|1 year ago
BTW, a few days ago I published a book on using Ollama. Here is a link to read it online https://leanpub.com/ollama/read
cloudking|1 year ago
Infact during the onboarding process they ask the user to choose which AI companion movie they related to the most: Her, BR2049 or Ex-Machina. The experience is then tailored to align closer to the movie chosen.
It's quite a terrible app from a product design perspective: filled with dark patterns (like sending the user blurred images to "unlock*) and upsells, but it's become successful amongst the masses that have adopted it, which I find fascinating. 30m+ users https://en.wikipedia.org/wiki/Replika#:~:text=Replika%20beca....
api|1 year ago
She appears to be a local model runnable on a small device without cloud.
yeahwhatever10|1 year ago
SamPatt|1 year ago
Hunyuan (open source video) has been remarkable. Flux dev makes some incredible images.
The fact that it's still only going to get better from here is hard to imagine.
crorella|1 year ago
mlepath|1 year ago
For these models probably no. But for proprietary things that are mission critical and purpose-built (think Adobe Creative Suite) the calculus is very different.
MS, Google, Amazon all win from infra for open source models. I have no idea what game Meta is playing
PittleyDunkin|1 year ago
> I think the strategy is now offer cheap and performant infra to run the models.
Is this not what microsoft is doing? What can microsoft possibly lose by releasing a model?
buyucu|1 year ago
This is a very clever move by Microsoft. OpenAI has no technological moat and a very unreliable partner.
easton|1 year ago
naasking|1 year ago
Yes, because you can't build a moat. Open source will very quickly catch up.
mythz|1 year ago
Unfortunately I'm only getting 6 tok/s on NVidia A4000 so it's still not great for real-time queries, but luckily now that it's MIT licensed it's available on OpenRouter [2] for a great price of $0.07/$0.14M at a fast 78 tok/s.
Because it yields better results and we're able to self-host Phi-4 for free, we've replaced Mistral NeMo with it in our default models for answering new questions [3].
[1] https://pvq.app/leaderboard
[2] https://openrouter.ai/microsoft/phi-4
[3] https://pvq.app/questions/ask
KTibow|1 year ago
Edit: they have a blog post https://pvq.app/posts/individual-voting-comparison although it could go deeper
lhl|1 year ago
The one red-flag w/ Phi-4 is that it's IFEval score is relatively low. IFEval has specific types of constraints (forbidden words, capitalization, etc) it tests for [2] but its one area especially worth keeping an eye out for those testing Phi-4 for themselves...
[1] https://docs.google.com/spreadsheets/u/3/d/18n--cIaVt49kOh-G...
[2] https://github.com/google-research/google-research/blob/mast...
driverdan|1 year ago
solomatov|1 year ago
Did it have a different license before? If so, why did they change it?
hbcondo714|1 year ago
https://ollama.com/vanilj/Phi-4
smallerize|1 year ago
Patrick_Devine|1 year ago
raybb|1 year ago
Then a quick search revealed you can as of a free weeks ago
https://ollama.com/blog/structured-outputs
porker|1 year ago
svachalek|1 year ago
Lots of other models will work nearly as well though if you just give them a clear schema to follow and ask them to output json only, then parse it yourself. Like I've been using gemma2:9b to analyze text and output a json structure and it's nearly 100% reliable despite it being a tiny model and not supporting tools or structured output officially.
andhuman|1 year ago
compumetrika|1 year ago
Edit: so for example of you want the unsloth "debugged" version of Phi4, you would run:
`$ollama pull hf.co/unsloth/phi-4-GGUF:Q8_0`
(check on the right side of the hf.co/unsloth/phi-4-GGUF page for the available quants)
jmorgan|1 year ago
For the Phi-4 uploaded to Ollama, the hyperparameters were set to avoid the error. The error should stop occurring in the next version of Ollama [2] for imported GGUF files as well
In retrospect, a new architecture name should probably have been used entirely, instead of re-using "phi3".
[1] https://arxiv.org/html/2412.08905v1
[2] https://github.com/ollama/ollama/releases/tag/v0.5.5
magicalhippo|1 year ago
[1]: https://news.ycombinator.com/item?id=42660335 Phi-4 Bug Fixes
gnabgib|1 year ago
Also on hugging face https://huggingface.co/microsoft/phi-4
unknown|1 year ago
[deleted]
summarity|1 year ago
mettamage|1 year ago
For context: I've made some simple neural nets with backprop. I read [1].
[1] http://neuralnetworksanddeeplearning.com/
blharr|1 year ago
The brief of it is by curating a smaller synthetic dataset of high quality from textbooks, problem sets, etc. instead of dumping a massive dataset with tons of information.
k__|1 year ago
Does this mean the model was trained without copyright infringements?
redcobra762|1 year ago
dartos|1 year ago
kuatroka|1 year ago
OJFord|1 year ago
XCSme|1 year ago
ionwake|1 year ago
svachalek|1 year ago
buyucu|1 year ago
llama.cpp basically dropped support for multimodal visual models. ollama still does support them, but only a handful. Also ollama still does not support vulkan eventhough llama.cpp had vulkan support for a long long time now.
This has been very sad to watch. I'm more and more convinced that vllm is the way to go, not ollama.
mistercheph|1 year ago
jacooper|1 year ago
sega_sai|1 year ago
behnamoh|1 year ago
[deleted]
noodletheworld|1 year ago
Only having one model host (hugging face) is bad for obvious reasons (and good in others, yes, but still)
Ollama offering an alternative as a model host seems quite reasonable and quite well implemented.
The frontend really is nothing; it’s just llama.cpp in a go wrapper. It has no value and it’s not really interesting, it’s simple stable technology that is perfectly fine to rely on and be totally unexcited or interested in, technically.
…but, they do a lot more than that; and I think it’s a little unfair to imply that trivial piece of their stack is all they do.
porcoda|1 year ago
The_Amp_Walrus|1 year ago
digdugdirk|1 year ago
DrPhish|1 year ago
Now, applications like ollama obviously need to exist, as not everyone can run CLI utilities, let alone clone a git repo and compile themselves. Easy to use GUIs are essential for the adoption of new tech (much like how there are many apps that wrap ffmpeg and are mostly UI).
However, if ollama are mostly doing commodity GUI things over a fully fleshed-out, _unique_ codebase to which their very existence is owed, they should do everything in their power to point that out. I'm sure they're legally within their rights because of the licensing, but just from an ethical perspective.
I think there is a lot of ill-will towards ollama in some hard-core OG LLM communities because ollama appears to be attempting to capture the value that ggerganov has provided to the world in this tool without adequate attribution (although there is a small footnote, iirc). Basically, the debt that ollama owes to llama.cpp is so immense that they need to do a much better job recognizing it imo.
v3ss0n|1 year ago