As I mentioned yesterday - I recently needed to process hundreds of low quality images of invoices (for a construction project). I had a script that had used pil/opencv, pytesseract, and open ai as a fallback. It still has a staggering number of failures.
Today I tried a handful of the really poor quality invoices and Qwen spat out all the information I needed without an issue. What's crazier is it gave me the bounding boxes to improve tesseract.
I wonder why you chose Qwen specifically - Mistral has a specialized model just for OCR that they advertised heavily (I tested it and it works surprisingly well, at least on English-language books from 80s and 90s).
Interesting. I have in the past tried to get bounding boxes of property boundaries on satellite maps estimated by VLLM models but had no success. Do you have any tips on how to improve the results?
I like to test these models on reading the contents of 80's Apple ][ games screenshots. These are very low resolution, very dense. All (free to use) models struggle on that task...
I’ve tried that too, trying to detect the scan layout to get better OCR, but it didn’t really beat a fine-tuned Qwen 2.5 VLM 7B. I’d say fine-tuning is the way to go
Any tipps on getting bounding boxes? The model doesn’t seem to even understand the original size of the image. And even if I provide the dimensions, the positioning is off. :'(
The Chinese are doing what they have been doing to the manufacturing industry as well. Take the core technology and just optimize, optimize, optimize for 10x the cost/efficiency. As simple as that. Super impressive. These models might be bechmaxxed but as another comment said, i see so many that it might as well be the most impressive benchmaxxing today, if not just a genuinely SOTA open source model. They even released a closed source 1 trillion parameter model today as well that is sitting on no3(!) on lm arena. EVen their 80gb model is 17th, gpt-oss 120b is 52nd
https://qwen.ai/blog?id=241398b9cd6353de490b0f82806c7848c5d2...
They still suck at explaining which model they serve is which, though.
They also released today Qwen3-VL Plus [1] today alongside Qwen3-VL 235B [2] and they don't tell us which one is better. Note that Qwen3-VL-Plus is a very different model compared to Qwen-VL-Plus.
Also, qwen-plus-2025-09-11 [3] vs qwen3-235b-a22b-instruct-2507 [4]. What's the difference? Which one is better? Who knows.
You know it's bad when OpenAI has a more clear naming scheme.
(Also I hate this "The Chinese" thing. Do we say "The British" if it came from a DeepMind team in the UK? Or what if there are Chinese born US citizens working in Paris for Mistral?
Give credit to the Qwen team rather than a whole country. China has both great labs and mediocre labs, just like the rest of the world.)
> Take the core technology and just optimize, optimize, optimize for 10x the cost/efficiency.
This is what really grinds my gears about American AI and American technology in general lately, as an American myself. We used to do that! But over the last 10-15 years, it seems like all this country can do is try to throw more and more resources at something instead of optimizing what we already have.
Download more ram for this progressive web app.
Buy a Threadripper CPU to run this game that looks worse than the ones you played on the Nintendo Gamecube in the early 2000s.
Generate more electricity (hello Elon Musk).
Y'all remember your algorithms classes from college, right? Why not apply that here? Because China is doing just that, and frankly making us look stupid by comparison.
If you're in SF, you don't want to miss this.
The Qwen team is making their first public appearance in the United States, with the VP of Qwen Lab speaking at the meetup below during SF teach week.
https://partiful.com/e/P7E418jd6Ti6hA40H6Qm
Rare opportunity to directly engage with the Qwen team members.
The biggest takeaway is that they claim SOTA for multi-modal stuff even ahead of proprietary models and still released it as open-weights. My first tests suggest this might actually be true, will continue testing. Wow
Most multi-modal input implementations suck, and a lot of them suck big time.
Doesn't seem to be far ahead of existing proprietary implementations. But it's still good that someone's willing to push that far and release the results. Getting multimodal input to work even this well is not at all easy.
I have a few images of animals with an extra limb photoshopped onto them. A dog with an leg coming out of it's stomach, or a cat with two front right legs.
Like every other model I have tested, it insists that the animals have their anatomically correct amount of limbs. Even pointing out there is a leg coming from the dogs stomach, it will push back and insist I am confused. Insist it counted again and there are definitely only 4. Qwen took it a step further and even after I told it the image was edited, it told me it wasn't and there were only 4 limbs.
It fails on any edge case, like all other VLMs. The last time a vision model succeeded at reading analog clocks, a notoriously difficult task, it was revealed they trained on nearly 1 million artificial clock images[0] to make it work. In a similar vein, I have encountered no model that could read for example a D20 correctly.[1]
It could probably identify extra limbs in your pictures if you too made a million example images to train it on, but until then it will keep failing. And of course you'll get to keep making millions more example images for every other issue you run into.
Definitely not a good model for accurately counting limbs on mutant species, then. Might be good at other things that have greater representation in the training set.
So 235B parameter Qwen3-VL is FP16, so practically it requires at least 512 GB RAM to run? Possibly even more for a reasonable context window?
Assuming I don’t want to run it on a CPU, what are my options to run it at home under $10k?
Or if my only option is to run the model with CPU (vs GPU or other specialized HW), what would be the best way to use that 10k? vLLM + Multiple networked (10/25/100Gbit) systems?
An Apple Mac Studio with 512GB of unified memory is around the $10k. If your really need that much power on your home computer, and you have that much money to spend, this could be the easiest option.
You probably don't need fp16. Most models can be quantized down to q8 with minimal loss of quality. Models can usually be quantized to q4 or even below and run reasonably well, depending on what you expect out of them.
Even at q8, you'll need around 235GB of memory. An Nvidia RTX 5090 has 32GB of VRAM and has an official price of about $2000, but usually retails for more. If you can find them at that price, you'd need eight of them to run a 235GB model entirely in VRAM, and that doesn't include a motherboard and CPU that can handle eight GPUs. You could look for old mining rigs built from RTX 3090s or P40s. Otherwise, I don't see much prospect for fitting this much data into VRAM on consumer GPUs for under $10k.
Without NVLink, you're going to take a massive performance hit running a model distributed over several computers. It can be done, and there's research into optimizing distributed models, but the throughput is a significant bottleneck. For now, you really want to run on a single machine.
You can get pretty good performance out of a CPU. The key is memory bandwidth. Look at server or workstation class CPUs with a lot of DDR5 memory channels that support a high MT/s rate. For example, an AMD Ryzen Threadripper 7965WX has eight DDR5 memory channels at up to 5200 MT/s and retails for about $2500. Depending on your needs, this might give you acceptable performance.
Lastly, I'd question whether you really need to run this at home. Obviously, this depends on your situation and what you need it for. Any investment you put into hardware is going to depreciate significantly in just a few years. $10k of credits in the cloud will take you a long way.
A non-CPU setup will very likely require an electrical service upgrade or tactical positioning of different systems on different circuits for you to be able to run models that large. Several kW setups also cost non-trivial sums of money to run usually
Extremely impressive, but can one really run these >200B param models on prem in any cost effective way? Even if you get your hands on cards with 80GB ram, you still need to tie them together in a low-latency high-BW manner.
It seems to me that small/medium sized players would still need a third party to get inference going on these frontier-quality models, and we're not in a fully self-owned self-hosted place yet. I'd love to be proven wrong though.
Roughly 1/10 the cost of Opus 4.1, 1/2 the cost of Sonnet 4 on per token inference basis. Impressive. I'd love to see a fast (groq style) version of this served. I wonder if the architecture is amenable.
I spent a little time with the thinking model today. It's good. It's not better than GPT5 Pro. It might be better than the smallest GPT 5, though.
My current go-to test is to ask the LLM to construct a charging solution for my macbook pro with the model on it, but sadly, I and the pro have been sent to 15th century Florence with no money and no charger. I explain I only have two to three hours of inference time, which can be spread out, but in that time I need to construct a working charge solution.
So far GPT-5 Pro has been by far the best, not just in its electrical specifications (drawings of a commutator), but it generated instructions for jewelers and blacksmith in what it claims is 15th century florentine italian, and furnished a year-by year set of events with trading / banking predictions, a short rundown of how to get to the right folks in the Medici family, .. it was comprehensive.
Generally models suggest building an Alternating current setup and then rectifying to 5V of DC power, and trickle charging over the USB-C pins that allow trickle charging. There's a lot of variation in how they suggest we get to DC power, and often times not a lot of help on key questions, like, say "how do I know I don't have too much voltage using only 15th century tools?"
Qwen 3 VL is a mixed bag. It's the only model other than GPT5 I've talked to that suggested building a voltaic pile, estimated voltage generated by number of plates, gave me some tests to check voltage (lick a lemon, touch your tongue. Mild tingling - good. Strong tingling, remove a few plates), and was overall helpful.
On the other hand, its money making strategy was laughable; predicting Halley's comet, and in exchange demanding a workshop and 20 copper pennies from the Medicis.
Anyway, interesting showing, definitely real, and definitely useful.
I JUST had a very intense dream that there was a catastrophic event that set humanity back massively, to the point that the internet was nonexistent and our laptops suddenly became priceless. The first thought I had was absolutely hating myself for not bothering to download a local LLM. A local LLM at the level of qwen is enough to massively jump start civilization.
Funny enough, I did a little bit of ChatGPT-assisted research into a loosely similar scenario not too long ago. LPT: if you happen to know in advance that you'll be in Renaissance Florence, make sure to pack as many synthetic diamonds as you can afford.
Team Qwen keeps cooking! qwen2.5VL was already my preferred visual model for querying images, will look at upgrading if they release a smaller model we can run locally.
aliyunga0019.com/DNSKEY: No response was received from the server over UDP (tried 4 times). See RFC 1035, Sec. 4.2. (8.129.152.246, UDP_-_EDNS0_512_D_KN)
Incredible release! Qwen has been leading the open source vision models for a while now. Releasing a really big model is amazing for a lot of use cases.
I would love to see a comparison to the latest GLM model. I would also love to see no one use OS World ever again, it’s a deeply flawed benchmark.
Their A3B Omni paper mentions that the Omni at that size outperformed the (unreleased I guess) VL. Edit: I see now that there is no Omni-235B-A22B; disregard the following. ~~Which is interesting - I'd have expected the larger model to have more weights to "waste" on additional modalities and thus for the opposite to be true (or for the VL to outperform in both cases, or for both to benefit from knowledge transfer).~~
Qwen models have historically been pretty good, but there seems to be no architectural novelty here, if I’m not missing it. Seems like another vision encoder, with a projection, and a large autoregressive model. Have there been any better ideas in the VLM space recently? I’ve been away for a couple of years :(
Qwen has some really great models. I recently used qwen/qwen3-next-80b-a3b-thinking as a drop-in replacement for GPT-4.1-mini in an agent workflow. Cost 4 times less for input tokens and half for output, instant cost savings.
As far as I can measure, system output has kept the same quality.
Imagine the demand for a 128GB/256GB/512GB unified memory stuffed hardware linux box shipping with Qwen models already up and running.
Although I´m agAInst steps towards AGI, it feels safer to have these things running locally and disconnected from each other, than some giant GW cloud agentic data centers connected to everyone and everything.
richardlblair|5 months ago
Today I tried a handful of the really poor quality invoices and Qwen spat out all the information I needed without an issue. What's crazier is it gave me the bounding boxes to improve tesseract.
iamflimflam1|5 months ago
benterix|5 months ago
VladVladikoff|5 months ago
wiz21c|5 months ago
netdur|5 months ago
unixhero|5 months ago
lofaszvanitt|5 months ago
creativebee|5 months ago
kardianos|5 months ago
pouetpouetpoue|5 months ago
re5i5tor|5 months ago
CV != AI Vision
gpt-4o would breeze through your poor images.
deepdarkforest|5 months ago
jychang|5 months ago
They also released today Qwen3-VL Plus [1] today alongside Qwen3-VL 235B [2] and they don't tell us which one is better. Note that Qwen3-VL-Plus is a very different model compared to Qwen-VL-Plus.
Also, qwen-plus-2025-09-11 [3] vs qwen3-235b-a22b-instruct-2507 [4]. What's the difference? Which one is better? Who knows.
You know it's bad when OpenAI has a more clear naming scheme.
[1] https://modelstudio.console.alibabacloud.com/?tab=doc#/doc/?...
[2] https://modelstudio.console.alibabacloud.com/?tab=doc#/doc/?...
[3] https://modelstudio.console.alibabacloud.com/?tab=doc#/doc/?...
[4] https://modelstudio.console.alibabacloud.com/?tab=doc#/doc/?...
nl|5 months ago
This "just" is incorrect.
The Qwen team invented things like DeepStack https://arxiv.org/abs/2406.04334
(Also I hate this "The Chinese" thing. Do we say "The British" if it came from a DeepMind team in the UK? Or what if there are Chinese born US citizens working in Paris for Mistral?
Give credit to the Qwen team rather than a whole country. China has both great labs and mediocre labs, just like the rest of the world.)
spaceman_2020|5 months ago
Fails on the benchmarks compared to other SOTA models but the real-world experience is different
dabockster|5 months ago
This is what really grinds my gears about American AI and American technology in general lately, as an American myself. We used to do that! But over the last 10-15 years, it seems like all this country can do is try to throw more and more resources at something instead of optimizing what we already have.
Download more ram for this progressive web app.
Buy a Threadripper CPU to run this game that looks worse than the ones you played on the Nintendo Gamecube in the early 2000s.
Generate more electricity (hello Elon Musk).
Y'all remember your algorithms classes from college, right? Why not apply that here? Because China is doing just that, and frankly making us look stupid by comparison.
helloericsf|5 months ago
alfiedotwtf|5 months ago
dazzaji|5 months ago
be7a|5 months ago
ACCount37|5 months ago
Doesn't seem to be far ahead of existing proprietary implementations. But it's still good that someone's willing to push that far and release the results. Getting multimodal input to work even this well is not at all easy.
Computer0|5 months ago
Workaccount2|5 months ago
I have a few images of animals with an extra limb photoshopped onto them. A dog with an leg coming out of it's stomach, or a cat with two front right legs.
Like every other model I have tested, it insists that the animals have their anatomically correct amount of limbs. Even pointing out there is a leg coming from the dogs stomach, it will push back and insist I am confused. Insist it counted again and there are definitely only 4. Qwen took it a step further and even after I told it the image was edited, it told me it wasn't and there were only 4 limbs.
Jackson__|5 months ago
It could probably identify extra limbs in your pictures if you too made a million example images to train it on, but until then it will keep failing. And of course you'll get to keep making millions more example images for every other issue you run into.
[0] https://huggingface.co/datasets/allenai/pixmo-clocks
[1] https://files.catbox.moe/ocbr35.jpg
brookst|5 months ago
ComputerGuru|5 months ago
willahmad|5 months ago
cedws|5 months ago
Workaccount2|5 months ago
swyx|5 months ago
sergiotapia|5 months ago
https://openrouter.ai/qwen/qwen3-235b-a22b-thinking-2507
Now with this I will use it to identify and caption meal pictures and user pictures for other workflows. Very cool!
natrys|5 months ago
- https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Thinking
- https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct
causal|5 months ago
esafak|5 months ago
BUFU|5 months ago
buyucu|5 months ago
vardump|5 months ago
Assuming I don’t want to run it on a CPU, what are my options to run it at home under $10k?
Or if my only option is to run the model with CPU (vs GPU or other specialized HW), what would be the best way to use that 10k? vLLM + Multiple networked (10/25/100Gbit) systems?
loudmax|5 months ago
You probably don't need fp16. Most models can be quantized down to q8 with minimal loss of quality. Models can usually be quantized to q4 or even below and run reasonably well, depending on what you expect out of them.
Even at q8, you'll need around 235GB of memory. An Nvidia RTX 5090 has 32GB of VRAM and has an official price of about $2000, but usually retails for more. If you can find them at that price, you'd need eight of them to run a 235GB model entirely in VRAM, and that doesn't include a motherboard and CPU that can handle eight GPUs. You could look for old mining rigs built from RTX 3090s or P40s. Otherwise, I don't see much prospect for fitting this much data into VRAM on consumer GPUs for under $10k.
Without NVLink, you're going to take a massive performance hit running a model distributed over several computers. It can be done, and there's research into optimizing distributed models, but the throughput is a significant bottleneck. For now, you really want to run on a single machine.
You can get pretty good performance out of a CPU. The key is memory bandwidth. Look at server or workstation class CPUs with a lot of DDR5 memory channels that support a high MT/s rate. For example, an AMD Ryzen Threadripper 7965WX has eight DDR5 memory channels at up to 5200 MT/s and retails for about $2500. Depending on your needs, this might give you acceptable performance.
Lastly, I'd question whether you really need to run this at home. Obviously, this depends on your situation and what you need it for. Any investment you put into hardware is going to depreciate significantly in just a few years. $10k of credits in the cloud will take you a long way.
bitflourjikg|5 months ago
isoprophlex|5 months ago
It seems to me that small/medium sized players would still need a third party to get inference going on these frontier-quality models, and we're not in a fully self-owned self-hosted place yet. I'd love to be proven wrong though.
Borealid|5 months ago
buyucu|5 months ago
vessenes|5 months ago
petesergeant|5 months ago
aitchnyu|5 months ago
vessenes|5 months ago
My current go-to test is to ask the LLM to construct a charging solution for my macbook pro with the model on it, but sadly, I and the pro have been sent to 15th century Florence with no money and no charger. I explain I only have two to three hours of inference time, which can be spread out, but in that time I need to construct a working charge solution.
So far GPT-5 Pro has been by far the best, not just in its electrical specifications (drawings of a commutator), but it generated instructions for jewelers and blacksmith in what it claims is 15th century florentine italian, and furnished a year-by year set of events with trading / banking predictions, a short rundown of how to get to the right folks in the Medici family, .. it was comprehensive.
Generally models suggest building an Alternating current setup and then rectifying to 5V of DC power, and trickle charging over the USB-C pins that allow trickle charging. There's a lot of variation in how they suggest we get to DC power, and often times not a lot of help on key questions, like, say "how do I know I don't have too much voltage using only 15th century tools?"
Qwen 3 VL is a mixed bag. It's the only model other than GPT5 I've talked to that suggested building a voltaic pile, estimated voltage generated by number of plates, gave me some tests to check voltage (lick a lemon, touch your tongue. Mild tingling - good. Strong tingling, remove a few plates), and was overall helpful.
On the other hand, its money making strategy was laughable; predicting Halley's comet, and in exchange demanding a workshop and 20 copper pennies from the Medicis.
Anyway, interesting showing, definitely real, and definitely useful.
ralusek|5 months ago
nl|5 months ago
I love this! Simple and probably effective (or would get you killed for witchcraft)
buu700|5 months ago
ripped_britches|5 months ago
mythz|5 months ago
fareesh|5 months ago
> resolvectl query qwen.ai > qwen.ai: resolve call failed: DNSSEC validation failed: no-signature
And
https://dnsviz.net/d/qwen.ai/dnssec/ shows
aliyunga0019.com/DNSKEY: No response was received from the server over UDP (tried 4 times). See RFC 1035, Sec. 4.2. (8.129.152.246, UDP_-_EDNS0_512_D_KN)
mountainriver|5 months ago
I would love to see a comparison to the latest GLM model. I would also love to see no one use OS World ever again, it’s a deeply flawed benchmark.
drapado|5 months ago
ilc|5 months ago
daemonologist|5 months ago
Relevant comparison is on page 15: https://arxiv.org/abs/2509.17765
jadbox|5 months ago
ramon156|5 months ago
ashvardanian|5 months ago
clueless|5 months ago
addandsubtract|5 months ago
michaelanckaert|5 months ago
am17an|5 months ago
whitehexagon|5 months ago
Although I´m agAInst steps towards AGI, it feels safer to have these things running locally and disconnected from each other, than some giant GW cloud agentic data centers connected to everyone and everything.
buyucu|5 months ago
Alifatisk|5 months ago
buyucu|5 months ago
youssefarizk|5 months ago