top | item 45352672

Qwen3-VL

434 points| natrys | 5 months ago |qwen.ai

160 comments

order

richardlblair|5 months ago

As I mentioned yesterday - I recently needed to process hundreds of low quality images of invoices (for a construction project). I had a script that had used pil/opencv, pytesseract, and open ai as a fallback. It still has a staggering number of failures.

Today I tried a handful of the really poor quality invoices and Qwen spat out all the information I needed without an issue. What's crazier is it gave me the bounding boxes to improve tesseract.

benterix|5 months ago

I wonder why you chose Qwen specifically - Mistral has a specialized model just for OCR that they advertised heavily (I tested it and it works surprisingly well, at least on English-language books from 80s and 90s).

VladVladikoff|5 months ago

Interesting. I have in the past tried to get bounding boxes of property boundaries on satellite maps estimated by VLLM models but had no success. Do you have any tips on how to improve the results?

wiz21c|5 months ago

I like to test these models on reading the contents of 80's Apple ][ games screenshots. These are very low resolution, very dense. All (free to use) models struggle on that task...

netdur|5 months ago

I’ve tried that too, trying to detect the scan layout to get better OCR, but it didn’t really beat a fine-tuned Qwen 2.5 VLM 7B. I’d say fine-tuning is the way to go

unixhero|5 months ago

So where did you load up Qwen and how did you supply the pdf or photo files? I don't know how to use these models, but want to learn

lofaszvanitt|5 months ago

People actually use tesseract? It's one of the worst OCR solutions out there. Forget it.

creativebee|5 months ago

Any tipps on getting bounding boxes? The model doesn’t seem to even understand the original size of the image. And even if I provide the dimensions, the positioning is off. :'(

kardianos|5 months ago

Wait a moment... It gave you BOUNDING BOXES? That is awesome! That is a missing link I need for models.

pouetpouetpoue|5 months ago

i had success with tabula. you may not need ai. but fine if it works too.

re5i5tor|5 months ago

I would strongly emphasize:

CV != AI Vision

gpt-4o would breeze through your poor images.

deepdarkforest|5 months ago

The Chinese are doing what they have been doing to the manufacturing industry as well. Take the core technology and just optimize, optimize, optimize for 10x the cost/efficiency. As simple as that. Super impressive. These models might be bechmaxxed but as another comment said, i see so many that it might as well be the most impressive benchmaxxing today, if not just a genuinely SOTA open source model. They even released a closed source 1 trillion parameter model today as well that is sitting on no3(!) on lm arena. EVen their 80gb model is 17th, gpt-oss 120b is 52nd https://qwen.ai/blog?id=241398b9cd6353de490b0f82806c7848c5d2...

jychang|5 months ago

They still suck at explaining which model they serve is which, though.

They also released today Qwen3-VL Plus [1] today alongside Qwen3-VL 235B [2] and they don't tell us which one is better. Note that Qwen3-VL-Plus is a very different model compared to Qwen-VL-Plus.

Also, qwen-plus-2025-09-11 [3] vs qwen3-235b-a22b-instruct-2507 [4]. What's the difference? Which one is better? Who knows.

You know it's bad when OpenAI has a more clear naming scheme.

[1] https://modelstudio.console.alibabacloud.com/?tab=doc#/doc/?...

[2] https://modelstudio.console.alibabacloud.com/?tab=doc#/doc/?...

[3] https://modelstudio.console.alibabacloud.com/?tab=doc#/doc/?...

[4] https://modelstudio.console.alibabacloud.com/?tab=doc#/doc/?...

nl|5 months ago

> Take the core technology and just optimize, optimize, optimize for 10x the cost/efficiency. As simple as that. Super impressive.

This "just" is incorrect.

The Qwen team invented things like DeepStack https://arxiv.org/abs/2406.04334

(Also I hate this "The Chinese" thing. Do we say "The British" if it came from a DeepMind team in the UK? Or what if there are Chinese born US citizens working in Paris for Mistral?

Give credit to the Qwen team rather than a whole country. China has both great labs and mediocre labs, just like the rest of the world.)

spaceman_2020|5 months ago

Interestingly, I've found that models like Kimi K2 spit out more organic, natural-sounding text than American models

Fails on the benchmarks compared to other SOTA models but the real-world experience is different

dabockster|5 months ago

> Take the core technology and just optimize, optimize, optimize for 10x the cost/efficiency.

This is what really grinds my gears about American AI and American technology in general lately, as an American myself. We used to do that! But over the last 10-15 years, it seems like all this country can do is try to throw more and more resources at something instead of optimizing what we already have.

Download more ram for this progressive web app.

Buy a Threadripper CPU to run this game that looks worse than the ones you played on the Nintendo Gamecube in the early 2000s.

Generate more electricity (hello Elon Musk).

Y'all remember your algorithms classes from college, right? Why not apply that here? Because China is doing just that, and frankly making us look stupid by comparison.

helloericsf|5 months ago

If you're in SF, you don't want to miss this. The Qwen team is making their first public appearance in the United States, with the VP of Qwen Lab speaking at the meetup below during SF teach week. https://partiful.com/e/P7E418jd6Ti6hA40H6Qm Rare opportunity to directly engage with the Qwen team members.

alfiedotwtf|5 months ago

Let’s hope they’re allowed in the country and get a visa… it’s 50/50 these days

dazzaji|5 months ago

Registration full :-(

be7a|5 months ago

The biggest takeaway is that they claim SOTA for multi-modal stuff even ahead of proprietary models and still released it as open-weights. My first tests suggest this might actually be true, will continue testing. Wow

ACCount37|5 months ago

Most multi-modal input implementations suck, and a lot of them suck big time.

Doesn't seem to be far ahead of existing proprietary implementations. But it's still good that someone's willing to push that far and release the results. Getting multimodal input to work even this well is not at all easy.

Computer0|5 months ago

I feel like most Open Source releases regardless of size claim to be similar in output quality to SOTA closed source stuff.

Workaccount2|5 months ago

Sadly it still fails the "extra limb" test.

I have a few images of animals with an extra limb photoshopped onto them. A dog with an leg coming out of it's stomach, or a cat with two front right legs.

Like every other model I have tested, it insists that the animals have their anatomically correct amount of limbs. Even pointing out there is a leg coming from the dogs stomach, it will push back and insist I am confused. Insist it counted again and there are definitely only 4. Qwen took it a step further and even after I told it the image was edited, it told me it wasn't and there were only 4 limbs.

Jackson__|5 months ago

It fails on any edge case, like all other VLMs. The last time a vision model succeeded at reading analog clocks, a notoriously difficult task, it was revealed they trained on nearly 1 million artificial clock images[0] to make it work. In a similar vein, I have encountered no model that could read for example a D20 correctly.[1]

It could probably identify extra limbs in your pictures if you too made a million example images to train it on, but until then it will keep failing. And of course you'll get to keep making millions more example images for every other issue you run into.

[0] https://huggingface.co/datasets/allenai/pixmo-clocks

[1] https://files.catbox.moe/ocbr35.jpg

brookst|5 months ago

Definitely not a good model for accurately counting limbs on mutant species, then. Might be good at other things that have greater representation in the training set.

ComputerGuru|5 months ago

I wonder if you used their image editing feature if it would insist on “correcting” the number of limbs even if you asked for unrelated changes.

willahmad|5 months ago

China is winning the hearts of developers in this race so far. At least, they won mine already.

cedws|5 months ago

Arguably they’ve already won. Check the names at the top the next time you see a paper from an American company, a lot of them are Chinese.

Workaccount2|5 months ago

They don't have to ever make a profit, so the game they are playing is a bit different.

swyx|5 months ago

so.. why do you think they are trying this hard to win your heart?

sergiotapia|5 months ago

Thank you Qwen team for your generosity. I'm already using their thinking model to build some cool workflows that help boring tasks within my org.

https://openrouter.ai/qwen/qwen3-235b-a22b-thinking-2507

Now with this I will use it to identify and caption meal pictures and user pictures for other workflows. Very cool!

causal|5 months ago

That has got to be the most benchmarks I've ever seen posted with an announcement. Kudos for not just cherrypicking a favorable set.

esafak|5 months ago

We should stop reporting saturated benchmarks.

BUFU|5 months ago

The open source models are no longer catching up. They are leading now.

buyucu|5 months ago

It has been like that for a while now. At least since Deepseek R1.

vardump|5 months ago

So 235B parameter Qwen3-VL is FP16, so practically it requires at least 512 GB RAM to run? Possibly even more for a reasonable context window?

Assuming I don’t want to run it on a CPU, what are my options to run it at home under $10k?

Or if my only option is to run the model with CPU (vs GPU or other specialized HW), what would be the best way to use that 10k? vLLM + Multiple networked (10/25/100Gbit) systems?

loudmax|5 months ago

An Apple Mac Studio with 512GB of unified memory is around the $10k. If your really need that much power on your home computer, and you have that much money to spend, this could be the easiest option.

You probably don't need fp16. Most models can be quantized down to q8 with minimal loss of quality. Models can usually be quantized to q4 or even below and run reasonably well, depending on what you expect out of them.

Even at q8, you'll need around 235GB of memory. An Nvidia RTX 5090 has 32GB of VRAM and has an official price of about $2000, but usually retails for more. If you can find them at that price, you'd need eight of them to run a 235GB model entirely in VRAM, and that doesn't include a motherboard and CPU that can handle eight GPUs. You could look for old mining rigs built from RTX 3090s or P40s. Otherwise, I don't see much prospect for fitting this much data into VRAM on consumer GPUs for under $10k.

Without NVLink, you're going to take a massive performance hit running a model distributed over several computers. It can be done, and there's research into optimizing distributed models, but the throughput is a significant bottleneck. For now, you really want to run on a single machine.

You can get pretty good performance out of a CPU. The key is memory bandwidth. Look at server or workstation class CPUs with a lot of DDR5 memory channels that support a high MT/s rate. For example, an AMD Ryzen Threadripper 7965WX has eight DDR5 memory channels at up to 5200 MT/s and retails for about $2500. Depending on your needs, this might give you acceptable performance.

Lastly, I'd question whether you really need to run this at home. Obviously, this depends on your situation and what you need it for. Any investment you put into hardware is going to depreciate significantly in just a few years. $10k of credits in the cloud will take you a long way.

bitflourjikg|5 months ago

A non-CPU setup will very likely require an electrical service upgrade or tactical positioning of different systems on different circuits for you to be able to run models that large. Several kW setups also cost non-trivial sums of money to run usually

isoprophlex|5 months ago

Extremely impressive, but can one really run these >200B param models on prem in any cost effective way? Even if you get your hands on cards with 80GB ram, you still need to tie them together in a low-latency high-BW manner.

It seems to me that small/medium sized players would still need a third party to get inference going on these frontier-quality models, and we're not in a fully self-owned self-hosted place yet. I'd love to be proven wrong though.

Borealid|5 months ago

A Framework Desktop exposes 96GB of RAM for inference and costs a few thou USD.

buyucu|5 months ago

I'm running them on GMKTec Evo 2.

vessenes|5 months ago

Roughly 1/10 the cost of Opus 4.1, 1/2 the cost of Sonnet 4 on per token inference basis. Impressive. I'd love to see a fast (groq style) version of this served. I wonder if the architecture is amenable.

petesergeant|5 months ago

Cerebras are hosting other Qwen models via OpenRouter, so probably

aitchnyu|5 months ago

Isnt it a 3x rate difference? 0.7$ for Qwen3-VL vs 3$ for Sonnet 4?

vessenes|5 months ago

I spent a little time with the thinking model today. It's good. It's not better than GPT5 Pro. It might be better than the smallest GPT 5, though.

My current go-to test is to ask the LLM to construct a charging solution for my macbook pro with the model on it, but sadly, I and the pro have been sent to 15th century Florence with no money and no charger. I explain I only have two to three hours of inference time, which can be spread out, but in that time I need to construct a working charge solution.

So far GPT-5 Pro has been by far the best, not just in its electrical specifications (drawings of a commutator), but it generated instructions for jewelers and blacksmith in what it claims is 15th century florentine italian, and furnished a year-by year set of events with trading / banking predictions, a short rundown of how to get to the right folks in the Medici family, .. it was comprehensive.

Generally models suggest building an Alternating current setup and then rectifying to 5V of DC power, and trickle charging over the USB-C pins that allow trickle charging. There's a lot of variation in how they suggest we get to DC power, and often times not a lot of help on key questions, like, say "how do I know I don't have too much voltage using only 15th century tools?"

Qwen 3 VL is a mixed bag. It's the only model other than GPT5 I've talked to that suggested building a voltaic pile, estimated voltage generated by number of plates, gave me some tests to check voltage (lick a lemon, touch your tongue. Mild tingling - good. Strong tingling, remove a few plates), and was overall helpful.

On the other hand, its money making strategy was laughable; predicting Halley's comet, and in exchange demanding a workshop and 20 copper pennies from the Medicis.

Anyway, interesting showing, definitely real, and definitely useful.

ralusek|5 months ago

I JUST had a very intense dream that there was a catastrophic event that set humanity back massively, to the point that the internet was nonexistent and our laptops suddenly became priceless. The first thought I had was absolutely hating myself for not bothering to download a local LLM. A local LLM at the level of qwen is enough to massively jump start civilization.

nl|5 months ago

> predicting Halley's comet, and in exchange demanding a workshop and 20 copper pennies from the Medicis

I love this! Simple and probably effective (or would get you killed for witchcraft)

buu700|5 months ago

Funny enough, I did a little bit of ChatGPT-assisted research into a loosely similar scenario not too long ago. LPT: if you happen to know in advance that you'll be in Renaissance Florence, make sure to pack as many synthetic diamonds as you can afford.

ripped_britches|5 months ago

That is a freaking insanely cool answer from gpt5

mythz|5 months ago

Team Qwen keeps cooking! qwen2.5VL was already my preferred visual model for querying images, will look at upgrading if they release a smaller model we can run locally.

fareesh|5 months ago

Can't seem to connect to qwen.ai with DNSSEC enabled

> resolvectl query qwen.ai > qwen.ai: resolve call failed: DNSSEC validation failed: no-signature

And

https://dnsviz.net/d/qwen.ai/dnssec/ shows

aliyunga0019.com/DNSKEY: No response was received from the server over UDP (tried 4 times). See RFC 1035, Sec. 4.2. (8.129.152.246, UDP_-_EDNS0_512_D_KN)

mountainriver|5 months ago

Incredible release! Qwen has been leading the open source vision models for a while now. Releasing a really big model is amazing for a lot of use cases.

I would love to see a comparison to the latest GLM model. I would also love to see no one use OS World ever again, it’s a deeply flawed benchmark.

drapado|5 months ago

Cool! Pity they are not releasing a smaller A3B MoE model

daemonologist|5 months ago

Their A3B Omni paper mentions that the Omni at that size outperformed the (unreleased I guess) VL. Edit: I see now that there is no Omni-235B-A22B; disregard the following. ~~Which is interesting - I'd have expected the larger model to have more weights to "waste" on additional modalities and thus for the opposite to be true (or for the VL to outperform in both cases, or for both to benefit from knowledge transfer).~~

Relevant comparison is on page 15: https://arxiv.org/abs/2509.17765

jadbox|5 months ago

How does it compare to Omni?

ramon156|5 months ago

One downside is it has less knowledge of lesser known tools like orpc, which is easily fixed by something like context7

ashvardanian|5 months ago

Qwen models have historically been pretty good, but there seems to be no architectural novelty here, if I’m not missing it. Seems like another vision encoder, with a projection, and a large autoregressive model. Have there been any better ideas in the VLM space recently? I’ve been away for a couple of years :(

clueless|5 months ago

This demo is crazy: "At what time was the goal scored in this match, who scored it, and how was it scored?"

addandsubtract|5 months ago

I had the same reaction, given the 100min+ runtime of the video.

michaelanckaert|5 months ago

Qwen has some really great models. I recently used qwen/qwen3-next-80b-a3b-thinking as a drop-in replacement for GPT-4.1-mini in an agent workflow. Cost 4 times less for input tokens and half for output, instant cost savings. As far as I can measure, system output has kept the same quality.

am17an|5 months ago

This model is literally amazing. Everyone should try to get their hands on a H100 and just call it a day.

whitehexagon|5 months ago

Imagine the demand for a 128GB/256GB/512GB unified memory stuffed hardware linux box shipping with Qwen models already up and running.

Although I´m agAInst steps towards AGI, it feels safer to have these things running locally and disconnected from each other, than some giant GW cloud agentic data centers connected to everyone and everything.

buyucu|5 months ago

I bought an GMKtec evo 2 that is a 128 GB unified memory system. Strong recommend.

Alifatisk|5 months ago

Wow, the Qwen team doesn't stop and keep coming up with surprises. Not only did they release this but also the new Qwen3-Max model

buyucu|5 months ago

The Chinese are great. They are making major contributions to human civilization by open sourcing these models.