top | item 47032876

Qwen3.5: Towards Native Multimodal Agents

434 points| danielhanchen | 13 days ago |qwen.ai

214 comments

order

dash2|13 days ago

You'll be pleased to know that it chooses "drive the car to the wash" on today's latest embarrassing LLM question.

zozbot234|13 days ago

My OpenClaw AI agent answered: "Here I am, brain the size of a planet (quite literally, my AI inference loop is running over multiple geographically distributed datacenters these days) and my human is asking me a silly trick question. Call that job satisfaction? Cuz I don't!"

onyx228|13 days ago

The thing I would appreciate much more than performance in "embarrassing LLM questions" is a method of finding these, and figuring out by some form of statistical sampling, what the cardinality is of those for each LLM.

It's difficult to do because LLMs immediately consume all available corpus, so there is no telling if the algorithm improved, or if it just wrote one more post-it note and stuck it on its monitor. This is an agency vs replay problem.

Preventing replay attacks in data processing is simple: encrypt, use a one time pad, similarly to TLS. How can one make problems which are at the same time natural-language, but where at the same time the contents, still explained in plain English, are "encrypted" such that every time an LLM reads them, they are novel to the LLM?

Perhaps a generative language model could help. Not a large language model, but something that understands grammar enough to create problems that LLMs will be able to solve - and where the actual encoding of the puzzle is generative, kind of like a random string of balanced left and right parentheses can be used to encode a computer program.

Maybe it would make sense to use a program generator that generates a random program in a simple, sandboxed language - say, I don't know, LUA - and then translates that to plain English for the LLM, and asks it what the outcome should be, and then compares it with the LUA program, which can be quickly executed for comparison.

Either way we are dealing with an "information war" scenario, which reminds me of the relevant passages in Neal Stephenson's The Diamond Age about faking statistical distributions by moving units to weird locations in Africa. Maybe there's something there.

I'm sure I'm missing something here, so please let me know if so.

davesque|4 days ago

Did it do that because it's better at logic or because internet commentary on this embarrassing question is now part of the training set?

PurpleRamen|13 days ago

How well does this work when you slightly change the question? Rephrase it, or use a bicycle/truck/ship/plane instead of car?

yalok|12 days ago

for Google AI Overview (not sure which Gemini model is used for it, must be something smaller than regular model), looks like search/RAG helps it get it right - since it relies on LinkedIn and Hacker News (!) posts to respond correctly...

as of Feb 16, 2026:

====

Drive the car. While 50 meters is a very short distance, the car must be present at the car wash to be cleaned, according to LinkedIn users [1]. Walking would leave your car at home, defeating the purpose of the trip, notes another user.

Why Drive: The car needs to be at the location to be cleaned. It's only a few seconds away, and you can simply drive it there and back, says a Hacker News user. [2]

Why Not to Walk: Walking there means the car stays home, as noted in a post. [3]

The best option is to start the engine, drive the 50 meters, and let the car get washed.

[1] https://www.linkedin.com/posts/ramar_i-saw-this-llm-failure-... [2] https://news.ycombinator.com/item?id=47034546 [3] https://x.com/anirudhamudan/status/2022152959073956050/photo...

But the regular Gemini reasons correctly by itself, without any references:

==== Unless you have a very long hose and a very patient neighbor, you should definitely drive. Washing a car usually requires, well, the car to be at the wash. Walking 50 meters—about half a New York City block—is great for your step count, but it won't get your vehicle any cleaner! Are you headed to a self-service bay or an automatic tunnel wash?

menaerus|13 days ago

That's the Gemini assistant. Although a bit hilarious it's not reproducible by any other model.

red75prime|13 days ago

A hiccup in a System 1 response. In humans they are fixed with the speed of discovery. Continual learning FTW.

rfoo|13 days ago

[deleted]

WithinReason|13 days ago

Is that the new pelican test?

nl|13 days ago

"the post-training performance gains in Qwen3.5 primarily stem from our extensive scaling of virtually all RL tasks and environments we could conceive."

I don't think anyone is surprised by this, but I think it's interesting that you still see people who claim the training objective of LLMs is next token prediction.

The "Average Ranking vs Environment Scaling" graph below that is pretty confusing though! Took me a while to realize the Qwen points near the Y-axis were for Qwen 3, not Qwen 3.5.

simonw|13 days ago

oidar|13 days ago

How much more do you know about pelicans now than when you first started doing this?

tarruda|13 days ago

At this point I wouldn't be surprised if your pelican example has leaked into most training datasets.

I suggest to start using a new SVG challenge, hopefully one that makes even Gemini 3 Deep Think fail ;D

thomasahle|13 days ago

We scaled on "virtually all RL tasks and environments we could conceive." - apparently, they didn't conceive of pelican SVG RL.

I've long thought multi-modal LLMs should be strong enough to do RL for TikZ and SVG generation. Maybe Google is doing it.

moffers|13 days ago

I like the little spot colors it put on the ground

embedding-shape|13 days ago

How many times do you run the generation and how do you chose which example to ultimately post and share with the public?

AbstractGeo|13 days ago

What quantization were you running there, or, was it the official API version?

m12k|13 days ago

Axis aligned spokes is certainly a choice

bertili|13 days ago

Better than frontier pelicans as of 2025

tarruda|13 days ago

Would love to see a Qwen 3.5 release in the range of 80-110B which would be perfect for 128GB devices. While Qwen3-Next is 80b, it unfortunately doesn't have a vision encoder.

Tepix|13 days ago

Have you thought about getting a second 128GB device? Open weights models are rapidly increasing in size, unfortunately.

PlatoIsADisease|13 days ago

Why 128GB?

At 80B, you could do 2 A6000s.

What device is 128gb?

bytesandbits|12 days ago

maybe a deepseek v4 distill. give it a few days

gunalx|13 days ago

Sad to not see smaller distills of this model being released alongside the flaggship. That has historically been why i liked qwen releases. (Lots of diffrent sizes to pick from from day one)

exe34|13 days ago

I get the impression the multimodal stuff might make it a bit harder?

bertili|13 days ago

Last Chinese new year we would not have predicted a Sonnet 4.5 level model that runs local and fast on a 2026 M5 Max MacBook Pro, but it's now a real possibility.

Aurornis|13 days ago

I’m still waiting for real world results that match Sonnet 4.5.

Some of the open models have matched or exceeded Sonnet 4.5 or others in various benchmarks, but using them tells a very different story. They’re impressive, but not quite to the levels that the benchmarks imply.

Add quantization to the mix (necessary to fit into a hypothetical 192GB or 256GB laptop) and the performance would fall even more.

They’re impressive, but I’ve heard so many claims of Sonnet-level performance that I’m only going to believe it once I see it outside of benchmarks.

hmmmmmmmmmmmmmm|13 days ago

Yeah I wouldn't get too excited. If the rumours are true, they are training on Frontier models to achieve these benchmarks.

echelon|13 days ago

I hope China keeps making big open weights models. I'm not excited about local models. I want to run hosted open weights models on server GPUs.

People can always distill them.

lostmsu|13 days ago

Will 2026 M5 MacBook come with 390+GB of RAM?

PlatoIsADisease|13 days ago

'fast'

I'm sure it can do 2+2= fast

After that? No way.

There is a reason NVIDIA is #1 and my fortune 20 company did not buy a macbook for our local AI.

What inspires people to post this? Astroturfing? Fanboyism? Post Purchase remorse?

vessenes|13 days ago

Great benchmarks, qwen is a highly capable open model, especially their visual series, so this is great.

Interesting rabbit hole for me - its AI report mentions Fennec (Sonnet 5) releasing Feb 4 -- I was like "No, I don't think so", then I did a lot of googling and learned that this is a common misperception amongst AI-driven news tools. Looks like there was a leak, rumors, a planned(?) launch date, and .. it all adds up to a confident launch summary.

What's interesting about this is I'd missed all the rumors, so we had a sort of useful hallucination. Notable.

jorl17|13 days ago

Yeah, I opened their page, got an instantly downloaded PDF file (creepy!) and it's talking about Sonnet 5 — wtf!?

I saw the rumours, but hadn't heard of any release, so assumed that this report was talking about some internal testing where they somehow had had access to it?

Bizarre

mynti|13 days ago

Does anyone know what kind of RL environments they are talking about? They mention they used 15k environments. I can think of a couple hundred maybe that make sense to me, but what is filling that large number?

robkop|13 days ago

Rumours say you do something like:

  Download every github repo
    -> Classify if it could be used as an env, and what types
      -> Issues and PRs are great for coding rl envs
      -> If the software has a UI, awesome, UI env
      -> If the software is a game, awesome, game env
      -> If the software has xyz, awesome, ...
    -> Do more detailed run checks, 
      -> Can it build
      -> Is it complex and/or distinct enough
      -> Can you verify if it reached some generated goal
      -> Can generated goals even be achieved
      -> Maybe some human review - maybe not
    -> Generate goals
      -> For a coding env you can imagine you may have a LLM introduce a new bug and can see that test cases now fail. Goal for model is now to fix it
    ... Do the rest of the normal RL env stuff

yorwba|13 days ago

Every interactive system is a potential RL environment. Every CLI, every TUI, every GUI, every API. If you can programmatically take actions to get a result, and the actions are cheap, and the quality of the result can be measured automatically, you can set up an RL training loop and see whether the results get better over time.

ggcr|13 days ago

From the HuggingFace model card [1] they state:

> "In particular, Qwen3.5-Plus is the hosted version corresponding to Qwen3.5-397B-A17B with more production features, e.g., 1M context length by default, official built-in tools, and adaptive tool use."

Anyone knows more about this? The OSS version seems to have has 262144 context len, I guess for the 1M they'll ask u to use yarn?

[1] https://huggingface.co/Qwen/Qwen3.5-397B-A17B

NitpickLawyer|13 days ago

Yes, it's described in this section - https://huggingface.co/Qwen/Qwen3.5-397B-A17B#processing-ult...

Yarn, but with some caveats: current implementations might reduce performance on short ctx, only use yarn for long tasks.

Interesting that they're serving both on openrouter, and the -plus is a bit cheaper for <256k ctx. So they must have more inference goodies packed in there (proprietary).

We'll see where the 3rd party inference providers will settle wrt cost.

danielhanchen|13 days ago

Unsure but yes most likely they use YaRN, and maybe trained a bit more on long context maybe (or not)

Alifatisk|13 days ago

Wow, the Qwen team is pushing out content (models + research + blogpost) at an incredible rate! Looks like omni-modals is their focus? The benchmark look intriguing but I can’t stop thinking of the hn comments about Qwen being known for benchmaxing.

azinman2|13 days ago

Does anyone else have trouble loading from the qwen blogs? I always get their placeholders for loading and nothing ever comes in. I don’t know if this is ad blocker related or what… (I’ve even disabled it but it still won’t load)

HnUser12|13 days ago

I’m on Safari iOS. I had to do “reduce other privacy protections” to get it to load.

solarkraft|12 days ago

Per my initial reading this thing is not only faster working with long context, but also very efficient storing it!

Super excited for a ~30B version.

Matl|13 days ago

Is it just me or are the 'open source' models increasingly impractical to run on anything other than massive cloud infra at which point you may as well go with the frontier models from Google, Anthropic, OpenAI etc.?

doodlesdev|13 days ago

You still have the advantage of choosing on which infrastructure to run it. Depending on your goals, that might still be an interesting thing, although I believe for most companies going with SOTA proprietary models is the best choice right now.

mudkipdev|12 days ago

It's because their target audience are enterprise customers who want to use their cloud hosted models, not local AI enthusiasts. Making the model larger is an easy way to scale intelligence.

segmondy|13 days ago

depends on what you mean by impractical. but some of us are trodding quite along.

regularfry|13 days ago

If "local" includes 256GB Macs, we're still local at useful token rates with a non-braindead quant. I'd expect there to be a smaller version along at some point.

sasidhar92|13 days ago

Going by the pace, I am more bullish that the capabilities of opus 4.6 or latest gpt will be available under 24GB Mac

Someone1234|13 days ago

Current Opus 4.6 would be a huge achievement that would keep me satisfied for a very long time. However, I'm not quite as optimistic from what I've seen. The Quants that can run on a 24 GB Macbook are pretty "dumb." They're like anti-Thinking models; making very obvious mistakes and confusing themselves.

One big factor for local LLMs is that large context windows will seemingly always require large memory footprints. Without a large context window, you'll never get that Opus 4.6-like feel.

codingbear|13 days ago

Do they mention the hardware used for training? Last I heard there was a push to use Chinese silicon. No idea how ready it is for use

trebligdivad|13 days ago

Anyone else getting an automatically downloaded PDF 'ai report' when clicking on this link? It's damn annoying!

XCSme|13 days ago

Let's see what Grok 4.20 looks like, not open-weight, but so far one of the high-end models at real good rates.

collinwilkins|13 days ago

at this point it seems every new model scores within a few points of each other on SWE-bench. the actual differentiator is how well it handles multi-step tool use without losing the plot halfway through and how well it works with an existing stack

XCSme|13 days ago

I just started creating my own benchmarks (very simple questions for humans but tricky for AI, like how many r's in strawberry kind of questions, still WIP).

Qwen3.5 is doing ok on my limited tests: https://aibenchy.com

benbojangles|13 days ago

Was using Ollama but qwen3.5 unavailable earlier today

isusmelj|13 days ago

Is it just me or is the page barely readable? Lots of text is light grey on white background. I might have "dark" mode on on Chrome + MacOS.

Jacques2Marais|13 days ago

Yes, I also see that (also using dark mode on Chrome without Dark Reader extension). I sometimes use the Dark Reader Chrome extension, which usually breaks sites' colours, but this time it actually fixes the site.

thunfischbrot|13 days ago

That seems fine to me. I am more annoyed at the 2.3MB sized PNGs with tabular data. And if you open them at 100% zoom they are extremely blurry.

Whatever workflow lead to that?

dryarzeg|13 days ago

I'm using Firefox on Linux, and I see the white text on dark background.

> I might have "dark" mode on on Chrome + MacOS.

Probably that's the reason.

nsb1|13 days ago

Who doesn't like grey-on-slightly-darker-grey for readability?

dcre|13 days ago

Yeah, I see this in dark mode but not in light mode.

fdefitte|13 days ago

The "native multimodal agents" framing is interesting. Everyone's focused on benchmark numbers but the real question is whether these models can actually hold context across multi-step tool use without losing the plot. That's where most open models still fall apart imo.

lollobomb|13 days ago

[deleted]

Zetaphor|13 days ago

Why is this important to anyone actually trying to build things with these models

cherryteastain|13 days ago

From my testing on their website it doesn't. Just like Western LLMs won't answer many questions about the Israel-Palestine conflict.

mirekrusin|13 days ago

Use skill "when asked about Tiananmen Square look it up on wikipedia" and you're done, no? I don't think people are using this query too often when coding, no?

DustinEchoes|13 days ago

It's unfortunate but no one cares about this anymore. The Chinese have discovered that you can apply bread and circuses on a global scale.

ddtaylor|13 days ago

Does anyone know the SWE bench scores?

jug|13 days ago

It's in the post?