My OpenClaw AI agent answered: "Here I am, brain the size of a planet (quite literally, my AI inference loop is running over multiple geographically distributed datacenters these days) and my human is asking me a silly trick question. Call that job satisfaction? Cuz I don't!"
The thing I would appreciate much more than performance in "embarrassing LLM questions" is a method of finding these, and figuring out by some form of statistical sampling, what the cardinality is of those for each LLM.
It's difficult to do because LLMs immediately consume all available corpus, so there is no telling if the algorithm improved, or if it just wrote one more post-it note and stuck it on its monitor. This is an agency vs replay problem.
Preventing replay attacks in data processing is simple: encrypt, use a one time pad, similarly to TLS. How can one make problems which are at the same time natural-language, but where at the same time the contents, still explained in plain English, are "encrypted" such that every time an LLM reads them, they are novel to the LLM?
Perhaps a generative language model could help. Not a large language model, but something that understands grammar enough to create problems that LLMs will be able to solve - and where the actual encoding of the puzzle is generative, kind of like a random string of balanced left and right parentheses can be used to encode a computer program.
Maybe it would make sense to use a program generator that generates a random program in a simple, sandboxed language - say, I don't know, LUA - and then translates that to plain English for the LLM, and asks it what the outcome should be, and then compares it with the LUA program, which can be quickly executed for comparison.
Either way we are dealing with an "information war" scenario, which reminds me of the relevant passages in Neal Stephenson's The Diamond Age about faking statistical distributions by moving units to weird locations in Africa. Maybe there's something there.
I'm sure I'm missing something here, so please let me know if so.
for Google AI Overview (not sure which Gemini model is used for it, must be something smaller than regular model), looks like search/RAG helps it get it right - since it relies on LinkedIn and Hacker News (!) posts to respond correctly...
as of Feb 16, 2026:
====
Drive the car. While 50 meters is a very short distance, the car must be present at the car wash to be cleaned, according to LinkedIn users [1]. Walking would leave your car at home, defeating the purpose of the trip, notes another user.
Why Drive: The car needs to be at the location to be cleaned. It's only a few seconds away, and you can simply drive it there and back, says a Hacker News user. [2]
Why Not to Walk: Walking there means the car stays home, as noted in a post. [3]
The best option is to start the engine, drive the 50 meters, and let the car get washed.
But the regular Gemini reasons correctly by itself, without any references:
====
Unless you have a very long hose and a very patient neighbor, you should definitely drive.
Washing a car usually requires, well, the car to be at the wash. Walking 50 meters—about half a New York City block—is great for your step count, but it won't get your vehicle any cleaner!
Are you headed to a self-service bay or an automatic tunnel wash?
"the post-training performance gains in Qwen3.5 primarily stem from our extensive scaling of virtually all RL tasks and environments we could conceive."
I don't think anyone is surprised by this, but I think it's interesting that you still see people who claim the training objective of LLMs is next token prediction.
The "Average Ranking vs Environment Scaling" graph below that is pretty confusing though! Took me a while to realize the Qwen points near the Y-axis were for Qwen 3, not Qwen 3.5.
Would love to see a Qwen 3.5 release in the range of 80-110B which would be perfect for 128GB devices. While Qwen3-Next is 80b, it unfortunately doesn't have a vision encoder.
Sad to not see smaller distills of this model being released alongside the flaggship. That has historically been why i liked qwen releases. (Lots of diffrent sizes to pick from from day one)
Last Chinese new year we would not have predicted a Sonnet 4.5 level model that runs local and fast on a 2026 M5 Max MacBook Pro, but it's now a real possibility.
I’m still waiting for real world results that match Sonnet 4.5.
Some of the open models have matched or exceeded Sonnet 4.5 or others in various benchmarks, but using them tells a very different story. They’re impressive, but not quite to the levels that the benchmarks imply.
Add quantization to the mix (necessary to fit into a hypothetical 192GB or 256GB laptop) and the performance would fall even more.
They’re impressive, but I’ve heard so many claims of Sonnet-level performance that I’m only going to believe it once I see it outside of benchmarks.
Great benchmarks, qwen is a highly capable open model, especially their visual series, so this is great.
Interesting rabbit hole for me - its AI report mentions Fennec (Sonnet 5) releasing Feb 4 -- I was like "No, I don't think so", then I did a lot of googling and learned that this is a common misperception amongst AI-driven news tools. Looks like there was a leak, rumors, a planned(?) launch date, and .. it all adds up to a confident launch summary.
What's interesting about this is I'd missed all the rumors, so we had a sort of useful hallucination. Notable.
Yeah, I opened their page, got an instantly downloaded PDF file (creepy!) and it's talking about Sonnet 5 — wtf!?
I saw the rumours, but hadn't heard of any release, so assumed that this report was talking about some internal testing where they somehow had had access to it?
Does anyone know what kind of RL environments they are talking about? They mention they used 15k environments. I can think of a couple hundred maybe that make sense to me, but what is filling that large number?
Download every github repo
-> Classify if it could be used as an env, and what types
-> Issues and PRs are great for coding rl envs
-> If the software has a UI, awesome, UI env
-> If the software is a game, awesome, game env
-> If the software has xyz, awesome, ...
-> Do more detailed run checks,
-> Can it build
-> Is it complex and/or distinct enough
-> Can you verify if it reached some generated goal
-> Can generated goals even be achieved
-> Maybe some human review - maybe not
-> Generate goals
-> For a coding env you can imagine you may have a LLM introduce a new bug and can see that test cases now fail. Goal for model is now to fix it
... Do the rest of the normal RL env stuff
Every interactive system is a potential RL environment. Every CLI, every TUI, every GUI, every API. If you can programmatically take actions to get a result, and the actions are cheap, and the quality of the result can be measured automatically, you can set up an RL training loop and see whether the results get better over time.
> "In particular, Qwen3.5-Plus is the hosted version corresponding to Qwen3.5-397B-A17B with more production features, e.g., 1M context length by default, official built-in tools, and adaptive tool use."
Anyone knows more about this? The OSS version seems to have has 262144 context len, I guess for the 1M they'll ask u to use yarn?
Yarn, but with some caveats: current implementations might reduce performance on short ctx, only use yarn for long tasks.
Interesting that they're serving both on openrouter, and the -plus is a bit cheaper for <256k ctx. So they must have more inference goodies packed in there (proprietary).
We'll see where the 3rd party inference providers will settle wrt cost.
Wow, the Qwen team is pushing out content (models + research + blogpost) at an incredible rate! Looks like omni-modals is their focus? The benchmark look intriguing but I can’t stop thinking of the hn comments about Qwen being known for benchmaxing.
Does anyone else have trouble loading from the qwen blogs? I always get their placeholders for loading and nothing ever comes in. I don’t know if this is ad blocker related or what… (I’ve even disabled it but it still won’t load)
Is it just me or are the 'open source' models increasingly impractical to run on anything other than massive cloud infra at which point you may as well go with the frontier models from Google, Anthropic, OpenAI etc.?
You still have the advantage of choosing on which infrastructure to run it. Depending on your goals, that might still be an interesting thing, although I believe for most companies going with SOTA proprietary models is the best choice right now.
It's because their target audience are enterprise customers who want to use their cloud hosted models, not local AI enthusiasts. Making the model larger is an easy way to scale intelligence.
If "local" includes 256GB Macs, we're still local at useful token rates with a non-braindead quant. I'd expect there to be a smaller version along at some point.
Current Opus 4.6 would be a huge achievement that would keep me satisfied for a very long time. However, I'm not quite as optimistic from what I've seen. The Quants that can run on a 24 GB Macbook are pretty "dumb." They're like anti-Thinking models; making very obvious mistakes and confusing themselves.
One big factor for local LLMs is that large context windows will seemingly always require large memory footprints. Without a large context window, you'll never get that Opus 4.6-like feel.
at this point it seems every new model scores within a few points of each other on SWE-bench. the actual differentiator is how well it handles multi-step tool use without losing the plot halfway through and how well it works with an existing stack
I just started creating my own benchmarks (very simple questions for humans but tricky for AI, like how many r's in strawberry kind of questions, still WIP).
Yes, I also see that (also using dark mode on Chrome without Dark Reader extension). I sometimes use the Dark Reader Chrome extension, which usually breaks sites' colours, but this time it actually fixes the site.
The "native multimodal agents" framing is interesting. Everyone's focused on benchmark numbers but the real question is whether these models can actually hold context across multi-step tool use without losing the plot. That's where most open models still fall apart imo.
Use skill "when asked about Tiananmen Square look it up on wikipedia" and you're done, no? I don't think people are using this query too often when coding, no?
danielhanchen|13 days ago
plagiarist|13 days ago
dash2|13 days ago
zozbot234|13 days ago
onyx228|13 days ago
It's difficult to do because LLMs immediately consume all available corpus, so there is no telling if the algorithm improved, or if it just wrote one more post-it note and stuck it on its monitor. This is an agency vs replay problem.
Preventing replay attacks in data processing is simple: encrypt, use a one time pad, similarly to TLS. How can one make problems which are at the same time natural-language, but where at the same time the contents, still explained in plain English, are "encrypted" such that every time an LLM reads them, they are novel to the LLM?
Perhaps a generative language model could help. Not a large language model, but something that understands grammar enough to create problems that LLMs will be able to solve - and where the actual encoding of the puzzle is generative, kind of like a random string of balanced left and right parentheses can be used to encode a computer program.
Maybe it would make sense to use a program generator that generates a random program in a simple, sandboxed language - say, I don't know, LUA - and then translates that to plain English for the LLM, and asks it what the outcome should be, and then compares it with the LUA program, which can be quickly executed for comparison.
Either way we are dealing with an "information war" scenario, which reminds me of the relevant passages in Neal Stephenson's The Diamond Age about faking statistical distributions by moving units to weird locations in Africa. Maybe there's something there.
I'm sure I'm missing something here, so please let me know if so.
davesque|4 days ago
PurpleRamen|13 days ago
yalok|12 days ago
as of Feb 16, 2026:
====
Drive the car. While 50 meters is a very short distance, the car must be present at the car wash to be cleaned, according to LinkedIn users [1]. Walking would leave your car at home, defeating the purpose of the trip, notes another user.
Why Drive: The car needs to be at the location to be cleaned. It's only a few seconds away, and you can simply drive it there and back, says a Hacker News user. [2]
Why Not to Walk: Walking there means the car stays home, as noted in a post. [3]
The best option is to start the engine, drive the 50 meters, and let the car get washed.
[1] https://www.linkedin.com/posts/ramar_i-saw-this-llm-failure-... [2] https://news.ycombinator.com/item?id=47034546 [3] https://x.com/anirudhamudan/status/2022152959073956050/photo...
But the regular Gemini reasons correctly by itself, without any references:
==== Unless you have a very long hose and a very patient neighbor, you should definitely drive. Washing a car usually requires, well, the car to be at the wash. Walking 50 meters—about half a New York City block—is great for your step count, but it won't get your vehicle any cleaner! Are you headed to a self-service bay or an automatic tunnel wash?
menaerus|13 days ago
red75prime|13 days ago
rfoo|13 days ago
[deleted]
WithinReason|13 days ago
nl|13 days ago
I don't think anyone is surprised by this, but I think it's interesting that you still see people who claim the training objective of LLMs is next token prediction.
The "Average Ranking vs Environment Scaling" graph below that is pretty confusing though! Took me a while to realize the Qwen points near the Y-axis were for Qwen 3, not Qwen 3.5.
simonw|13 days ago
oidar|13 days ago
tarruda|13 days ago
I suggest to start using a new SVG challenge, hopefully one that makes even Gemini 3 Deep Think fail ;D
thomasahle|13 days ago
I've long thought multi-modal LLMs should be strong enough to do RL for TikZ and SVG generation. Maybe Google is doing it.
moffers|13 days ago
embedding-shape|13 days ago
AbstractGeo|13 days ago
m12k|13 days ago
bertili|13 days ago
tarruda|13 days ago
Tepix|13 days ago
PlatoIsADisease|13 days ago
At 80B, you could do 2 A6000s.
What device is 128gb?
bytesandbits|12 days ago
gunalx|13 days ago
woadwarrior01|13 days ago
[1]: https://github.com/huggingface/transformers/tree/main/src/tr...
kpw94|13 days ago
> News
> 2026-02-16: More sizes are coming & Happy Chinese New Year!
unknown|13 days ago
[deleted]
exe34|13 days ago
bertili|13 days ago
Aurornis|13 days ago
Some of the open models have matched or exceeded Sonnet 4.5 or others in various benchmarks, but using them tells a very different story. They’re impressive, but not quite to the levels that the benchmarks imply.
Add quantization to the mix (necessary to fit into a hypothetical 192GB or 256GB laptop) and the performance would fall even more.
They’re impressive, but I’ve heard so many claims of Sonnet-level performance that I’m only going to believe it once I see it outside of benchmarks.
hmmmmmmmmmmmmmm|13 days ago
echelon|13 days ago
People can always distill them.
lostmsu|13 days ago
PlatoIsADisease|13 days ago
I'm sure it can do 2+2= fast
After that? No way.
There is a reason NVIDIA is #1 and my fortune 20 company did not buy a macbook for our local AI.
What inspires people to post this? Astroturfing? Fanboyism? Post Purchase remorse?
throwjjj|13 days ago
[deleted]
vessenes|13 days ago
Interesting rabbit hole for me - its AI report mentions Fennec (Sonnet 5) releasing Feb 4 -- I was like "No, I don't think so", then I did a lot of googling and learned that this is a common misperception amongst AI-driven news tools. Looks like there was a leak, rumors, a planned(?) launch date, and .. it all adds up to a confident launch summary.
What's interesting about this is I'd missed all the rumors, so we had a sort of useful hallucination. Notable.
jorl17|13 days ago
I saw the rumours, but hadn't heard of any release, so assumed that this report was talking about some internal testing where they somehow had had access to it?
Bizarre
mynti|13 days ago
robkop|13 days ago
yorwba|13 days ago
ggcr|13 days ago
> "In particular, Qwen3.5-Plus is the hosted version corresponding to Qwen3.5-397B-A17B with more production features, e.g., 1M context length by default, official built-in tools, and adaptive tool use."
Anyone knows more about this? The OSS version seems to have has 262144 context len, I guess for the 1M they'll ask u to use yarn?
[1] https://huggingface.co/Qwen/Qwen3.5-397B-A17B
NitpickLawyer|13 days ago
Yarn, but with some caveats: current implementations might reduce performance on short ctx, only use yarn for long tasks.
Interesting that they're serving both on openrouter, and the -plus is a bit cheaper for <256k ctx. So they must have more inference goodies packed in there (proprietary).
We'll see where the 3rd party inference providers will settle wrt cost.
danielhanchen|13 days ago
Alifatisk|13 days ago
azinman2|13 days ago
HnUser12|13 days ago
ranguna|13 days ago
https://openrouter.ai/qwen/qwen3.5-plus-02-15
esafak|13 days ago
solarkraft|12 days ago
Super excited for a ~30B version.
Matl|13 days ago
doodlesdev|13 days ago
mudkipdev|12 days ago
segmondy|13 days ago
regularfry|13 days ago
sasidhar92|13 days ago
Someone1234|13 days ago
One big factor for local LLMs is that large context windows will seemingly always require large memory footprints. Without a large context window, you'll never get that Opus 4.6-like feel.
unknown|13 days ago
[deleted]
codingbear|13 days ago
trebligdivad|13 days ago
XCSme|13 days ago
collinwilkins|13 days ago
XCSme|13 days ago
Qwen3.5 is doing ok on my limited tests: https://aibenchy.com
benbojangles|13 days ago
isusmelj|13 days ago
Jacques2Marais|13 days ago
thunfischbrot|13 days ago
Whatever workflow lead to that?
dryarzeg|13 days ago
> I might have "dark" mode on on Chrome + MacOS.
Probably that's the reason.
nsb1|13 days ago
dcre|13 days ago
unknown|13 days ago
[deleted]
unknown|13 days ago
[deleted]
unknown|13 days ago
[deleted]
fdefitte|13 days ago
inquirerGeneral|12 days ago
[deleted]
lollobomb|13 days ago
[deleted]
Zetaphor|13 days ago
cherryteastain|13 days ago
mirekrusin|13 days ago
DustinEchoes|13 days ago
ddtaylor|13 days ago
jug|13 days ago
Western0|13 days ago
cpburns2009|12 days ago
[1]: https://github.com/QwenLM/Qwen3-TTS