Show HN: Only 1 LLM can fly a drone

avaer|1 month ago

Gemini 3 is the only model I've found that can reason spatially. The results here are accurate to my experiments with putting LLM NPCs in simulated worlds.

I was surprised that most VLLMs cannot reliably tell if a character is facing left or right, they will confidently lie no matter what you do (even gemini 3 cannot do it reliably). I guess it's just not in the training data.

That said Qwen3VL models are smaller/faster and better "spatially grounded" in pixel space, because pixel coordinates are encoded in the tokens. So you can use them for detecting things in the scene, and where they are (which you can project to 3d space if you are running a sim). But they are not good reasoning models so don't ask them to think.

That means the best pipeline I've found at the moment is to tack a dumb detection prepass on before your action reasoning. This basically turns 3d sims into 1d text sims operating on labels -- which is something that LLMs are good at.

general_reveal|1 month ago

We just need to fine tune these models on Ocarina of Time Water Temple - spatial reasoning solved.

storystarling|1 month ago

I suspect the latency on Gemini 3 makes it non-viable for a real-time control loop though. Even if the reasoning works, the input token costs would destroy the unit economics pretty quickly. I'd be worried about relying on that kind of API overhead for the critical path.

Krutonium|1 month ago

Neuro-sama, the V-Tuber/AI actually does a decent job of it. Vedal seems to have cooked and figured out how to make an LLM move reasonably well in VRChat.

Not perfectly, there's a lot abuse of gravity or the lack thereof, but yeah. Neuro has also piloted a Robot Dog in the past.

modeless|1 month ago

This is what VLA models are for. They would work much better. Would need a bit of fine tuning but probably not much. Lots of literature out there on using VLAs to control drones.

SpyCoder77|1 month ago

Did some research, found a model that is exactly that. https://cognitivedrone.github.io/

volkercraig|1 month ago

I don't understand. Surely training an LSTM with sensor input is more practical and reasonable way than trying to get a text generator to speak commands to a drone.

encrux|1 month ago

Very much depends on what you want to do.

The fact that a language model can „reason“ (in the LLM-slang meaning of the term) about 3D space is an interesting property.

If you give a text description of a scene and ask a robot to perform a peg in hole task, modern models are able to solve them fairly easily based on movement primitives. I implemented this on a UR robot arm back in 2023

The next logical step is, instead of having the model output text (code representing movement primitives), outputting tokens in action space. This is what models like pi0 are doing.

broast|1 month ago

On the discussion of the right or wrong tool, I find it possible that the ability to reason towards a goal is more valuable in the long run than an intrinsic ability to achieve the same result. Or maybe a mix of both is the ideal.

dimatura|1 month ago

This is neat! It's a bit amusing in that I worked on a somewhat similar project for my phd thesis almost 10 years ago, although in that case we got it working on a real drone (heavily customized, based on DJI matrice) in the field, with only onboard compute. Back then it was just a fairly lightweight CNN for the perception, not that we could've gotten much more out of the jetson TX2.

bigfishrunning|1 month ago

Why would you want an LLM to fly a drone? Seems like the wrong tool for the job -- it's like saying "Only one power drill can pound roofing nails". Maybe that's true, but just get a hammer

notepad0x90|1 month ago

There are almost endless reasons why. It's like asking why would you want a self-driving car. Having a drone to transport things would be amazing, or to patrol an area. LLMs can be helpful with object identification, reacting to different events, and taking commands from users.

The first thought I had was those security guard robots that are popping up all over the place. if they were drones instead, and LLM talked to people asking them to do/not-do things, that would be an improvement.

Or an waiter drone, that takes your order in a restaurant, flies to the kitchen, picks up a sealed and secured food container, flies it back to the table, opens it, and leaves. It will monitor for gestures and voice commands to respond to diners and get their feedback, abuse, take the food back if it isn't satisfactory,etc...

This is the type of stuff we used to see in futuristic movies. It's almost possible now. glad to see this kind of tinkering.

munchler|1 month ago

Because we’re interested in AGI (emphasis on general) and LLM’s are the closest thing to AGI that we have right now.

pavlov|1 month ago

Yeah, it feels a bit like asking "which typewriter model is the best for swimming".

avaer|1 month ago

Using an LLM is the SOTA way to turn plain text instructions into embodied world behavior.

Charitably, I guess you can question why you would ever want to use text to command a machine in the world (simulated or not).

But I don't see how it's the wrong tool given the goal.

dan-bailey|1 month ago

When your only tool is a hammer, every problem begins to resemble a nail.

Mashimo|1 month ago

> Why would you want an LLM to fly a drone?

We are on HACKER news. Using tools outside the scope is the ethos of a hacker.

smw1218|1 month ago

It's a great feature to tell my drone to do a task in English. Like "a child is lost in the woods around here. Fly a search pattern to find her" or "film a cool panorama of this property. Be sure to get shots of the water feature by the pool." While LLMs are bad at flying, better navigation models likely can't be prompted in natural language yet.

bob1029|1 month ago

The system prompt for the drone is hilarious to me. These models are horrible at spatial reasoning tasks:

https://github.com/kxzk/snapbench/blob/main/llm_drone/src/ma...

I've been working with integrating GPT-5.2 in Unity. It's fantastic at scripting but completely worthless at managing transforms for scene objects. Even with elaborate planning phases it's going to make a complete jackass of itself in world space every time.

LLMs are also wildly unsuitable for real-time control problems. They never will be. A PID controller or dedicated pathfinding tool being driven by the LLM will provide a radically superior result.

infecto|1 month ago

What’s the right tool then?

This looks like a pretty fun project and in my rough estimation a fun hacker project.

ralusek|1 month ago

Why would you want an LLM to identify plants and animals? Well, they're often better than bespoke image classification models at doing just that. Why would you want a language model to help diagnose a medical condition?

It would not surprise me at all if self-driving models are adopting a lot of the model architecture from LLMs/generative AI, and actually invoke actual LLMs in moments where they would've needed human intervention.

Imagine if there's a decision engine at the core of a self driving model, and it gets a classification result of what to do next. Suddenly it gets 3 options back with 33.33% weight attached to each of them and a very low confidence interval of which is the best choice. Maybe that's the kind of scenario that used to trigger self-driving to refuse to choose and defer to human intervention. If that can then first defer judgement to an LLM which could say "that's just a goat crossing the road, INVOKE: HONK_HORN," you could imagine how that might be useful. LLMs are clearly proving to be universal reasoning agents, and it's getting tiring to hear people continuously try to reduce them to "next word predictors."

sbsnjsks|1 month ago

[deleted]

peterpost2|1 month ago

Did you read his post?

He answers your question

accrual|1 month ago

I think it's fascinating work even if LLMs aren't the ideal tool for this job right now.

There were some experiments with embodied LLMs on the front page recently (e.g. basic robot body + task) and SOTA models struggled with that too. And of course they would - what training data is there for embodying a random device with arbitrary controls and feedback? They have to lean on the "general" aspects of their intelligence which is still improving.

With dedicated embodiment training and an even tighter/faster feedback loop, I don't see why an LLM couldn't successfully pilot a drone. I'm sure some will still fall of the rails, but software guardrails could help by preventing certain maneuvers.

calchiwo|1 month ago

The detection prepass plus text reasoning pipeline is effectively a perception to symbol translation layer, and that is where most of the brittleness will hide. Once you collapse a continuous 3D scene into discrete labels, you lose uncertainty, relative geometry, and temporal consistency unless you explicitly model them. The LLM then reasons over a clean but lossy world model, so action quality is capped by what the detector chose to surface.

The failure mode is not just missed objects, it is state aliasing. Two physically different scenes can map to the same label set, especially with occlusion, depth ambiguity, or near boundary conditions. In control tasks like drone navigation, that can produce confident but wrong actions because the planner has no access to the underlying geometry or sensor noise. Error compounds over time since each step re-anchors on an already simplified state.

Are you carrying forward any notion of uncertainty or temporal tracking from the vision stage, or is each step a stateless label snapshot fed to the reasoning model?

unknown|1 month ago

[deleted]

fsiefken|1 month ago

I am curious how these models would perform and how much energy they'd take to semi-realtime detect objects: SmolVLM2-500M - Moondream 0.5B/2B/2.5B - Qwen3-VL (3B) https://huggingface.co/collections/Qwen/qwen3-vl

I am sure this is already worked on in Russia, Ukraine and The Netherlands. A lot can go wrong with autonomous flying. One could load the VLM on a high end android phone on the drone and have dual control.

SpyCoder77|1 month ago

A better way would be a VLA as opposed to a VLM. VLAs are meant to take action, where as vlms are for geneeral use. https://cognitivedrone.github.io/

Bender|1 month ago

LLM's seem like the wrong platform to operate a drone in my opinion. I would expect that to be something more like a gaming engine. It should be small, simple, low latency and maybe based on a first person shooter running on insane difficulty. Small enough to fit in a tiny firmware space. It should boot so fast the firmware could be upgraded mid-flight without missing a beat. Give it simple friend or foe and obliterate anything not green.

unknown|1 month ago

[deleted]

eichin|1 month ago

At least he's not feeding real drones to the coyotes... oh, there's a link in the readme https://github.com/kxzk/tello-bench

me551ah|1 month ago

In a real world test you would have a tool call for the LLM which is a bit high level like GoTo(object) and the tool calls another program which identities the objects in frame and uses standard programs to go to that.

zahlman|1 month ago

> I gave 7 frontier LLMs a simple task: pilot a drone through a 3D voxel world and find 3 creatures.

> Only one could do it.

If I understood the chart correctly, even the successful one only found 1/6 of the creatures across multiple runs.

uoaei|1 month ago

No science detected.

Without comparison to some null hypothesis (a random policy), this article is hogwash.

Havoc|1 month ago

I’m guessing googles model has extensive Minecraft sandbox mode YouTube vids in its training which would exactly this perspective

SpyCoder77|1 month ago

https://cognitivedrone.github.io/

andai|1 month ago

Gemini Flash beats Gemini Pro? How does that work?

Gemini Pro, like the other models, didn't even find a single creature.

arikrahman|1 month ago

Interesting. In some benchmarks I even see flash outperforming thinking in general reasoning.

kylehotchkiss|1 month ago

This sounds like a good way to get your drone shot down by a Concerned Citizen or the military.

SoftTalker|1 month ago

LLMs are trained on text. Why would we expect them to understand a visual and tactile 3D world?

azinman2|1 month ago

Because they’re also multimodal vLLMs.

mbreese|1 month ago

I can’t really take this too seriously. This seems to me to be a case of asking “can an LLM do X?” Instead, the question is like to see is: “I want to do X, is an LLM this right tool?”

But that said, I think the author missed something. LLMs aren’t great at this type of reasoning/state task, but they are good at writing programs. Instead of asking the LLM to search with a drone, it would be very interesting to know how they performed if you asked them to write a program to search with a drone.

This is more aligned with the strengths of LLMs, so I could see this as having more success.

antisthenes|1 month ago

LLMs flying weaponized drones is exactly how it starts.

goda90|1 month ago

https://www.youtube.com/watch?v=O-2tpwW0kmU

popcornricecake|1 month ago

One day they'll fly to a drone factory, eliminate all the personnel, then start gently shooting at the machinery to create more weaponized drones and then it's all over before you know it!

SoftTalker|1 month ago

It's pretty entertaining seeing the plot lines and ficticious history in The Terminator movies actually happening in real time.

seniortaco|1 month ago

"drone"

92 comments