So I have been doing formal specification with TLA+ using AI assistance and it has been very helpful AFTER I REALIZED that quite often it was proving things that were either trivial or irrelevant to the problem at hand (and not the problem itself), but difficult to detect at a high level.
I realize formal verification with lean is a slightly different game but if anyone here has any insight, I tend to be extremely nervous about a confidently presented AI "proof" because I am sure that the proof is proving whatever it is proving, but it's still very hard for me to be confident that it is proving what I need it to prove.
Before the dog piling starts, I'm talking specifically about distributed systems scenarios where it is just not possible for a human to think through all the combinatorics of the liveness and safety properties without proof assistance.
I'm open to being wrong on this, but I think the skill of writing a proof and understanding the proof is different than being sure it actually proves for all the guarantees you have in mind.
I feel like closing this gap is make it or break it for using AI augmented proof assistance.
In my experience, finding the "correct" specification for a problem is usually very difficult for realistic systems. Generally it's unlikely that you'll be able to specify ALL the relevant properties formally. I think there's probably some facet of Kolmogorov complexity there; some properties probably cannot be significantly "compressed" in a way where the specification is significantly shorter and clearer than the solution.
But it's still usually possible to distill a few crucial properties that can be specified in an "obviously correct" manner. It takes A LOT of work (sometimes I'd be stuck for a couple of weeks trying to formalize a property). But in my experience the trade off can be worth it. One obvious benefit is that bugs can be pricey, depending on the system. But another benefit is that, even without formal verification, having a few clear properties can make it much easier to write a correct system, but crucially also make it easier to maintain the system as time goes by.
Yeah, even for simple things, it's surprisingly hard to write a correct spec. Or more to the point, it's surprisingly easy to write an incorrect spec and think it's correct, even under scrutiny, and so it turns out that you've proved the wrong thing.
This isn't to say it's useless; sometimes it helps you think about the problem more concretely and document it using known standards. But I'm not super bullish on "proofs" being the thing that keeps AI in line. First, like I said, they're easy to specify incorrectly, and second, they become incredibly hard to prove beyond a certain level of complexity. But I'll be interested to watch the space evolve.
(Note I'm bullish on AI+Lean for math. It's just the "provably safe AI" or "provably correct PRs" that I'm more skeptical of).
As a heavy user of formal methods, I think refinement types, instead of theorem proving with Lean or Isabelle, is both easier and more amenable to automation that doesn't get into these pitfalls.
It's less powerful, but easier to break down and align with code. Dafny and F* are two good showcases. Less power makes it also faster to verify and iterate on.
You have identified the crux of the problem, just like mathematics writing down the “right” theorem is often half or more of the difficulty.
In the case of digital systems it can be much worse because we often have to include many assumptions to accommodate the complexity of our models. To use an example from your context, usually one is required to assume some kind of fairness to get anything to go through with systems operating concurrently but many kinds of fairness are not realistic (eg strong fairness).
I was having the same intuition, but you verbalised it better: the notion of having a definitive yes/no answer is very attractive, but describing what you need in such terms using natural language, which is inherently ambiguous... that feels like a fool's errand. That's why I keep thinking that LLM usage for serious things will break down once we get to the truly complicated things: it's non-deterministic nature will be an unbreakable barrier. I wish I'm wrong, though.
Interesting. It's essentially the same idea as in this article: https://substack.com/home/post/p-184486153. In both scenarios, the human is relieved of the burden of writing complex formal syntax (whether Event-B or Lean 4). The human specifies intent and constraints in natural language, while the LLM handles the work of formalization and satisfying the proof engine.
But Lean 4 is significantly more rigid, granular, and foundational than e.g. Event-B, and they handle concepts like undefined areas and contradictions very differently. While both are "formal methods," they were built by different communities for different purposes: Lean is a pure mathematician's tool, while Event-B is a systems engineer's tool. Event-B is much more flexible, allowing an engineer (or the LLM) to sketch the vague, undefined contours of a system and gradually tighten the logical constraints through refinement.
LLMs are inherently statistical interpolators. They operate beautifully in an Open World (where missing information is just "unknown" and can be guessed or left vague) and they use Non-Monotonic Reasoning (where new information can invalidate previous conclusions). Lean 4 operates strictly on the Closed World Assumption (CWA) and is brutally Monotonic. This is why using Lean to model things humans care about (business logic, user interfaces, physical environments, dynamic regulations) quickly hits a dead end. The physical world is full of exceptions, missing data, and contradictions. Lean 4 is essentially a return to the rigid, brittle approach of the 1980s expert systems. Event-B (or similar methods) provides the logical guardrails, but critically, it tolerates under-specification. It doesn't force the LLM to solve the Frame Problem or explicitly define the whole universe. It just checks the specific boundaries the human cares about.
>> LLMs are inherently statistical interpolators. They operate beautifully in an Open World (where missing information is just "unknown" and can be guessed or left vague) and they use Non-Monotonic Reasoning (where new information can invalidate previous conclusions).
I think LLM reasoning is not so much non-monotonic as unsound, in the sense that conclusions do not necessarily follow from the premises. New information may change conclusions but how that happens is anyone's guess. There's some scholarship on that, e.g. there's a series of papers by Subarao Kamphampathi and his students that show how reasoning models' "thiking" tokens don't really correspond to sound reasoning chains, even if they seem to improve performance overall [1].
But it is difficult to tell what reasoning really means in LLMs. I believe the charitable interpretation of claims about LLM reasoning is that it is supposed to be informal. There is evidence both for and against it (e.g. much testing is in fact on formal reasoning problems, like math exam questions or Sokoban, but there's tests of informal reasoning also, e.g. on the bar exam). However, different interpretations are hard to square with the claims that "we don't understand reasoning"; not a direct quote, but I'm aware of many claims like that by people whose job it is to develop LLMs and that were made at the height of activity around reasoning models (which seems now to have been superseded by activity around "world models") [1].
If LLMs are really capable of informal reasoning (I'm not necessarily opposed to that idea) then we really don't understand what that reasoning is, but it seems we're a bit stuck because to really understand it, we have to, well, formalise it.
That said, non-monotonic reasoning is supposed to be closer to the way humans do informal reasoning in the real world, compared to classical logic, even though classical logic started entirely as an effort to formalise human reasoning; I mean, with Aristotle's Syllogisms (literally "rsasonings" in Greek).
I think it’s better to think of an LLM as a very good hint engine. It’s good at coming up with more possibilities to consider and less good at making sure they work, unless it has an external system to test ideas on and is trained to use it. In the case of applied math, it’s not enough to prove theorems. It also needs to be testing against the real world somehow.
Lean 4 is uses constructive logic. If a closed world assumption requires that a statement that is true is also known to be true, and that any statement that is not known to be true is therefore false, that is not true of constructive systems. I only use Rocq, but I believe the type theories in Rocq and Lean 4 are basically similar variations on the Calculus of Constructions in both cases, though there are important differences. In a constructive theory something is true if a proof can be constructed, but the lack of a proof does not entail that something is false. One needs to prove that something is false. In constructive type theory, one can say, that something is true or false.
So basically you are arguing a Type Theory vs Set Theory problem, Foundationalism or Engineering Refinement. Since we read here of multiple use cases for LLMs in both CS divides, we can conclude an eventual convergence in these given approaches; and if not that, some formal principles should emerge of when to use what.
I just completed the formal verification of my bachelor thesis about real time cellular automata with Lean 4, with heavy use of AI.
Over the past year, I went from fully manual mode (occasionally asking chat gpt some Lean questions) to fully automatic mode, where I barely do Lean proofs myself now (and just point AI to the original .tex files, in German).
It is hard to believe how much the models and agentic harnesses improved over the last year.
I cannot describe how much fun it is to do refactorings with AI on a verified Lean project!
Also, it's so easy now to have visualizations and typesetted documents generated by AI, from dependency visualizations of proofs using the Lean reflection API, to visual execution traces of cellular automatas.
Lean is a great idea, especially the 4th version, a huge level up from the 3rd one, but its core still deficient[1] in some particular scenarious (see an interesting discussion[2] in the Rock (formerly Coq) issue tracker). Not sure if it might hinder the automation with the AI.
Machine learning is definitely enabling writing _proofs_ within a proof assistant, and I'm sure it will help to make formal verification more viable in the future.
Where it cannot (fully) replace humans, is writing the _theorems_ themselves. A human has to check that the theorem being proven is actually what you were trying to prove, and this is not safe from LLM hallucinations. If you ask an LLM, is this bridge safe, and it writes `Theorem bridge_is_safe : 1 + 1 = 2.` and proves this theorem, that does _not_ mean the bridge is safe...
The article then also makes some wild extrapolations:
> We could imagine an LLM assistant for finance that provides an answer only if it can generate a formal proof that it adheres to accounting rules or legal constraints.
I guess it's true because you could imagine this, hypothetically. But it's not going to happen, because you cannot formalize a financial or legal statement in a proof assistant. It's a fundamentally informal, real-world thing, and proof assistants are fundamentally for proving formal, abstract things.
Here is another way to think of this. We all understand that the value of a lawyer in contract negotiations lies not only in drafting a document that, when fed to judge, produces the desired outcome. Rather, lawyers help clients (and counterparties) decide on what their interests consist in.
Developing software is always something of a principal-agent coordination problem, and comes with transaction costs.
Much of the time, most of us labor under the illusion that each of us understands our desires and interests better than any other party could.
I am using lean as part of the prd.md description handed to a coding agent. The definitions in lean compile and mean exactly what I want them to say. The implementation i want to build is in rust.
HOWEVER … I hit something i now call a McLuhen vortex error: “When a tool, language, or abstraction smuggles in an implied purpose at odds with your intended goal.”
Using Lean implies to the coding agent ‘proven’ is a pervasive goal.
I want to use lean to be more articulate about the goal. Instead using lean smuggled in a difficult to remove implicit requirement that everything everywhere must be proven.
This was obvious because the definitions i made in lean imply the exact opposite of everything needs to be proven. When i use morphism i mean anything that is a morphism not only things proven to be morphisms.
A coding agent driven by an llm needs a huge amount of structure to use what the math says rather than take on the implications that because it is using a proof system therefore everything everywhere is better if proven.
The initial way i used lean poisoned the satisficing structure that unfolds during a coding pass.
If you want to mess with this at home, I've been vibe coding https://github.com/kig/formalanswer to plug theorem provers into an LLM call loop. It's pretty early dev but it does have a logic rap battle mode.
I think I saw Terence Tao use a formal proof language but I don't remember if it was Lean. I'm not familiar with it but I do agree that moving to provable languages could improve AI but isn't the basis just having some immutable rigorous set of tests basically which could be replicated in "regular" programming languages?
You can think of theorem provers as really crazy type checkers. It's not just a handful of tests that have to run, but more like a program that has to compile.
He also has blogged about how he uses lean for his research.
Edit to add: Looking at that repo, one thing I like (but others may find infuriating idk) is that where in the text he leaves certain proofs as exercises for the reader, in the repo he turns those into “sorry”s, so you can fork the repo and have a go at proving those things in lean yourself.
If you have some proposition which you need to use as the basis of further work but you haven’t completed a formal proof of yet, in lean, you can just state the proposition with the proof being “sorry”. Lean will then proceed as though that proposition had been proved except that it will give you a warning saying that you have a sorry. For something to be proved in lean you have to have it done without any “sorry”s. https://lean-lang.org/doc/reference/latest/Tactic-Proofs/Tac...
Yes, though often the easiest way to replicate it in regular programming languages is to translate that language to Lean or another ITM, though auto-active like Verus is used for Rust pretty successfully.
Python and C though have enough nasal demons and undefined behavior that it's a huge pain to verify things about them, since some random other thread can drive by and modify memory in another thread.
A theorem prover is a dependently typed functional programming language. If you can generate a term with a particular type then the theorem is true. There is no testing involved.
I like a lot of the idea behind such theorem provers, however, I always have issues with them producing compatible code with other languages.
This happened to me with idris and many others, I took some time to learn the basics, wrote some examples and then FFI was a joke or code generators for JavaScript absolutely useless.
Apart from prioritizing FFI (like Java/Scala, Erlang/Elixir), the other two easy ways to bootstrap an integration of a new obscure or relatively new programming language is to focus on RPC (ffi through network) or file input-output (parse and produce well known file formats to integrate with other tools at Bash level).
I find it very surprising that nobody tried to make something like gRPC as an interop story for a new language, with an easy way to write impure "extensions" in other languages and let your pure/formal/dependently typed language implement the rest purely through immutable message passing over gRPC boundary. Want file i/o? Implement gRPC endpoint in Go, and let your language send read/write messages to it without having to deal with antiquated and memory unsafe Posix layer.
This has been the approach taken by some using LLMs, even in less type-heavy situations. Of course, it is part of a broader tradition in which search is combined with verification. Genetic programming and related areas come to mind. Here, LLMs are search, while Lean is used to express constraints.
> Large language models (LLMs) have astounded the world with their capabilities, yet they remain plagued by unpredictability and hallucinations – confidently outputting incorrect information. In high-stakes domains like finance, medicine or autonomous systems, such unreliability is unacceptable.
This misses a point that software engineers initmately know especially ones using ai tools:
* Proofs are one QA tool
* Unit tests, integration tests and browser automation are other tools.
* Your code can have bugs because it fails a test above BUT...
* You may have got the requirements wrong!
Working with claude code you can have productive loops getting it to assist you in writing tests, finding bugs you hadn't spotted and generally hardening your code.
It takes taste and dev experience definitely helps (as of Jan 26)
So I think hallucinations and proofs as the fix is a bit barking up the wrong tree
The solution to hallucinations is careful shaping of the agent environment around the project to ensure quality.
Proofs may be part of the qa toolkit for AI coded projects but probably rarely.
This site is getting invaded by AI bots... how long before its just AI speaking with AI, and just people reading the conversations thinking that its actual people?
upghost|10 days ago
I realize formal verification with lean is a slightly different game but if anyone here has any insight, I tend to be extremely nervous about a confidently presented AI "proof" because I am sure that the proof is proving whatever it is proving, but it's still very hard for me to be confident that it is proving what I need it to prove.
Before the dog piling starts, I'm talking specifically about distributed systems scenarios where it is just not possible for a human to think through all the combinatorics of the liveness and safety properties without proof assistance.
I'm open to being wrong on this, but I think the skill of writing a proof and understanding the proof is different than being sure it actually proves for all the guarantees you have in mind.
I feel like closing this gap is make it or break it for using AI augmented proof assistance.
oggy|10 days ago
But it's still usually possible to distill a few crucial properties that can be specified in an "obviously correct" manner. It takes A LOT of work (sometimes I'd be stuck for a couple of weeks trying to formalize a property). But in my experience the trade off can be worth it. One obvious benefit is that bugs can be pricey, depending on the system. But another benefit is that, even without formal verification, having a few clear properties can make it much easier to write a correct system, but crucially also make it easier to maintain the system as time goes by.
daxfohl|10 days ago
There was a post a few months ago demonstrating this for various "proved" implementations of leftpad: https://news.ycombinator.com/item?id=45492274
This isn't to say it's useless; sometimes it helps you think about the problem more concretely and document it using known standards. But I'm not super bullish on "proofs" being the thing that keeps AI in line. First, like I said, they're easy to specify incorrectly, and second, they become incredibly hard to prove beyond a certain level of complexity. But I'll be interested to watch the space evolve.
(Note I'm bullish on AI+Lean for math. It's just the "provably safe AI" or "provably correct PRs" that I'm more skeptical of).
nextos|9 days ago
It's less powerful, but easier to break down and align with code. Dafny and F* are two good showcases. Less power makes it also faster to verify and iterate on.
johnbender|10 days ago
In the case of digital systems it can be much worse because we often have to include many assumptions to accommodate the complexity of our models. To use an example from your context, usually one is required to assume some kind of fairness to get anything to go through with systems operating concurrently but many kinds of fairness are not realistic (eg strong fairness).
esafak|10 days ago
youknownothing|10 days ago
Rochus|10 days ago
But Lean 4 is significantly more rigid, granular, and foundational than e.g. Event-B, and they handle concepts like undefined areas and contradictions very differently. While both are "formal methods," they were built by different communities for different purposes: Lean is a pure mathematician's tool, while Event-B is a systems engineer's tool. Event-B is much more flexible, allowing an engineer (or the LLM) to sketch the vague, undefined contours of a system and gradually tighten the logical constraints through refinement.
LLMs are inherently statistical interpolators. They operate beautifully in an Open World (where missing information is just "unknown" and can be guessed or left vague) and they use Non-Monotonic Reasoning (where new information can invalidate previous conclusions). Lean 4 operates strictly on the Closed World Assumption (CWA) and is brutally Monotonic. This is why using Lean to model things humans care about (business logic, user interfaces, physical environments, dynamic regulations) quickly hits a dead end. The physical world is full of exceptions, missing data, and contradictions. Lean 4 is essentially a return to the rigid, brittle approach of the 1980s expert systems. Event-B (or similar methods) provides the logical guardrails, but critically, it tolerates under-specification. It doesn't force the LLM to solve the Frame Problem or explicitly define the whole universe. It just checks the specific boundaries the human cares about.
YeGoblynQueenne|10 days ago
I think LLM reasoning is not so much non-monotonic as unsound, in the sense that conclusions do not necessarily follow from the premises. New information may change conclusions but how that happens is anyone's guess. There's some scholarship on that, e.g. there's a series of papers by Subarao Kamphampathi and his students that show how reasoning models' "thiking" tokens don't really correspond to sound reasoning chains, even if they seem to improve performance overall [1].
But it is difficult to tell what reasoning really means in LLMs. I believe the charitable interpretation of claims about LLM reasoning is that it is supposed to be informal. There is evidence both for and against it (e.g. much testing is in fact on formal reasoning problems, like math exam questions or Sokoban, but there's tests of informal reasoning also, e.g. on the bar exam). However, different interpretations are hard to square with the claims that "we don't understand reasoning"; not a direct quote, but I'm aware of many claims like that by people whose job it is to develop LLMs and that were made at the height of activity around reasoning models (which seems now to have been superseded by activity around "world models") [1].
If LLMs are really capable of informal reasoning (I'm not necessarily opposed to that idea) then we really don't understand what that reasoning is, but it seems we're a bit stuck because to really understand it, we have to, well, formalise it.
That said, non-monotonic reasoning is supposed to be closer to the way humans do informal reasoning in the real world, compared to classical logic, even though classical logic started entirely as an effort to formalise human reasoning; I mean, with Aristotle's Syllogisms (literally "rsasonings" in Greek).
________________
[1] Happy to get links if needed.
skybrian|10 days ago
sinkasapa|10 days ago
mycall|10 days ago
Gehinnn|10 days ago
Over the past year, I went from fully manual mode (occasionally asking chat gpt some Lean questions) to fully automatic mode, where I barely do Lean proofs myself now (and just point AI to the original .tex files, in German). It is hard to believe how much the models and agentic harnesses improved over the last year.
I cannot describe how much fun it is to do refactorings with AI on a verified Lean project!
Also, it's so easy now to have visualizations and typesetted documents generated by AI, from dependency visualizations of proofs using the Lean reflection API, to visual execution traces of cellular automatas.
svara|10 days ago
xvilka|10 days ago
[1] https://artagnon.com/logic/leancoq
[2] https://github.com/rocq-prover/rocq/issues/10871
joomy|10 days ago
emih|10 days ago
Where it cannot (fully) replace humans, is writing the _theorems_ themselves. A human has to check that the theorem being proven is actually what you were trying to prove, and this is not safe from LLM hallucinations. If you ask an LLM, is this bridge safe, and it writes `Theorem bridge_is_safe : 1 + 1 = 2.` and proves this theorem, that does _not_ mean the bridge is safe...
The article then also makes some wild extrapolations:
> We could imagine an LLM assistant for finance that provides an answer only if it can generate a formal proof that it adheres to accounting rules or legal constraints.
I guess it's true because you could imagine this, hypothetically. But it's not going to happen, because you cannot formalize a financial or legal statement in a proof assistant. It's a fundamentally informal, real-world thing, and proof assistants are fundamentally for proving formal, abstract things.
bwestergard|10 days ago
Here is another way to think of this. We all understand that the value of a lawyer in contract negotiations lies not only in drafting a document that, when fed to judge, produces the desired outcome. Rather, lawyers help clients (and counterparties) decide on what their interests consist in.
Developing software is always something of a principal-agent coordination problem, and comes with transaction costs.
Much of the time, most of us labor under the illusion that each of us understands our desires and interests better than any other party could.
SteveJS|10 days ago
HOWEVER … I hit something i now call a McLuhen vortex error: “When a tool, language, or abstraction smuggles in an implied purpose at odds with your intended goal.”
Using Lean implies to the coding agent ‘proven’ is a pervasive goal.
I want to use lean to be more articulate about the goal. Instead using lean smuggled in a difficult to remove implicit requirement that everything everywhere must be proven.
This was obvious because the definitions i made in lean imply the exact opposite of everything needs to be proven. When i use morphism i mean anything that is a morphism not only things proven to be morphisms.
A coding agent driven by an llm needs a huge amount of structure to use what the math says rather than take on the implications that because it is using a proof system therefore everything everywhere is better if proven.
The initial way i used lean poisoned the satisficing structure that unfolds during a coding pass.
mycall|10 days ago
kig|10 days ago
nl|10 days ago
zmgsabst|10 days ago
- Lean supports calling out as a tactic, allowing you to call LLMs or other AI as judges (ie, they return a judgment about a claim)
- Lean can combine these judgments from external systems according to formal theories (ie, normal proof mechanics)
- an LLM engaged in higher order reasoning can decompose its thinking into such logical steps of fuzzy blocks
- this can be done recursively, eg, having a step that replaces LLM judgments with further logical formulations of fuzzy judgments from the LLM
Something, something, sheaves.
throwaway2027|10 days ago
iNic|10 days ago
seanhunter|10 days ago
https://github.com/teorth/analysis
He also has blogged about how he uses lean for his research.
Edit to add: Looking at that repo, one thing I like (but others may find infuriating idk) is that where in the text he leaves certain proofs as exercises for the reader, in the repo he turns those into “sorry”s, so you can fork the repo and have a go at proving those things in lean yourself.
If you have some proposition which you need to use as the basis of further work but you haven’t completed a formal proof of yet, in lean, you can just state the proposition with the proof being “sorry”. Lean will then proceed as though that proposition had been proved except that it will give you a warning saying that you have a sorry. For something to be proved in lean you have to have it done without any “sorry”s. https://lean-lang.org/doc/reference/latest/Tactic-Proofs/Tac...
gaogao|10 days ago
Python and C though have enough nasal demons and undefined behavior that it's a huge pain to verify things about them, since some random other thread can drive by and modify memory in another thread.
anon291|10 days ago
nudpiedo|10 days ago
This happened to me with idris and many others, I took some time to learn the basics, wrote some examples and then FFI was a joke or code generators for JavaScript absolutely useless.
So no way of leveraging an existing ecosystem.
densh|10 days ago
I find it very surprising that nobody tried to make something like gRPC as an interop story for a new language, with an easy way to write impure "extensions" in other languages and let your pure/formal/dependently typed language implement the rest purely through immutable message passing over gRPC boundary. Want file i/o? Implement gRPC endpoint in Go, and let your language send read/write messages to it without having to deal with antiquated and memory unsafe Posix layer.
seanhunter|10 days ago
lo_zamoyski|10 days ago
tokenless|10 days ago
This misses a point that software engineers initmately know especially ones using ai tools:
* Proofs are one QA tool
* Unit tests, integration tests and browser automation are other tools.
* Your code can have bugs because it fails a test above BUT...
* You may have got the requirements wrong!
Working with claude code you can have productive loops getting it to assist you in writing tests, finding bugs you hadn't spotted and generally hardening your code.
It takes taste and dev experience definitely helps (as of Jan 26)
So I think hallucinations and proofs as the fix is a bit barking up the wrong tree
The solution to hallucinations is careful shaping of the agent environment around the project to ensure quality.
Proofs may be part of the qa toolkit for AI coded projects but probably rarely.
whattheheckheck|10 days ago
45dsilicon|10 days ago
[deleted]
AxiomLab|10 days ago
[deleted]
youoy|10 days ago
tesserato|14 days ago
[deleted]
nudpiedo|10 days ago