This article is underselling of how much was achieved in proof formalizing for math in the last few years and how close it is to being solved.
If we disregard programming and just look at formalizing math (Christian Szegedy has been doing it for a long time now), the length of proofs that are being formalized are exponentially growing and there's a good chance that in 2026 close to 100% of human written big/important proofs will be translated to and verified by Lean.
Just as an example for programming / modelling cache lines and cycle counts: we have quite good models for lots of architectures (even quite good reverse engineered model for NVIDIA GPUs in some papers). The problem is that calculating exact numbers for cache reads / writes is boring with lots of constants in them, and whenever we change the model a little bit the calculations have to be remade.
It's a lot of boring constraints to solve, and the main bottleneck for me when I was trying to do it by hand was that I couldn't just trust the output of LLMs.
I think this misses a lot of reasons why learning verification is important. For instance learning the concept of invariants and their types such as loop invariants. They make reasoning about code in general easier, even if you never formally do any verification, it makes it easier to write tests or asserts(). A substantial amount of bugs are due to the program having a different state to that assumed by the programmer, and there are other tools that help with this. For example a statically typed language is a type of verification since it verifies a variable has a specific type and thus operations that can be performed on it, and limits the valid input and output range of any function. Languages like Rust are also verification in terms of memory correctness, and are also extremely useful tools.
"Beware of bugs in the above code; I have only proved it correct, not tried it." - Donald Knuth
Not that relevant in context as the code in question is used to conclude a formal proof, not the other way around. Buy hey, it is a common quote when talking about proving software and someone has to do it...
Before we start writing Lean. Perhaps we can start with something "dumber" like Rust or any typed program. If you want to write something correct, or you care about correctness, you should not be using dynamic languages. The most useful and used type of test is type checking.
Type errors, especially once you have designed your types to be correct by construction, is extremely, extremely useful for LLMs. Once you have the foundation correct, they just have to wiggle through that narrow gap until it figures out something that fits.
But from what I understood and read so far, I am not convinced of OP's "formal verification". A simple litmus test is to take any of your recent day job task and try to describe a formal specification of it. Is it even doable? Reasonable? Is it even there? For me the most useful kind of verification is the verification of the lower level tools i.e. data structures, language, compilers etc
For example, the type signature of Vec::operator[usize] in Rust returns T. This cannot be true because it cannot guarantee to return a T given ANY usize. To me, panic is the most laziest and worst ways to put in a specification. It means that every single line of Rust code is now able to enter this termination state.
I once attended a talk by someone who is or was big in the node.js world. He opened with the premise, "a static type check is just a stand-in for a unit test."
I wanted to throw a shoe at him. A static type check doesn't stand in for "a" unit test; static typing stands in for an unbounded number of unit tests.
Put another way, this common misconception by users of languages like Javascript and Python that unit testing is just as good as type checking (plus more flexible) is a confusion between the "exists" and "for all" logical operators.
The author is in the comfortable position of working on a system that does have a formal specification and a formally verified reference implementation. The post is not about how they wish things would work, but how their existing system (Cedar) works.
Regarding your point on Rust, the vast majority of software has nowhere near the amount of static guarantees provided by Rust. If you need more, use static memory allocation, that's what people do for safety critical systems. By the way, it seems that Rust aborts on OOM errors, not panics: https://github.com/rust-lang/rust/issues/43596
I think it's possible to write correct systems with dynamic languages, just not the ones we commonly use like Python and JavaScript. I find Clojure, for example to be one example of a dynamic language that is pretty easy to manage and I attribute that to the immutable nature and data-centric ethos. I'm sure there are other dynamic languages that would work as well.
Now, I wouldn't necessarily use Clojure on a huge multi-organization codebase (maybe it's fine, this is outside of my experience with it), but it can be the right tool for some jobs.
> Perhaps we can start with something "dumber" like Rust or any typed program. If you want to write something correct, or you care about correctness, you should not be using dynamic languages. The most useful and used type of test is type checking.
Lean or TLA+ are to Rust/Java/Haskell's type systems what algebraic topology and non-linear PDEs are to "one potato, two potatoes". The level of "correctness" achievable with such simple type systems is so negligible in comparison to the things you can express and prove in rich formal mathematics languages that they barely leave an impression (they do make some grunt work easier, but if we're talking about a world where a machine can do the more complicated things, a little more grunt work doesn't matter).
> To me, panic is the most laziest and worst ways to put in a specification.
This why the "existing programs don't have specs!" Hand-ringing is entirely premature. Just about every code base today has error modes the authors think won't happen.
All you have to do is start proving they won't happen. And if you do this, you will begin a long journey that ends up with a formal spec for, at least, a good part of your program.
Proving the panics are dead code is a Socratic method, between you and the proof assistant / type checker, for figuring out what your program is and what you want it to be :).
Yeah, Rust has been pretty good for formal verification so far. Hoare spec contracts I think are the way forward, especially since they fairly naturally flow from unittests. I've been using Hax to pretty good effect so far. I'm generally suspect that advances in Lean proof solving by dedicated models are that useful for program verification, compared to generalist models, though it could help lower costs a good bit.
I agree. And learning a typed language is significantly easier now that AI can explain everything. The types also help AI to write a correct code. A very positive feedback loop.
- Lean will optimize peano arithmetic with binary bignums underneath the hood
- Property based checking and proof search already exist on a continuum, because counterexamples are a valid (dis)proof technique. This should surprise no writer of tactics.
- the lack of formal specs for existing software should become less a problem for greenfield software after these techniques go mainstream. People will be incentivized to actually figure out what they want, and successfully doing so vastly improves project management.
Finally, and most importantly, people thinking that there is a "big specification" and then "big implementation" are totally missing the mark. Remember tools like lean are just More Types. When we program with types, do we have a single big type and a single untyped term, paired together? Absolutely not.
As always, the key to productive software development is more and more libraries. Fancier types will allow writing more interesting libraries that tackle the "reusable core" of many tasks.
For example, do you want to write a "polymorphic web app" that can be instantiated with a arbitrary SQL Schema? Ideas like that become describable.
> the key to productive software development is more and more libraries
You had me until this statement. The idea that "more and more libraries" is going to solve the (rather large) quality problems we have in the software industry is .. misguided.
There are many arguable points in this blog post, but I want to highlight just one: the need for formal specification. It is indeed a big issue. However, one must distinguish between a full specification, which is sufficient to prove functional correctness, and a specification of certain security or safety properties, which only allows us to verify those properties. For example, we can easily specify the property that "the program shall never read uninitialised memory" and prove it. That wouldn't guarantee that the program is functionally correct, but it would at least rule out a whole class of potential errors.
This is an aside because I agree with the author’s core point, but spelling, grammatical errors, and typos actually imply something authored by a human now. This sentence:
“It affects point number 1 because AI-assisted programming is a very natural fit fot specification-driven development.”
made me smile. Reading something hand made that hadn’t been through the filters and presses of modern internet writing.
I think more salient here (at term certainly) is setting up adversarial agents for testing/verification - that has been a big win for me in multi-agent workflows - when claude first released "computer use" that was a very big step in closing this loop and avoiding the manual babysitting involved in larger projects. PSA that it's not a silver bullet as the "analyzer" can still get tripped up and falsely declare something as broken (or functional), but it greatly reduces the "Hey I've done the task" when the task is not done or the output is broken.
I agree completely with the author that AI assisted coding pushes the bottleneck to verification of the code.
But you don't really need complete formal verification to get these benefits. TDD gets you a lot of them as well. Perhaps your verification is less certain, but it's much easier to get high automated test coverage than it is to get a formally verifiable codebase.
I think AI assisted coding is going to cause a resurgence of interest in XP (https://en.wikipedia.org/wiki/Extreme_programming) since AI is a great fit for two big parts of XP. AI makes it easy to write well-tested code. The "pairing" method of writing code is also a great model for interacting with an AI assistant (much better than the vibe-coding model).
Trouble is that TDD, and formal proofs to much the same extent, assume a model of "double entry accounting". Meaning that you write both the test/proof and the implementation, and then make sure they agree. Like in accounting, the assumption is that the probability of you making the same mistake twice is fairly low, giving high confidence to accuracy when they agree. When there is a discrepancy, then you can then unpack if the problem is in the test/proof or the implementation. The fallible human can easily screw either.
But if you only fill out one side of the ledger, so to speak, an LLM will happily invent something that ensures that it is balanced, even where your side of the entry is completely wrong. So while this type of development is an improvement over blindly trusting an arbitrary prompt without any checks and balances, it doesn't really get us to truly verifying the code to the same degree we were able to achieve before. This remains an unsolved problem.
I find this discourse about AI and formal verification of software very confusing. It's like someone saying, let's assume I can somehow get a crane that would lift that vintage car and place it in my 15th floor apartment living room, but what will I do with my suitcases?
All the problems mentioned in the article are serious. They're also easier than the problem of getting an AI to automatically prove at least hundreds of correctness properties on programs that are hundreds of thousand, if not millions of lines long. Bringing higher mathematics into the discussion is also unhelpful. Proofs of interesting mathematical theorems require ingenuity and creativity that isn't needed in proving software correct, but they also require orders of magnitude fewer lemmas and inference steps. We're talking 100-1000 lines of proof per line of program code.
I don't know when AI will be able to do all that, but I see no reason to believe that a computer that can do that wouldn't also be able to reconcile the formal statements of correctness properties with informal requirements, and even match the requirements themselves to market needs.
What I'm trying to say is that a machine that can reliably write a complex, large piece of software and prove its correctness - somethng that, BTW, humans are not currently capable of doing - is also likely a machine that can do that from the prompt: Write a piece of software that will be popular among women aged 35-65, let alone "write a spreadsheet that's as powerful as excel but easier to use". Of course, once that happens, the market value of any such software will drop to zero, because anyone could give such a prompt. In fact, there would be no need for software as we know it because the AI could just do what the software is supposed to do (although perhaps it would choose to create an executable as an implementation detail).
What I see is people spending a lot of time imagining how we would work with an AI that could solve some huge problems and at the same time fail to solve easier problems. I don't understand the point of the exercise.
For the verification experts: (and forgive me because I have almost zero of the math understanding of this stuff)
> This makes formal verification a prime target for AI-assisted programming. Given that we have a formal specification, we can just let the machine wander around for hours, days, even weeks.
Is this sentiment completely discounting that there can be many possible ways to write program that satisfies certain requirements that all have correct outputs? Won’t many of these be terrible in terms of performance, time complexity, etc? I know that in the most trivial case, AI doesn’t jump straight to O(n)^3 solutions or anything, but also there’s no guarantee it won’t have bugs that degrade performance as long as they don’t interfere with technical correctness.
Also, are we also pretending that having Claude spin for “even weeks” is free?
Verified software should satisfy the liveness property; otherwise, an infinite loop that never returns would pass as "correct."
Verifying realtime software goes even further and enforces an upper bound on the maximum number of ticks it takes to complete the algorithm in all cases.
I lack the level of education and eloquence of the author, but I have my own notion that I think agrees with them: Specification is difficult and slow, and bugs do not care whether they are part of the official specification or not.
Some software needs formal verification, but all software needs testing.
On another subject...
> Tests are great at finding bugs ... but they cannot prove the absence of bugs.
Unless people therefore decide that testing unnecessary... Which has happened a lot in academia. One of the reasons testing is not being taught that well on some universities...
Doesn't this run into the same bottleneck as developing AI first languages? AI need tons of training material for how to write good formal verification code or code in new AI first languages that doesn't exist. The only solution is large scale synthetic generation which is hard to do if humans, on some level, can't verify that the synthetic data is any good.
I think the article still sells formal specification a bit short in a few areas:
- A formal spec can be used to derive randomly generated tests attempting to find counterexamples to spec invariants (which the article hinted at but didn't describe explicitly)
- A formal spec can be used as input to a model checker, which will try to find counterexamples to spec invariants via bounded exploration of the model's state space
- A formal spec can be used to find spec invariant violations by analyzing traces from production systems
What all these examples have in common is that they attempt to falsify, not verify, that the spec accurately describes the desired design properties or the actual implementation. That tends to be much more feasible than an actual proof.
> It is also painfully slow, the computational complexity of a + b, an operation so fast in CPU that it's literally an instant, is O(a + b), addition is linear in time to the added values instead of a constant operation.
To me, this reads as an insurmountably high hurdle for the application domain. We're talking about trying to verify systems which are produced very quickly by AIs. If the verification step is glacially slow (which, by any measure, a million cycles to add two integers is), I don't see how this could be considered a tractable solution.
I wonder if Design by Contract or schema-first design might take off as a way of structuring AI output and allowing it to rapidly iterate toward goals. I'm starting to try these methods out for myself with AI to see where they lead. Looking into https://deal.readthedocs.io/
Formal verification is a nice idea but it's a big hill to climb from where we're at. Most people can't even get agents to robustly E2E QA code, which is a much smaller hill to climb for (probably) larger benefits. I'm sure this area will improve over time though, since it is an eventual unlock for fully autonomous engineering.
I think for most complex systems, robust E2E QA is a waste of money. A small handful of E2E smoke tests and thoughtful application of smaller tests is usually enough. Though to be fair, agent aren't good at that either.
I think the section on AI from Zero to QED (a proofs in Lean/lang guide) gives a sober path forward from the perspective of market-makers and trading:
"Imagine market infrastructure where agents must prove, before executing, that their actions satisfy regulatory constraints, risk limits, fairness properties, and eventually machine-checkable proofs of Pareto efficiency of market mechanisms. This is a big, hairy, ambitious goal. Not “we reviewed the code” but “the system verified the proof.” The agent that cannot demonstrate compliance cannot act."
You can. Naturals are the default because the kinds of people who write and use proof assistants think usually in terms of natural numbers first rather than the 2-adic numbers mod 2^N we use in programming, or even the plain 2-adics used for algorithmics. It's like mathematicians and 1-based indexing. It violates programming conventions, but the people using it find it easier.
I smell vaporware. Formal verification is easy on easy stuff like simple functions - complex functions it might be impossible. Then you most likely will get bunch of snake oil salesmen promising that you can verify full system…
The fact that we're reading about it here today and have read about it in the past weeks is one piece of evidence. Another is that we hadn't been reading about it in the past months before November. Opus 4.5 and GPT 5.2 have crossed an usefulness frontier.
Anecdotally, I've been having some success (guiding LLMs) writing Alloy models in the past month and ensuring conformance with code. Making these would've been unjustifiable from ROI perspective fairy tales just this summer. The landscape has changed qualitatively.
Formal verification is going mainstream as watercooler weakend project fodder. As someone that has been well-versed in functional programming and depedent types for over a decade, this is a vast improvement.
The hobby project to day job methodology pipeline is real.
LLM-style AI isn't great for formal verification, not so far as I understand. And the recent advances in AI didn't do much for the kind of AI that is useful for formal verification.
We won't be formally verifying millions of LOC anytime soon, don't get your hopes that high up.
...but we will be modelling those 5-10kLOC modules across multiple services doing critical business logic or distributed transactions. This has been unthinkable a couple months ago and today is a read-only-Friday experiment away (try it with a frontier model and you'll be surprised).
Thanks for the article. Perhaps you could write a follow-up article or tutorial on your favored approach, Verification-Guided Development? This is new to most people, including myself, and you only briefly touch on it after spending most of the article on what you don't like.
Good luck with your degree!
P.S. Some links in your Research page are placeholders or broken.
I'll add some links for the original VGD paper and related articles, that should help in short term. Thank you! I'll look into writing something on VGD itself in the next few weeks.
See also Regehr's example[1] where a formally verified C compiler generates incorrect output because of an inconsistent value in <limits.h> (TL;DR: The compiler can pick whether "char" is signed or unsigned. Compcert picked one, but the linux system header used the other for CHAR_MIN and CHAR_MAX).
xiphias2|2 months ago
If we disregard programming and just look at formalizing math (Christian Szegedy has been doing it for a long time now), the length of proofs that are being formalized are exponentially growing and there's a good chance that in 2026 close to 100% of human written big/important proofs will be translated to and verified by Lean.
Just as an example for programming / modelling cache lines and cycle counts: we have quite good models for lots of architectures (even quite good reverse engineered model for NVIDIA GPUs in some papers). The problem is that calculating exact numbers for cache reads / writes is boring with lots of constants in them, and whenever we change the model a little bit the calculations have to be remade.
It's a lot of boring constraints to solve, and the main bottleneck for me when I was trying to do it by hand was that I couldn't just trust the output of LLMs.
zipy124|2 months ago
GuB-42|2 months ago
Not that relevant in context as the code in question is used to conclude a formal proof, not the other way around. Buy hey, it is a common quote when talking about proving software and someone has to do it...
Context: https://staff.fnwi.uva.nl/p.vanemdeboas/knuthnote.pdf
anon-3988|2 months ago
Type errors, especially once you have designed your types to be correct by construction, is extremely, extremely useful for LLMs. Once you have the foundation correct, they just have to wiggle through that narrow gap until it figures out something that fits.
But from what I understood and read so far, I am not convinced of OP's "formal verification". A simple litmus test is to take any of your recent day job task and try to describe a formal specification of it. Is it even doable? Reasonable? Is it even there? For me the most useful kind of verification is the verification of the lower level tools i.e. data structures, language, compilers etc
For example, the type signature of Vec::operator[usize] in Rust returns T. This cannot be true because it cannot guarantee to return a T given ANY usize. To me, panic is the most laziest and worst ways to put in a specification. It means that every single line of Rust code is now able to enter this termination state.
phaedrus|2 months ago
I wanted to throw a shoe at him. A static type check doesn't stand in for "a" unit test; static typing stands in for an unbounded number of unit tests.
Put another way, this common misconception by users of languages like Javascript and Python that unit testing is just as good as type checking (plus more flexible) is a confusion between the "exists" and "for all" logical operators.
loglog|2 months ago
Regarding your point on Rust, the vast majority of software has nowhere near the amount of static guarantees provided by Rust. If you need more, use static memory allocation, that's what people do for safety critical systems. By the way, it seems that Rust aborts on OOM errors, not panics: https://github.com/rust-lang/rust/issues/43596
packetlost|2 months ago
Now, I wouldn't necessarily use Clojure on a huge multi-organization codebase (maybe it's fine, this is outside of my experience with it), but it can be the right tool for some jobs.
pron|2 months ago
Lean or TLA+ are to Rust/Java/Haskell's type systems what algebraic topology and non-linear PDEs are to "one potato, two potatoes". The level of "correctness" achievable with such simple type systems is so negligible in comparison to the things you can express and prove in rich formal mathematics languages that they barely leave an impression (they do make some grunt work easier, but if we're talking about a world where a machine can do the more complicated things, a little more grunt work doesn't matter).
Ericson2314|2 months ago
This why the "existing programs don't have specs!" Hand-ringing is entirely premature. Just about every code base today has error modes the authors think won't happen.
All you have to do is start proving they won't happen. And if you do this, you will begin a long journey that ends up with a formal spec for, at least, a good part of your program.
Proving the panics are dead code is a Socratic method, between you and the proof assistant / type checker, for figuring out what your program is and what you want it to be :).
gaogao|2 months ago
didip|2 months ago
i_am_a_peasant|2 months ago
Ericson2314|2 months ago
- Lean will optimize peano arithmetic with binary bignums underneath the hood
- Property based checking and proof search already exist on a continuum, because counterexamples are a valid (dis)proof technique. This should surprise no writer of tactics.
- the lack of formal specs for existing software should become less a problem for greenfield software after these techniques go mainstream. People will be incentivized to actually figure out what they want, and successfully doing so vastly improves project management.
Finally, and most importantly, people thinking that there is a "big specification" and then "big implementation" are totally missing the mark. Remember tools like lean are just More Types. When we program with types, do we have a single big type and a single untyped term, paired together? Absolutely not.
As always, the key to productive software development is more and more libraries. Fancier types will allow writing more interesting libraries that tackle the "reusable core" of many tasks.
For example, do you want to write a "polymorphic web app" that can be instantiated with a arbitrary SQL Schema? Ideas like that become describable.
jesse__|2 months ago
You had me until this statement. The idea that "more and more libraries" is going to solve the (rather large) quality problems we have in the software industry is .. misguided.
see:
https://www.folklore.org/Negative_2000_Lines_Of_Code.html
https://caseymuratori.com/blog_0031
dalmo3|2 months ago
That's the AGI I want to see.
getregistered|2 months ago
This really resonates. We can write code a lot faster than we can safely deploy it at the moment.
marcosdumay|2 months ago
We always could. That has been true since the days we programmed computers by plugging jumper wires into a panel.
acedTrex|2 months ago
This has always been the case?
vzaliva|2 months ago
MarkMarine|2 months ago
“It affects point number 1 because AI-assisted programming is a very natural fit fot specification-driven development.”
made me smile. Reading something hand made that hadn’t been through the filters and presses of modern internet writing.
blauditore|2 months ago
hebelehubele|2 months ago
tgtweak|2 months ago
esafak|2 months ago
andrewmutz|2 months ago
But you don't really need complete formal verification to get these benefits. TDD gets you a lot of them as well. Perhaps your verification is less certain, but it's much easier to get high automated test coverage than it is to get a formally verifiable codebase.
I think AI assisted coding is going to cause a resurgence of interest in XP (https://en.wikipedia.org/wiki/Extreme_programming) since AI is a great fit for two big parts of XP. AI makes it easy to write well-tested code. The "pairing" method of writing code is also a great model for interacting with an AI assistant (much better than the vibe-coding model).
9rx|2 months ago
But if you only fill out one side of the ledger, so to speak, an LLM will happily invent something that ensures that it is balanced, even where your side of the entry is completely wrong. So while this type of development is an improvement over blindly trusting an arbitrary prompt without any checks and balances, it doesn't really get us to truly verifying the code to the same degree we were able to achieve before. This remains an unsolved problem.
senderista|2 months ago
antupis|2 months ago
andai|2 months ago
AI will make formal verification go mainstream
https://news.ycombinator.com/item?id=46294574
pron|2 months ago
All the problems mentioned in the article are serious. They're also easier than the problem of getting an AI to automatically prove at least hundreds of correctness properties on programs that are hundreds of thousand, if not millions of lines long. Bringing higher mathematics into the discussion is also unhelpful. Proofs of interesting mathematical theorems require ingenuity and creativity that isn't needed in proving software correct, but they also require orders of magnitude fewer lemmas and inference steps. We're talking 100-1000 lines of proof per line of program code.
I don't know when AI will be able to do all that, but I see no reason to believe that a computer that can do that wouldn't also be able to reconcile the formal statements of correctness properties with informal requirements, and even match the requirements themselves to market needs.
pron|2 months ago
What I see is people spending a lot of time imagining how we would work with an AI that could solve some huge problems and at the same time fail to solve easier problems. I don't understand the point of the exercise.
xp84|2 months ago
> This makes formal verification a prime target for AI-assisted programming. Given that we have a formal specification, we can just let the machine wander around for hours, days, even weeks.
Is this sentiment completely discounting that there can be many possible ways to write program that satisfies certain requirements that all have correct outputs? Won’t many of these be terrible in terms of performance, time complexity, etc? I know that in the most trivial case, AI doesn’t jump straight to O(n)^3 solutions or anything, but also there’s no guarantee it won’t have bugs that degrade performance as long as they don’t interfere with technical correctness.
Also, are we also pretending that having Claude spin for “even weeks” is free?
hun3|2 months ago
Verifying realtime software goes even further and enforces an upper bound on the maximum number of ticks it takes to complete the algorithm in all cases.
gaogao|2 months ago
MetaWhirledPeas|2 months ago
Some software needs formal verification, but all software needs testing.
On another subject...
> Tests are great at finding bugs ... but they cannot prove the absence of bugs.
I wish more people understood this.
enceladus76|2 months ago
ecocentrik|2 months ago
unknown|2 months ago
[deleted]
senderista|2 months ago
- A formal spec can be used to derive randomly generated tests attempting to find counterexamples to spec invariants (which the article hinted at but didn't describe explicitly)
- A formal spec can be used as input to a model checker, which will try to find counterexamples to spec invariants via bounded exploration of the model's state space
- A formal spec can be used to find spec invariant violations by analyzing traces from production systems
What all these examples have in common is that they attempt to falsify, not verify, that the spec accurately describes the desired design properties or the actual implementation. That tends to be much more feasible than an actual proof.
jesse__|2 months ago
To me, this reads as an insurmountably high hurdle for the application domain. We're talking about trying to verify systems which are produced very quickly by AIs. If the verification step is glacially slow (which, by any measure, a million cycles to add two integers is), I don't see how this could be considered a tractable solution.
throw-12-16|2 months ago
odie5533|2 months ago
CuriouslyC|2 months ago
__MatrixMan__|2 months ago
smccabe0|2 months ago
"Imagine market infrastructure where agents must prove, before executing, that their actions satisfy regulatory constraints, risk limits, fairness properties, and eventually machine-checkable proofs of Pareto efficiency of market mechanisms. This is a big, hairy, ambitious goal. Not “we reviewed the code” but “the system verified the proof.” The agent that cannot demonstrate compliance cannot act."
https://sdiehl.github.io/zero-to-qed/20_artificial_intellige...
nileshtrivedi|2 months ago
Why can't we just prove theorems about the standard two's complement integers, instead of Nat?
AlotOfReading|2 months ago
ozim|2 months ago
NlightNFotis|2 months ago
hacker_homie|2 months ago
How I learned to deploy faster.
badgersnake|2 months ago
This nonsense again. No. No it isn’t.
I’m sure the people selling it wish it was, but that doesn’t make it true.
baq|2 months ago
The fact that we're reading about it here today and have read about it in the past weeks is one piece of evidence. Another is that we hadn't been reading about it in the past months before November. Opus 4.5 and GPT 5.2 have crossed an usefulness frontier.
Anecdotally, I've been having some success (guiding LLMs) writing Alloy models in the past month and ensuring conformance with code. Making these would've been unjustifiable from ROI perspective fairy tales just this summer. The landscape has changed qualitatively.
Ericson2314|2 months ago
The hobby project to day job methodology pipeline is real.
AnimalMuppet|2 months ago
whatisthishere|2 months ago
[deleted]
baq|2 months ago
...but we will be modelling those 5-10kLOC modules across multiple services doing critical business logic or distributed transactions. This has been unthinkable a couple months ago and today is a read-only-Friday experiment away (try it with a frontier model and you'll be surprised).
esafak|2 months ago
Thanks for the article. Perhaps you could write a follow-up article or tutorial on your favored approach, Verification-Guided Development? This is new to most people, including myself, and you only briefly touch on it after spending most of the article on what you don't like.
Good luck with your degree!
P.S. Some links in your Research page are placeholders or broken.
alpaylan|2 months ago
drejt|2 months ago
visarga|2 months ago
aidenn0|2 months ago
1: https://blog.regehr.org/archives/482 there were many issues here, not just with compcert
whatisthishere|2 months ago
[deleted]
sapphirebreeze|2 months ago
[deleted]
omgJustTest|2 months ago
omgJustTest|2 months ago