I had a surprising interaction with Gemini 2.5 Pro that this project reminds me of. I was asking the LLM for help using an online CAS system to solve a system of equations, and the CAS system wasn't working as I expected. After a couple back and forths with Gemini about the CAS system, Gemini just gave me the solution. I was surprised because it's the kind of thing I don't expect LLMs to be good at. It said it used Python's sympy symbolic computation package to arrive at the solution. So, yes, the marriage of fuzzy LLMs with more rigorous tools can have powerful effects.
Just like humans... we are not so good at hard number crunching, but we can invent computers that are amazing at it. And with a lot of effort we can make a program that uses a whole lot of number crunching to be ok at predicting text but kind of bad at crunching hard numbers. And then that program can predict how to create and use programs which are good at number crunching.
I really like LLM+sympy for math. I have the LLM write me a sympy program, so I can trust that the symbolic manipulation is done correctly.
The code is also a useful artifact that can be iteratively edited and improved by both the human and LLM, with git history, etc. Running and passing tests/assertions helps to build and maintain confidence that the math remains correct.
I use helper functions to easily render from the sympy code to latex, etc.
A lot of the math behind this quantum eraser experiment was done this way.
The combination of LLMs and formal verification tools is pretty interesting. We've been thinking about this for compliance automation - there are a lot of regulatory requirements that could theoretically be expressed as formal constraints. Curious about the performance though. Z3 can be really slow on complex problems, and if you're chaining that with LLM calls, the latency could get rough for interactive use cases.
I get having it walk you through figuring out a problem with a tool: seems like a good idea and it clearly worked even better than expected. But deliberately coaxing an LLM into doing math correctly instead of a CAS because you’ve got one handy seems like moving apartments with dozens of bus trips rather than taking the bus to a truck rental place, just because you’ve already got a bus pass.
LLMs are statistical language models (d'uh) not reasoners after all. I found generating logic programs, and Prolog source specifically, to work unreasonably well, though [1], maybe because Prolog was introduced for symbolic natural language processing and there's a wealth of translation examples in the training set. Might be worth checking out Z3's alternative Datalog syntax [2] instead of its Lisp-ish SMTLib syntax.
Yep! Datalog syntax for Z3 is pretty neat! We used SMT [1] in our grammars paper because it allowed the most interoperability with solvers, but our technique also works with PROLOG; as tested our at the behest of reviewers at NeurIPS. I would assume that this should also work with datalog [2].
My team has been prototyping something very similar with encoding business operations policies with LEAN. We have some internal knowledge bases (google docs / wiki pages) that we first convert to LEAN using LLMs.
Then we run the solver to verify consistency.
When a wiki page is changed, the process is run again and it's essentially a linter for process.
Can't say it moved beyond the prototyping stage though, since the LEAN conversion does require some engineers to look through it at least.
But a promising approach indeed, especially when you have a domain that requires tight legal / financial compliance.
The autoformalization gap is pretty difficult to bridge indeed. We explored uncertainty quantification of autoformalization on well-defined grammars in our NeurIPS 2025 paper : https://arxiv.org/abs/2505.20047 .
If you ever feel like chatting and discussing more details, happy to chat!
This is a very interesting area of research. I did something similar a couple of years ago using logic and probabilistic logic inference engines to make sure conclusions followed from premises.
I also used agents to synthesize, formalize, and criticize domain knowledge. Obviously, it is not a silver bullet, but it does ensure some degree of correctness.
I think introducing some degree of symbolism and agents-as-a-judge is a promising way ahead, see e.g.: https://arxiv.org/abs/2410.10934
Yep! I have read your work! Pretty cool! I also worked on a similar deep research agent for autoformalization this summer at AWS ARChecks, building on similar patterns.
Although that work is not public, you can play with the generally available product here!
Agent/LLM as a judge is biased and only good for bootstrapping. As capabilities get better LLM as a judge will artificially cap your performance, you need to graduate to either expert human judges or deterministic oracles.
LLMs lack logical constraints in the generative process; they only learn probabilistic constraints. If you apply logic verification post-hoc, you're not "ensuring the correctness of your LLMs reasoning" (I went down this path a year ago); you're classifying whether the LLM's statistically driven pattern generation happens to correspond to correct logic or not, where the LLMs output may be wrong 100% of the time, and your theorem prover simply acts as a classifier, ensuring nothing at all.
Yep, this is a genuine problem, and this is what we term as the autoformalization gap in our follow up paper. (https://arxiv.org/abs/2505.20047)
Some LLMs are more consistent between text and SMT, while others are not. (Tab 1, Fig 14,15)
You can do uncertainty quantification with selective verification to reduce the "risk", for e.g. shown as the Area Under the Risk Coverage Curve in Tab 4.
Probabilistic constraints are all around us. You learn that the sine function is the ratio of the length of the side of the right triangle opposite to the angle to the length of the side opposite to the right angle, so obviously the sine is always positive. Yet your thinking should be flexible enough to allow changing the definition to the ordinate of the point on the unit circle where the line corresponding to the given angle and drawn from zero intersects that circle. So your knowledge - the symbolic one - can also be probabilistic.
I posted about my year long development effort of this very method on reddit 25 days ago. My comment elsewhere in this thread provides a cautionary tale, and the authors response to the basic issue I raised is incomplete in that it leaves out that certain problems simply cannot be solved with LLMs (requires logical constraints in the generative process but LLMs lack that layer) So I've pivoted to something else since (also mentioned in my comment elsewhere in this thread)
So the core idea is to use an LLM to draft reasoning as a structured, JSON domain-specific language (DSL), then deterministically translate that into first-order logic and verify it with a theorem prover (Z3).
Interesting that the final answer is provably entailed (or you get a counterexample), instead of being merely persuasive chain-of-thought.
Formal logic serves as a useful filter. In other words, "crap in, filtered crap out" - remember, evolution works with absolutely random, "crap" mutations, which then are "filtered" by the environment.
That's subjective. One could argue all the things we invented in the past few thousands years were crap. Life would have been much easier in the caves, albeit shorter.
The repo is sparse on the details unless you go digging, which perhaps makes sense if this is just meant as the artifact for the mentioned paper.
Unless I’m wrong, this is mainly an API for trying to get an LLM to generate a Z3 program which “logically” represents a real query, including known facts, inference rules, and goals. The “oversight” this introduces is in the ability to literally read the logical statement being evaluated to an answer, and running the solver to see if it holds or not.
The natural source of doubt is: who’s going to read a bunch of SMT rules manually and be able to accurately double-check them against real-world understanding? Who double checks the constants? What stops the LLM from accidentally (or deliberately, for achieving the goal) adding facts or rules that are unsound (both logically and from a real-world perspective)?
The paper reports a *51%* false positive rate on a logic benchmark! That’s shockingly high, and suggests the LLM is either bad at logical models or keeps creating unsoundnesses. Sadly, the evaluation is a bit thin on the ground about how this stacks up, and what causes it to fall short.
Yep. The paper was written last year with GPT-4o. Things have become a lot better since then with newer models.
E.g. https://arxiv.org/pdf/2505.20047 Tab 1, we compare the performance on text-only vs SMT-only. o3-mini does pretty well at mirroring its text reasoning in its SMT, vs Gemini Flash 2.0.
Illustration of this can be seen in Fig 14, 15 on Page 29.
In commercially available products like AWS Automated Reasoning Checks, you build a model from your domain (e.g. from a PDF policy document), cross verify it for correctness, and during answer generation, you only cross check whether your Q/A pairs from the LLM comply with the policy using a solver with guarantees.
This means that they can give you a 99%+ soundness guarantee, which basically means that if the service says the Q/A pair is valid or guaranteed w.r.t the policy, it is right more than 99% of the time.
Interesting. I wonder if you could implement tool calling with this approach so the LLM calls the tool with the formal specification and gets back the result. Just like a coding agent can run a compiler, get back errors and then self-correct.
This is fascinating! An AI that doesn't just think out loud, but keeps a verifiable diary. It's like a philosopher with a cryptographic notary public living in its brain. Amazing work!
Cool research! I went to the repo to see what the DSL looked like but it was hard to find a clear example. It would be cool if you added a snippet to the README.
Hey! Thank you for the interest! I shall do that. Meanwhile, check out Page 11 onwards. We describe a lot of situations! (https://arxiv.org/pdf/2409.17270)
I always find it amazing how many people seem to fail to use current LLMs to the fullest, even though they apparently work with them in research settings. This benchmark pipeline simply calls the OpenAI API and then painstakingly tries to parse the raw text output into a structured json format, when in reality the OpenAI API has supported structured outputs for ages now. That already ensures your model generates schema compliant output without hallucinating keys at the inference level. Today all the major providers support this feature either directly or at least indirectly via function calling. And if you run open models, you can literally write arbitrary schema (i.e. not limited to json behind the scenes) adhering inference engines yourself with rather manageable effort. I'm constantly using this in my daily work and I'm always baffled when people tell me about their hallucination problems, because so many of them can be fixed trivially these days.
Hey there! I mostly designed and wrote most of the actual interpreter during my internship at Microsoft Research last summer. Constrained decoding for GPT-4 wasn’t available when we started designing the DSL, and besides, creating a regex to constrain this specific DSL is quite challenging.
When the grammar of the language is better defined, like SMT (https://arxiv.org/abs/2505.20047) - we are able to do this with open source LLMs.
I wouldn't find it amazing, there are so many new models, features, ways to use models that the minute you pause to take a deep dive into something specific, 43 other things have already passed by you.
I see JSON parse errors on occasion when using OpeanAI structured outputs that resolve upon retry. It seems it’s giving instructions to the LLM but validation is still up to the caller. Wondering if others see this too.
I’d also be surprised if the models are better at writing code in some custom schema (assuming that’s not z3s native structure) than writing code in something else. Decent models can write pretty good code and for a lot of mistakes can fix them, plus you get testing/etc setups for free.
It's a relatively new feature, also people need actual professional training to become true LLM developers using them to their fullest and not just developers that happen to call an LLM API here and there. Takes a lot of time and effort.
LLMs and its output are bounded by Rices theorem. This is not going to ensure correctness it’s just going to validate that the model can produce an undecidable result.
I'm honestly confused why we can't determine how LLMs come to their decisions in the general sense. Is it not possible to log every step as the neural network / vector db / magic happens? Is it merely impractical, or is it actually something that's genuinely difficult to do?
My understanding is that it's neither impractical nor genuinely difficult, it's just that the "logging every step" approach provides explanations of their "reasoning" that are completely meaningless to us, as humans. It's like trying to understand why a person likes the color red, but not the color blue, using a database recording the position, makeup, and velocity of every atom in their brain. Theoretically, yes, that should be sufficient to explain their color preferences, in that it fully models their brain. But practically, the explanation would be phrased in terms of atomic configurations in a way that makes much less sense to us than "oh, this person likes red because they like roses".
Everything happens in an opaque super-high-dimensional numerical space that was "organically grown" not engineered, so we don't really understand what's going on.
There's people doing both types. Look up survey of mechanistic interpretebility of language models and survey of explainable AI for neural networks. Those will give you many techniques for illustrating what's happening.
You'll also see why their applications are limited compared to what you probably hoped for.
Imagine having a bunch of 2D matrices with a combined 1.8 trillion total numbers, from which you pick out a blocks of numbers in a loop and finally merge them and combine them to form a token.
Good luck figuring out what number represents what.
I agree that "proof of thought" is a misleading name, but this whole "computers can't think" thing is making LLM skepticism seem very unscientific. There is no universally agreed upon objective definition of what it means to be able to "think" or how you would measure such a thing. The definition that these types of positions seem to rely upon is "a thing that only humans can do", which is obviously a circular one that isn't useful.
chrchr|4 months ago
TrainedMonkey|4 months ago
anotherpaulg|4 months ago
The code is also a useful artifact that can be iteratively edited and improved by both the human and LLM, with git history, etc. Running and passing tests/assertions helps to build and maintain confidence that the math remains correct.
I use helper functions to easily render from the sympy code to latex, etc.
A lot of the math behind this quantum eraser experiment was done this way.
https://github.com/paul-gauthier/entangled-pair-quantum-eras...
selinkocalar|4 months ago
fennecfoxy|4 months ago
Probably the main deficiencies are confusion as the context grows (therefore confusion as task complexity grows).
jansan|4 months ago
DrewADesign|4 months ago
tannhaeuser|4 months ago
[1]: https://quantumprolog.sgml.net/llm-demo/part1.html
[2]: https://microsoft.github.io/z3guide/docs/fixedpoints/syntax
barthelomew|4 months ago
[1] https://arxiv.org/abs/2505.20047 [2] https://github.com/antlr/grammars-v4/blob/master/datalog/dat...
larodi|4 months ago
LASR|4 months ago
My team has been prototyping something very similar with encoding business operations policies with LEAN. We have some internal knowledge bases (google docs / wiki pages) that we first convert to LEAN using LLMs.
Then we run the solver to verify consistency.
When a wiki page is changed, the process is run again and it's essentially a linter for process.
Can't say it moved beyond the prototyping stage though, since the LEAN conversion does require some engineers to look through it at least.
But a promising approach indeed, especially when you have a domain that requires tight legal / financial compliance.
barthelomew|4 months ago
If you ever feel like chatting and discussing more details, happy to chat!
viraptor|4 months ago
chandureddyvari|4 months ago
pbronez|4 months ago
nextos|4 months ago
I also used agents to synthesize, formalize, and criticize domain knowledge. Obviously, it is not a silver bullet, but it does ensure some degree of correctness.
I think introducing some degree of symbolism and agents-as-a-judge is a promising way ahead, see e.g.: https://arxiv.org/abs/2410.10934
barthelomew|4 months ago
Although that work is not public, you can play with the generally available product here!
[1] https://aws.amazon.com/blogs/aws/minimize-ai-hallucinations-...
CuriouslyC|4 months ago
nakamoto_damacy|4 months ago
barthelomew|4 months ago
Some LLMs are more consistent between text and SMT, while others are not. (Tab 1, Fig 14,15)
You can do uncertainty quantification with selective verification to reduce the "risk", for e.g. shown as the Area Under the Risk Coverage Curve in Tab 4.
avmich|4 months ago
nakamoto_damacy|4 months ago
nakamoto_damacy|4 months ago
https://www.reddit.com/r/healthIT/comments/1n81e8g/comment/n...
sytse|4 months ago
Interesting that the final answer is provably entailed (or you get a counterexample), instead of being merely persuasive chain-of-thought.
0xWTF|4 months ago
avmich|4 months ago
varispeed|4 months ago
baq|4 months ago
ivanbakel|4 months ago
Unless I’m wrong, this is mainly an API for trying to get an LLM to generate a Z3 program which “logically” represents a real query, including known facts, inference rules, and goals. The “oversight” this introduces is in the ability to literally read the logical statement being evaluated to an answer, and running the solver to see if it holds or not.
The natural source of doubt is: who’s going to read a bunch of SMT rules manually and be able to accurately double-check them against real-world understanding? Who double checks the constants? What stops the LLM from accidentally (or deliberately, for achieving the goal) adding facts or rules that are unsound (both logically and from a real-world perspective)?
The paper reports a *51%* false positive rate on a logic benchmark! That’s shockingly high, and suggests the LLM is either bad at logical models or keeps creating unsoundnesses. Sadly, the evaluation is a bit thin on the ground about how this stacks up, and what causes it to fall short.
barthelomew|4 months ago
E.g. https://arxiv.org/pdf/2505.20047 Tab 1, we compare the performance on text-only vs SMT-only. o3-mini does pretty well at mirroring its text reasoning in its SMT, vs Gemini Flash 2.0.
Illustration of this can be seen in Fig 14, 15 on Page 29.
In commercially available products like AWS Automated Reasoning Checks, you build a model from your domain (e.g. from a PDF policy document), cross verify it for correctness, and during answer generation, you only cross check whether your Q/A pairs from the LLM comply with the policy using a solver with guarantees.
This means that they can give you a 99%+ soundness guarantee, which basically means that if the service says the Q/A pair is valid or guaranteed w.r.t the policy, it is right more than 99% of the time.
https://aws.amazon.com/blogs/aws/minimize-ai-hallucinations-...
derekcheng08|4 months ago
westurner|4 months ago
jadelcastillo|4 months ago
"Alice has 60 brothers and she also has 212 sisters. How many sisters does Alice's brother have?"
But the generated program is not very useful:
{ "sorts": [], "functions": [], "constants": {}, "variables": [ {"name": "num_brothers_of_alice", "sort": "IntSort"}, {"name": "num_sisters_of_alice", "sort": "IntSort"}, {"name": "sisters_of_alice_brother", "sort": "IntSort"} ], "knowledge_base": [ "num_brothers_of_alice == 60", "num_sisters_of_alice == 212", "sisters_of_alice_brother == num_sisters_of_alice + 1" ], "rules": [], "verifications": [ { "name": "Alice\'s brother has 213 sisters", "constraint": "sisters_of_alice_brother == 213" } ], "actions": ["verify_conditions"] }
renshijian|4 months ago
tonerow|4 months ago
barthelomew|4 months ago
Yoric|4 months ago
sigmoid10|4 months ago
barthelomew|4 months ago
When the grammar of the language is better defined, like SMT (https://arxiv.org/abs/2505.20047) - we are able to do this with open source LLMs.
atrus|4 months ago
jssmith|4 months ago
IanCal|4 months ago
eric-burel|4 months ago
retinaros|4 months ago
dehsge|4 months ago
ogogmad|4 months ago
Western0|4 months ago
zwnow|4 months ago
elcomet|4 months ago
moffkalast|4 months ago
unknown|4 months ago
[deleted]
maiuki|4 months ago
everdrive|4 months ago
konmok|4 months ago
chpatrick|4 months ago
moffkalast|4 months ago
nickpsecurity|4 months ago
You'll also see why their applications are limited compared to what you probably hoped for.
NotGMan|4 months ago
Imagine having a bunch of 2D matrices with a combined 1.8 trillion total numbers, from which you pick out a blocks of numbers in a loop and finally merge them and combine them to form a token.
Good luck figuring out what number represents what.
hamonrye|4 months ago
sicariomoon|4 months ago
[deleted]
measurablefunc|4 months ago
aSanchezStern|4 months ago
chpatrick|4 months ago