top | item 45475529

ProofOfThought: LLM-based reasoning using Z3 theorem proving

326 points| barthelomew | 4 months ago |github.com

https://arxiv.org/abs/2409.17270

175 comments

order

chrchr|4 months ago

I had a surprising interaction with Gemini 2.5 Pro that this project reminds me of. I was asking the LLM for help using an online CAS system to solve a system of equations, and the CAS system wasn't working as I expected. After a couple back and forths with Gemini about the CAS system, Gemini just gave me the solution. I was surprised because it's the kind of thing I don't expect LLMs to be good at. It said it used Python's sympy symbolic computation package to arrive at the solution. So, yes, the marriage of fuzzy LLMs with more rigorous tools can have powerful effects.

TrainedMonkey|4 months ago

Just like humans... we are not so good at hard number crunching, but we can invent computers that are amazing at it. And with a lot of effort we can make a program that uses a whole lot of number crunching to be ok at predicting text but kind of bad at crunching hard numbers. And then that program can predict how to create and use programs which are good at number crunching.

anotherpaulg|4 months ago

I really like LLM+sympy for math. I have the LLM write me a sympy program, so I can trust that the symbolic manipulation is done correctly.

The code is also a useful artifact that can be iteratively edited and improved by both the human and LLM, with git history, etc. Running and passing tests/assertions helps to build and maintain confidence that the math remains correct.

I use helper functions to easily render from the sympy code to latex, etc.

A lot of the math behind this quantum eraser experiment was done this way.

https://github.com/paul-gauthier/entangled-pair-quantum-eras...

selinkocalar|4 months ago

The combination of LLMs and formal verification tools is pretty interesting. We've been thinking about this for compliance automation - there are a lot of regulatory requirements that could theoretically be expressed as formal constraints. Curious about the performance though. Z3 can be really slow on complex problems, and if you're chaining that with LLM calls, the latency could get rough for interactive use cases.

fennecfoxy|4 months ago

Yeah it feels like these early LLMs are pretty decent at the coming up with a plan and executing a plan part.

Probably the main deficiencies are confusion as the context grows (therefore confusion as task complexity grows).

jansan|4 months ago

How die that work? Did Gemini call sympy on your maschine, or is access to sympy built-in and available through normal chat?

DrewADesign|4 months ago

I get having it walk you through figuring out a problem with a tool: seems like a good idea and it clearly worked even better than expected. But deliberately coaxing an LLM into doing math correctly instead of a CAS because you’ve got one handy seems like moving apartments with dozens of bus trips rather than taking the bus to a truck rental place, just because you’ve already got a bus pass.

tannhaeuser|4 months ago

LLMs are statistical language models (d'uh) not reasoners after all. I found generating logic programs, and Prolog source specifically, to work unreasonably well, though [1], maybe because Prolog was introduced for symbolic natural language processing and there's a wealth of translation examples in the training set. Might be worth checking out Z3's alternative Datalog syntax [2] instead of its Lisp-ish SMTLib syntax.

[1]: https://quantumprolog.sgml.net/llm-demo/part1.html

[2]: https://microsoft.github.io/z3guide/docs/fixedpoints/syntax

larodi|4 months ago

Neuralsymbolic systems are very likely the future as so many times mentioned here already.

LASR|4 months ago

This is an interesting approach.

My team has been prototyping something very similar with encoding business operations policies with LEAN. We have some internal knowledge bases (google docs / wiki pages) that we first convert to LEAN using LLMs.

Then we run the solver to verify consistency.

When a wiki page is changed, the process is run again and it's essentially a linter for process.

Can't say it moved beyond the prototyping stage though, since the LEAN conversion does require some engineers to look through it at least.

But a promising approach indeed, especially when you have a domain that requires tight legal / financial compliance.

barthelomew|4 months ago

The autoformalization gap is pretty difficult to bridge indeed. We explored uncertainty quantification of autoformalization on well-defined grammars in our NeurIPS 2025 paper : https://arxiv.org/abs/2505.20047 .

If you ever feel like chatting and discussing more details, happy to chat!

viraptor|4 months ago

Could you share an example of such policy? I'm struggling to think of something defined well enough in the real world to apply in Lean.

pbronez|4 months ago

That’s pretty cool. It would be super useful to identify contradictory guidance systematically.

nextos|4 months ago

This is a very interesting area of research. I did something similar a couple of years ago using logic and probabilistic logic inference engines to make sure conclusions followed from premises.

I also used agents to synthesize, formalize, and criticize domain knowledge. Obviously, it is not a silver bullet, but it does ensure some degree of correctness.

I think introducing some degree of symbolism and agents-as-a-judge is a promising way ahead, see e.g.: https://arxiv.org/abs/2410.10934

CuriouslyC|4 months ago

Agent/LLM as a judge is biased and only good for bootstrapping. As capabilities get better LLM as a judge will artificially cap your performance, you need to graduate to either expert human judges or deterministic oracles.

nakamoto_damacy|4 months ago

LLMs lack logical constraints in the generative process; they only learn probabilistic constraints. If you apply logic verification post-hoc, you're not "ensuring the correctness of your LLMs reasoning" (I went down this path a year ago); you're classifying whether the LLM's statistically driven pattern generation happens to correspond to correct logic or not, where the LLMs output may be wrong 100% of the time, and your theorem prover simply acts as a classifier, ensuring nothing at all.

barthelomew|4 months ago

Yep, this is a genuine problem, and this is what we term as the autoformalization gap in our follow up paper. (https://arxiv.org/abs/2505.20047)

Some LLMs are more consistent between text and SMT, while others are not. (Tab 1, Fig 14,15)

You can do uncertainty quantification with selective verification to reduce the "risk", for e.g. shown as the Area Under the Risk Coverage Curve in Tab 4.

avmich|4 months ago

Probabilistic constraints are all around us. You learn that the sine function is the ratio of the length of the side of the right triangle opposite to the angle to the length of the side opposite to the right angle, so obviously the sine is always positive. Yet your thinking should be flexible enough to allow changing the definition to the ordinate of the point on the unit circle where the line corresponding to the given angle and drawn from zero intersects that circle. So your knowledge - the symbolic one - can also be probabilistic.

nakamoto_damacy|4 months ago

Why is this being down voted? I believe the author acknowledged and responded. Anything wrong?

nakamoto_damacy|4 months ago

I posted about my year long development effort of this very method on reddit 25 days ago. My comment elsewhere in this thread provides a cautionary tale, and the authors response to the basic issue I raised is incomplete in that it leaves out that certain problems simply cannot be solved with LLMs (requires logical constraints in the generative process but LLMs lack that layer) So I've pivoted to something else since (also mentioned in my comment elsewhere in this thread)

https://www.reddit.com/r/healthIT/comments/1n81e8g/comment/n...

sytse|4 months ago

So the core idea is to use an LLM to draft reasoning as a structured, JSON domain-specific language (DSL), then deterministically translate that into first-order logic and verify it with a theorem prover (Z3).

Interesting that the final answer is provably entailed (or you get a counterexample), instead of being merely persuasive chain-of-thought.

0xWTF|4 months ago

Am I reading this right? Statistical LLM outputs pushed through a formal logic model? Wouldn't that be a case of "crap in, crap out"?

avmich|4 months ago

Formal logic serves as a useful filter. In other words, "crap in, filtered crap out" - remember, evolution works with absolutely random, "crap" mutations, which then are "filtered" by the environment.

varispeed|4 months ago

That's subjective. One could argue all the things we invented in the past few thousands years were crap. Life would have been much easier in the caves, albeit shorter.

baq|4 months ago

You assume it’s all crap when it clearly isn’t often enough to be useful.

ivanbakel|4 months ago

The repo is sparse on the details unless you go digging, which perhaps makes sense if this is just meant as the artifact for the mentioned paper.

Unless I’m wrong, this is mainly an API for trying to get an LLM to generate a Z3 program which “logically” represents a real query, including known facts, inference rules, and goals. The “oversight” this introduces is in the ability to literally read the logical statement being evaluated to an answer, and running the solver to see if it holds or not.

The natural source of doubt is: who’s going to read a bunch of SMT rules manually and be able to accurately double-check them against real-world understanding? Who double checks the constants? What stops the LLM from accidentally (or deliberately, for achieving the goal) adding facts or rules that are unsound (both logically and from a real-world perspective)?

The paper reports a *51%* false positive rate on a logic benchmark! That’s shockingly high, and suggests the LLM is either bad at logical models or keeps creating unsoundnesses. Sadly, the evaluation is a bit thin on the ground about how this stacks up, and what causes it to fall short.

barthelomew|4 months ago

Yep. The paper was written last year with GPT-4o. Things have become a lot better since then with newer models.

E.g. https://arxiv.org/pdf/2505.20047 Tab 1, we compare the performance on text-only vs SMT-only. o3-mini does pretty well at mirroring its text reasoning in its SMT, vs Gemini Flash 2.0.

Illustration of this can be seen in Fig 14, 15 on Page 29.

In commercially available products like AWS Automated Reasoning Checks, you build a model from your domain (e.g. from a PDF policy document), cross verify it for correctness, and during answer generation, you only cross check whether your Q/A pairs from the LLM comply with the policy using a solver with guarantees.

This means that they can give you a 99%+ soundness guarantee, which basically means that if the service says the Q/A pair is valid or guaranteed w.r.t the policy, it is right more than 99% of the time.

https://aws.amazon.com/blogs/aws/minimize-ai-hallucinations-...

derekcheng08|4 months ago

Interesting. I wonder if you could implement tool calling with this approach so the LLM calls the tool with the formal specification and gets back the result. Just like a coding agent can run a compiler, get back errors and then self-correct.

jadelcastillo|4 months ago

Interesting approach, but I guess still lot of work to be done. I tried with this question:

"Alice has 60 brothers and she also has 212 sisters. How many sisters does Alice's brother have?"

But the generated program is not very useful:

{ "sorts": [], "functions": [], "constants": {}, "variables": [ {"name": "num_brothers_of_alice", "sort": "IntSort"}, {"name": "num_sisters_of_alice", "sort": "IntSort"}, {"name": "sisters_of_alice_brother", "sort": "IntSort"} ], "knowledge_base": [ "num_brothers_of_alice == 60", "num_sisters_of_alice == 212", "sisters_of_alice_brother == num_sisters_of_alice + 1" ], "rules": [], "verifications": [ { "name": "Alice\'s brother has 213 sisters", "constraint": "sisters_of_alice_brother == 213" } ], "actions": ["verify_conditions"] }

renshijian|4 months ago

This is fascinating! An AI that doesn't just think out loud, but keeps a verifiable diary. It's like a philosopher with a cryptographic notary public living in its brain. Amazing work!

tonerow|4 months ago

Cool research! I went to the repo to see what the DSL looked like but it was hard to find a clear example. It would be cool if you added a snippet to the README.

Yoric|4 months ago

That is exactly the kind of things that I hope LLM will help us achieve before the next AI winter.

sigmoid10|4 months ago

I always find it amazing how many people seem to fail to use current LLMs to the fullest, even though they apparently work with them in research settings. This benchmark pipeline simply calls the OpenAI API and then painstakingly tries to parse the raw text output into a structured json format, when in reality the OpenAI API has supported structured outputs for ages now. That already ensures your model generates schema compliant output without hallucinating keys at the inference level. Today all the major providers support this feature either directly or at least indirectly via function calling. And if you run open models, you can literally write arbitrary schema (i.e. not limited to json behind the scenes) adhering inference engines yourself with rather manageable effort. I'm constantly using this in my daily work and I'm always baffled when people tell me about their hallucination problems, because so many of them can be fixed trivially these days.

barthelomew|4 months ago

Hey there! I mostly designed and wrote most of the actual interpreter during my internship at Microsoft Research last summer. Constrained decoding for GPT-4 wasn’t available when we started designing the DSL, and besides, creating a regex to constrain this specific DSL is quite challenging.

When the grammar of the language is better defined, like SMT (https://arxiv.org/abs/2505.20047) - we are able to do this with open source LLMs.

atrus|4 months ago

I wouldn't find it amazing, there are so many new models, features, ways to use models that the minute you pause to take a deep dive into something specific, 43 other things have already passed by you.

jssmith|4 months ago

I see JSON parse errors on occasion when using OpeanAI structured outputs that resolve upon retry. It seems it’s giving instructions to the LLM but validation is still up to the caller. Wondering if others see this too.

IanCal|4 months ago

I’d also be surprised if the models are better at writing code in some custom schema (assuming that’s not z3s native structure) than writing code in something else. Decent models can write pretty good code and for a lot of mistakes can fix them, plus you get testing/etc setups for free.

eric-burel|4 months ago

It's a relatively new feature, also people need actual professional training to become true LLM developers using them to their fullest and not just developers that happen to call an LLM API here and there. Takes a lot of time and effort.

retinaros|4 months ago

yes this can also improve the said reasoning.

dehsge|4 months ago

LLMs and its output are bounded by Rices theorem. This is not going to ensure correctness it’s just going to validate that the model can produce an undecidable result.

ogogmad|4 months ago

Errr, checking correctness of proofs is decidable.

zwnow|4 months ago

Reasoning? LLMs can not reason, why is it always assumed they reason? They mimic reasoning.

elcomet|4 months ago

How can you know?

moffkalast|4 months ago

It's so funny to me that people are still adamant about this like two years after it's become a completely moot point.

maiuki|4 months ago

What industrial problems would this solve?

everdrive|4 months ago

I'm honestly confused why we can't determine how LLMs come to their decisions in the general sense. Is it not possible to log every step as the neural network / vector db / magic happens? Is it merely impractical, or is it actually something that's genuinely difficult to do?

konmok|4 months ago

My understanding is that it's neither impractical nor genuinely difficult, it's just that the "logging every step" approach provides explanations of their "reasoning" that are completely meaningless to us, as humans. It's like trying to understand why a person likes the color red, but not the color blue, using a database recording the position, makeup, and velocity of every atom in their brain. Theoretically, yes, that should be sufficient to explain their color preferences, in that it fully models their brain. But practically, the explanation would be phrased in terms of atomic configurations in a way that makes much less sense to us than "oh, this person likes red because they like roses".

chpatrick|4 months ago

Everything happens in an opaque super-high-dimensional numerical space that was "organically grown" not engineered, so we don't really understand what's going on.

moffkalast|4 months ago

It would be like logging a bunch of random noise from anyone's perspective except the LLM's.

nickpsecurity|4 months ago

There's people doing both types. Look up survey of mechanistic interpretebility of language models and survey of explainable AI for neural networks. Those will give you many techniques for illustrating what's happening.

You'll also see why their applications are limited compared to what you probably hoped for.

NotGMan|4 months ago

Chat GPT-4 has alegedly 1.8 trillion parameters.

Imagine having a bunch of 2D matrices with a combined 1.8 trillion total numbers, from which you pick out a blocks of numbers in a loop and finally merge them and combine them to form a token.

Good luck figuring out what number represents what.

hamonrye|4 months ago

RHEL knife-edge rolling kernel distribition for the proof of concept.

measurablefunc|4 months ago

This is proof of verifiable logic. Computers can not think so calling it proof of thought misrepresents what's actually happening.

aSanchezStern|4 months ago

I agree that "proof of thought" is a misleading name, but this whole "computers can't think" thing is making LLM skepticism seem very unscientific. There is no universally agreed upon objective definition of what it means to be able to "think" or how you would measure such a thing. The definition that these types of positions seem to rely upon is "a thing that only humans can do", which is obviously a circular one that isn't useful.

chpatrick|4 months ago

Do you understand human thinking well enough to determine what can think and what can't? We have next to no idea how an organic brain works.