Show HN: LLMs can generate valid JSON 100% of the time
854 points| remilouf | 2 years ago |github.com
Recently we came up with a fast way to generate text that matches a regex (https://blog.normalcomputing.ai/posts/2023-07-27-regex-guide...). The basic idea is simple: regular expressions have an equivalent Deterministic-Finite Automaton (DFA) representation. We can transform this DFA into a generative model: in each state we get a list of symbols which correspond to completions that partially match the regular expression. We mask the other symbols in the logits returned by a large language model, sample a new symbol and move to the next state. The subtelty is that language models work with tokens, not symbols, so we derive a new FSM whose alphabet is the model's vocabulary. We can do this in only one pass over the vocabulary.
Generating the token masks thus only requires a dictionary lookup at each state. Our method blows other libraries like Microsoft's guidance out of the water.
From there it was only a small leap to be able to generate text that follows a JSON schema (https://json-schema.org/), or is parseable into a Pydantic model (https://docs.pydantic.dev/latest/usage/models/). The method works with union types, optional types, nested schemas, arrays, everything. It is guaranteed that the output is parseable.
I think it's cool, and I've spent a lot of time watching even tiny models output valid JSON over the weekend. Hope you will too.
I look forward to feedback, bug reports, feature requests and discussions!
Edit: Link to our pre-print explaining the method and how this can be extended to generate text that follows a Context-Free Grammar https://arxiv.org/abs/2307.09702
[+] [-] activatedgeek|2 years ago|reply
I am curious, however, for the ones who have played around with such libraries wrapping base LLMs with output structure: do base models like Llama2 work very well? My experience says "hell no!" and you do need a fair bit of instruction-tuning for specific use cases to actually get things to work.
And even then, it seems very counter-intuitive to me that given an instruction-tuned model, post-hoc masking of the state-space during generation then amounts to just changing the generation distribution, and potentially detrimental to instruction-tuning?
[+] [-] make3|2 years ago|reply
About your second point, the goal is that the model can only generate JSON (for example), which can 100% be done by constraining which output token can and cannot be used.
[+] [-] simonw|2 years ago|reply
I'm using the MLC version (since that works with a GPU on my M2 Mac) via my https://github.com/simonw/llm-mlc plugin.
[+] [-] LakshyAAAgrawal|2 years ago|reply
In our paper titled "Guiding Language Models of Code with Global Context using Monitors" (https://arxiv.org/abs/2306.10763), we propose Monitor Guided Decoding, which interfaces LLMs to static analysis, and guides the model to generate type-consistent code. Without any kind of fine-tuning, we show that using static analysis to guide token level generation at specific points leads to significantly improved quality of generated code, both in terms of compilability and match with ground truth. Even very small models (1.1B) are able to generate more compilable code than much larger models (175B) while also improving on match with ground truth.
[+] [-] ethbr1|2 years ago|reply
Isn't that what we did with test driven development?
The primary difference was our generator functions were human instead of LLM. Why not cut out the middle-human?
[+] [-] Havoc|2 years ago|reply
The instruction tuning part is "trivial"...it's the dealing with edge cases part that gets me.
With classic code edge cases are well insignificant edge cases. With LLM you never know what will make it go off on a tangent & the parsing code needs to deal with that chaos.
Or put differently the % of cases that are edge cases seems to have gone up dramatically
[+] [-] panarky|2 years ago|reply
But it's still probabilistic, and nine times out of ten isn't good enough.
Occasionally it will hallucinate responses like this:
{"key1": "value1", "key2": "value2" for i in range(n)}
Re-prompting with the parsing error message is usually enough to get it on the second try.
But escaping double-quotes and newline characters is less reliable. Even after giving it multiple examples, it correctly escapes only about half the time.
Re-prompting for escaping errors still yields a ~50% success rate.
[+] [-] simonw|2 years ago|reply
Here's their prompt for that: https://github.com/microsoft/TypeChat/blob/c45460f4030938da3...
I think the approach using grammars (seen here, but also in things like https://github.com/ggerganov/llama.cpp/pull/1773 ) is a much more elegant solution.
[+] [-] padolsey|2 years ago|reply
[+] [-] caesil|2 years ago|reply
The chief error is not providing escape hatches. LLMs look for a right answer. If you are feeding it some texts and asking it to return structured data about the texts, but then one of the texts is blank, it will be difficult to determine a right answer, so you get hallucinations. The solution is an escape hatch where one of the arguments is a `textIsMissing` boolean or something.
As long as you've accounted for these failure modes, it works flawlessly.
[+] [-] andreygrehov|2 years ago|reply
[+] [-] karmasimida|2 years ago|reply
1. It consumes fewer tokens, no need to add too many examples into the prompt.
2. It suffers less from the forgetting issue.
Another minor advantage is you can control precisely where your desired output to begin with.
But overall, those are nice perks not too substantial IMO.
[+] [-] nextaccountic|2 years ago|reply
If this works, how to select the optimal value? Maybe you can train a model that can excel at the task of querying gpt4 for valid jsons
[+] [-] MuffinFlavored|2 years ago|reply
right now you can inject prompts that the LLM takes into consideration before the output
I wonder if you can make it have a "post" generation function that says like "keep re-trying in a loop (aka hallucinating with randomness) until the output message passes XYZ format/checks/scoring"
[+] [-] msp26|2 years ago|reply
But you can do both. For my current use case of extracting information from articles, I have a json schema + one/two example articles along with their correct answers. This increases token costs but 3.5 is so cheap that it doesn't matter and for 4 you can use batching to decrease token cost per article.
[+] [-] phillipcarter|2 years ago|reply
[+] [-] keiferwiseman|2 years ago|reply
[+] [-] thumbsup-_-|2 years ago|reply
[+] [-] orasis|2 years ago|reply
[+] [-] hansvm|2 years ago|reply
As a brief example, suppose the only possible LLM outputs were "hello world", "food", "hello", and "good day" (and that they're all equally probable with no prompting). Suppose your grammar requires a space in the output somewhere and has no other constraints. If you sampled LLM outputs till something passed the grammer you'd receive "hello world" and "good day" with equal probability. If you apply the website's technique you'll receive "hello world" twice as frequently as "good day".
The core problem is that an answer prefix might have been extremely unlikely to yield a valid response, but the technique (probably -- assuming it succeeds -- my example assumed retries would eventually succeed) constructs a valid response from it regardless. Assuming enough independence in the right places everything is fine and dandy still, but correlated errors compound quickly in autoregressive models.
As a brief JSON-specific question, is an LLM more or less likely to make factual errors (hallucinations, truncated strings, missing main characters, ...) when it produces a response failing to adhere to a schema? If factual error rate relates nontrivially to schema error rate then this path is more perilous than it seems. Given the outsized impact certain words or schmooshed together word-phrases seem to have on LLM output, I'd be surprised if details like schema adherence didn't bleed into other characteristics of the output.
[+] [-] sneedchucker|2 years ago|reply
https://news.ycombinator.com/item?id=36819906 https://github.com/ggerganov/llama.cpp/pull/1773
[+] [-] remilouf|2 years ago|reply
Our method is much more efficient. llama.cpp loops over the entire vocabulary (~50k tokens) at each step to generate the mask. We generate an index at initialization, and building the masks at each step only requires a dictionary lookup (trade speed for memory). Sampling is just as fast as standard sampling.
[+] [-] btwillard|2 years ago|reply
[+] [-] xigency|2 years ago|reply
[+] [-] BoorishBears|2 years ago|reply
https://github.com/1rgs/jsonformer
or
https://github.com/newhouseb/clownfish
or
https://github.com/mkuchnik/relm
or
https://github.com/ggerganov/llama.cpp/pull/1773
or
https://github.com/Shopify/torch-grammar
Overall there are a ton of these logit based guidance systems, the reason they don't get tons of traction is the SOTA models are behind REST APIs that don't enable this fine-grained approach.
Those models perform so much better that people generally settle for just re-requesting until they get the correct format (and with GPT-4 that ends up being a fairly rare occurrence in my experience)
[+] [-] J_Shelby_J|2 years ago|reply
After each token generated by the LLM you update the logit bias “mask” to only allow the next token to be a valid json token?
Very slick!
[+] [-] Q6T46nT668w6i3m|2 years ago|reply
Edit: It is! https://brandonwillard.github.io/
[+] [-] YeGoblynQueenne|2 years ago|reply
As far as I can tell your approach requires a grammar to be given by a user. In that case, what is the advantage of using an LLM to generate text? Why can't you just run your grammar as a generator and generate the text you want? That would save you the considerable trouble and cost of training an LLM in the first place. And why would you need an LLM, a model of natural language, if all you want is to generate structured text, anyway?
[+] [-] aduffy|2 years ago|reply
I firmly believe that output format guarantees are going to be important for real (non-toy) decades for LLMs
[1] https://github.com/ggerganov/llama.cpp/discussions/2494
[+] [-] Scaevolus|2 years ago|reply
[+] [-] Scene_Cast2|2 years ago|reply
[+] [-] Scarblac|2 years ago|reply
[+] [-] pshc|2 years ago|reply
[+] [-] remilouf|2 years ago|reply
[+] [-] contravariant|2 years ago|reply
[+] [-] Deukhoofd|2 years ago|reply
https://microsoft.github.io/TypeChat/blog/introducing-typech...
[+] [-] remilouf|2 years ago|reply
Our method on the other guarantees that the output will follow the specs of the JSON schema. No need to call the LLM several times.
[+] [-] 2bitencryption|2 years ago|reply
Guidance (and this project?): Let's not even bother with trying to convince the model; instead, we'll only sample from the set of tokens that are guaranteed to be correct for the grammar we want to emit.
[+] [-] Ilasky|2 years ago|reply
Here’s a bit more of a description of using the functions API for JSON returns: https://yonom.substack.com/p/native-json-output-from-gpt-4
[0] https://openai.com/blog/function-calling-and-other-api-updat...
[1] https://resgen.app
[2] https://github.com/guidance-ai/guidance
[+] [-] londons_explore|2 years ago|reply
From OpenAI's docs:
> note: the model may generate invalid JSON
I would guess they don't use your method - and perhaps they should!
[+] [-] thomasfromcdnjs|2 years ago|reply
[+] [-] Animats|2 years ago|reply
[+] [-] anotherpaulg|2 years ago|reply
https://aider.chat/docs/benchmarks.html
I’m curious if you have measured whether the “constrained generation” that you’re doing suffers from similar downsides?
[+] [-] simonw|2 years ago|reply
Being able to pass up some kind of grammar (a regular expression, or a JSON schema, or some other format) and have this trick run during their token sampling process to ensure the output was compliant would be incredibly useful.
[+] [-] coder543|2 years ago|reply
[+] [-] Havoc|2 years ago|reply
[+] [-] swyx|2 years ago|reply
[+] [-] lettergram|2 years ago|reply
Can't mention how we did it (there are a lot of public patents, if interested), but back in 2018 we had a way to generate synthetic data (statistically, structurally similar) off any dataset - https://medium.com/capital-one-tech/why-you-dont-necessarily... You could also design datasets if you wanted.
It'd keep similar relations and worked pretty darn well. Not the exact same, but always produced valid JSON.
[+] [-] remilouf|2 years ago|reply
[+] [-] visarga|2 years ago|reply