top | item 37125118

Show HN: LLMs can generate valid JSON 100% of the time

854 points| remilouf | 2 years ago |github.com

Outlines is a Python library that focuses on text generation with large language models. Brandon and I are not LLM experts and started the project a few months ago because we wanted to understand better how the generation process works. Our original background is probabilistic, relational and symbolic programming.

Recently we came up with a fast way to generate text that matches a regex (https://blog.normalcomputing.ai/posts/2023-07-27-regex-guide...). The basic idea is simple: regular expressions have an equivalent Deterministic-Finite Automaton (DFA) representation. We can transform this DFA into a generative model: in each state we get a list of symbols which correspond to completions that partially match the regular expression. We mask the other symbols in the logits returned by a large language model, sample a new symbol and move to the next state. The subtelty is that language models work with tokens, not symbols, so we derive a new FSM whose alphabet is the model's vocabulary. We can do this in only one pass over the vocabulary.

Generating the token masks thus only requires a dictionary lookup at each state. Our method blows other libraries like Microsoft's guidance out of the water.

From there it was only a small leap to be able to generate text that follows a JSON schema (https://json-schema.org/), or is parseable into a Pydantic model (https://docs.pydantic.dev/latest/usage/models/). The method works with union types, optional types, nested schemas, arrays, everything. It is guaranteed that the output is parseable.

I think it's cool, and I've spent a lot of time watching even tiny models output valid JSON over the weekend. Hope you will too.

I look forward to feedback, bug reports, feature requests and discussions!

Edit: Link to our pre-print explaining the method and how this can be extended to generate text that follows a Context-Free Grammar https://arxiv.org/abs/2307.09702

303 comments

order
[+] activatedgeek|2 years ago|reply
Mechanistically, I think this library takes the simple idea of masking part of the vocabulary space and steps in time efficiently. Great!

I am curious, however, for the ones who have played around with such libraries wrapping base LLMs with output structure: do base models like Llama2 work very well? My experience says "hell no!" and you do need a fair bit of instruction-tuning for specific use cases to actually get things to work.

And even then, it seems very counter-intuitive to me that given an instruction-tuned model, post-hoc masking of the state-space during generation then amounts to just changing the generation distribution, and potentially detrimental to instruction-tuning?

[+] make3|2 years ago|reply
I'm not sure of why you would want to use raw llama-2 though when there is a million super strong instruction fine-tuned versions of llama-2 on HF hub that would do the job a million times better? Like Stability-AI's Beluga-2. See https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderb...

About your second point, the goal is that the model can only generate JSON (for example), which can 100% be done by constraining which output token can and cannot be used.

[+] simonw|2 years ago|reply
I'm quite impressed with Llama 2 13B - the more time I spend with it the more I think it might be genuinely useful for more than just playing around with local LLMs.

I'm using the MLC version (since that works with a GPU on my M2 Mac) via my https://github.com/simonw/llm-mlc plugin.

[+] LakshyAAAgrawal|2 years ago|reply
In our experience, at least for code generation, the experience has been that base models can be improved significantly by guiding token level generation.

In our paper titled "Guiding Language Models of Code with Global Context using Monitors" (https://arxiv.org/abs/2306.10763), we propose Monitor Guided Decoding, which interfaces LLMs to static analysis, and guides the model to generate type-consistent code. Without any kind of fine-tuning, we show that using static analysis to guide token level generation at specific points leads to significantly improved quality of generated code, both in terms of compilability and match with ground truth. Even very small models (1.1B) are able to generate more compilable code than much larger models (175B) while also improving on match with ground truth.

[+] ethbr1|2 years ago|reply
> ...given an instruction-tuned model, post-hoc masking of the state-space during generation then amounts to just changing the generation distribution...

Isn't that what we did with test driven development?

The primary difference was our generator functions were human instead of LLM. Why not cut out the middle-human?

[+] Havoc|2 years ago|reply
>you do need a fair bit of instruction-tuning for specific use cases to actually get things to work.

The instruction tuning part is "trivial"...it's the dealing with edge cases part that gets me.

With classic code edge cases are well insignificant edge cases. With LLM you never know what will make it go off on a tangent & the parsing code needs to deal with that chaos.

Or put differently the % of cases that are edge cases seems to have gone up dramatically

[+] panarky|2 years ago|reply
I can make GPT4 return valid JSON simply by providing examples in the system message. This works nine times out of ten.

But it's still probabilistic, and nine times out of ten isn't good enough.

Occasionally it will hallucinate responses like this:

{"key1": "value1", "key2": "value2" for i in range(n)}

Re-prompting with the parsing error message is usually enough to get it on the second try.

But escaping double-quotes and newline characters is less reliable. Even after giving it multiple examples, it correctly escapes only about half the time.

Re-prompting for escaping errors still yields a ~50% success rate.

[+] padolsey|2 years ago|reply
I've had more luck with getting it to output XML as (1) You can imbue XML with actual language/meaning (which LLMs adore) and (2) parsers can be made to be more forgiving. I get why people want to make JSON, but to me it's a bit like trying to get a cat to swim - you might eventually succeed, but it's not their natural inclination.
[+] caesil|2 years ago|reply
With ChatGPT function calling I get valid JSON 100% of the time from GPT-4 unless I have made some error in prompting.

The chief error is not providing escape hatches. LLMs look for a right answer. If you are feeding it some texts and asking it to return structured data about the texts, but then one of the texts is blank, it will be difficult to determine a right answer, so you get hallucinations. The solution is an escape hatch where one of the arguments is a `textIsMissing` boolean or something.

As long as you've accounted for these failure modes, it works flawlessly.

[+] andreygrehov|2 years ago|reply
Meh... I asked GPT4 to return a sample PHP code inside of a random JSON. It failed the JSON linter from the very first try. I actually couldn't pass the validation despite many retries, eg follow up corrections. Not a single time it generated a 100% valid JSON, I eventually gave up.
[+] karmasimida|2 years ago|reply
I see grammar constrained generation for 2 major advantages:

1. It consumes fewer tokens, no need to add too many examples into the prompt.

2. It suffers less from the forgetting issue.

Another minor advantage is you can control precisely where your desired output to begin with.

But overall, those are nice perks not too substantial IMO.

[+] nextaccountic|2 years ago|reply
What about reprompting with a different temperature value?

If this works, how to select the optimal value? Maybe you can train a model that can excel at the task of querying gpt4 for valid jsons

[+] MuffinFlavored|2 years ago|reply
I wonder if the next iteration of OpenAI features is something like:

right now you can inject prompts that the LLM takes into consideration before the output

I wonder if you can make it have a "post" generation function that says like "keep re-trying in a loop (aka hallucinating with randomness) until the output message passes XYZ format/checks/scoring"

[+] msp26|2 years ago|reply
>I can make GPT4 return valid JSON simply by providing examples in the system message. This works nine times out of ten

But you can do both. For my current use case of extracting information from articles, I have a json schema + one/two example articles along with their correct answers. This increases token costs but 3.5 is so cheap that it doesn't matter and for 4 you can use batching to decrease token cost per article.

[+] phillipcarter|2 years ago|reply
This is what we do, but for GPT-3.5. And it doesn't need to be system messages either. We even have it emitting only JSON in a specific structure (except for when it fails to produce an output altogether). This is without the function calling model.
[+] keiferwiseman|2 years ago|reply
It took some iterations but I've managed to get the OpenAI API to give me valid JSON 100% of the time now(based on my testing). I think I put in the prompt to never use newlines because it was causing issues lol.
[+] thumbsup-_-|2 years ago|reply
Yeah same thing. I have done the same with GPT-3.5. Simply ask it to output using provided schema only and give a few examples. Always outputs in provided json format
[+] orasis|2 years ago|reply
What about using ChatGPT’s new function calling mechanism?
[+] hansvm|2 years ago|reply
A major part of the power of an LLM is the calibrated probability distribution in its responses, and this technique probably throws that ability away. Why is it good enough?

As a brief example, suppose the only possible LLM outputs were "hello world", "food", "hello", and "good day" (and that they're all equally probable with no prompting). Suppose your grammar requires a space in the output somewhere and has no other constraints. If you sampled LLM outputs till something passed the grammer you'd receive "hello world" and "good day" with equal probability. If you apply the website's technique you'll receive "hello world" twice as frequently as "good day".

The core problem is that an answer prefix might have been extremely unlikely to yield a valid response, but the technique (probably -- assuming it succeeds -- my example assumed retries would eventually succeed) constructs a valid response from it regardless. Assuming enough independence in the right places everything is fine and dandy still, but correlated errors compound quickly in autoregressive models.

As a brief JSON-specific question, is an LLM more or less likely to make factual errors (hallucinations, truncated strings, missing main characters, ...) when it produces a response failing to adhere to a schema? If factual error rate relates nontrivially to schema error rate then this path is more perilous than it seems. Given the outsized impact certain words or schmooshed together word-phrases seem to have on LLM output, I'd be surprised if details like schema adherence didn't bleed into other characteristics of the output.

[+] sneedchucker|2 years ago|reply
Relevant; LLama.cpp implemented grammar-based sampling last month.

https://news.ycombinator.com/item?id=36819906 https://github.com/ggerganov/llama.cpp/pull/1773

[+] remilouf|2 years ago|reply
We can extend our approach to grammar-based sampling, as explained in the paper linked above. Relevant PR: https://github.com/normal-computing/outlines/pull/178

Our method is much more efficient. llama.cpp loops over the entire vocabulary (~50k tokens) at each step to generate the mask. We generate an index at initialization, and building the masks at each step only requires a dictionary lookup (trade speed for memory). Sampling is just as fast as standard sampling.

[+] btwillard|2 years ago|reply
We also had an implementation of grammar-driven guidance around the same time: https://github.com/normal-computing/outlines/pull/131. I imagine many others did as well, given all the papers we found on the subject. The point of this and our ongoing work is the availability of very low cost guidance, which was implemented a while ago for the regex case and expanded upon with JSON.
[+] xigency|2 years ago|reply
Thanks for building this. The mechanics are such an obvious idea that it's astounding that the first-party platforms haven't done this yet. I would be interested to see how this could be used for other tasks outside of JSON that require structured input.
[+] BoorishBears|2 years ago|reply
I'm not sure how this is different than:

https://github.com/1rgs/jsonformer

or

https://github.com/newhouseb/clownfish

or

https://github.com/mkuchnik/relm

or

https://github.com/ggerganov/llama.cpp/pull/1773

or

https://github.com/Shopify/torch-grammar

Overall there are a ton of these logit based guidance systems, the reason they don't get tons of traction is the SOTA models are behind REST APIs that don't enable this fine-grained approach.

Those models perform so much better that people generally settle for just re-requesting until they get the correct format (and with GPT-4 that ends up being a fairly rare occurrence in my experience)

[+] J_Shelby_J|2 years ago|reply
So to explain this another way:

After each token generated by the LLM you update the logit bias “mask” to only allow the next token to be a valid json token?

Very slick!

[+] YeGoblynQueenne|2 years ago|reply
Hi, remilouf. You say that your background is in "probabilistic, relational and symbolic programming". In that case I suspect you understand that it is no problem to generate text from a regular or context-free grammar, or really any level of grammar. For example, you can do that very easily in Prolog (a relational language) given a grammar in Definite Clause Grammars notation.

As far as I can tell your approach requires a grammar to be given by a user. In that case, what is the advantage of using an LLM to generate text? Why can't you just run your grammar as a generator and generate the text you want? That would save you the considerable trouble and cost of training an LLM in the first place. And why would you need an LLM, a model of natural language, if all you want is to generate structured text, anyway?

[+] aduffy|2 years ago|reply
This is exciting, we built a similar tool[1] recently specifically targeted at constraining llama output to match a TypeScript interface.

I firmly believe that output format guarantees are going to be important for real (non-toy) decades for LLMs

[1] https://github.com/ggerganov/llama.cpp/discussions/2494

[+] Scaevolus|2 years ago|reply
Are there temperature or sampling parameters for generate.regex? I'm poking around trying to generate password mnemonics (https://rmmh.github.io/abbrase/), and it really doesn't like actually giving me proper words:

    >> model = models.transformers("gpt2-medium")
    >> generate.regex(model, r"Rea[a-z']{,10} lik[a-z']{,10} acr[a-z']{,10} ene[a-z']{,10} sta[a-z']{,10}\.", max_tokens=30)("A memorable phrase is:")
    'Rearmingandme like acrowetteanda eneatubootank stackfishkies.'
[+] Scene_Cast2|2 years ago|reply
One potential drawback I can see is if the viable tokens are far down the list of predictions. In that case, filtering down to just those tokens is a distribution shift with resulting output being less stable / less sensible.
[+] Scarblac|2 years ago|reply
It can't be less sensible JSON than syntactically invalid JSON. All the tokens higher on the list are syntax errors.
[+] pshc|2 years ago|reply
Exactly my concern. If the model isn't sure-footed about the path forward, it seems prudent to take that fact as information and adjust the initial conditions, rather than forcing the model into a potentially hallucinatory idea-space.
[+] remilouf|2 years ago|reply
Indeed, this remains an empirical question.
[+] contravariant|2 years ago|reply
More concretely, sometimes it is not enough to simply constrain the next token, backtracking might end up being better.
[+] Deukhoofd|2 years ago|reply
Looks interesting! How would you say it compares to Microsoft's TypeChat (beyond the obvious Python/TypeScript difference)?

https://microsoft.github.io/TypeChat/blog/introducing-typech...

[+] remilouf|2 years ago|reply
Thanks for bringing this library to my attention! From my understanding, TypeChat proceeds by (1) generating (2) attempting validation (3) if it fails, call the LLM again to fix the output (4) etc.

Our method on the other guarantees that the output will follow the specs of the JSON schema. No need to call the LLM several times.

[+] 2bitencryption|2 years ago|reply
TypeChat: let's try really hard to try to convince the model to make the highest-scoring tokens follow the grammar we want.

Guidance (and this project?): Let's not even bother with trying to convince the model; instead, we'll only sample from the set of tokens that are guaranteed to be correct for the grammar we want to emit.

[+] Ilasky|2 years ago|reply
OpenAI has this capability built in with functions[0], I believe! Building my own project[1] I have implemented functions in combination with guidance[2] and haven’t had a hiccup yet! I have a JSON parser function there, just in case, but it seems to be working reliably.

Here’s a bit more of a description of using the functions API for JSON returns: https://yonom.substack.com/p/native-json-output-from-gpt-4

[0] https://openai.com/blog/function-calling-and-other-api-updat...

[1] https://resgen.app

[2] https://github.com/guidance-ai/guidance

[+] londons_explore|2 years ago|reply
>OpenAI has this capability built in with functions

From OpenAI's docs:

> note: the model may generate invalid JSON

I would guess they don't use your method - and perhaps they should!

[+] thomasfromcdnjs|2 years ago|reply
I do the same, just tell Openai to call a parser at the end and wahal.
[+] Animats|2 years ago|reply
OK, you get syntactically valid JSON, but does it contain the correct info? This is effectively a polisher, like spell check, which gives the output superficially correct form but doesn't understand the content. Right?
[+] anotherpaulg|2 years ago|reply
For complex tasks like coding, my experience is that asking for a complex output format hurts performance on the underlying task. This showed up clearly in code editing benchmarks of GPT-3.5 and GPT-4:

https://aider.chat/docs/benchmarks.html

I’m curious if you have measured whether the “constrained generation” that you’re doing suffers from similar downsides?

[+] simonw|2 years ago|reply
I really hope OpenAI add something like this to their endpoints soon.

Being able to pass up some kind of grammar (a regular expression, or a JSON schema, or some other format) and have this trick run during their token sampling process to ensure the output was compliant would be incredibly useful.

[+] coder543|2 years ago|reply
As a more general comment, the repo README provides examples that all use gpt2. It would be nice to see at least one example that invokes llama2, since I feel like that would make sure the reader knows that this library can use models that are more modern and interesting.
[+] Havoc|2 years ago|reply
Inclined to disagree - gpt2 is far more likely to produce gibberish. So if you can force specific outputs on that then it is a good demo that higher quality models will be even better
[+] swyx|2 years ago|reply
it would also be nice to see one example that uses gpt4.
[+] lettergram|2 years ago|reply
Few thoughts, you're effectively creating representations that can convert to JSON (kudos!)

Can't mention how we did it (there are a lot of public patents, if interested), but back in 2018 we had a way to generate synthetic data (statistically, structurally similar) off any dataset - https://medium.com/capital-one-tech/why-you-dont-necessarily... You could also design datasets if you wanted.

It'd keep similar relations and worked pretty darn well. Not the exact same, but always produced valid JSON.

[+] remilouf|2 years ago|reply
Thank you for the pointer. The best part of posting on HN is the long list of related work you get in response.
[+] visarga|2 years ago|reply
Enforcing JSON schema, regex and grammars is very useful. But how can we enforce decoding spans from a document? decoded text should be copied from a list of spans in the input document. It would be useful for extractive tasks.