Here's my understanding of how this works (please someone correct me if I'm getting this wrong).
Language models emit tokens one at a time, starting with the prompt that you give them.
If you have a conversation with an LLM, effectively you can think of that as you giving it a sequence of tokens, then it generates some, then you generate more and so-on.
This grammar trick effectively takes advantage of this by giving you much more finely grained control over the tokens. So you can do things like this:
Give me the address of the
White House as JSON:
{"street": "
Then the LLM can return:
1600 Pennsylvania Ave NW"
The moment you see that closing double quote, you take over again and inject:
",
"City": "
It fills in:
Washington, DC"
And so on.
But because this is all based on a grammar, you can do way more with it than just JSON.
I saw a brilliant suggestion relating to this on Twitter a while ago:
> @OpenAI should add an API argument allowing passing up a deterministic context free grammar.
> [...]
> While I think DCFL is what you want here in the short term, the really best thing is passing up a small WASM binary that simply is the sampler.
> Allow a user to pass up a few KB of WASM binary and give it a few megabytes of RAM to run. Would enable next level LLM superpowers.
Not just that: the LLM outputs not individual tokens, but a weighted recommendation. The most probable (“best”) token has the highest weight, but there may be many alternatives including JSON symbols like quote characters.
The “temperature” setting adjusts how likely it is that an output token is chosen that is not the top-rated option. That prevents repetitive output.
Forcing an LLM to obey a grammar is mostly about filtering the list before the token choice is made. There may still be a random element controlled by the temperature!
A more advanced feature not commonly used is to also enable back-tracking if the AI gets stuck and can’t produce a valid output.
No, the way it works is that the current output + potential next tokens to be sampled are checked with the grammar. All potential tokens that don't match are removed. Then, with the list of valid tokens left, normal sampling strategies are used.
I don't think this is correct; previously you could already control output by reading tokens one at a time from the LLM until you hit a stop character.
My take from the grammar-based sampling PR is that you ask llama.cpp to constrain the next output token, to a restricted set of possible tokens, using the grammar.
I'm struggling to understand what he's talking about. Starting with "passing up," did he invent this terminology? The only input you have to an LLM is the prompt, which gets tokenized. And if you were to send DCFG rules or a compiled version of it as part of the request, how would that fundamentally alter the way that the tokens are predicted? If the model predicts something that doesn't conform to the grammar you require, is he proposing re-prompting until it gets it right?
I think it should be noted that this enforces grammatical constraints on the model's generated text, but it doesn't do anything to properly align the content. This would be useful if you needed to ensure a server delivered well-formatted JSON, but it I suspect it wont solve a lot of alignment issues with current language generation. For example current iterations of Llama and GPT often do not label markdown code-blocks correctly. Using grammar-based sampling, you could enforce that it labels code blocks but you couldn't enforce correct labeling since this is context-dependent. You also couldn't invent a novel domain-specific language without aligning against that language and expect good output.
Also important to call out that anytime you have a freeform string it's pretty much an open invitation for the LLM to go completely haywire and run off into all sorts of weird tangents. So these methods are best used with other heuristics to bias sampling once you get to free-form text territory (i.e. a repetition penalty etc)
I will have to think about how I can build grammars that force things like syllable counts or syntactic rules. Current LLMs do very poorly on those kinds of tasks due to the tokenization schemes...
I implemented this for PyTorch too at https://github.com/Shopify/torch-grammar. I have a hacked version of text-generation-inference that uses it—happy to share that if it’s useful to anyone.
I've been meaning to play around with dumping the token probability vectors inside one of the LLM UIs. Having a diff starting point would help a bunch.
Specifically for multi-choice string enums (essentially dropdowns), I wonder if this would work better if the full (joint/product) probability given the logits is considered when picking the final choice, rather than using a greedy algorithm. This will favor the right choice, as opposed to e.g. one of the choices that contain the most common start token - if a start token are shared among many items in the list.
Of course the probability needs to be adjusted once a subset of the logits goes to zero so it actually makes sense...
There is everything from assembly and C++ to glsl and scripting languages, arithmetic, games, and other weird formats like freedesktop shortcuts, llvm ir or verilog.
A convenience feature in any inference API would be to specify a shortcut to a standardized grammar such as HTML, JSON, Python, etc. It is frankly strange to me that OpenAI have not already done this, considering the obvious effort they undertook to fine-tune the Code Interpreter model.
Can someone ELI5 what's going on here? I'm reasonably familiar with LLMs, but I can't quite grok what Georgi is doing here and why it's so exciting for some.
An LLM does not generate "the next token" - from an input text, it generates a vector of probabilities where each slot in the vector corresponds to a token. The value in a token's slot is (approximately) the probability that that particular token might appear next in the text.
Programs like ChatGPT "interpret" that vector of probabilities to generate text by selecting (sampling) one of the top tokens. But sometimes this is too flexible -- for example, ChatGPT might generate invalid JSON when you want JSON output because it chose a token that does not conform to the JSON grammar.
A way to "force" an LLM to generate e.g. JSON is to change the sampling process. Instead of choosing any top token, we first filter the tokens to just those that conform to the JSON grammar. Then, we sample one of the top tokens from that subset.
If you ask an LLM to generate JSON or another language that has a grammar, it will sometimes produce invalid syntax. This pull request constrains the LLM so that it can only output valid syntax according to whatever grammar you supply. It's a modification to the sampling procedure.
What is the sampling procedure? Well, the way an LLM generates text is one token (short sequence of characters) at a time. First the giant neural net assigns a probability to every possible token (this is the hard part). Then a sampling procedure uses the probabilities to pick one of the tokens, and the process repeats.
The sampling procedure is not a neural net and can be modified in many different ways. You might think that the sampling procedure should always simply pick the token with the highest probability (greedy sampling). You can do that, but it's usually better to pick at random weighted by the probabilities. This gives more diversity and is less likely to get stuck in loops. But this means that literally any token with nonzero probability might get picked, so you can see how this might lead to invalid JSON being generated sometimes. This pull request zeros out the probabilities of all the tokens that wouldn't be valid according to your grammar, so they can't be picked.
BTW there are lots of other interesting modifications to the sampling process you could consider. For example, maybe you can see that in the process of sampling tokens one after the other you might paint yourself into a corner and end up with no good options to choose from. So maybe it makes sense to allow backtracking. In fact, maybe at each sampling step we can consider multiple options, making a tree of possible outputs, and at the end we can pick the path through the tree with the highest overall probability. Of course we can't consider every option; it would be a complete tree with a branching factor of the number of possible tokens, which would grow exponentially. Let's prune the tree at each step and only consider the top, say, five paths we've seen so far. This is called "beam search". It's not normally used for LLMs because the neural net that generates the probabilities is very expensive to run and multiplying that cost by a factor of e.g. five is unpalatable. But it can be done, and produces somewhat better results. You could also consider using MCTS like chess engines do.
LLMs are happy to generate arbitrary strings. You might want it to spit out something along the lines of "Alice: 42" and then it spits out "hi, i'm helpful and Alice is exactly forty two, as far as I can tell, but I'm just a language model."
So you give it a grammar that says the response has to be an uppercase letter followed by lowercase letters, then a colon, then a space, then digits, then it's done. Now, when it looks for that first token, it will only consider tokens that are compatible with that pattern. Then it'll continue with only next tokens that are compatible with the next parts of the pattern.
These grammars do that with a flexible and useful kind of pattern.
I'm interested in this and I'm going to try incorporating it into something I'm doing. That said, I feel like this could be one of those Bitter Lesson situations where it's not the most effective approach in anything but the very short term: http://www.incompleteideas.net/IncIdeas/BitterLesson.html
It may be a stop-gap, but its an important one as it is not obvious that LLMs in the next few years will "organically" solve their issues with generating text with constraints.
Not an expert at all, but I believe that OpenAI uses this in some of their GPT apis which are meant for programmatic use. I've seen it theorized that offloading the rote grammar stuff to a simple process that is meant for it lets the LLM use it's "brainpower" on the complicated stuff more effectively. No idea if this is true.
Can anyone recommend some paper or overview on how "sampling" / "decoding" is done in the e2e neural network age? I know how decoding was done for machine translation and speech recognition back in the HMM times (i.e. https://en.wikipedia.org/wiki/Viterbi_algorithm and https://en.wikipedia.org/wiki/Beam_search). These days I get the impression people just do "greedy" - but I don't really know. Any recommendations for info on that topic?
Of course we now know that GPT4 is a Mixture of Experts, so under the hood they’re parallelizing computation. They also include a way to modify the logits with presence/frequency penalty terms.
But LLM's are usually very good at following grammars. I rarely see LLM generating code that is OOD. Ofc, this is only true for popular language (JSON/Python/Java, etc), I can see how this is handy for more niche and in house DSL.
You still need quite a lot of prompt engineering to get desired outputs, this just add another layer of output verification IMO. But does it really save much as comparing to get the output then parse and reject the output that doesn't follow the grammar? Might be debateable.
> this just add another layer of output verification IMO.
It's not verifying the output after it's done, it's constraining the output as it's generated.
> But does it really save much as comparing to get the output then parse and reject the output that doesn't follow the grammar? Might be debateable.
I don't think it's debatable at all. Forcing the model to conform to a grammar during generation means there is never a need to discard and regenerate because it got the grammar wrong.
Think of the compute involved in generating the whole output and then re-generating if it's non-conformant. There is no comparison.
So, umm, if you want to walk BNF and emit likely tokens you can do that without any "machine learning" or whatever you want to call it. So what is being added here? Training to tie the prompt to the output?
The difference is in the word “likely”. You can put unstructured data in the prompt and get structured data out. You could put in the beginning of a list and ask for a continuation.
Ah finally, this was discussed a lot and is well overdue. Remains to be seen how well the models will adapt to this new constraint, though the demo seems promising.
Something I’m wondering lately is if you are generating tokens fast enough, is restricting the logits actually worth it computationally? If tokens are cheap enough it might be more efficient to validate/discard them as they come rather than place constraints on how they come out. I don’t know how this one works, but the sampling or renormalizing scheme would cost something too right?
There is at least 6 orders of magnitude difference in computation cost of computing a token (pass through a multi-B model) and doing a single step of a program. Even if your validation is really naive, it's hard to beat 6 orders of magnitude.
This can constrain an LLM's output to an arbitrary grammar/format as it is generated, rather than asking the model to output a specific format and hoping it outputs something valid.
[+] [-] simonw|2 years ago|reply
Language models emit tokens one at a time, starting with the prompt that you give them.
If you have a conversation with an LLM, effectively you can think of that as you giving it a sequence of tokens, then it generates some, then you generate more and so-on.
This grammar trick effectively takes advantage of this by giving you much more finely grained control over the tokens. So you can do things like this:
Then the LLM can return: The moment you see that closing double quote, you take over again and inject: It fills in: And so on.But because this is all based on a grammar, you can do way more with it than just JSON.
I saw a brilliant suggestion relating to this on Twitter a while ago:
> @OpenAI should add an API argument allowing passing up a deterministic context free grammar.
> [...]
> While I think DCFL is what you want here in the short term, the really best thing is passing up a small WASM binary that simply is the sampler.
> Allow a user to pass up a few KB of WASM binary and give it a few megabytes of RAM to run. Would enable next level LLM superpowers.
https://twitter.com/grantslatton/status/1637692033115762688
[+] [-] jiggawatts|2 years ago|reply
The “temperature” setting adjusts how likely it is that an output token is chosen that is not the top-rated option. That prevents repetitive output.
Forcing an LLM to obey a grammar is mostly about filtering the list before the token choice is made. There may still be a random element controlled by the temperature!
A more advanced feature not commonly used is to also enable back-tracking if the AI gets stuck and can’t produce a valid output.
[+] [-] SCHiM|2 years ago|reply
[+] [-] pshc|2 years ago|reply
My take from the grammar-based sampling PR is that you ask llama.cpp to constrain the next output token, to a restricted set of possible tokens, using the grammar.
[+] [-] thomasahle|2 years ago|reply
That's one of the developers of the Outlines library, another cool LLM workflow library.
[+] [-] IAmNotACellist|2 years ago|reply
[+] [-] eightysixfour|2 years ago|reply
https://github.com/microsoft/guidance
[+] [-] anothernewdude|2 years ago|reply
[+] [-] nojvek|2 years ago|reply
[+] [-] barbazoo|2 years ago|reply
[+] [-] svc0|2 years ago|reply
[+] [-] newhouseb|2 years ago|reply
[+] [-] brucethemoose2|2 years ago|reply
I can imagine a system where, for instance, a markdown lora and a markdown grammar file can be hotswapped in and out.
[+] [-] Der_Einzige|2 years ago|reply
I will have to think about how I can build grammars that force things like syllable counts or syntactic rules. Current LLMs do very poorly on those kinds of tasks due to the tokenization schemes...
[+] [-] sp332|2 years ago|reply
[+] [-] burke|2 years ago|reply
[+] [-] Scene_Cast2|2 years ago|reply
I've been meaning to play around with dumping the token probability vectors inside one of the LLM UIs. Having a diff starting point would help a bunch.
[+] [-] spion|2 years ago|reply
Of course the probability needs to be adjusted once a subset of the logits goes to zero so it actually makes sense...
[+] [-] brucethemoose2|2 years ago|reply
https://github.com/antlr/grammars-v4
There is everything from assembly and C++ to glsl and scripting languages, arithmetic, games, and other weird formats like freedesktop shortcuts, llvm ir or verilog.
[+] [-] ttul|2 years ago|reply
[+] [-] RossBencina|2 years ago|reply
[+] [-] 1024core|2 years ago|reply
[+] [-] tylerhou|2 years ago|reply
Programs like ChatGPT "interpret" that vector of probabilities to generate text by selecting (sampling) one of the top tokens. But sometimes this is too flexible -- for example, ChatGPT might generate invalid JSON when you want JSON output because it chose a token that does not conform to the JSON grammar.
A way to "force" an LLM to generate e.g. JSON is to change the sampling process. Instead of choosing any top token, we first filter the tokens to just those that conform to the JSON grammar. Then, we sample one of the top tokens from that subset.
[+] [-] modeless|2 years ago|reply
What is the sampling procedure? Well, the way an LLM generates text is one token (short sequence of characters) at a time. First the giant neural net assigns a probability to every possible token (this is the hard part). Then a sampling procedure uses the probabilities to pick one of the tokens, and the process repeats.
The sampling procedure is not a neural net and can be modified in many different ways. You might think that the sampling procedure should always simply pick the token with the highest probability (greedy sampling). You can do that, but it's usually better to pick at random weighted by the probabilities. This gives more diversity and is less likely to get stuck in loops. But this means that literally any token with nonzero probability might get picked, so you can see how this might lead to invalid JSON being generated sometimes. This pull request zeros out the probabilities of all the tokens that wouldn't be valid according to your grammar, so they can't be picked.
BTW there are lots of other interesting modifications to the sampling process you could consider. For example, maybe you can see that in the process of sampling tokens one after the other you might paint yourself into a corner and end up with no good options to choose from. So maybe it makes sense to allow backtracking. In fact, maybe at each sampling step we can consider multiple options, making a tree of possible outputs, and at the end we can pick the path through the tree with the highest overall probability. Of course we can't consider every option; it would be a complete tree with a branching factor of the number of possible tokens, which would grow exponentially. Let's prune the tree at each step and only consider the top, say, five paths we've seen so far. This is called "beam search". It's not normally used for LLMs because the neural net that generates the probabilities is very expensive to run and multiplying that cost by a factor of e.g. five is unpalatable. But it can be done, and produces somewhat better results. You could also consider using MCTS like chess engines do.
[+] [-] 6gvONxR4sf7o|2 years ago|reply
So you give it a grammar that says the response has to be an uppercase letter followed by lowercase letters, then a colon, then a space, then digits, then it's done. Now, when it looks for that first token, it will only consider tokens that are compatible with that pattern. Then it'll continue with only next tokens that are compatible with the next parts of the pattern.
These grammars do that with a flexible and useful kind of pattern.
[+] [-] simonw|2 years ago|reply
[+] [-] version_five|2 years ago|reply
[+] [-] Der_Einzige|2 years ago|reply
[+] [-] woah|2 years ago|reply
[+] [-] sandkoan|2 years ago|reply
Playground: https://automorphic.ai/playground
[+] [-] painted-now|2 years ago|reply
Edit: Forgot Viterbi
[+] [-] spion|2 years ago|reply
[+] [-] janalsncm|2 years ago|reply
https://platform.openai.com/docs/api-reference/completions/c...
Of course we now know that GPT4 is a Mixture of Experts, so under the hood they’re parallelizing computation. They also include a way to modify the logits with presence/frequency penalty terms.
[+] [-] Icko|2 years ago|reply
[+] [-] jameshart|2 years ago|reply
[+] [-] karmasimida|2 years ago|reply
But LLM's are usually very good at following grammars. I rarely see LLM generating code that is OOD. Ofc, this is only true for popular language (JSON/Python/Java, etc), I can see how this is handy for more niche and in house DSL.
You still need quite a lot of prompt engineering to get desired outputs, this just add another layer of output verification IMO. But does it really save much as comparing to get the output then parse and reject the output that doesn't follow the grammar? Might be debateable.
But great work regardless.
[+] [-] mpalmer|2 years ago|reply
It's not verifying the output after it's done, it's constraining the output as it's generated.
> But does it really save much as comparing to get the output then parse and reject the output that doesn't follow the grammar? Might be debateable.
I don't think it's debatable at all. Forcing the model to conform to a grammar during generation means there is never a need to discard and regenerate because it got the grammar wrong.
Think of the compute involved in generating the whole output and then re-generating if it's non-conformant. There is no comparison.
[+] [-] QuantumG|2 years ago|reply
[+] [-] sp332|2 years ago|reply
[+] [-] ahupp|2 years ago|reply
[+] [-] moffkalast|2 years ago|reply
[+] [-] ec109685|2 years ago|reply
[+] [-] ilaksh|2 years ago|reply
[+] [-] meepmorp|2 years ago|reply
[+] [-] vore|2 years ago|reply
[+] [-] brucethemoose2|2 years ago|reply
[+] [-] lachlan_gray|2 years ago|reply
[+] [-] mmoskal|2 years ago|reply
So no, you're not generating tokens fast enough.
[+] [-] bavarianbob|2 years ago|reply
[+] [-] brucethemoose2|2 years ago|reply