Llama: Add grammar-based sampling

[+] simonw|2 years ago|reply

Here's my understanding of how this works (please someone correct me if I'm getting this wrong).

Language models emit tokens one at a time, starting with the prompt that you give them.

If you have a conversation with an LLM, effectively you can think of that as you giving it a sequence of tokens, then it generates some, then you generate more and so-on.

This grammar trick effectively takes advantage of this by giving you much more finely grained control over the tokens. So you can do things like this:

    Give me the address of the
    White House as JSON:
    
    {"street": "

Then the LLM can return:

    1600 Pennsylvania Ave NW"

The moment you see that closing double quote, you take over again and inject:

    ",
    "City": "

It fills in:

    Washington, DC"

And so on.

But because this is all based on a grammar, you can do way more with it than just JSON.

I saw a brilliant suggestion relating to this on Twitter a while ago:

> @OpenAI should add an API argument allowing passing up a deterministic context free grammar.

> [...]

> While I think DCFL is what you want here in the short term, the really best thing is passing up a small WASM binary that simply is the sampler.

> Allow a user to pass up a few KB of WASM binary and give it a few megabytes of RAM to run. Would enable next level LLM superpowers.

https://twitter.com/grantslatton/status/1637692033115762688

[+] jiggawatts|2 years ago|reply

Not just that: the LLM outputs not individual tokens, but a weighted recommendation. The most probable (“best”) token has the highest weight, but there may be many alternatives including JSON symbols like quote characters.

The “temperature” setting adjusts how likely it is that an output token is chosen that is not the top-rated option. That prevents repetitive output.

Forcing an LLM to obey a grammar is mostly about filtering the list before the token choice is made. There may still be a random element controlled by the temperature!

A more advanced feature not commonly used is to also enable back-tracking if the AI gets stuck and can’t produce a valid output.

[+] SCHiM|2 years ago|reply

No, the way it works is that the current output + potential next tokens to be sampled are checked with the grammar. All potential tokens that don't match are removed. Then, with the list of valid tokens left, normal sampling strategies are used.

[+] pshc|2 years ago|reply

I don't think this is correct; previously you could already control output by reading tokens one at a time from the LLM until you hit a stop character.

My take from the grammar-based sampling PR is that you ask llama.cpp to constrain the next output token, to a restricted set of possible tokens, using the grammar.

[+] thomasahle|2 years ago|reply

Another detailed description of how to do this: https://github.com/normal-computing/outlines/pull/131

That's one of the developers of the Outlines library, another cool LLM workflow library.

[+] IAmNotACellist|2 years ago|reply

I'm struggling to understand what he's talking about. Starting with "passing up," did he invent this terminology? The only input you have to an LLM is the prompt, which gets tokenized. And if you were to send DCFG rules or a compiled version of it as part of the request, how would that fundamentally alter the way that the tokens are predicted? If the model predicts something that doesn't conform to the grammar you require, is he proposing re-prompting until it gets it right?

[+] eightysixfour|2 years ago|reply

Isn’t this what Microsoft Guidance does?

https://github.com/microsoft/guidance

[+] anothernewdude|2 years ago|reply

So does this mean I can detect "Sorry" at the start of a response and prevent it?

[+] nojvek|2 years ago|reply

I’m sure ggml would accept that as a PR if the wasm vm had a tiny surface area.

[+] barbazoo|2 years ago|reply

Thank you for laying it out like that.

[+] svc0|2 years ago|reply

I think it should be noted that this enforces grammatical constraints on the model's generated text, but it doesn't do anything to properly align the content. This would be useful if you needed to ensure a server delivered well-formatted JSON, but it I suspect it wont solve a lot of alignment issues with current language generation. For example current iterations of Llama and GPT often do not label markdown code-blocks correctly. Using grammar-based sampling, you could enforce that it labels code blocks but you couldn't enforce correct labeling since this is context-dependent. You also couldn't invent a novel domain-specific language without aligning against that language and expect good output.

[+] newhouseb|2 years ago|reply

Also important to call out that anytime you have a freeform string it's pretty much an open invitation for the LLM to go completely haywire and run off into all sorts of weird tangents. So these methods are best used with other heuristics to bias sampling once you get to free-form text territory (i.e. a repetition penalty etc)

[+] brucethemoose2|2 years ago|reply

But since its llama, some examples could be trained into a lora.

I can imagine a system where, for instance, a markdown lora and a markdown grammar file can be hotswapped in and out.

[+] Der_Einzige|2 years ago|reply

I am in love with this, I tried my hand at building a Constrained Text Generation Studio (https://github.com/Hellisotherpeople/Constrained-Text-Genera...), and got published at COLING 2022 for my paper on it (https://paperswithcode.com/paper/most-language-models-can-be...), but I always knew that something like this or the related idea enumerated in this paper: https://arxiv.org/abs/2306.03081 was the way to go.

I will have to think about how I can build grammars that force things like syllable counts or syntactic rules. Current LLMs do very poorly on those kinds of tasks due to the tokenization schemes...

[+] sp332|2 years ago|reply

I was surprised, but Nous Hermes does a half decent job at writing haikus.

[+] burke|2 years ago|reply

I implemented this for PyTorch too at https://github.com/Shopify/torch-grammar. I have a hacked version of text-generation-inference that uses it—happy to share that if it’s useful to anyone.

[+] Scene_Cast2|2 years ago|reply

Yes, please share.

I've been meaning to play around with dumping the token probability vectors inside one of the LLM UIs. Having a diff starting point would help a bunch.

[+] spion|2 years ago|reply

Specifically for multi-choice string enums (essentially dropdowns), I wonder if this would work better if the full (joint/product) probability given the logits is considered when picking the final choice, rather than using a greedy algorithm. This will favor the right choice, as opposed to e.g. one of the choices that contain the most common start token - if a start token are shared among many items in the list.

Of course the probability needs to be adjusted once a subset of the logits goes to zero so it actually makes sense...

[+] brucethemoose2|2 years ago|reply

This grammar "library" was cited as an example of what the format could look like:.

https://github.com/antlr/grammars-v4

There is everything from assembly and C++ to glsl and scripting languages, arithmetic, games, and other weird formats like freedesktop shortcuts, llvm ir or verilog.

[+] ttul|2 years ago|reply

A convenience feature in any inference API would be to specify a shortcut to a standardized grammar such as HTML, JSON, Python, etc. It is frankly strange to me that OpenAI have not already done this, considering the obvious effort they undertook to fine-tune the Code Interpreter model.

[+] RossBencina|2 years ago|reply

It would be awesome if they supported ANLTR4 grammar syntax. Such a great tool.

[+] 1024core|2 years ago|reply

Can someone ELI5 what's going on here? I'm reasonably familiar with LLMs, but I can't quite grok what Georgi is doing here and why it's so exciting for some.

[+] tylerhou|2 years ago|reply

An LLM does not generate "the next token" - from an input text, it generates a vector of probabilities where each slot in the vector corresponds to a token. The value in a token's slot is (approximately) the probability that that particular token might appear next in the text.

Programs like ChatGPT "interpret" that vector of probabilities to generate text by selecting (sampling) one of the top tokens. But sometimes this is too flexible -- for example, ChatGPT might generate invalid JSON when you want JSON output because it chose a token that does not conform to the JSON grammar.

A way to "force" an LLM to generate e.g. JSON is to change the sampling process. Instead of choosing any top token, we first filter the tokens to just those that conform to the JSON grammar. Then, we sample one of the top tokens from that subset.

[+] modeless|2 years ago|reply

If you ask an LLM to generate JSON or another language that has a grammar, it will sometimes produce invalid syntax. This pull request constrains the LLM so that it can only output valid syntax according to whatever grammar you supply. It's a modification to the sampling procedure.

What is the sampling procedure? Well, the way an LLM generates text is one token (short sequence of characters) at a time. First the giant neural net assigns a probability to every possible token (this is the hard part). Then a sampling procedure uses the probabilities to pick one of the tokens, and the process repeats.

The sampling procedure is not a neural net and can be modified in many different ways. You might think that the sampling procedure should always simply pick the token with the highest probability (greedy sampling). You can do that, but it's usually better to pick at random weighted by the probabilities. This gives more diversity and is less likely to get stuck in loops. But this means that literally any token with nonzero probability might get picked, so you can see how this might lead to invalid JSON being generated sometimes. This pull request zeros out the probabilities of all the tokens that wouldn't be valid according to your grammar, so they can't be picked.

BTW there are lots of other interesting modifications to the sampling process you could consider. For example, maybe you can see that in the process of sampling tokens one after the other you might paint yourself into a corner and end up with no good options to choose from. So maybe it makes sense to allow backtracking. In fact, maybe at each sampling step we can consider multiple options, making a tree of possible outputs, and at the end we can pick the path through the tree with the highest overall probability. Of course we can't consider every option; it would be a complete tree with a branching factor of the number of possible tokens, which would grow exponentially. Let's prune the tree at each step and only consider the top, say, five paths we've seen so far. This is called "beam search". It's not normally used for LLMs because the neural net that generates the probabilities is very expensive to run and multiplying that cost by a factor of e.g. five is unpalatable. But it can be done, and produces somewhat better results. You could also consider using MCTS like chess engines do.

[+] 6gvONxR4sf7o|2 years ago|reply

LLMs are happy to generate arbitrary strings. You might want it to spit out something along the lines of "Alice: 42" and then it spits out "hi, i'm helpful and Alice is exactly forty two, as far as I can tell, but I'm just a language model."

So you give it a grammar that says the response has to be an uppercase letter followed by lowercase letters, then a colon, then a space, then digits, then it's done. Now, when it looks for that first token, it will only consider tokens that are compatible with that pattern. Then it'll continue with only next tokens that are compatible with the next parts of the pattern.

These grammars do that with a flexible and useful kind of pattern.

[+] simonw|2 years ago|reply

See my comment here https://news.ycombinator.com/item?id=36820884

[+] version_five|2 years ago|reply

I'm interested in this and I'm going to try incorporating it into something I'm doing. That said, I feel like this could be one of those Bitter Lesson situations where it's not the most effective approach in anything but the very short term: http://www.incompleteideas.net/IncIdeas/BitterLesson.html

[+] Der_Einzige|2 years ago|reply

It may be a stop-gap, but its an important one as it is not obvious that LLMs in the next few years will "organically" solve their issues with generating text with constraints.

[+] woah|2 years ago|reply

Not an expert at all, but I believe that OpenAI uses this in some of their GPT apis which are meant for programmatic use. I've seen it theorized that offloading the rote grammar stuff to a simple process that is meant for it lets the LLM use it's "brainpower" on the complicated stuff more effectively. No idea if this is true.

[+] sandkoan|2 years ago|reply

Also using a similar method: https://github.com/automorphic-ai/trex

Playground: https://automorphic.ai/playground

[+] painted-now|2 years ago|reply

Can anyone recommend some paper or overview on how "sampling" / "decoding" is done in the e2e neural network age? I know how decoding was done for machine translation and speech recognition back in the HMM times (i.e. https://en.wikipedia.org/wiki/Viterbi_algorithm and https://en.wikipedia.org/wiki/Beam_search). These days I get the impression people just do "greedy" - but I don't really know. Any recommendations for info on that topic?

Edit: Forgot Viterbi

[+] spion|2 years ago|reply

Its greedy and random :) Instead of a paper, I would recommend the algorithms of most LMM implementations (rwkv.cpp has a relatively clean implementation in python https://github.com/saharNooby/rwkv.cpp/blob/master/rwkv/samp...)

[+] janalsncm|2 years ago|reply

Just reading through the GPT4 documentation it doesn’t seem like there’s a ton of difference with what you’ve mentioned.

https://platform.openai.com/docs/api-reference/completions/c...

Of course we now know that GPT4 is a Mixture of Experts, so under the hood they’re parallelizing computation. They also include a way to modify the logits with presence/frequency penalty terms.

[+] Icko|2 years ago|reply

How is this different from Guidance and LMQL?

[+] jameshart|2 years ago|reply

Looks like a tool Guidance could use to make better use of the sampling from a local llama model.

[+] karmasimida|2 years ago|reply

This is great and all.

But LLM's are usually very good at following grammars. I rarely see LLM generating code that is OOD. Ofc, this is only true for popular language (JSON/Python/Java, etc), I can see how this is handy for more niche and in house DSL.

You still need quite a lot of prompt engineering to get desired outputs, this just add another layer of output verification IMO. But does it really save much as comparing to get the output then parse and reject the output that doesn't follow the grammar? Might be debateable.

But great work regardless.

[+] mpalmer|2 years ago|reply

> this just add another layer of output verification IMO.

It's not verifying the output after it's done, it's constraining the output as it's generated.

> But does it really save much as comparing to get the output then parse and reject the output that doesn't follow the grammar? Might be debateable.

I don't think it's debatable at all. Forcing the model to conform to a grammar during generation means there is never a need to discard and regenerate because it got the grammar wrong.

Think of the compute involved in generating the whole output and then re-generating if it's non-conformant. There is no comparison.

[+] QuantumG|2 years ago|reply

So, umm, if you want to walk BNF and emit likely tokens you can do that without any "machine learning" or whatever you want to call it. So what is being added here? Training to tie the prompt to the output?

[+] sp332|2 years ago|reply

The difference is in the word “likely”. You can put unstructured data in the prompt and get structured data out. You could put in the beginning of a list and ask for a continuation.

[+] ahupp|2 years ago|reply

Interesting that the second commentor is Tobias Lütke, CEO of Shopify.

[+] moffkalast|2 years ago|reply

Ah finally, this was discussed a lot and is well overdue. Remains to be seen how well the models will adapt to this new constraint, though the demo seems promising.

[+] ec109685|2 years ago|reply

Isn’t this approach forcing the LLM to adapt? E.g. it is throwing tokens away that don’t match the grammar.

[+] ilaksh|2 years ago|reply

Has anyone tested FreeWilly2 (the new Llama2 fine-tune released today by Stable Foundation) on code generation?

[+] meepmorp|2 years ago|reply

Does anyone know Japanese well enough to comment on the output from the Japanese example?

[+] vore|2 years ago|reply

It is vaguely Japanese, I guess, but pretty incoherent:

  1. What is the purpose?
  2. Remember the customer
  3. About the customer [incomplete sentence?]

[+] brucethemoose2|2 years ago|reply

Note that there are actual Japanese llama finetunes that would be much more coherent with these grammar constraints

[+] lachlan_gray|2 years ago|reply

Something I’m wondering lately is if you are generating tokens fast enough, is restricting the logits actually worth it computationally? If tokens are cheap enough it might be more efficient to validate/discard them as they come rather than place constraints on how they come out. I don’t know how this one works, but the sampling or renormalizing scheme would cost something too right?

[+] mmoskal|2 years ago|reply

There is at least 6 orders of magnitude difference in computation cost of computing a token (pass through a multi-B model) and doing a single step of a program. Even if your validation is really naive, it's hard to beat 6 orders of magnitude.

So no, you're not generating tokens fast enough.

[+] bavarianbob|2 years ago|reply

Could someone help me with context? I'm OOTL and don't understand what is going on here.

[+] brucethemoose2|2 years ago|reply

This can constrain an LLM's output to an arbitrary grammar/format as it is generated, rather than asking the model to output a specific format and hoping it outputs something valid.

105 comments