The thing I most want from this project is a technical explanation of what it's actually doing for me and how it works.
I dug into this the other day, and just about figured out how the old text-davinci-003 version works.
When it runs against a text completion model (like text-davinci-003) the trick seems to be that it breaks your overall Mustache-templated program up into a sequence of prompts.
These are executed one at a time. Some of them will be open ended, but some of them will include restrictions based on the rules that you laid out.
So you might have a completion prompt that asks for a maximum of 1 token and uses the logit_bias argument to ensure that the returned value can only come from a specific set of tokens. That's how you would answer a piece in the program that says "next should be just the sequence 'true' or 'false'" for example.
What I don't yet understand is how it works against non-completion models. There are open issues complaining about broken examples using it with gpt-3.5-turbo for example.
And how does it work with models other than the OpenAI ones?
This is how we implemented it anyhow, with some more parameters to control how that all works (and the LLM params) at each "pause" point. The _neat_ part for us was that a template helper could make use of the partially generated content. Hadn't thought about that before for a templating engine, but was trivial to implement in the end
I took a stab at making something[1] like guidance - I'm not sure exactly how guidance does it (and I'm also really curious how it would work with chat api's) but here's how my solution works.
Each expression becomes a new inference request, so it's not a single inference pass. Because each subsequent pass includes the previously inferenced text, the LLM ends up doing a lot of prefill and less decode. You only decode as much as you actually inference, the repeated passes only end up costing more in prefill (which tend to be much faster tok/s).
To work with chat tuned instruction models, you can basically still treat it as a completion model. I provide the previously completed inference text as a partially completed assistant response, e.g. with llama 2 it goes after [/INST]. You can add a bit of instruction for each inference expression which gets added to the [INST]. This approach lets you start off the inference with `{ "someField": "` for example to guarantee (at least the start of) a json response and allow you to add a little bit of instruction or context just for that field.
I didn't even try with openai api's since afaict you can't provide a partial assistant response for it to continue from. Even if you were to request a single token at a time and use logit_bias for biased sampling, I don't see how you can get it to continue a partially completed inference.
Is this microsoft guidance? It looks like it is and they spun it out.
I find guidance to be fantastic for doing complicated prompting. I haven't used the 'controlling' the output feature as much as used it for chain prompting. Ask to come up with answers to a prompt N times, then discuss pros and cons of each answer, then make a new answer based on the best parts of the output. Stuff like that.
I found that the approach of template processing at large prompts leads to difficulty in reading programs. Their attractive part is that control flow is not separate from prompt as in langchain, which allows you to write prompts as classical programs. But the problem remains in unintuitive syntax for large programs
Logit-bias guidance goes a long way -- LLM structure for regex, context-free grammars, categorization, and typed construction. I'm working on a hosted and model-agnostic version of this with thiggle
This is just a different way to write prompts, it allows some interleaving of calls to the API so you can build things up, write a conversation as a single file, with conventions around the text to send to the LLM.
I would not expect it to make a difference in your current applications. Getting JSON is all about the model, training, and prompt, in that order
If you are looking for low-hanging fruit to improve your JSON responses from LLMs, fine-tuning will likely get you the most bang for your buck. Start from a coding model like codellama, code-bison, or starcoder
OpenAI function calling + JSON schema is dead simple and has never failed for me, where as I had a bunch of errors with guidance when trying to do things like nested, repeating values.
I've been trying to figure out how projects like this, semantic kernel (also msft), and langchain add value. Is the paradigm sort of like a web framework? It reduces the boilerplate you need to write so you can focus on the business problem?
Is that needed in the LLM space yet? I'm just not convinced the abstraction pays for itself in reduced cognitive load, or at least not yet, but very happy to be convinced otherwise.
It lets you actually control the output structure and more or less guarantee the LLM is doing what you want. Plus it reliably extracts structured results.
It's obviously extremely valuable if you're doing anything with the LLM output other than displaying it as a block of text to the user, or if you care about the output format at all.
imo Guidance is valuable, the underlying logic is sufficiently complex that I'm glad I didn't need to DIY it. Same goes for the faster Outlines project from Normal Computing.
LangChain: I found having a framework useful to ramp up people without prior LLM exposure, in an open-ended experimental space. The library covers many usecases and gets people thinking. But honestly their documentation is somewhat lacking for that purpose (stale text, shallow examples). Personally, coming from a search background I was able to DIY semantic RAG in the time it took to figure out how to do the same thing in LangChain.
In my experience, they add cognitive load to working with LLMs, including when doing more than just calling an LLM, like RAG. But maybe others feel differently. I’m glad there’s variety.
As a developer of these things, I don't get why they want to put so much effort into the mundane parts rather than focusing on the interesting parts. These things are mostly just the same as any other workflow or API call: https://github.com/hofstadter-io/hof/blob/_dev/flow/chat/cmd... (unless you get into the python and (i.e.) start messing with the logits or token probabilities)
The thing that’s bugging me about this eco system is the library, although it augments, has to become the thing running the LLM, I can’t use guidance as a plug-in on some other LLM system.
I look forward to when we have something that can run any LLM without compatibility issues, can expose APIs etc and has a robust plugin or augmentation system.
LMQL seems to be alive and takes some of these concepts even further. It's the project of 1 or 2 PhD students at ETH Zürich so I'm hopeful they'll see it through.
I thought guidance was smart, but LMQL seems brilliant as it merges pythonic constructions with LLMs (I think it may be an outright superset of python with LLM functionalities?)
I'm hacking on a library (https://github.com/gsuuon/ad-llama) inspired by guidance, but in TS and for the browser. I think structured inference and controlled sampling are really good ways of getting consistent responses out of LLM's. It lets smaller models really punch above their weight.
I wonder what other folks are building on this sort of workflow? I've been playing around with it and trying to figure out interesting applications that weren't possible before.
I've seen this link pop up in various places now, but it seems like it's still mostly not being developed? Is there a reason it was posted today? Some new development in it?
they are changing the governance and contributors, maybe in prep to do something more or raise money? Every AI library seems to try that path these days
Somehow, the VCs and investors made us think it was cool to be working for them rather than our users
I've been using this library a lot, it's amazing. However, I noticed a very considerable degradation (time taken + generation quality) with versions > 0.0.58 when used with local LLMs.
I haven't taken time to compare between the different releases but if anyone is having the same type of issues, I recommend downgrading even if it might mean less features.
[+] [-] simonw|2 years ago|reply
I dug into this the other day, and just about figured out how the old text-davinci-003 version works.
When it runs against a text completion model (like text-davinci-003) the trick seems to be that it breaks your overall Mustache-templated program up into a sequence of prompts.
These are executed one at a time. Some of them will be open ended, but some of them will include restrictions based on the rules that you laid out.
So you might have a completion prompt that asks for a maximum of 1 token and uses the logit_bias argument to ensure that the returned value can only come from a specific set of tokens. That's how you would answer a piece in the program that says "next should be just the sequence 'true' or 'false'" for example.
What I don't yet understand is how it works against non-completion models. There are open issues complaining about broken examples using it with gpt-3.5-turbo for example.
And how does it work with models other than the OpenAI ones?
[+] [-] verdverm|2 years ago|reply
This is how we implemented it anyhow, with some more parameters to control how that all works (and the LLM params) at each "pause" point. The _neat_ part for us was that a template helper could make use of the partially generated content. Hadn't thought about that before for a templating engine, but was trivial to implement in the end
[+] [-] gsuuon|2 years ago|reply
Each expression becomes a new inference request, so it's not a single inference pass. Because each subsequent pass includes the previously inferenced text, the LLM ends up doing a lot of prefill and less decode. You only decode as much as you actually inference, the repeated passes only end up costing more in prefill (which tend to be much faster tok/s).
To work with chat tuned instruction models, you can basically still treat it as a completion model. I provide the previously completed inference text as a partially completed assistant response, e.g. with llama 2 it goes after [/INST]. You can add a bit of instruction for each inference expression which gets added to the [INST]. This approach lets you start off the inference with `{ "someField": "` for example to guarantee (at least the start of) a json response and allow you to add a little bit of instruction or context just for that field.
I didn't even try with openai api's since afaict you can't provide a partial assistant response for it to continue from. Even if you were to request a single token at a time and use logit_bias for biased sampling, I don't see how you can get it to continue a partially completed inference.
[1] https://github.com/gsuuon/ad-llama
[+] [-] adamgordonbell|2 years ago|reply
I find guidance to be fantastic for doing complicated prompting. I haven't used the 'controlling' the output feature as much as used it for chain prompting. Ask to come up with answers to a prompt N times, then discuss pros and cons of each answer, then make a new answer based on the best parts of the output. Stuff like that.
[+] [-] bugglebeetle|2 years ago|reply
https://blog.simonfarshid.com/native-json-output-from-gpt-4
(it works perfectly with GPT-3.5 as well)
[+] [-] hexman|2 years ago|reply
[+] [-] rckrd|2 years ago|reply
[0] https://thiggle.com
[+] [-] lukasb|2 years ago|reply
[+] [-] verdverm|2 years ago|reply
I would not expect it to make a difference in your current applications. Getting JSON is all about the model, training, and prompt, in that order
If you are looking for low-hanging fruit to improve your JSON responses from LLMs, fine-tuning will likely get you the most bang for your buck. Start from a coding model like codellama, code-bison, or starcoder
[+] [-] bugglebeetle|2 years ago|reply
[+] [-] guyrt|2 years ago|reply
Is that needed in the LLM space yet? I'm just not convinced the abstraction pays for itself in reduced cognitive load, or at least not yet, but very happy to be convinced otherwise.
[+] [-] IshKebab|2 years ago|reply
It's obviously extremely valuable if you're doing anything with the LLM output other than displaying it as a block of text to the user, or if you care about the output format at all.
[+] [-] losteric|2 years ago|reply
LangChain: I found having a framework useful to ramp up people without prior LLM exposure, in an open-ended experimental space. The library covers many usecases and gets people thinking. But honestly their documentation is somewhat lacking for that purpose (stale text, shallow examples). Personally, coming from a search background I was able to DIY semantic RAG in the time it took to figure out how to do the same thing in LangChain.
[+] [-] phillipcarter|2 years ago|reply
[+] [-] verdverm|2 years ago|reply
You can get the same thing with Go text/templates by adding chat function(s) as custom a helper: https://github.com/hofstadter-io/hof/blob/_dev/lib/templates...
As a developer of these things, I don't get why they want to put so much effort into the mundane parts rather than focusing on the interesting parts. These things are mostly just the same as any other workflow or API call: https://github.com/hofstadter-io/hof/blob/_dev/flow/chat/cmd... (unless you get into the python and (i.e.) start messing with the logits or token probabilities)
[+] [-] PUSH_AX|2 years ago|reply
I look forward to when we have something that can run any LLM without compatibility issues, can expose APIs etc and has a robust plugin or augmentation system.
[+] [-] dave1010uk|2 years ago|reply
https://github.com/simonw/llm
[+] [-] avereveard|2 years ago|reply
There are many projects like these I'm tracking, but they all kinda cool off after the initial prototype and have thus many quirks and limitations
So far the only one that I could reliably use was llamacpp grammars, and those are fairly slow
[+] [-] verdverm|2 years ago|reply
How often does a project need to release to not be considered dead? It's only been 10 weeks, in the summer, at the peak of vacation time
Look at the most recent commits, they are setting up new governance, which likely took more than 1o weeks to work through the bureaucracy of Mircosoft
[+] [-] Forgotthepass8|2 years ago|reply
I thought guidance was smart, but LMQL seems brilliant as it merges pythonic constructions with LLMs (I think it may be an outright superset of python with LLM functionalities?)
It's predicated off a paper as well : https://arxiv.org/pdf/2212.06094
[+] [-] gsuuon|2 years ago|reply
I wonder what other folks are building on this sort of workflow? I've been playing around with it and trying to figure out interesting applications that weren't possible before.
[+] [-] maccam912|2 years ago|reply
[+] [-] verdverm|2 years ago|reply
Somehow, the VCs and investors made us think it was cool to be working for them rather than our users
[+] [-] ilovefood|2 years ago|reply
I haven't taken time to compare between the different releases but if anyone is having the same type of issues, I recommend downgrading even if it might mean less features.
[+] [-] startupsfail|2 years ago|reply
[+] [-] simonw|2 years ago|reply
https://github.com/microsoft/guidance redirects to https://github.com/guidance-ai/guidance now.