Inversion: Fast, Reliable Structured LLMs

[+] AmazingTurtle|2 years ago|reply

Most likely a small retrieval optimized model with focus on JSON tokens and a combination of the technique involved in grammar/guidance

https://github.com/guidance-ai/guidance

Same principle

[+] fzysingularity|2 years ago|reply

I was initially impressed with the landing page, but it does look a bit suspect when things are claimed to be 100x faster without much info on the HW acceleration or the model sizes.

My best guess is that they're using two approaches to get this running faster:

- structured generation techniques from sglang (https://github.com/sgl-project/sglang) that allow them to generate faster JSON (with look-ahead / pre-fill) with strong guarantees on the output (i.e. 100% reliable, without requiring any retries).

- distilling a gpt-3.5 turbo-esque model from GPT-4 JSON outputs, and using it in conjuction with above to give the additional performance boosts on inference.

It doesn't seem like they're deploying on any custom silicon, nor have they optimized GPU kernels to suggest that the speed ups came there.

[+] dwallin|2 years ago|reply

I thought they pretty clearly explained where the additional performance came from. If there is only 1 valid schema confirming option you can skip the LLM entirely. If there are only a limited number of possible tokens (eg. only } or ,) then you run it on smaller subset of the model. Between these two you capture a large amount of the actual token count of most json.

[+] azeirah|2 years ago|reply

I've been doing a lot of indie work with structured generation and llama.cpp, you can get extremely fast responses with caching and deterministic token skipping.

When generating json, calling the llm when you already know that after

    { "brand": "Toyota"

Comes

     , "year":

Is a massive waste. If the data itself is constrained too you can skip most of that too! You'll go down from needing 20 calls to the llm to just three for a simple piece of data like

    { "brand": "Toyota", "year": 1995 }

If they combine these techniques with a model that's specifically trained for structured output, along with a novel inference-time pruning technique that they were talking about in the post I can definitely see them getting these kinds of inference speeds.

I'm experimenting with a self hosted api that is fast enough to not even need a gpu for single user use cases (because the latency is good, but not batching). Once I'm done with the finishing touches I'll rent a GPU server for actual hosting.

[+] zuck_vs_musk|2 years ago|reply

> strong guarantees on the output (i.e. 100% reliable, without requiring any retries).

Has anyone seen a good JSON library that can handle slightly broken JSON? e.g. trailing commas, unescaped newlines, etc.? I have not found a good one.

[+] alexandd|2 years ago|reply

[deleted]

[+] iAkashPaul|2 years ago|reply

Despite reading it twice I couldn't come away with why they chose char/s or Hz as an appropriate measure. They also provided no benchmarks or model sizes except for a relative comparison with models 10x or 100x in size, which leads me to assume this is a small model maybe?

[+] reissbaker|2 years ago|reply

My guess is they're generating the structure of the JSON programmatically (i.e. keys, commas, braces), and doing JSON escaping for the strings programatically, and not handling JSON in the LLM at all. Hence they're comparing char/s: first of all it's not just generating tokens, and secondly it's better for their benchmarks to compare char/s (since they don't hit the LLM for a lot of their characters) rather than LLM tokens/s (which are probably somewhat faster, but not 100x faster).

[+] padolsey|2 years ago|reply

Yeh it’s weird they don’t mention parameter size or other reasoning metrics. It’s a very cool approach to getting structured output from an LLM, but the benchmarks don’t show us the whole picture. I’m wondering if their approach can be used to delegate to different models at each step in a structured output. If it could be run with mistral 8x7B and still maintain its performance then that’s awesome.

[+] dwallin|2 years ago|reply

This seems like a very practical and well thought out approach. Turning unstructured data into valid structured data is in my experience one of the most important things for integrating LLMs deeply into a pipeline. Doing that fast and cheap goes a very long way for these use cases. Also, if you need stronger content generation than this, nothing prevents you from generating some higher quality content in another LLM and then passing it through this to structure it.

[+] imaurer|2 years ago|reply

Currently, LLM models are not state of the art at Named Entity Recognition. They are slower, more expensive and less accurate than a fine tuned BERT model.

However, they are way easier to get started with using in context learning. Soon, they will be cheaper and probably faster enough too that training your own model will be a waste of time for 95% of use cases (probably higher because it will unlock use cases that wouldn’t break even with the old NLP approaches from a value perspective).

This is why I am tracking LLM structured outputs here:

https://github.com/imaurer/awesome-llm-json

And created an autocorrecting pydantic library that could be used for Named entity linking:

https://github.com/genomoncology/FuzzTypes

[+] XCSme|2 years ago|reply

I was excited about their AI capabilities, but seeing that they built/promote their own UI framework, I couldn't stop but think that their focus is not where it should be. Why would you spend the time and resources to build your own UI framework, when your main product is an API?

[+] tinyhouse|2 years ago|reply

Interesting but I think there's one comparison missing. When I use GPT-4 with function calling with a real system, in a single call it usually returns 5-6 responses - the first one with Content that has the plan / reasoning, followed by multiple function calls (parallel function calling).

[+] xianshou|2 years ago|reply

"Ability against models with 10x or 100x more parameters"

This is a small model optimized for retrieval and function calling. "Reasoning" makes an appearance in the title but no standard benchmarks of general ability, such as MMLU or HumanEval, are mentioned. No details about the training process and no access to the models other than via API.

Nice marketing, but looks empty. I can also make an LLM that runs 1000x faster than Mistral:

def complete(prompt): print('As an AI language model...')

25 comments