I was initially impressed with the landing page, but it does look a bit suspect when things are claimed to be 100x faster without much info on the HW acceleration or the model sizes.
My best guess is that they're using two approaches to get this running faster:
- structured generation techniques from sglang (https://github.com/sgl-project/sglang) that allow them to generate faster JSON (with look-ahead / pre-fill) with strong guarantees on the output (i.e. 100% reliable, without requiring any retries).
- distilling a gpt-3.5 turbo-esque model from GPT-4 JSON outputs, and using it in conjuction with above to give the additional performance boosts on inference.
It doesn't seem like they're deploying on any custom silicon, nor have they optimized GPU kernels to suggest that the speed ups came there.
I thought they pretty clearly explained where the additional performance came from. If there is only 1 valid schema confirming option you can skip the LLM entirely. If there are only a limited number of possible tokens (eg. only } or ,) then you run it on smaller subset of the model. Between these two you capture a large amount of the actual token count of most json.
I've been doing a lot of indie work with structured generation and llama.cpp, you can get extremely fast responses with caching and deterministic token skipping.
When generating json, calling the llm when you already know that after
{ "brand": "Toyota"
Comes
, "year":
Is a massive waste. If the data itself is constrained too you can skip most of that too! You'll go down from needing 20 calls to the llm to just three for a simple piece of data like
{ "brand": "Toyota", "year": 1995 }
If they combine these techniques with a model that's specifically trained for structured output, along with a novel inference-time pruning technique that they were talking about in the post I can definitely see them getting these kinds of inference speeds.
I'm experimenting with a self hosted api that is fast enough to not even need a gpu for single user use cases (because the latency is good, but not batching). Once I'm done with the finishing touches I'll rent a GPU server for actual hosting.
Despite reading it twice I couldn't come away with why they chose char/s or Hz as an appropriate measure. They also provided no benchmarks or model sizes except for a relative comparison with models 10x or 100x in size, which leads me to assume this is a small model maybe?
My guess is they're generating the structure of the JSON programmatically (i.e. keys, commas, braces), and doing JSON escaping for the strings programatically, and not handling JSON in the LLM at all. Hence they're comparing char/s: first of all it's not just generating tokens, and secondly it's better for their benchmarks to compare char/s (since they don't hit the LLM for a lot of their characters) rather than LLM tokens/s (which are probably somewhat faster, but not 100x faster).
Yeh it’s weird they don’t mention parameter size or other reasoning metrics. It’s a very cool approach to getting structured output from an LLM, but the benchmarks don’t show us the whole picture. I’m wondering if their approach can be used to delegate to different models at each step in a structured output. If it could be run with mistral 8x7B and still maintain its performance then that’s awesome.
This seems like a very practical and well thought out approach. Turning unstructured data into valid structured data is in my experience one of the most important things for integrating LLMs deeply into a pipeline. Doing that fast and cheap goes a very long way for these use cases. Also, if you need stronger content generation than this, nothing prevents you from generating some higher quality content in another LLM and then passing it through this to structure it.
Currently, LLM models are not state of the art at Named Entity Recognition. They are slower, more expensive and less accurate than a fine tuned BERT model.
However, they are way easier to get started with using in context learning. Soon, they will be cheaper and probably faster enough too that training your own model will be a waste of time for 95% of use cases (probably higher because it will unlock use cases that wouldn’t break even with the old NLP approaches from a value perspective).
This is why I am tracking LLM structured outputs here:
I was excited about their AI capabilities, but seeing that they built/promote their own UI framework, I couldn't stop but think that their focus is not where it should be. Why would you spend the time and resources to build your own UI framework, when your main product is an API?
Interesting but I think there's one comparison missing. When I use GPT-4 with function calling with a real system, in a single call it usually returns 5-6 responses - the first one with Content that has the plan / reasoning, followed by multiple function calls (parallel function calling).
"Ability against models with 10x or 100x more parameters"
This is a small model optimized for retrieval and function calling. "Reasoning" makes an appearance in the title but no standard benchmarks of general ability, such as MMLU or HumanEval, are mentioned. No details about the training process and no access to the models other than via API.
Nice marketing, but looks empty. I can also make an LLM that runs 1000x faster than Mistral:
def complete(prompt): print('As an AI language model...')
[+] [-] AmazingTurtle|2 years ago|reply
https://github.com/guidance-ai/guidance
Same principle
[+] [-] fzysingularity|2 years ago|reply
My best guess is that they're using two approaches to get this running faster:
- structured generation techniques from sglang (https://github.com/sgl-project/sglang) that allow them to generate faster JSON (with look-ahead / pre-fill) with strong guarantees on the output (i.e. 100% reliable, without requiring any retries).
- distilling a gpt-3.5 turbo-esque model from GPT-4 JSON outputs, and using it in conjuction with above to give the additional performance boosts on inference.
It doesn't seem like they're deploying on any custom silicon, nor have they optimized GPU kernels to suggest that the speed ups came there.
[+] [-] dwallin|2 years ago|reply
[+] [-] azeirah|2 years ago|reply
When generating json, calling the llm when you already know that after
Comes Is a massive waste. If the data itself is constrained too you can skip most of that too! You'll go down from needing 20 calls to the llm to just three for a simple piece of data like If they combine these techniques with a model that's specifically trained for structured output, along with a novel inference-time pruning technique that they were talking about in the post I can definitely see them getting these kinds of inference speeds.I'm experimenting with a self hosted api that is fast enough to not even need a gpu for single user use cases (because the latency is good, but not batching). Once I'm done with the finishing touches I'll rent a GPU server for actual hosting.
[+] [-] zuck_vs_musk|2 years ago|reply
Has anyone seen a good JSON library that can handle slightly broken JSON? e.g. trailing commas, unescaped newlines, etc.? I have not found a good one.
[+] [-] alexandd|2 years ago|reply
[deleted]
[+] [-] iAkashPaul|2 years ago|reply
[+] [-] reissbaker|2 years ago|reply
[+] [-] padolsey|2 years ago|reply
[+] [-] dwallin|2 years ago|reply
[+] [-] imaurer|2 years ago|reply
However, they are way easier to get started with using in context learning. Soon, they will be cheaper and probably faster enough too that training your own model will be a waste of time for 95% of use cases (probably higher because it will unlock use cases that wouldn’t break even with the old NLP approaches from a value perspective).
This is why I am tracking LLM structured outputs here:
https://github.com/imaurer/awesome-llm-json
And created an autocorrecting pydantic library that could be used for Named entity linking:
https://github.com/genomoncology/FuzzTypes
[+] [-] XCSme|2 years ago|reply
[+] [-] tinyhouse|2 years ago|reply
[+] [-] xianshou|2 years ago|reply
This is a small model optimized for retrieval and function calling. "Reasoning" makes an appearance in the title but no standard benchmarks of general ability, such as MMLU or HumanEval, are mentioned. No details about the training process and no access to the models other than via API.
Nice marketing, but looks empty. I can also make an LLM that runs 1000x faster than Mistral:
def complete(prompt): print('As an AI language model...')