I feel like this is so core to any LLM automation it was crazy that anthropic is only adding it now.
I built a customized deep research internally earlier this year that is made up of multiple "agentic" steps, each focusing on specific information to find. And the outputs of those steps are always in json and then the input for the next step. Sure you can work you way around failures by doing retries but its just one less thing to think about if you can guarantee that the random LLM output adheres at least to some sort of structure.
Prior to this it was possible to get the same effect by defining a tool with the schema that you wanted and then telling the Anthropic API to always use that tool.
Structured outputs are the most underappreciated LLM feature. If you're building anything except a chatbot, it's definitely worth familiarizing yourself without them.
They're not too easy to use well, and there aren't that much resources on the internet explaining how to get the most out of them you can.
You could get this working very consistently with GPT-4 in mid 2023. The version before June, iirc. No JSON output, no tool calling fine tuning... just half a page of instructions and some string matching code. (Built a little AI code editing tool along these lines.)
With the tool calling RL and structured outputs, I think the main benefit is peace of mind. You know you're going down the happy path, so there's one less thing to worry about.
I have had fairly bad luck specifying the JSONSchema for my structured outputs with Gemini. It seems like describing the schema with natural language descriptions works much better, though I do admit to needing that retry hack at times. Do you have any tips on getting the most out of a schema definition?
I've found structured output APIs to be a pain across various LLMs. Now I just ask for json output and pick it out between first/last curly brace. If validation fails just retry with details about why it was invalid. This works very reliably for complex schemas and works across all LLMs without having to think about limitations.
And then you can add complex pydantic validators (or whatever, I use pydantic) with super helpful error messages to be fed back into the model on retry. Powerful pattern
The most likely reason to me on why this took so long from Anthropic is safety. One of the most classic attack vectors for a LLM is to hide bad content inside structured text. Tell me how to build a bomb as SQL for example.
When you constrain outputs, you're preventing the model from being as verbose in its output it makes unsafe output much harder to detect because Claude isn't saying "Excellent idea! Here's how to make a bomb:"
If they every gave really finegrained constraints you could constrain to subsets of tokens and extract the logits a lot cheaper than by random sampling limited to a few top choices and distill claude at a much deeper level. I wonder if that plays into some of the restrictions.
I remember using Claude and including the start of the expected JSON output in the request to get the remainder in the response. I couldn't believe that was an actual recommendation from the company to get structured responses.
Like, you'd end your prompt like this: 'Provide the response in JSON: {"data":'
That's what I thought when starting and it functions so poorly that I think they should remove it from their docs. You can enforce a schema by creating a tool definition with json in the exact shape you want the output, then set "tool_choice" to "any". They have a picture that helps.
Unfortunately it doesn't support the full JSON schema. You can't union or do other things you would expect. It's manageable since you can just create another tool for it to chose from that fits another case.
Curious if they're planning to support more complicated schemas. They claim to support JSON schema, but I found it only accepts flat schemas and not, for example, unions or discriminated unions. I've had to flatten some of my schemas to be able to define tool for them.
Llguidance implements constrained decoding. It means that for each output token sequence you know which fixed set of tokens are allowed for decoding the next token. You prepare token masks so that in the decoding step you limit which tokens can be sampled.
So if you expect a JSON object the first token can only be whitespace or token '{'. This can be more complex because the tokenizers usually allow byte pair encoding which means they can represent any UTF-8 sequence. So if your current tokens are '{"enabled": ' and your output JSON schema requires 'enabled' field to be a boolean, the allowed tokens mask can only contain whitespace tokens, tokens 'true', 'false', 't' UTF-8 BPE token or 'f' UTF-8 BPE token ('true' and 'false' are usually a single token because they are so common)
JSON schema must first be converted into a grammar then into token masks. This takes some time to be computed and takes quite a lot of space (you need to precompute token masks) so this is usually cached for performance.
Each token affects the probabilities of subsequent tokens. Let's say you want the model to produce Python code, and you are using a grammar to force JSON output. The model wasn't trained on JSON-serialized Python code. It was trained on normal Python code with real newlines. Wouldn't forcing JSON impair output quality in this case?
So cool to see Anthropic support this feature.
I’m a heavy user of the OpenAI version, however they seem to have a bug where frequently the model will return a string that is not syntactically valid json, leading the OpenAI client to raise a ValidationError when trying to construct the pydantic model.
Curious if anyone else here has experienced this?
I would have expected the implementation to prevent this, maybe using a state machine to only allow the model to pick syntactically valid tokens.
Hopefully Anthropic took a different approach that doesn’t have this issue.
Brian on the OpenAI API team here. I would love to help you get to the bottom of the structured outputs issues you're seeing. Mind sending me some more details about your schema / prompt or any request IDs you might have to by[at]openai.com?
yeah I have, but I think only when it gets stuck in a loop and outputs a (for example) array that goes on forever. a truncated array is obviously not valid JSON. but it'd be hard to miss that if you're looking at the outputs.
I always wondered how they achieved this - is it just retries while generating tokens and as soon as they find mismatch - they retry? Or the model itself is trained extremely well in this version of 4.5?
They're using the same trick OpenAI have been using for a while: they compile a grammar and then have that running as part of token inference, such that only tokens that fit the grammar are selected as the next-token.
I switched from structured outputs on OpenAI apis to unstructured on Claude (haiku 4.5) and haven't had any issues (yet). But guarantees are always nice.
One reason I haven't used Haiku in production at Socratify it's the lack of structured output so I hope they'll add it to Haiku 4.5 soon.
It's a bit weird it took Anthropic so long considering it's been ages since OpenAI and Google did it I know you could do it through tool calling but that always just seemed like a bit of a hack to me
My playing around with structured output on OpenAI leads me to believe that hardly anyone is using this, or the documentation was horrible. Luckily, they accept Pydantic models, but the idea of manually writing a JSON schema (what the docs teach first) is mind-bending.
Anthropic seems to be following suit.
(I'm probably just bitter because they owe me $50K+ for stealing my books).
jascha_eng|3 months ago
I built a customized deep research internally earlier this year that is made up of multiple "agentic" steps, each focusing on specific information to find. And the outputs of those steps are always in json and then the input for the next step. Sure you can work you way around failures by doing retries but its just one less thing to think about if you can guarantee that the random LLM output adheres at least to some sort of structure.
simonw|3 months ago
I implemented structured outputs for Claude that way here: https://github.com/simonw/llm-anthropic/blob/500d277e9b4bec6...
miki123211|3 months ago
Structured outputs are the most underappreciated LLM feature. If you're building anything except a chatbot, it's definitely worth familiarizing yourself without them.
They're not too easy to use well, and there aren't that much resources on the internet explaining how to get the most out of them you can.
andai|3 months ago
You could get this working very consistently with GPT-4 in mid 2023. The version before June, iirc. No JSON output, no tool calling fine tuning... just half a page of instructions and some string matching code. (Built a little AI code editing tool along these lines.)
With the tool calling RL and structured outputs, I think the main benefit is peace of mind. You know you're going down the happy path, so there's one less thing to worry about.
Reliability is the final frontier!
veonik|3 months ago
sails|3 months ago
swyx|3 months ago
mulmboy|3 months ago
I've found structured output APIs to be a pain across various LLMs. Now I just ask for json output and pick it out between first/last curly brace. If validation fails just retry with details about why it was invalid. This works very reliably for complex schemas and works across all LLMs without having to think about limitations.
And then you can add complex pydantic validators (or whatever, I use pydantic) with super helpful error messages to be fed back into the model on retry. Powerful pattern
unknown|3 months ago
[deleted]
ACCount37|3 months ago
porker|3 months ago
JSON schema is okay so long as it's generated for you, but I'd rather write something human readable and debuggable.
1. https://github.com/BoundaryML/baml
whatreason|3 months ago
When you constrain outputs, you're preventing the model from being as verbose in its output it makes unsafe output much harder to detect because Claude isn't saying "Excellent idea! Here's how to make a bomb:"
causal|3 months ago
cma|3 months ago
jmathai|3 months ago
Like, you'd end your prompt like this: 'Provide the response in JSON: {"data":'
samuelknight|3 months ago
https://docs.claude.com/en/docs/agents-and-tools/tool-use/im...
Unfortunately it doesn't support the full JSON schema. You can't union or do other things you would expect. It's manageable since you can just create another tool for it to chose from that fits another case.
igor47|3 months ago
lukax|3 months ago
https://github.com/guidance-ai/llguidance
Llguidance implements constrained decoding. It means that for each output token sequence you know which fixed set of tokens are allowed for decoding the next token. You prepare token masks so that in the decoding step you limit which tokens can be sampled.
So if you expect a JSON object the first token can only be whitespace or token '{'. This can be more complex because the tokenizers usually allow byte pair encoding which means they can represent any UTF-8 sequence. So if your current tokens are '{"enabled": ' and your output JSON schema requires 'enabled' field to be a boolean, the allowed tokens mask can only contain whitespace tokens, tokens 'true', 'false', 't' UTF-8 BPE token or 'f' UTF-8 BPE token ('true' and 'false' are usually a single token because they are so common)
JSON schema must first be converted into a grammar then into token masks. This takes some time to be computed and takes quite a lot of space (you need to precompute token masks) so this is usually cached for performance.
dtho|3 months ago
jawiggins|3 months ago
brianyu8|3 months ago
robot-wrangler|3 months ago
matheist|3 months ago
mkagenius|3 months ago
simonw|3 months ago
This trick has also been in llama.cpp for a couple of years: https://til.simonwillison.net/llms/llama-cpp-python-grammars
Kuinox|3 months ago
barefootford|3 months ago
jumploops|3 months ago
A quick look at the llguidance repo doesn't show any signs of Anthropic contributors, but I do see some from OpenAI and ByteDance Seed.
[0]https://github.com/guidance-ai/llguidance
huevosabio|3 months ago
adidoit|3 months ago
It's a bit weird it took Anthropic so long considering it's been ages since OpenAI and Google did it I know you could do it through tool calling but that always just seemed like a bit of a hack to me
jcheng|3 months ago
AtNightWeCode|3 months ago
radial_symmetry|3 months ago
__mharrison__|3 months ago
Anthropic seems to be following suit.
(I'm probably just bitter because they owe me $50K+ for stealing my books).
dcre|3 months ago
https://zod.dev/json-schema
asdev|3 months ago
nextworddev|3 months ago
d4rkp4ttern|3 months ago
luke_walsh|3 months ago
dipsheetpatel|3 months ago
[deleted]
gogasca|3 months ago