You're comparing apples to oranges - structured output (a capability) with structured output + CoT (a technique), saying that structured output isn't good for reasoning. Well, it's not supposed to "reason", and you didn't apply CoT to it!
I didn't like it overall and I think it can confuse people and it shouldn't mention best practices:
1. To ensure a reasonable amount of variation, its temperature was set to 1.0.
Why would you use any other temperature than 0 when you are asserting the correctness of the data extraction and the "reasoning" of the LLM? You don't want variation.
2. true_answer = (5029) + (1.7509)
Why are you using the LLM to do math? If the data was extracted correctly (with structured output or function calling) let it write the formula and evaluate it.
3. The new Structured Outputs in the API feature from OpenAI is a significant advancement, but it's not a silver bullet. / there's more to this story than meets the eye
The new API is just a nicer built-in way to extract structured data. Previously (still valid) you had to use function calling and pass it a "returnResult" function that had its payload typed to your expected schema. This is one of the most powerful and effective tools we have to work with LLMs, if used for what it's supposed to do, we shouldn't avoid this because it doesn't "reason" as well.
> John Doe is a freelance software engineer. He charges a
> base rate of $50 per hour for the first 29 hours of work
> each week. For any additional hours, he charges 1.7
> times his base hourly rate. This week, John worked on a
> project for 38 hours.
And the sample scenario is something I wouldn't use LLMs for. Nevertheless, CoT can still be applied with structured output, I'd like to see your structured-output-reasoning-cot to figure out why it didn't work.
Structured outputs allow you to extend the prompts by providing descriptions for the fields you expect, making it much more effective than other solutions, and you can also implement features like self-healing loops (fx. to get rid of the 1% chance of gpt-4o to not reply with data following the schema), etc. The paper's authors used the plain "JSON mode" that is useless, glad to see you did it better.
---
Anyway, here's GPT-4o with function calling (not even structured output) solving the issue correctly every time: https://gist.ro/gpt-reason.mp4
As you can see it's super consistent with GPT-4o and temp 0, with a silly simple prompt. If someone worked on the prompts / split it into a "multi-step pipeline" (come on, is this what we call fn(fn(x)) now?) they would achieve the same result with 4o-mini.
I appreciate you taking the time to read the article and share your thoughts. Your arguments consider many perspectives and dimensions. Please allow me to address each point:
> You’re comparing apples to oranges - structured output (a capability) with structured output + CoT (a technique), saying that structured output isn’t good for reasoning. Well, it’s not supposed to “reason,” and you didn’t apply CoT to it!
The goal of our evaluation is to address the original OpenAI JSON mode statement: https://openai.com/index/introducing-structured-outputs-in-t... (see the section “Separating a final answer from supporting reasoning or additional commentary”). It illustrates structured output used as CoT reasoning steps, which is the source of the confusion and the basis for both the research paper and our evaluation work. Our findings indicate that structured output is indeed not effective for reasoning (i.e., don’t trust the answer even if you specify a reasoning field).
> Why would you use any other temperature than 0 when you are asserting the correctness of the data extraction and the “reasoning” of the LLM? You don’t want variation.
Good question. The example we provided involves not only correct arithmetic but also accurate interpretation of specific conditions (e.g., recognizing that the first 29 hours are charged at one rate, with additional hours at a higher rate). While setting the temperature to 0 ensures consistent, predictable results by choosing the most likely next word or token, we wanted to explore how the model handles variations and uncertainties. Note that all models were set with a temperature of 1.0 in the comparison. A consistent output in a multistep setup suggests a robust reasoning process we can trust. In contrast, the JSON mode with reasoning field (i.e., reasoning_steps in the structured-output-reasoning-cot pipeline, as detailed in the Chain-of-Thought Reasoning section of https://colab.research.google.com/github/instill-ai/cookbook...) did not show similar reliability.
> Why are you using the LLM to do math? If the data was extracted correctly (with structured output or function calling), let it write the formula and evaluate it. The new API is just a nicer built-in way to extract structured data. Previously (still valid), you had to use function calling and pass it a “returnResult” function that had its payload typed to your expected schema. This is one of the most powerful and effective tools we have to work with LLMs. If used properly, we shouldn’t avoid it just because it doesn’t “reason” as well.
Function calling is outside the scope of our current exploration. Our focus, inspired by the paper "Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models" Research Paper (https://arxiv.org/abs/2408.02442v1), is on maintaining the model’s accuracy while producing structured outputs. As far as I know, functions need to be pre-defined for function calling with OpenAI’s LLMs. For example, to get the correct salary calculation, you would need to pre-define the relevant salary calculation functions. By the way, I’d be interested in seeing your full code (the code in your "./api") if you’re willing to share. Our experiment focuses on evaluating the model’s reasoning ability without relying on external tools.
We also observed some intriguing results with the multi-step pipeline. In your gist, it appears that models like GPT-4o-mini or GPT-3.5-turbo didn’t produce accurate answers consistently. However, in our experiment, we achieved correct results even with these less powerful models (see video https://drive.google.com/file/d/19NZjZ8LZRazInImcm27XjperBMt...):
- GPT-3.5 for reasoning
- GPT-4o-mini for structured outputs (note that only GPT-4o related models support structured outputs)
_andrei_|1 year ago
I didn't like it overall and I think it can confuse people and it shouldn't mention best practices:
1. To ensure a reasonable amount of variation, its temperature was set to 1.0.
Why would you use any other temperature than 0 when you are asserting the correctness of the data extraction and the "reasoning" of the LLM? You don't want variation.
2. true_answer = (5029) + (1.7509)
Why are you using the LLM to do math? If the data was extracted correctly (with structured output or function calling) let it write the formula and evaluate it.
3. The new Structured Outputs in the API feature from OpenAI is a significant advancement, but it's not a silver bullet. / there's more to this story than meets the eye
The new API is just a nicer built-in way to extract structured data. Previously (still valid) you had to use function calling and pass it a "returnResult" function that had its payload typed to your expected schema. This is one of the most powerful and effective tools we have to work with LLMs, if used for what it's supposed to do, we shouldn't avoid this because it doesn't "reason" as well.
> John Doe is a freelance software engineer. He charges a > base rate of $50 per hour for the first 29 hours of work > each week. For any additional hours, he charges 1.7 > times his base hourly rate. This week, John worked on a > project for 38 hours.
And the sample scenario is something I wouldn't use LLMs for. Nevertheless, CoT can still be applied with structured output, I'd like to see your structured-output-reasoning-cot to figure out why it didn't work.
Structured outputs allow you to extend the prompts by providing descriptions for the fields you expect, making it much more effective than other solutions, and you can also implement features like self-healing loops (fx. to get rid of the 1% chance of gpt-4o to not reply with data following the schema), etc. The paper's authors used the plain "JSON mode" that is useless, glad to see you did it better.
---
Anyway, here's GPT-4o with function calling (not even structured output) solving the issue correctly every time: https://gist.ro/gpt-reason.mp4
As you can see it's super consistent with GPT-4o and temp 0, with a silly simple prompt. If someone worked on the prompts / split it into a "multi-step pipeline" (come on, is this what we call fn(fn(x)) now?) they would achieve the same result with 4o-mini.
xiaofei_|1 year ago
> You’re comparing apples to oranges - structured output (a capability) with structured output + CoT (a technique), saying that structured output isn’t good for reasoning. Well, it’s not supposed to “reason,” and you didn’t apply CoT to it!
The goal of our evaluation is to address the original OpenAI JSON mode statement: https://openai.com/index/introducing-structured-outputs-in-t... (see the section “Separating a final answer from supporting reasoning or additional commentary”). It illustrates structured output used as CoT reasoning steps, which is the source of the confusion and the basis for both the research paper and our evaluation work. Our findings indicate that structured output is indeed not effective for reasoning (i.e., don’t trust the answer even if you specify a reasoning field).
> Why would you use any other temperature than 0 when you are asserting the correctness of the data extraction and the “reasoning” of the LLM? You don’t want variation.
Good question. The example we provided involves not only correct arithmetic but also accurate interpretation of specific conditions (e.g., recognizing that the first 29 hours are charged at one rate, with additional hours at a higher rate). While setting the temperature to 0 ensures consistent, predictable results by choosing the most likely next word or token, we wanted to explore how the model handles variations and uncertainties. Note that all models were set with a temperature of 1.0 in the comparison. A consistent output in a multistep setup suggests a robust reasoning process we can trust. In contrast, the JSON mode with reasoning field (i.e., reasoning_steps in the structured-output-reasoning-cot pipeline, as detailed in the Chain-of-Thought Reasoning section of https://colab.research.google.com/github/instill-ai/cookbook...) did not show similar reliability.
> Why are you using the LLM to do math? If the data was extracted correctly (with structured output or function calling), let it write the formula and evaluate it. The new API is just a nicer built-in way to extract structured data. Previously (still valid), you had to use function calling and pass it a “returnResult” function that had its payload typed to your expected schema. This is one of the most powerful and effective tools we have to work with LLMs. If used properly, we shouldn’t avoid it just because it doesn’t “reason” as well.
Function calling is outside the scope of our current exploration. Our focus, inspired by the paper "Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models" Research Paper (https://arxiv.org/abs/2408.02442v1), is on maintaining the model’s accuracy while producing structured outputs. As far as I know, functions need to be pre-defined for function calling with OpenAI’s LLMs. For example, to get the correct salary calculation, you would need to pre-define the relevant salary calculation functions. By the way, I’d be interested in seeing your full code (the code in your "./api") if you’re willing to share. Our experiment focuses on evaluating the model’s reasoning ability without relying on external tools.
We also observed some intriguing results with the multi-step pipeline. In your gist, it appears that models like GPT-4o-mini or GPT-3.5-turbo didn’t produce accurate answers consistently. However, in our experiment, we achieved correct results even with these less powerful models (see video https://drive.google.com/file/d/19NZjZ8LZRazInImcm27XjperBMt...):