top | item 41421644

(no title)

xiaofei_ | 1 year ago

I appreciate you taking the time to read the article and share your thoughts. Your arguments consider many perspectives and dimensions. Please allow me to address each point:

> You’re comparing apples to oranges - structured output (a capability) with structured output + CoT (a technique), saying that structured output isn’t good for reasoning. Well, it’s not supposed to “reason,” and you didn’t apply CoT to it!

The goal of our evaluation is to address the original OpenAI JSON mode statement: https://openai.com/index/introducing-structured-outputs-in-t... (see the section “Separating a final answer from supporting reasoning or additional commentary”). It illustrates structured output used as CoT reasoning steps, which is the source of the confusion and the basis for both the research paper and our evaluation work. Our findings indicate that structured output is indeed not effective for reasoning (i.e., don’t trust the answer even if you specify a reasoning field).

> Why would you use any other temperature than 0 when you are asserting the correctness of the data extraction and the “reasoning” of the LLM? You don’t want variation.

Good question. The example we provided involves not only correct arithmetic but also accurate interpretation of specific conditions (e.g., recognizing that the first 29 hours are charged at one rate, with additional hours at a higher rate). While setting the temperature to 0 ensures consistent, predictable results by choosing the most likely next word or token, we wanted to explore how the model handles variations and uncertainties. Note that all models were set with a temperature of 1.0 in the comparison. A consistent output in a multistep setup suggests a robust reasoning process we can trust. In contrast, the JSON mode with reasoning field (i.e., reasoning_steps in the structured-output-reasoning-cot pipeline, as detailed in the Chain-of-Thought Reasoning section of https://colab.research.google.com/github/instill-ai/cookbook...) did not show similar reliability.

> Why are you using the LLM to do math? If the data was extracted correctly (with structured output or function calling), let it write the formula and evaluate it. The new API is just a nicer built-in way to extract structured data. Previously (still valid), you had to use function calling and pass it a “returnResult” function that had its payload typed to your expected schema. This is one of the most powerful and effective tools we have to work with LLMs. If used properly, we shouldn’t avoid it just because it doesn’t “reason” as well.

Function calling is outside the scope of our current exploration. Our focus, inspired by the paper "Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models" Research Paper (https://arxiv.org/abs/2408.02442v1), is on maintaining the model’s accuracy while producing structured outputs. As far as I know, functions need to be pre-defined for function calling with OpenAI’s LLMs. For example, to get the correct salary calculation, you would need to pre-define the relevant salary calculation functions. By the way, I’d be interested in seeing your full code (the code in your "./api") if you’re willing to share. Our experiment focuses on evaluating the model’s reasoning ability without relying on external tools.

We also observed some intriguing results with the multi-step pipeline. In your gist, it appears that models like GPT-4o-mini or GPT-3.5-turbo didn’t produce accurate answers consistently. However, in our experiment, we achieved correct results even with these less powerful models (see video https://drive.google.com/file/d/19NZjZ8LZRazInImcm27XjperBMt...):

  - GPT-3.5 for reasoning
  - GPT-4o-mini for structured outputs (note that only GPT-4o related models support structured outputs)

discuss

_andrei_|1 year ago

Thank you for the reply and the video. I like the code + flow UI combo.

It might be that the new Structured Outputs feature is subpar when compared with "structured output through function calling", I'll try to compare them.

`/api` contains the openai/anthropic clients builder, using llm-api (on npm) as a wrapper. There are no external tools, the LLM is just forced to call a shim "returnResult" function to make it fill each field according to the schema.

segmondy|1 year ago

I think the temp is too high, run the experiments with lower temps around 0.2