top | item 43640533

(no title)

fpgaminer | 10 months ago

I ran an interesting benchmark/experiment yesterday, which did not do Quasar Alpha any favors (from best to worst, score is an average of four runs):

  "google/gemini-2.5-pro-preview-03-25"    => 67.65
  "anthropic/claude-3.7-sonnet:thinking"   => 66.76
  "anthropic/claude-3.7-sonnet"            => 66.23
  "deepseek/deepseek-r1:free"              => 54.38
  "google/gemini-2.0-flash-001"            => 52.03
  "openai/o3-mini"                         => 47.82
  "qwen/qwen2.5-32b-instruct"              => 44.78
  "meta-llama/llama-4-maverick:free"       => 42.87
  "openrouter/quasar-alpha"                => 40.27
  "openai/chatgpt-4o-latest"               => 37.94
  "meta-llama/llama-3.3-70b-instruct:free" => 34.40

The benchmark is a bit specific, but challenging. It's a prompt optimization task where the model iteratively writes a prompt, the prompt gets evaluated and scored from 0 to 100, and then the model can try again given the feedback. The whole process occurs in one conversation with the model, so it sees its previous attempts and their scores. In other words, it has to do Reinforcement Learning on the fly.

Quasar did barely better than 4o. I was also surprised to see the thinking variant of Sonnet not provide any benefit. Both Gemini and ChatGPT benefit from their thinking modes. Normal Sonnet 3.7 does do a lot of thinking in its responses by default though, even without explicit prompting, which seems to help it a lot.

Quasar was also very unreliable and frequently did not follow instructions. I had the whole process automated, and the automation would retry a request if the response was incorrect. Quasar took on average 4 retries of the first round before it caught on to what it was supposed to be doing. None of the other models had that difficulty and almost all other retries were the result of a model re-using an existing prompt.

Based on looking at the logs, I'd say only o3-mini and the models above it were genuinely optimizing. By that I mean they continued to try new things, tweak the prompts in subtle ways to see what it does, and consistently introspect on patterns it's observing. That enabled all of those models to continuously find better and better prompts. In a separate manual run I let Gemini 2.5 Pro go for longer and it was eventually able to get a prompt to a score of 100.

EDIT: But yes, to the article's point, Quasar was the fastest of all the models, hands down. That does have value on its own.

discuss

krackers|10 months ago

Didn't they say they were going to open-source some model? "Fast and good but not too cutting-edge" would be a good candidate for a "token model" to open-source without meaningfully hurting your own bottom line.

daemonologist|10 months ago

I'd be pleasantly surprised - GPT-4o is their bread and butter (it powers paid ChatGPT) and QA seems to be slightly ahead on benchmarks at similar or lower latency (so very roughly, it might be cheaper to run).

andai|10 months ago

Are you willing to share this code? I'm working on a project where I'm optimizing the prompt manually, I wonder if it could be automated. I guess I'd have to find a way to actually objectively measure the output quality.

yorwba|10 months ago

You might also be interested in DSPy's prompt optimizers: https://dspy.ai/learn/optimization/optimizers/

fpgaminer|10 months ago

https://gist.github.com/fpgaminer/8782dd205216ea2afcd3dda29d...

That's the model automation. To evaluate the prompts it suggests I have a sample of my dataset with 128 examples. For this particular run, all I cared about was optimizing a prompt for Llama 3.1 that would get it to write responses like those I'm finetuning for. That way the finetuning has a better starting point.

So to evaluate how effective a given prompt is, I go through each example and run <user>prompt</user><assistant>responses</assistant> (in the proper format, of course) through llama 3.1 and measure the NLL on the assistant portion. I then have a simple linear formula to convert the NLL to a score between 0 and 100, scaled based on typical NLL values. It should _probably_ be a non-linear formula, but I'm lazy.

Another approach to prompt optimization is to give the model something like:

  I have some texts along with their corresponding scores. The texts are arranged in ascending order based on their scores from worst (low score) to best (higher score).
  
  Text: {text0}
  Score: {score0}
  Text: {text1}
  Score: {score1}
  ...
  
  Thoroughly read all of the texts and their corresponding scores.
  Analyze the texts and their scores to understand what leads to a high score. Don't just look for literal patterns of words/tokens. Extensively research the data until you understand the underlying mechanisms that lead to high scores. The underlying, internal relationships. Much like how an LLM is able to predict the token not just from the literal text but also by understanding very complex relationships of the "tokens" between the tokens.
  Take all of the texts into consideration, not just the best.
  Solidify your understanding of how to optimize for a high score.
  Demonstrate your deep and complete understanding by writing a new text that maximizes the score and is better than all of the provided texts.
  Ideally the new text should be under 20 words.

Or some variation thereof. That's the "one off" approach where you don't keep a conversation with the model and instead just call it again with the updated scores. Supposedly that's "better" since the texts are in ascending order, letting the model easily track improvements, but I've had far better luck with the iterative, conversational approach.

Also the constraint on how long the "new text" can be is important, as all models have a tendency of writing longer and longer prompts with each iteration.