top | item 46874760

(no title)

smeeth | 26 days ago

Just reading your description, it sounds like there are two variables:

1. Prompt adherence: how well the models follow your stated strategy

2. Decision quality: how well models do on judgment calls that aren’t explicitly in the strategy

Candidly, since you haven’t shared the strategy, there’s no way for me to evaluate either (1) or (2). A model’s performance could be coming from the quality of your strategy, the model itself, or an interaction between the two, and I can’t disentangle that from what you’ve provided.

So as presented, the benchmark is basically useless to me for evaluating models (not because it’s pointless overall, but because I can’t tell what it’s actually measuring without seeing the strategy).

discuss

porttipasi|26 days ago

That's a fair point. You're right that without seeing the strategy, you can't fully disentangle what drives the differences.

But the strategy itself isn't really the point. Since every model gets the exact same prompt and the exact same market data, the only variable is the model. So relative performance differences are real regardless of what the strategy contains. If Model A consistently outperforms Model B under identical conditions, that tells you something meaningful about the model.

And honestly, that blend of prompt adherence and decision quality is how people actually use LLMs in practice. You give it instructions and context, and you care about the result.

You're right though that the strategy being private limits what outsiders can evaluate. It's something I'm thinking about.

smeeth|26 days ago

> Model A consistently outperforms Model B under identical conditions, that tells you something meaningful about the model.

Not really! Sorry to harp on this, but there are two ways one model could outperform another:

1) It adheres to your strategy better

2) It improvises

If the prompt was "maximize money, here's inspiration" improvising is fine. If the prompt was "implement the strategy," improvising is failure.

Right now you have a leaderboard; you don’t yet have a benchmark, because you can’t tell whether high P&L reflects correctness.