trq_'s comments

trq_ | 1 month ago | on: Claude Code daily benchmarks for degradation tracking

Yes, we do but harnesses are hard to eval, people use them across a huge variety of tasks and sometimes different behaviors tradeoff against each other. We have added some evals to catch this one in particular.

trq_ | 1 month ago | on: Claude Code daily benchmarks for degradation tracking

Hi everyone, Thariq from the Claude Code team here.

Thanks for reporting this. We fixed a Claude Code harness issue that was introduced on 1/26. This was rolled back on 1/28 as soon as we found it.

Run `claude update` to make sure you're on the latest version.

trq_ | 4 months ago | on: Claude Is Down

We're back up! It was about ~30 minutes of downtime this morning, our apologies if it interrupted your work.

trq_ | 1 year ago | on: Detecting when LLMs are uncertain

Yeah! I want to use the logprobs API, but you can't for example:

- sample multiple logits and branch (we maybe could with the old text completion API, but this no longer exists)

- add in a reasoning token on the fly

- stop execution, ask the user, etc.

But a visualization of logprobs in a query seems like it might be useful.

trq_ | 1 year ago | on: Detecting when LLMs are uncertain

I mean, LLMs certainly know representations of what words means and their relationship to each other, that's what the Key and Query matrices hold for example.

But in this case, it means that the underlying point in embedding space doesn't map clearly to only one specific token. That's not too different from when you have an idea in your head but can't think of the word.

trq_ | 1 year ago | on: Detecting when LLMs are uncertain

Definitely, but if you can detect when you might be in one of those states, you could reflect to see exactly which state you're in.

So far this has mostly been done using Reinforcement Learning, but catching it and doing it inference seems like it could be interesting to explore. And much more approachable for open source, only the big ML labs can do this sort of RL.

trq_ | 1 year ago | on: Detecting when LLMs are uncertain

Yeah that's been my thinking as well.

There are definitely times when entropy can be high but not actually be uncertain (again synonyms are the best), but it seems promising. I want to build a visualizer using the OpenAI endpoints.

trq_ | 1 year ago | on: Detecting when LLMs are uncertain

In this case it would be a low entropy, high varentropy situation. It's confident in a few possible answers, like if it's a set of synonyms.
page 1